Improving Biological Activity Prediction Using Recursive Clustering Algorithm

سال انتشار: 1400
نوع سند: مقاله کنفرانسی
زبان: انگلیسی
مشاهده: 208

متن کامل این مقاله منتشر نشده است و فقط به صورت چکیده یا چکیده مبسوط در پایگاه موجود می باشد.
توضیح: معمولا کلیه مقالاتی که کمتر از ۵ صفحه باشند در پایگاه سیویلیکا اصل مقاله (فول تکست) محسوب نمی شوند و فقط کاربران عضو بدون کسر اعتبار می توانند فایل آنها را دریافت نمایند.

این مقاله در بخشهای موضوعی زیر دسته بندی شده است:

استخراج به نرم افزارهای پژوهشی:

لینک ثابت به این مقاله:

شناسه ملی سند علمی:

ICSB04_012

تاریخ نمایه سازی: 20 مهر 1400

چکیده مقاله:

Laboratory procedures have been absolutely time-consuming, complex and costly processes in nature which has led scientists to use alternative methods such as machine learning techniques applied in ligand based virtual screening protocols (LBVS) . Prediction or classification of compounds' biological activities in all drug discovery approaches creates a theoretical framework for statistical machine learning techniques . Machine learning is a computer programming technique applicable in statistical and mathematical research in which evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In these statistical models, diverse theoretical molecular descriptors derived from different molecular representations encode physicochemical and structural features of the molecules . This information plays a fundamental role in identifying model parameters. Thus, they are usually utilized to construct statistical models applicable for biological activity prediction . But thousands number of molecular descriptors has been one of the main problems in QSAR/QSPR studies. Therefore, to improve prediction, feature selection step was suggested as a pre-processing step. Feature selection methods have always been a subject of debate in drug design to remove irrelevant descriptors. Over the previous two decades, numerous feature selection methods have turned into popular tools used in LBVS. Some of the most standard and prevalent algorithms (۱) for filter methods are χ۲-test, mutual information, factor analysis, principle component analysis , (۲) for wrapper methods are: stepwise regression, K-nearest neighbours (KNN) , genetic algorithm (GA) , random forest (RF) , support vector machine(SVM) , Bayes classifier, kernel based methods such as Gaussian process , one against all , restricted Boltzmann machine (RBM) and fuzzy clustering such as k-means algorithm . In the field of prediction, foreseeing the biological activity of new molecule is the main goal while there is no information about their activity. Some of the methods used in this research are known as quantitative structure activity relationship (QSAR) approach. QSAR techniques are based upon both molecular descriptors and their biological activities (in the training set) and just molecular descriptors (in the test set). Lots of linear and none linear models have been applied in QSAR approach. Artificial neural networks (ANN) is one of the most popular non-linear methods utilized in drug design . The efficacy of this technique depends on selecting the optimal features among molecular descriptors. Algorithm simplicity and runtime are the advantages of the above-mentioned method comparing to the more recent ones (e.g. deep learning) while the unfavourable result of growing the number of molecular descriptors is the main disadvantage. It seems that the best solution for in-house database is applying pre-processing step to find the features. On the other hand, pre-processing step useful overcoming other problems such as getting stuck in local minima and being prone to over-fitting in the learning step of this technique . In over-fitting, the prediction accuracy in training set is very good but for the test set, it would be very low, or sometimes, the random correlation between the predicted and experimental results rises . In this article, it was attempted to detect and extract the optimum descriptors based on the new embedded algorithm contains recursive clustering focused on the k-means method and genetic algorithm (GA) (Figure ۱). As a proof of concept, it was applied to compounds of eight different targets. To evaluate the suggested algorithm, ANN algorithm was applied to predict molecular biological activity

نویسندگان

Fahimeh Ghasemi

Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Hezar-Jerib Ave., Isfahan, IR Iran, ۸۱۷۴۶ ۷۳۴۶۱.