8th International Electronic Conference
on Synthetic Organic Chemistry. ECSOC-8. 1-30 November 2004.
http://www.lugo.usc.es/~qoseijas/ECSOC-8/
MARCH-INSIDE Methodology Describing Physico-Chemical Properties of Amino Acids
Ronal Ramos de Armas1,2*, Humberto González Díaz 1,3,*, Reinaldo Molina 1,4, Eugenio Uriarte 3.
1 Chemical Bioactives Center, Central University of “Las Villas” 54830, Cuba.
2 Department of Chemistry, Central University of “Las Villas” 54830, Cuba.
3Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de
Compostela 15706, Spain.
4 Universität Rostock, FB Chemie, Albert-Einstein-Str. 3a, D 18059 Rostock, Germany.
* Corresponding author: e-mail
[email protected] or [email protected] Phone: 5342281473; Fax: 5342281230
Abstract: Stochastic-based descriptors generated by the MARCH-INSIDE methodology are applied to the prediction of several properties related to electronic, hydrophobicity and size dependent parameters. Linear Regression is the statistical technique employed finding these quantitative correlations. The model obtained explained more than the 85% of the experimental variance and are comparable in the quality of their statistical parameters and predictive power with those previously reported for the prediction of isoelectric point but using robust methodologies for variable selection like Genetic Algorithm and Partial Least Squared (PLS) as well as neural network for the quantitative correlation.
Key words: Stocahstic Process, Markov Chain, Molecular descriptors, QSPR, Isoelectric Point, Thermodynamical Potentials, Chromatographic Properties, logD, Vander Waals Volumen
There is a growing interest in the study of natural and synthetic peptides and their roles in various biological processes is a great challenge for finding relationships between the amino acid composition of proteins and some of their properties 1. Many amino acid properties are essential in order to get an adequate separation of them and even of a protein because a simple amino acid substitution can determine a drastic change in a protein physico-chemical and biological behavior of a protein. On the other hand, the relative little amount of these compound (amino acids) that can be extracted from nature and the highly time and resource-consuming processes of determining their physical constants are serious disadvantages while studying either natural or synthetic amino acids 2. That is why QSPR studies have become a very promising tool for the study of a given amino acid property even before their synthesis in the laboratory or their isolation from a natural environment 2-4. The variation in the lateral chain (R) of an amino acid [(2HN)(R)(H)CCOOH] is the main cause of variation in its properties being the center of attention in the study of these compounds. In order to achieve these objectives, several scales taken into account hydrophobicity, electronic and size-dependent interactions has been defined. These scales can also be considered as molecular descriptors for QSPR and QSAR studies related to these compounds as well as peptides and proteins. One of the first attempts defining a general scales of amino acids properties was reported by Kidera 5 defining the KOKOS descriptors by means of a Principal Component Analysis (PCA) of 188 physico-chemical properties of the 20 naturally occurring amino acids. Ten Principal Components were extracted describing most of their conformational, steric, hydrophobic properties as well as their tendency to form α-Helix and β-Sheet secondary structures in proteins. Later Hellberg 6 attempted the concentrated this scale in a more useful one for its posterior application to the study of protein and peptides taken into account 20 properties of the 20 amino acids naturally found on proteins (Molecular Weight, pKa, pI, Rf values, Chemical Shifts in NMR, Van der Waals Volume, log P, log D, Several Thermodynamic Potential). This study also was carried out by means of a Principal Component Analysis, yielding three components describing Hydrophobicity, Size, and Electronic-Dependent characteristics respectively named z1, z2 and z3 (z-scale). This scale was lately modified by adding 9 properties related to HPLC measurements in different conditions 7, and by Sandberg and Jonson 8, 9 by adding 35 and 67 amino acids respectively taken this scale up to 5 components. These scales have been widely employed modeling peptide and protein biological properties such as: Bradykinin Potentiation 6, 7, Oxitocin Analogues, Oncostatic Activity, Peptstatin Analogues 7, Elastase Substrates, Neurotensin Analogues 8, G-Proteins Classifications 10, Protein-Drugs Interactions and the characterization of Bacterial Proteins 11. These scales can also be obtained by means of neural networks 12 and applied to the prediction of secondary structure of proteins. Despite the importance of the prediction and characterization of amino acid properties by means of QSPR studies, relatively little reports have been found on the literature about this particular. Among the most significant highlights the prediction of the Isoelectric Point of amino acids reported by Pogliani 4 with connectivity indices yielding a two-variable model, as well as the repot of Liu 2 using a Genetic Algorithm combined with Partial Least Squares as variable selection method and a support vector machine (SVM), as a novel type of a learning machine, for the first time, was used to develop a QSPR model that relates the structures of 35 amino acids to their isoelectric point. This later paper reported an R value of 0.97 and a standard error of estimation of 0.238. The application of new molecular descriptors to the prediction of amino acids properties by means of QSPR studies is then a challenging field in Bioinformatics and Drug Design.
The MARCH-INSIDE (Markovian Chemicals In Silico Design) methodology has been developed by our research group to generate molecular descriptors based on the Markov Chain Theory. This approach has been successfully employed in QSPR and QSAR studies, including studies related to Proteomics and Nucleic Acid-Drug interactions. The approach describes changes in the electron distribution and vibrational decay with time throughout the molecular backbone. The method allowed us to introduce physically meaningful stochastic graph invariants for the study molecular properties. The method has also demonstrated flexibility in relation to many different problems. One of the applications involved the prediction of the fluckicidal activity of novel drugs (flukes are tiny intestinal parasites) 13 More recently, the MARCH-INSIDE approach has been applied to the fast-track experimental discovery of novel anticancer compounds 14. Additionally, promising results have been found in the modeling of the interaction between drugs and HIV-packaging-region RNA in the field of bioinformatics 15. An alternative formulation of our approach in terms of negentropies gives more physical sense to our models for drug-RNA interactions 16. The prediction of the biological activities of peptides and NMR shifts in proteins are problems that can also be addressed using this approach 17. Codification of chirality and other 3D structural features constitutes another advantage of this method 18. The latter opportunity has allowed the estimation of the level of agranulocytocis that is chemically induced by drugs 19. The main objective of this article is the application of MARCH INSIDE Methodology to the prediction of several Physico-Chemical properties of amino acids.
2.1- General Definitions:
A precise definition of the descriptors generated by this methodology can be found in several reports of its application in the study of several biological properties 13-19. Briefly we can say that MARCH-INSIDE methodology considers as states of the MCH the electrons layers of any atom in the molecule. The method uses as source of molecular descriptor the П1 matrix (the one-step electron-transition stochastic matrix) built up as a squared matrix nxn (n number of atoms in the molecule) whose elements (pij) are calculated as the ratio between the withdrawing of the jth atom and the sum over all the atoms covalently linked to the ith atom. Also a new matrix, AПk matrix, can be defined as the product of a 1xn vector (AП0) whose elements (Aπk(j)) are calculated in the similar way as pij but the sum is carried out now over the all the atoms in the molecule and the kП which is the k-th power of 1П matrix. These matrices van then be used to generate three families of molecular descriptors: Self return Probabilities (SRπk): Can be defined as the trace (sum over the pii values) of the k-th power of the 1П matrix.
Codify the attraction of an atom or group of atoms for its electrons (the electrons that were at the atom or the given group of atoms in the time t0) at any time tk located at k-th steps away or less. Absolute Probabilities (Absπk(j)): Codify the attraction of the j-th atom over any electron in the molecule at any time tk after traveling by different paths of less than k steps.
Electronic Delocalization Entropy (Θk):
It describes the entropy involved in the electron attraction at lest k steps beginning with the j atom.
2.2- Descriptors Calculation and Physico-Chemical properties of amino acids:
The molecular descriptors (SRπk Θ k Absπk) were calculated with the experimental software MARCH-INSIDE version 2.0 20. The chemical structure is directly introduced by using the Draw Mode of the software. The structure can then be saved and is possible to select the Calculus Mode of the software and to obtain the first 10th –order local (as desired) and total molecular descriptors. In these case the total descriptors as well as the local descriptors over the residue (R), Heteroatom, and Hydrogen directly bonded to Heteroatom were calculated. The Physico-Chemical properties of amino acids were extracted from the reports of Hellberg 6 in the first z-scale report and Liu 2 with a larger data including synthetic amino acids (up to 35) for the prediction of the pI. The Table 1 shows the properties modeled in this study.
The STATISTICA software 21 was used to develop the Multiple Linear Regression. Several statistical parameters were taken into account to asses the statistical quality of the models: Correlation coefficient (R2), Standard Error of Determination (s), Fischer ratio (F), Cross Validation Regression Coeficient (R2cv or q2).
3.1- Modeling some of the Properties reported for the development of the Z-scale 6.
All the above equations showed a high linearity explaining (except equation 3.3) more than 85% of the experimental variance in the property being modeled. The value of q2 higher than 0.65 in all cases asses the predictive power of the equations showing the ability for predicting a compound not included in the training set of compound. It highlights the contribution of the presence of heteroatom as well as the presence of Hydrogen bonded to heteroatom in all the equations.
The presence of the local defined variables makes the interpretation of the variables in the model very easy. Let analyzed as an example the equation 3.1. The model for prediction of Isoelectric Point have variable like SRπ (H − Het) (the number of Hydrogen capable of hydrogen-bonding interactions) with a positive contribution. This contribution is easily explained if we consider that the principal heteroatom that gives basic properties to an amino acid is nitrogen which in turn has the higher valence in the more common heteroatom found on these compound (3) compared to oxygen (2) and sulfur (2 for Cys and Met). Equation 3.1 has also the variable SRπ(Het) which measures the withdrawing power of the heteroatom. As we said before, basic amino acids have a great content of nitrogen and acidic amino acids has a higher content of oxygen, an atom with a higher withdrawing power than nitrogen; explaining the decrease in the pI value as the content of more electronegative element (oxygen). The other equations can be interpreted in a same way as we did for the first one. The observed and predicted values are shown in Table 2 and Figure 3.1.
Figure 1 (Cont.) Plot of Observed and Predicted Values according to Equations 3.1-3.7
3.2- Modelling the Isoelectric Point of 35 amino acids according to the data reported by Liu 2
In this case the best model obtained was:
In this model despite of the presence of favorable statistical values in lineality, and predictive power, the figures were not as good as those reported by Liu (2004) (R2= 0.95 and s=0.238). The presence of an outlier was detected by several method 21 in this case was the amino acid Citrulline, after the elimination of these compound from the data set we achieved a considerable improvement in the parameters of the model as shows the following equation (Eq. 3.9):
As can be seen from the above equation the linearity parameters dramatically increases (R2=0.937) explaining about 14% more of the variance in the experimental data as well as about 24% more of the variance when a new compound is predicted (q2=0.899). If we compared those values with our reference work 2, in spite of the fact that, in that report, powerful methods of variable selection were used and robust methods for the development of the quantitative relationships (Genetic algorithm, PLS and Neural Networks), the MARCH-INSIDE Methodology using only the Forward Stepwise as an strategy for variable selection gets a group of variables lower and reach very similar results as we can see in Table 3.
The FITNESS value is a parameter that can be calculated in the following way 2:
Where n is the number of compounds in the data, q2 is the Cross Validation Regression Coefficient and c is the number of variables/component in the equation. As can be seen this parameter allow us to compared equation with different number of variables and compounds in the dataset. Taking a close look at these table, we can find that the same content of information is aggregated in almost the half of the variables (compared the FITNESS value for the Liu’s work and equation 3.8), obtaining almost the same results (Compared the R values of Liu’s work and equation 3.9). The observed and predicted values for this dataset according to equations 3.8 and 3.9 are shown on Table 4 and Figure 3.2.
Analyzing the above results, the prediction of physico-chemical properties of amino acids as diverse as isolectric point (Electronic), log D, ∆Gvapor-H2O, ∆Gorg-H2O (Hydrophobic) and Van der Waals’ Volumen (Size), can be achieved in a very successful way applying the MARCH-INSIDE methodology. The obtained results makes use of simpler statistical techniques for variable selection and the production of the quantitative relationships with comparable results with those reported previously with more elaborated and robust techniques.
Lenhinger Principles of Biochemistry Worth Publishers, New York. 2000. 3rd Ed. Nelson DL, Cox MM.
Liu H. X., Zhang R. S, Yao X. J., Liu M. C., Hu Z. D., Fan B. T. Prediction of the Isoelectric Point of an Amino Acid Based on GA-PLS and SVMs. J. Chem. Inf. Comput. Sci. 2004, 44, 161-167
Todechini, R., Consonni, V. Handbook of Molecular Descriptors. (Mannhold, R., Kubinyi, H., Timmerman, H., Eds.) Wiley-ECH 2000. 667pp.
Pogliani L. Molecular connectivity model for determination of isoelectric point of amino acids. J Pharm Sci. 1992, 81, 334-6.
Kidera, A., Konisci, Y., Ooi, T., Scheraga, H.A. Statistical Analysis of the physical properties of the 20 Naturally Occurring Amino Acids. J. Prot. Chem. 1985, 4, 23-55
Hellberg, S., Sjostrom M., Wold, S. An example of Peptide Quantitave Structure-Activity Relationship. Act. Chem. Scan. 1986, 40, 135-140.
Hellberg, S., Sjostrom, M., Skagerberg, B., Wold, S. Peptide Quantitative Structure-Activity Relationships, a Multivariate Approach. J. Med. Chem. 1987, 30, 1126-1135.
Sandberg, M., Eriksson, L., Jonsson, J., Sjostrom, M., Wold, S. New Chemical Descriptors relevamt for the design of Biologically Active Peptides. A multivariate characterization of 87 Amino Acids. J. Med. Chem. 1998, 41, 24812491.
Jonsson, J; Eriksson, L., Hellberg, S, Sjostrom, M., Wold, S. Multivariate Parametrization of 55 Coded and Non-Coded Amino Acid. Quant. Struct. Act. Relat. 1989, 8, 204-209.
Lapinsh M., Gutcaits A., Prusis P., Post C., Lundstedt T.R.,. Wikberg J.E.S Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences Protein Science. 2002, 11,795–805.
Sjöström, M., Rännar, S., and Wieslander, Å. Polypeptide sequence property relationships in Escherichia coli based on auto cross covariances. Chemometr. Intell. Lab. Syst. 1995, 29, 295–305.
Meiler J., Müller M., Zeidler A., Schmäschke F. Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. J. Mol. Model. 2001, 7, 360–369
González, D.H., Olazábal, E., Castañedo, N., Hernádez, S.I., Morales, A., Serrano, H.S., González, J., Ramos de Armas, R. Markovian chemicals “in silico” design (MARCH-INSIDE), a promising approach for computer aided molecular design ii: experimental and theoretical assessment of a novel method for virtual screening of fasciolicides. J. Mol. Mod. 2002, 8, 237-245.
González, D.H., Gia, O., Uriarte, E., Hernádez, I, Ramos, R., Chaviano, M., Seijo, S., Castillo, J.A., Morales, L., Santana, L., Akpaloo, D., Molina, E., Cruz, M., Torres, L. A., Cabrera, M.A. Markovian chemicals "in silico" design (MARCHINSIDE), a promising approach for computer-aided molecular design I: discovery of anticancer compounds. J. Mol. Mod. 2003, 9, 395-407.
González, D.H., Ramos, de A. R., Molina R. Vibrational markovian modelling of footprints after the interaction of antibiotics with the packaging region of HIV type 1. Bull. Math. Biol. 2003, 65, 991-1002.
González, D.H., Ramos, de A. R., Molina R. Markovian negentropies in bioinformatics. 1. A picture of footprints after the interaction of the HIV-1 ψ-rna packaging region with drugs. Bioinformatics. 2003, 19, 2079-2087.
González, D.H., Ramos, de A. R., Uriarte, E. (b). In Silico Markovian Bioinformatics for predicting 1Hα-NMR chemical shifts in mouse epidermis growth factor (m-EGF). Online J. Bioinf. 2002, 1, 83-95.
González, D.H., Hernández, S.I., Uriarte, E., Santana, L. Symmetry considerations in markovian chemicals “in silico” design (MARCH-INSIDE). I: central chirality codification, classification of ace inhibitors and prediction of (receptor antagonist activities. Comput. Biol. Chem. 2003, 27, 217-227.
González, D. H., Marrero, Y., Hernández, I., Bastida, I., Tenorio, I., Nasco, O., Uriarte, E., Castañedo, N., Cabrera, M., Aguila, E., Marrero, O., Morales, A., Pérez, M. 3d-MEDE’s: an alternative "in silico" technique for chemical research in toxicology. 1. Prediction of chemically induced agranulocytosis. Chem. Res. Tox. 2003, 16, 1318 – 1327.
González D.-H., Molina-Ruiz, R., Hernández I. MARCH-INSIDE version 2.0 (Markovian Chemicals “In Silico” Design), Chemicals Bio-actives Center, Central University of “Las Villas”, Cuba 2003. This is a preliminary experimental version future professional version shall be available to the public. For any information about it sends and e-mail to the corresponding author [email protected].
21.-StatSoft, Inc. 2002 Statistica 6.0. Copyright © 1984-2002.