7th International Electronic Conference on Synthetic Organic Chemistry (ECSOC-7), http://www.mdpi.net/ecsoc-7, 1-30 November 2003
[C006]
STOCHASTIC-BASED DESCRIPTORS FOR MODELING BIOLOGICAL PROPERTIES OF PEPTIDES: MODELING ANGIOTENSIN-CONVERTING ENZYME INHIBITION OF DIPEPTIDES.
Ronal Ramosa*, Humberto González Díazb, , Alexander Durana, Maykel Perezc, b
a Chemistry Department, Chemistry and Pharmacy Faculty, Central University of Las Villas, Santa Clara Villa Clara, Cuba. PC 54830.
b Department of Drug Design, Chemical Bioactive Center, Central University of Las Villas. Cuba. Santa Clara Villa Clara, Cuba. PC 54830.
c Sugar Cane Experimental Center, Villa Clara Cuba.
*Corresponding author e-mail address: ronal@medscape.com
Abstract: A stochastic based molecular descriptor was introduced to predict inhibition properties of a set of 58 dipeptides over Angiotensin Converting Enzyme. For this purpose several model (equations) were found with total and locally defined descriptors over the side chain of each amino acid position. In all cases very good regressions models explaining more than 75% of total variance and having a very good results in cross-validation procedures employed (less than 15% of the variance is lost when they are used to predict an external compound).
Key Words: QSAR, Dipeptides, ACE inhibition, Markov Theory, Stochastic Process.
INTRODUCTION
Quantitative Structure Activity Relationship (QSAR) techniques have become indispensable for in all aspects of search into the molecular interpretation of biological properties (1). Several kinds of methodologies yielding different molecular descriptors have been defined and widely employed in order to study physico-chemical and biological properties (2-4). A lot of research has been made in the specific field of connectivity indices (5-6).
On the other hand, several problems arise when QSAR studies are applied to peptides and proteins. The peptide comprises a great variety of biologically active linear and cyclic biopolymers with diverse functions, which can be divided according to their functions into different classes such as antibiotics, enzyme inhibitors and substrates, hormones and regulatory factors, peptide alkaloids, toxins and sweeteners. Most of those bioactive peptides consist of a larger number of amino acid residues but there are also a small drugs consisting of up to three amino acid residues. Despite these difficulties, there have been successful reports that describe their molecular structure in quantitative way in the last decades (7-8).
Special attention has been paid to Angiotensin and related enzymes with peptidic derivatives and non-peptidic as well. Krovat (9) reported Chemical feature based pharmacophore models elaborated for angiotensin II receptor subtype 1 (AT(1)) antagonists using both a quantitative and a qualitative approach (Catalyst HypoGen and HipHop algorithms, respectively). Literature also reports studies on this specific compounds employing conformational and chemometric combined analysis (10), steric parameters (11), hologram-based methodologies. Several references has been found also dealing with QSAR studies of ACE inhibitor employing minimum analogue peptide sets (MAPS) of nine dipeptides, and from a set of 58 dipeptides inhibiting Angiotensin Converting Enzyme, descriptor encoding 3-D information (12-14) and molecular holographic distances vector (15)
Despite the extension of the methodologies employed in the above-mentioned studies, they are still limited in that they either not allow a straightforward interpretation of the resulting QSAR, as in the z-scales, in terms of physico-quemical factors important for biological activity. They may also require previous procedures, e.g. structural alignment in CoMFA, (4) or identification of a putative bioactive conformation derived from several methods as in CoMFA and classical 3-D QSAR (16). All these facts can introduce uncertainties into QSAR models especially in dealing with a structurally diverse dataset of highly flexible compounds such as peptides and proteins. That is why it is still reasonable and meaningful to construct a set of descriptors which posses’ good prediction, speed, and case of use without 3-d structural information.
The use of stochastic matrix formalism as source of simple molecular descriptors has not appeared in the literature before 2002. Our research group recently developed a new methodology (MARCH INSIDE) based on a Markov chain formalism to codify molecular structure towards virtual screening, and rational drug discovery that has been applied to rational design of fluckicidal drugs, (17). These ideas have been extended to the study of Proteins structure property relationships (18). Thus, taking into consideration all aspects highlighted in this introductory section, the present paper has very specific aims. Primordially, the present work deals with the QSAR study of a set of 58 dipeptides and their ACE-Inhibition activity applying MARCH-INSIDE methodology in order to proof their applicability on the prediction of biological properties of amino acids. The results were compared with previous reports and the contribution of the amino acid residues on either position of the dipeptide was also analyzed reaching to very interesting conclusions.
March-Inside Methodology
Markovian Chemicals “in Silico” Design (MARCH-INSIDE). The MARCH-INSIDE methodology uses MCH (19-20) to codify information about molecular structure. This procedure considers as states of the MCH the external electron layers of any atom core in the molecule (21) (valence shell). The method uses as a source of molecular descriptors the matrix 1P, which has the elements pij. This matrix is called the 1-step electron-transition stochastic matrix. 1P is built as a squared table of order n, where n represents the number of atoms in the molecule. The elements (1pij) of the 1-step electron-transition stochastic matrix are the transition probabilities:
(1)
Where cj is the electronegativity of the atom aj, which is bonded to the atom ai. (17,18-21) The elements of 1P (1pij) are defined to codify information about the electron-withdrawing strength of atoms to withdraw electrons from their neighbors in the molecule. The MARCH-INSIDE molecular descriptors used in the previous paper (35) are defined as:
(2)
These molecular descriptors are the traces of the kth-step-electron-transition stochastic matrices (kP). These matrices are the successive powers of 1P. The trace (Tr) (22-25) is the sum of the main diagonal elements (kpii) of 1P. The construction of the 1P matrix for Nitrilo-acetyl Fluoride is shown in Figure 1. It can be observed that the pij values are proportional to the electronegativity of the atom aj (the atom that attracts the electrons of ai). Conversely, the pij values are inversely related with the electronegativity of the atoms that “compete” with aj to withdraw electrons from ai. In equation (2), Sm represents a specific group of atoms in the molecule. When Sm contains all the atoms in the molecule the term SRpk(S) becomes a global molecular indices and we write only SRpk.

Fig. 1 construction of the 1P matrix for Nitrilo-acetyl Fluoride Markovian Electronic Delocalization Entropies (MEDEs).
The MCH theory permits us to explore the potential of the 1P matrix as a source of simple molecular descriptors that a clear physical meaning and are based on the use of the physical informational concept of entropy in this stochastic theory. These molecular descriptors, which we shall define here as kth-MEDEs, have the use of the entropy concept in common with those reported previously (2,26). Thus, we can expect that the use of entropy in the generalization of MARCH-INSIDE may not only clarify the physical meaning of this method but further improve its effectiveness in QSTR studies.
In equation (3) we can carry out the sum in the dominator over all the atoms in the molecule (n) instead of the sum only up to d + 1. We call this new probability the 0-step absolute Markovian probability of electronic delocalization for atom j (Ap0(j)), note the use of A (absolute) rather than SR (self-return). In fact, these probabilities for each atom are normalized, i.e. Ap0(j) values obey the normalization condition because their sum up to n is equal to 1.
(3)
By following the MCH theory formulation (19, 24) the physical meaning of Ap0(j) is very clear. Under the approximation that electronegativity describes the electron withdrawing strength of atoms in the molecule, (21) then p0(j) are the probabilities with which a specific atom j attracts any electrons in the molecule in 0 steps. The calculation of any Apk(j) value, which is the probability with which a specific atom j attracts any electrons in the molecule in k steps, is straightforward:
(4)
Where APk are 1 × n vectors whose elements Apk(j) are the found probabilities, AP0 is a 1 × n vector whose elements are the Ap0 (j) probabilities for the n atoms in the molecule and kP are the kth natural powers of the 1P matrix. It is now very simple to calculate the kth-steps Markovian Electron Delocalization Entropies, which represents the entropy involved in the attraction of electrons at least k-steps (bonds) away from by any atom j in the molecule (Qk (j)). The sum of the Qk (j) for all n atoms in the molecule gives the kth MEDEs used here as total molecular descriptors:
(5)
We use natural logarithmic expressions instead of base 2 counterparts, as use to be in Shannon´s expression, and we multiply by k B (Boltzmann`s constant). This change only implies a change of scale from bits (more commonly used in information theory) to kJ/K (more familiar to the chemist) (27) and, in any case, the physical meaning remains the same.
Angiotensin-Converting Enzyme Inhibitors
A set of 58 dipeptides inhibiting ACE was obtained from the report of Hellberg et al (14). The dipeptides and the ACE inhibition data in their study were taken from a previous compilation (28) and synthesized as part of the development of Captopril.
Molecular Descriptor Generation And Statistical Analysis
The calculation of SRpk or Q k for any organic or inorganic molecule was implemented in the software MARCH-INSIDE. (29) The chemical structure is input directly using the molecular graphics in the software draw mode. The structure can then be saved and it is then possible to select the calculation option and perform the calculation of the 10th first values for each descriptor (total and local molecular indices). The local calculations were carried out over the side chain residues of each position on the dipeptide molecule. At this point, we have a set of electronic delocalization entropy (Θk) , self-return (SRπk)and absolute probability (absπR2k) for each set of calculation (total and over R1 and R2) STATISTICA (30) and its Multiple Regression Analysis Module of were used to obtain the mathematical relationship between ACE-inhibition and molecular structure (total and local molecular descriptors). Several statistical parameters were calculated such as correlation coefficient (R2), the root mean square error (s), the F statistic, cross validation regression coefficient (R2cv).
RESULTS AND DISCUSSION.
Regression Analysis:
Trying to obtain quantitative equation to model this properties several model arouse comprising either total or local molecular descriptors. The best ones of each set was:
(6)
(7)
(8)
The statistical parameters of the above equation are shown in the following table (Table1)
|
Eq. |
N |
R2 |
Q2 |
SD |
F |
Variables |
ρ |
(1-Q2/ R2) |
|
8 |
58 |
0.847 |
0.801 |
0.419 |
39.479 |
7 |
8.3 |
9.0 |
|
7 |
58 |
0.740 |
0.623 |
0.563 |
13.351 |
10 |
5.8 |
16.0 |
|
6 |
58 |
0.835 |
0.760 |
0.450 |
24.84 |
10 |
5.8 |
5.4 |
As can be seen the three model showed good linearity (R2) explaining more than 74% of the variance in the experimental data. Even models 6 and 8 can expain more than 83% of such variance. The ρ coefficient (should be > 5 for linear models) also shows an acceptable 5.8 for model 6-7 and a very good value of 8.3 for model 8. The result of the Leave-One-Out cross validation procedure (Q2) also showed the superiority of models 6 and 8 over model 7. The ratio (1-Q2/ R2) measures the percent of variance that can be explained for the model that is lost in the prediction of a specific compound. These values were 9, 16, and 5.4% respectively for model 6, 7, and 8 showing that more than 90% of the former explained variance in the linearity test remains explained in the prediction test. The model 6 and 8 also show better values of the Fischer ratio and SD than model 7. the following table and figure shows the observed and predicted values of each model (table 2 and Fig 2)
Tabla 2- Predicted and observed values of pIC50
a observed value of pIC50. b, c, d predicted Values for Eq. 8, 7, 6 respectively.
|
DIPEP |
Obsa |
Predb R2 |
Predc R1 |
Predd Total |
DIPEP |
Obsa |
Predb R2 |
Predc R1 |
Predd Total |
|
Ala-Ala |
3.21 |
2.93 |
2.96 |
3.32 |
Gly-Thr |
2.24 |
2.74 |
2.72 |
2.79 |
|
Ala-Gly |
2.60 |
2.39 |
2.49 |
2.77 |
Gly-Trp |
4.52 |
4.25 |
2.95 |
3.78 |
|
Ala-Phe |
3.72 |
3.80 |
3.95 |
3.79 |
Gly-Tyr |
3.68 |
3.75 |
2.90 |
3.35 |
|
Ala-Pro |
3.64 |
3.55 |
3.43 |
3.46 |
Gly-Val |
2.34 |
2.57 |
2.75 |
2.47 |
|
Ala-Trp |
5.00 |
5.23 |
4.25 |
4.73 |
His-Gly |
2.20 |
2.35 |
2.19 |
1.72 |
|
Ala-Tyr |
4.06 |
4.27 |
4.03 |
4.12 |
His-Leu |
2.49 |
2.81 |
2.79 |
2.39 |
|
Arg-Ala |
3.34 |
2.78 |
3.63 |
3.19 |
Ile-Gly |
2.92 |
2.41 |
3.30 |
2.91 |
|
Arg-Phe |
3.64 |
3.30 |
4.21 |
4.13 |
Ile-Phe |
3.03 |
4.09 |
4.46 |
4.06 |
|
Arg-Pro |
3.74 |
3.27 |
3.69 |
3.99 |
Ile-Pro |
3.89 |
3.76 |
3.86 |
3.98 |
|
Arg-Trp |
4.80 |
4.94 |
5.03 |
5.10 |
Ile-trp |
5.70 |
5.59 |
5.16 |
5.08 |
|
Asp-Ala |
2.42 |
2.83 |
1.83 |
2.16 |
Ile-Tyr |
5.43 |
4.58 |
4.60 |
4.45 |
|
Asp-Gly |
1.85 |
2.36 |
1.88 |
1.82 |
Leu-Ala |
3.51 |
2.93 |
3.00 |
2.82 |
|
Gln-Gly |
2.13 |
2.36 |
1.85 |
2.76 |
Leu-Gly |
2.06 |
2.39 |
2.22 |
2.33 |
|
Glu-Ala |
2.00 |
2.81 |
2.14 |
2.22 |
Lys-Ala |
3.42 |
2.85 |
3.46 |
2.82 |
|
Glu-Gly |
2.00 |
2.36 |
1.70 |
1.81 |
Lys-Gly |
2.49 |
2.37 |
2.53 |
2.41 |
|
Gly-Ala |
2.70 |
2.69 |
2.53 |
2.39 |
Met-Gly |
2.32 |
2.38 |
2.84 |
2.75 |
|
Gly-Arg |
2.49 |
2.91 |
2.95 |
2.92 |
Phe-Arg |
3.04 |
3.34 |
3.09 |
3.19 |
|
Gly-Asp |
2.04 |
1.73 |
2.73 |
1.89 |
Phe-Gly |
2.43 |
2.37 |
1.58 |
2.33 |
|
Gly-Gln |
2.15 |
2.37 |
2.83 |
2.78 |
Pro-Gly |
1.77 |
2.41 |
1.73 |
1.90 |
|
Gly-Glu |
2.27 |
1.84 |
2.82 |
1.96 |
Ser-Gly |
2.07 |
2.37 |
2.23 |
1.67 |
|
Gly-Gly |
2.14 |
2.34 |
2.37 |
2.11 |
Thr-Gly |
2.00 |
2.41 |
2.31 |
2.21 |
|
Gly-His |
2.51 |
2.35 |
2.82 |
2.86 |
Trp-Gly |
2.23 |
2.34 |
2.13 |
2.88 |
|
Gly-Ile |
2.92 |
2.57 |
2.79 |
2.69 |
Tyr-Ala |
2.70 |
2.82 |
3.32 |
3.07 |
|
Gly-Leu |
2.60 |
2.58 |
2.81 |
3.05 |
Tyr-Gly |
3.34 |
2.36 |
2.94 |
2.75 |
|
Gly-Lys |
2.27 |
2.18 |
2.90 |
2.49 |
Val-Gly |
2.96 |
2.42 |
3.11 |
3.14 |
|
Gly-Met |
2.85 |
2.83 |
2.82 |
2.48 |
Val-Phe |
4.28 |
4.17 |
4.17 |
4.30 |
|
Gly-Phe |
3.20 |
3.30 |
2.87 |
3.06 |
Val-Pro |
3.38 |
3.81 |
3.97 |
4.11 |
|
Gly-Pro |
3.35 |
3.03 |
2.71 |
3.05 |
Val-Trp |
5.80 |
5.53 |
4.67 |
5.20 |
|
Gly-Ser |
2.42 |
2.23 |
2.65 |
2.34 |
Val-Tyr |
4.66 |
4.66 |
4.30 |
4.66 |
|
|
R1 |
|
|
Fig. 2 Predicted vs. Observed values |
||
These results can be compared with previous reports about the same series. The result are summarized in the following table (table 3):
Table 3 Result obtained applying other QSAR approaches compared with MARCH-INSIDE approach.
|
N |
R2 |
SD |
Q2 |
Descriptor |
Ref |
|
58 |
0.769 |
- |
- |
Z scores |
Collantes, 1995 |
|
58 |
0.878 |
0.35 |
0.75 |
MHDV scores |
Shushen, 2001 |
|
58 |
0.705 |
0.54 |
- |
T scores |
Shushen, 2001 |
|
58 |
0.747 |
0.50 |
- |
MD-WHIM scores |
Shushen, 2001 |
|
58 |
0.847 |
0.42 |
0.80 |
absπR2k SRπR2k ΘkR2 (Eq. 8) |
This paper |
|
58 |
0.740 |
0.56 |
0.62 |
absπR1k SRπR1k ΘkR1 (Eq. 7) |
This paper |
|
58 |
0.835 |
0.45 |
0.76 |
SRπk Θk (Eq. 6) |
This paper |
As can be seen from the above table, our methodology gives similar result predicting this property compared with well-established descriptors such as Z, MHDV and MD-WHIM scores (even better if we take into account only eq. 6 and 8 for the comparison with Z and MD-WHIM descriptors). Even in the case of the MHDV that , at first sight, seems to give better statistical parameters our models show higher predictability for external samples in the case of our two best model (6, 8) reflected on their Q2 value.
On the other hand, taking a close look to the residual of the predicted cases of each models, we found that cases like Ile-Phe has the largest values of residual and delete residual (see table 2 for residual). These results have been previously reported (32) and may be due to assaying the cyclized form of the dipeptide, a very common phenomenon on peptides containing lipophilic amino acids like Phe-Ile.
Interpretation of this model becomes quite complicated just by analyzing the variable’s coefficient in the regression equation, but in all the equation can be pointed out the contribution of entropic and probability factors (self-return and absolute as well) except for equation 8 were the entropic factors are absent.
Angiotensin Converting Enzyme is the target of many anti hypertensive drugs. In this work several model were found that allow the quantitative and qualitative determination of a possible inhibition activity of several dipeptides applying several statistical techniques. Multiple regression model explain more than 75% of the experimental variance in the data (2 of them explain more than 80%) all these models have adequate statistical parameters and high stability in the cross validation procedure (LOO). All these results confirm the applicability of stochastic-based descriptor to describe biological properties of dipeptides. These methodology is been applying to other and more complicated problems of bioinformatics such as RNA-drug interactions and mutation-based protein stability.