Study of amino acids correlation in proteins and application in structural analysis

Du Qishi a*, Wang Shuqing a, Wei Dongqing a, Li Aixiua,b
(aInstitute of Bioinformatics and Drug Discovery, Tianjin Normal University; bMedical College of Chinese People’s Armed Police Forces, Tianjin, 300162)

Abstract In this research we calculate the correlation coefficients and analyze the correlation relationship among 20 amino acids in 4 type proteins (a, b, a/b, and a+b), total 204 proteins. Our research shows that amino acid correlations can be divided into strong positive correlation, strong negative correlation, weak correlation, and no-correlation. The correlation relationships of amino acids in 4 type proteins are quite different, and represent the protein’s structure characters. We analyze the relationship between amino acid correlation and the structure of proteins and give explanation for the physicochemical origins of amino acid correlations.
Keywords Amino acids; Protein structure; Bioinformatics; Chemometrics

Amino Acid Correlation in Proteins and Its Application in Structural Analysis

Du Qishi *a , Wang Shuqinga , Li Aixiua ,b
a Tianjin Normal University , Institute of Bioinformatics and Drug Development, Tianjin, 300074; Chinese People’s Armed Police Force Medical College, Tianjin, 300162 ) February 22 ,

2004 Accepted ; this project was funded by the National Natural Science Foundation of China ( 20373048 ) and the Basic Science Key Project of Tianjin Municipal Science and Technology Commission (023618211) .

Abstract In this paper , the correlations between 20 amino acids in 204 proteins of 4types ( type a , type b , type a/b , and type a+b ) were calculated and analyzed . The study found that the correlation between amino acids can be divided into strong positive correlation, strong negative correlation, weak correlation, and no correlation. As building blocks of proteins, the correlation of 20 amino acids in different types of proteins reflects the matching rules among these building blocks and represents the structural features of proteins. This paper analyzes the relationship between some amino acid correlations and protein structures, and explains the origin of amino acid correlations from the physical and chemical properties. Key words amino acid ; protein structure; bioinformatics; stoichiometry     Amino acids are building materials for protein construction, in order to build structural units (such as a- helix and b- band) and overall structures (such as globulin) of proteins with specific configurations , there must be a certain matching between these building components. That is to say, in a specific protein shape, a certain amino acid can only be connected to some amino acids, but not to another part of amino acids, and there is a certain correlation between amino acids. In this study we apply the correlation analysis method of chemometrics [1] to type a , type b , type a / b and

Correlation analysis between amino acids in a + b -type proteins. The correlation of amino acids comes from the physical and chemical properties of amino acids, such as the hydrophilicity and hydrophobicity of amino acid side chains, the volume of side chains, the number of hydrogen bond donors and acceptors, and so on. We then analyze the origin of the physicochemical properties of some amino acid correlations and give a reasonable explanation for the correlations.

Amino Acid Correlation Analysis Assuming that N known proteins form a set S, which is a union set composed of m subsets S x ,                                              (1) Each subset x corresponds to a protein class, containing n x proteins, and there are N = S x n x . Each protein is a vector , k , or a point in the 20-dimensional amino acid space,                   (2) In formula (3), , k, i is the i -th amino acid of the k -th protein of the subset x Percentage frequency, subject to the following normalization conditions,                    (3) each subset S x has a standard vector (or mean vector),                        (4) its components are the subset x

The average value of the corresponding amino acid composition of all proteins in
(5) The protein percentage composition of
the subset S x constitutes the matrix [ x ]n x x20 . We calculate the covariance matrix [ x ] 20×20 of the subset S x according to the following formula ,      (6)     The covariance matrix expresses the degree of dispersion of the protein composition in the subset to the mean value. In the past, in protein structure prediction based on amino acid composition [2-5] , the focus was on the “distance” between the composition of unknown protein and standard protein composition, and unknown protein belongs to the protein type with the smallest distance. The amino acid correlation analysis method proposed in this paper considers the problem from different angles, and the foothold is the correlation between amino acids. This method believes that the number of various amino acids in the same type of protein can have a large change, but the changes in the number of amino acids have certain mutual constraints and follow a certain correlation. For this reason, we start from the covariance matrix x to construct the correlation matrix [ x ] 20×20 between the amino acids of the subset S x ,                   (7) the correlation matrix x

is a symmetric matrix, the diagonal elements are all 1, x ,i,i =1. There should be certain matching rules between amino acids when constituting the secondary structure unit of protein ( a helix and b band, etc.): one amino acid may be connected to some amino acids more frequently, while the frequency of connection to other amino acids may be lower . The correlation matrix starts from the protein composition matrix, expresses the quantitative dependence relationship between amino acids, and reflects the structural characteristics of protein types.

  1. Calculation results
    We used the protein training set used in literature [6] , and selected 4 types of 204 proteins from the database (PDB Bank) ( 52 types a , 61 types b , 45 types a/b) , and a+b type 46), the PDB codes of the proteins are listed in Table 1.

PDB codes of the 204 proteins used in the calculations in Table 1 [6]

1 . a type 52 _ 1vls_ 1ithA 1spgA 1pbxB 1emy_ 2pghB
1aep_ 1ilk_ 2fal_ 2gdm_ 1fslA 1sctB 1hdaB 1hdaA
1ash_ 1maz_ 2hbg_ 2lhb_ 1helm_ 1babB 1hdsA 1hrm_
1bcfA 1mls_ 3sdhA 1hdsB 1lht_ 2asr_ 1ibeB 1mA
1cnt1 1rhgA 1allA 1myt_ August 1 1babA 1mbs_ 1vlk_
1when_ 1spgB 1flp_ 1osa_ 1outB 1bgc_ 2mm1_
1hlb_ 1sra_ 1ibeA 1sctA 1pbxA 1bgeA 2pghA
2 . Type b 61 _ 1yna_ 1bbdH 1nldH 1bafL 1iaiM 1nsnH
1bbt2 3hhrC 8 fabA 1eapA 1opgL 1bjmA 1igcL 1plgH
1cfb_ 6fabl 1flrH 1 gafL 1ospL 1bqlH 1ikfL 1 plgL
1 edhA 8fabB 1ggiH 1gbg_ 1vgeL 1bqlL 1indL 1tetH
1gen_ 1pex_ 1indH 1ggiL 2fbjL 1dfbL 1macA 1xnd_
1sacA 1 vcaA 1JELH 1ghfH 2mcg1 1forL 1mamL 1yuhA
1tcrA 1mfbL 2cgrH 1hilB 7fabL 1ghfL 1mreH 3hfmH
2ayh_ 1gnhA 7fabH 1ncbL 1acyL 1iaiL 1 ngqH
3 . a/ b type 45 1nar_ 1vdc_ 2ebn_ 1obr_ 1lwiA 1cerO
1amp_ 1ghr_ 1pbn_ 1vpt_ 3pga1 1cnv_ 1 wsaB 1gia_
1ceo_ 1gym_ 1pfkA 1xel_ 8abp_ 1exp_ 2alr_ 2lip_
1cvl_ 1lbA 1sbp_ 1xyzA 1emp_ 1trb_ 3ecaA 1ula_
1d or A 1lucA 1 scA 2bgu_ 1gdhA 1ghsA 4pfk_ 2gbp_
1gca_ March 1 1thtA 2ctc_ 1lucB 1hdgO 1agx_
4 . a+ b type 46 1gtqA 1mA 1seiA 1 vhiA 1apyB 2prd_
2aak_ 1def_ 1hjrA 1msc_ 1sfe_ 1vsd_ 1div_ 1hup_
1afb1 1doi_ 1htp_ 1nhkL 1snc_ 1whtB 1pvuA 1 nueA
1bplA 1epaB 1ino_ 1pkp_ 1std_ 1ytbB 1pc_ 1cdwA
1cof_ 1file_ 1itg_ 1poc_ 1tfe_ 2tbd_ 1qcqA 1pne_
1 chick_ 1grj_ 1lit_ 1rbu_ 1vhh_ 8atcB 1ril_ 2 km1

    According to formulas ( 4 ) and ( 5 ), the standard protein of each subset is calculated, and the results are shown in Fig. 1 . The composition of the standard protein of type a and type b is quite different, but the composition of type a/b and type a+b is very similar. This result shows that it is difficult to distinguish between a/b type and a+b type proteins only by amino acid composition .

Table 2 The correlation coefficient between 20 amino acids of type a protein , calculated according to the data in Table 1

A 1
C -0.300 1
D -0.281 -0.011 1
E -0.565 -0.016 0.215 1
F 0.126 0.091 0.174 -0.132 1
G 0.216 -0.162 -0.140 -0.182 0.035 1
H -0.194 -0.181 -0.130 0.022 0.019 0.164 1
I -0.269 -0.130 0.158 0.369 -0.390 0.036 -0.249 1
K -0.036 -0.316 0.182 0.141 0.314 0.151 0.402 0.145 1
L -0.276 0.446 -0.214 -0.087 -0.101 -0.049 0.340 -0.413 -0.208 1
M -0.202 0.013 0.178 0.167 -0.219 -0.138 -0.583 0.326 -0.377 -0.278 1
N 0.027 -0.046 -0.015 -0.064 0.073 -0.380 -0.241 -0.182 -0.190 -0.138 0.165 1
P -0.085 0.465 -0.421 -0.217 -0.065 -0.044 0.101 -0.127 -0.341 0.402 -0.147 -0.318 1
Q -0.056 0.242 -0.487 0.110 -0.359 -0.258 -0.248 -0.026 -0.506 0.310 0.286 0.233 0.175 1
R -0.541 0.368 0.265 0.244 -0.218 -0.361 -0.363 0.185 -0.456 0.029 0.540 0.158 -0.036 0.195 1
S 0.498 -0.173 -0.195 -0.436 -0.054 -0.122 -0.210 -0.233 -0.314 -0.077 0.079 -0.086 0.118 -0.034 -0.058 1
T 0.037 -0.097 -0.074 -0.091 -0.240 -0.416 -0.013 -0.013 -0.274 -0.121 0.103 -0.036 0.246 0.157 0.080 0.167 1
V 0.277 -0.226 0.075 -0.421 0.246 0.235 0.067 -0.310 0.105 -0.203 -0.392 0.070 -0.099 -0.514 -0.273 0.055 -0.114 1
W 0.060 -0.189 -0.130 0.021 -0.004 0.134 0.055 0.187 0.013 -0.416 -0.102 -0.002 0.067 -0.094 -0.160 -0.046 -0.036 0.232 1
Y -0.297 0.186 0.262 0.031 -0.264 -0.288 -0.298 0.176 -0.136 -0.110 0.394 0.139 -0.147 0.022 0.569 -0.001 0.100 -0.200 -0.141
A C D E F G H I K L M N P Q R S T V W

Table 3 The correlation coefficient between 20 amino acids of type b protein , calculated according to the data in Table 1

A 1
C -0.051 1
D -0.275 -0.256 1
E 0.021 -0.057 0.258 1
F -0.178 -0.489 0.484 0.204 1
G 0.017 -0.447 -0.274 -0.397 0.031 1
H -0.071 -0.138 0.022 0.273 0.122 0.104 1
I -0.291 -0.387 0.660 0.433 0.492 -0.281 0.114 1
K 0.010 0.098 0.227 0.272 0.373 -0.334 -0.210 0.223 1
L -0.213 0.350 -0.099 0.030 0.027 -0.086 -0.078 -0.218 0.191 1
M -0.226 -0.083 0.037 0.109 0.119 -0.175 0.021 0.003 0.134 -0.258 1
N -0.155 -0.593 0.328 0.029 0.366 0.268 0.287 0.413 -0.255 -0.599 0.152 1
P 0.373 0.103 -0.154 -0.017 -0.130 -0.304 -0.140 -0.159 0.208 0.214 -0.055 -0.466 1
Q 0.133 0.307 0.027 -0.099 -0.184 -0.456 -0.277 0.027 -0.054 -0.006 -0.291 -0.149 -0.065 1
R -0.378 0.014 0.448 0.292 0.193 -0.255 0.140 0.451 -0.168 0.177 -0.036 0.079 -0.223 0.076 1
S -0.049 0.608 -0.329 -0.445 -0.506 -0.210 -0.465 -0.321 -0.051 0.115 -0.169 -0.431 -0.049 0.455 -0.103 1
T 0.065 0.380 -0.482 -0.321 -0.566 0.116 -0.023 -0.568 -0.325 -0.006 0.086 -0.220 -0.054 0.030 -0.327 0.358 1
V 0.300 -0.007 -0.466 -0.156 -0.281 0.147 0.033 -0.543 -0.276 0.196 -0.015 -0.363 0.493 -0.181 -0.263 -0.111 0.198 1
W -0.042 -0.432 -0.126 -0.205 0.157 0.551 0.113 -0.058 -0.164 -0.290 0.139 0.328 0.015 -0.475 -0.352 -0.428 -0.069 0.042 1
Y -0.112 -0.475 -0.014 -0.412 0.032 0.609 -0.016 -0.081 -0.160 -0.362 0.117 0.402 -0.376 -0.230 -0.218 -0.119 -0.109 -0.138 0.498
A C D E F G H I K L M N P Q R S T V W

    We calculated the correlation matrix according to formulas (6) and (7). Due to the symmetry relationship, we only listed the lower triangular part of the correlation matrix of type a proteins in Table 2. Table 3 is the correlation matrix of type b proteins. There are C 2 20 =190 amino acid related pairs among 20 amino acids . The correlation coefficient should be between 0 and 1, representing from completely uncorrelated to completely correlated. Most of the correlation coefficients in Table 1 and Table 2 are far less than 1, indicating that there is no specific correlation between most amino acids, and there is greater substitution. But there are also individual amino acid pairs with significantly higher correlation coefficients than the average. According to the size of the correlation coefficient of amino acid pairs, we divide amino acid correlation into strong positive correlation (correlation coefficient>0.5), strong negative correlation (correlation coefficient<-0.5), weak correlation (|correlation coefficient|<0.05), and no Correlation (|correlation coefficient|<0.01). We show the correlations of several amino acid pairs with strong positive correlation and strong negative correlation in type a protein and type b protein in Fig. 2, Fig. 3, Fig. 4 and Fig. 5. Fig. 1 The average amino acid content of 4 types of proteins ( 52 types of a type, 61 types of b type, 45 types of a/b type, and 46 types of a+b type). There is a big difference between type a and type b , but the composition of type a/b and type a+b is very similar Figure 2 Amino acids in type a protein

The correlation between H and M , the correlation coefficient R H-M = -0.583 . Amino acid H is negatively correlated with M , when the content of H is high, the content of M is low
Fig.3 Correlation between amino acid R and Y in type a protein, correlation coefficient R R-Y = 0.569, amino acid R and Y are positively correlated, when R content is high, Y content is also highFig.4 Amino acid D and I in type b protein The correlation between them, the correlation coefficient R D-I =0.660 . Amino acid D and I are positively correlated, when the content of D is high, the content of I is also high Fig. 5 Correlation between amino acid V and I in type b protein , correlation coefficient R V-I = -0.543 , amino acid V and I are negatively correlated, When the content of V is high, the content of I is low

  1. Analysis of results
    The largest correlation coefficient in type a protein is between amino acid M and H, RM -H = -0.583, indicating that there is a certain restriction relationship between amino acid M and H, showing a strong negative correlation (Figure 2). It is worth noting that the frequency of occurrence of amino acids M and H in the a- type protein is not very high, see Figure 1. The largest correlation coefficient in type b protein is between amino acid D and I, R D – I = 0.660, indicating that there is a strong positive correlation between amino acid D and I (Figure 4). Figure 6 Molecular structural formula and side chain properties of aspartic acid (D, Asp), isoleucine (I, Ile) and valine (V, Val)

    The correlation between amino acids reflects the restriction relationship between amino acids. Let us take the HIV-1 GP120 protein ( 1acyL ) of the b- type protein HIV as an example to illustrate that in the b- type protein, aspartic acid (D, Asp) and The physicochemical origin of the strong positive correlation between isoleucine (I, Ile) . Figure 6 shows the structural formulas of D ( Asp ), I (Ile ) , and V ( Val ). Aspartic acid D has a strong hydrophilic side chain CH 2 COO – , and the side chains of isoleucine I and valine V are both strongly hydrophobic.

Figure 7 Amino acid sequence and secondary structure analysis of HIV-1 GP120 protein 1acyL of HIV [7,8]

    Figure 7 is the amino acid sequence and secondary structure analysis of HIV-1 GP120 protein ( 1acyL ) of HIV [7,8] . In Fig . 7 we found four places where amino acids D and I are closely connected, which are marked with bold letters in the figure. Interestingly, all four of these occur at the head or tail of the b- fold, where the b -fold occurs. The hydrophobic and hydrophilic properties of amino acids play a very important role in the structure of proteins [9,10] . Rose ‘s research showed that [11,12] , the hydrophobicity of amino acids in proteins is related to the hydrophobic region of sequences, and based on this, the corners of secondary structural units can be judged. Since D is a strongly hydrophilic amino acid and I is a strongly hydrophobic amino acid, this is why D and I often appear in pairs at the corner of the β band, resulting in a strong positive correlation between D and I. amino acid D andIn addition to appearing in pairs, I also appeared scattered in many places, belonging to random plus correlation distribution, so the overall correlation coefficient is not very high (0.660) .
Figure 5 shows a strong negative correlation between V (Val) and I (Ile) in b- type proteins. From the molecular structures of V and I in Figure 6, it can be seen that both are strongly hydrophobic amino acids, with only one methylene group (-CH 2 -) difference. In the hydrophobic part of the b -type protein, the presence of one of the two will reduce the probability of the other, which may be the reason why V and I show a strong negative correlation.

  1. Conclusion Both
    theoretical analysis and computational practice have shown that there are specific correlations between amino acids in various types of proteins. The correlation between amino acids originates from the specific physicochemical properties of amino acids. There are some amino acid pairs that have obvious correlations and are easy to explain, such as the positive correlation between aspartic acid D and isoleucine I and the negative correlation between isoleucine I and valine V in type b proteins . However, due to the complexity of protein structure, not all the correlations of amino acid pairs occur between adjacent amino acids, like amino acids D and I. Correlations may also occur between amino acids that are relatively far apart. The internal factors of the correlation of these amino acid pairs are not obvious, and it is not easy to give an explanation. We will gradually decipher them in future work. As the basic building blocks of proteins, the correlation among amino acids is the matching rule of these building blocks, representing the structural features of protein types. The study of amino acid correlation can be used in the prediction of protein structure type, which we will report in future studies.

REFERENCES
[1] Xu Lu. Chemometrics. Science Publishing House, 1985.
[2] Chou K C. Current Protein and Peptide Science, 2000,1: 171.
[3] Chou K C. Proteins: Structure, Function and Genetics, 1995, 21: 319.
[4] Chou K C. FEBS Letters, 1995, 363: 127.
[5] Chou K C, Liu W, Maggiora G M et al. Proteins: Structure, Function and Genetics, 1998, 31: 97.
[6] Berman H M, Westbrook J, Feng Z et al. Nucleic Acids Research, 2000, 28: 235.
[7] Ghiara J B, Stura E A, Stanfiled R L et al. Science, 1994, 82: 264.
[8] http://www.rcsb.org/pdb/
[9] Qi J X, Xiao Y. Science B, 2003, 47 (6): 425.
[10] Qiu J D, Liang R P, Zou X Y et al. Chemical Journal, 2003, 61: 748.
[11] Rose G D ,Wolfonden R. Hydrogen bonding hydrophobicity packing and protein folding, Ann Rev Biophys Biochem Struct, 1993, 22: 381.
[12] Mandell A J, Selz K A, Shlesinger M F. Physica A, 1997, 244: 254.