Du Qishi a*, Wang Shuqing a, Wei Dongqing a, Li Aixiua,b
(aInstitute of Bioinformatics and Drug Discovery, Tianjin Normal University; bMedical College of Chinese People’s Armed Police Forces, Tianjin, 300162)
Abstract In this research we calculate the correlation coefficients and analyze the correlation relationship among 20 amino acids in 4 type proteins (a, b, a/b, and a+b), total 204 proteins. Our research shows that amino acid correlations can be divided into strong positive correlation, strong negative correlation, weak correlation, and no-correlation. The correlation relationships of amino acids in 4 type proteins are quite different, and represent the protein’s structure characters. We analyze the relationship between amino acid correlation and the structure of proteins and give explanation for the physicochemical origins of amino acid correlations.
Keywords Amino acids; Protein structure; Bioinformatics; Chemometrics
Amino Acid Correlation in Proteins and Its Application in Structural Analysis
Du Qishi *a , Wang Shuqinga , Li Aixiua ,b
( a Tianjin Normal University , Institute of Bioinformatics and Drug Development, Tianjin, 300074; b Chinese People’s Armed Police Force Medical College, Tianjin, 300162 ) February 22 ,
2004 Accepted ; this project was funded by the National Natural Science Foundation of China ( 20373048 ) and the Basic Science Key Project of Tianjin Municipal Science and Technology Commission (023618211) .
Abstract In this paper , the correlations between 20 amino acids in 204 proteins of 4types ( type a , type b , type a/b , and type a+b ) were calculated and analyzed . The study found that the correlation between amino acids can be divided into strong positive correlation, strong negative correlation, weak correlation, and no correlation. As building blocks of proteins, the correlation of 20 amino acids in different types of proteins reflects the matching rules among these building blocks and represents the structural features of proteins. This paper analyzes the relationship between some amino acid correlations and protein structures, and explains the origin of amino acid correlations from the physical and chemical properties. Key words amino acid ; protein structure; bioinformatics; stoichiometry Amino acids are building materials for protein construction, in order to build structural units (such as a- helix and b- band) and overall structures (such as globulin) of proteins with specific configurations , there must be a certain matching between these building components. That is to say, in a specific protein shape, a certain amino acid can only be connected to some amino acids, but not to another part of amino acids, and there is a certain correlation between amino acids. In this study we apply the correlation analysis method of chemometrics [1] to type a , type b , type a / b and
Correlation analysis between amino acids in a + b -type proteins. The correlation of amino acids comes from the physical and chemical properties of amino acids, such as the hydrophilicity and hydrophobicity of amino acid side chains, the volume of side chains, the number of hydrogen bond donors and acceptors, and so on. We then analyze the origin of the physicochemical properties of some amino acid correlations and give a reasonable explanation for the correlations.
1 Amino Acid Correlation Analysis Assuming that N known proteins form a set S, which is a union set composed of m subsets S x , (1) Each subset x corresponds to a protein class, containing n x proteins, and there are N = S x n x . Each protein is a vector X x , k , or a point in the 20-dimensional amino acid space, (2) In formula (3), x x , k, i is the i -th amino acid of the k -th protein of the subset x Percentage frequency, subject to the following normalization conditions, (3) each subset S x has a standard vector (or mean vector), (4) its components are the subset x
The average value of the corresponding amino acid composition of all proteins in
(5) The protein percentage composition of
the subset S x constitutes the matrix [ X x ]n x x20 . We calculate the covariance matrix [ C x ] 20×20 of the subset S x according to the following formula , (6) The covariance matrix expresses the degree of dispersion of the protein composition in the subset to the mean value. In the past, in protein structure prediction based on amino acid composition [2-5] , the focus was on the “distance” between the composition of unknown protein and standard protein composition, and unknown protein belongs to the protein type with the smallest distance. The amino acid correlation analysis method proposed in this paper considers the problem from different angles, and the foothold is the correlation between amino acids. This method believes that the number of various amino acids in the same type of protein can have a large change, but the changes in the number of amino acids have certain mutual constraints and follow a certain correlation. For this reason, we start from the covariance matrix C x to construct the correlation matrix [ R x ] 20×20 between the amino acids of the subset S x , (7) the correlation matrix R x
is a symmetric matrix, the diagonal elements are all 1, R x ,i,i =1. There should be certain matching rules between amino acids when constituting the secondary structure unit of protein ( a helix and b band, etc.): one amino acid may be connected to some amino acids more frequently, while the frequency of connection to other amino acids may be lower . The correlation matrix starts from the protein composition matrix, expresses the quantitative dependence relationship between amino acids, and reflects the structural characteristics of protein types.
-
Calculation results
We used the protein training set used in literature [6] , and selected 4 types of 204 proteins from the database (PDB Bank) ( 52 types a , 61 types b , 45 types a/b) , and a+b type 46), the PDB codes of the proteins are listed in Table 1.
PDB codes of the 204 proteins used in the calculations in Table 1 [6]
1 . a type 52 _ | 1vls_ | 1ithA | 1spgA | 1pbxB | 1emy_ | 2pghB | |
1aep_ | 1ilk_ | 2fal_ | 2gdm_ | 1fslA | 1sctB | 1hdaB | 1hdaA |
1ash_ | 1maz_ | 2hbg_ | 2lhb_ | 1helm_ | 1babB | 1hdsA | 1hrm_ |
1bcfA | 1mls_ | 3sdhA | 1hdsB | 1lht_ | 2asr_ | 1ibeB | 1mA |
1cnt1 | 1rhgA | 1allA | 1myt_ | August 1 | 1babA | 1mbs_ | 1vlk_ |
1when_ | 1spgB | 1flp_ | 1osa_ | 1outB | 1bgc_ | 2mm1_ | |
1hlb_ | 1sra_ | 1ibeA | 1sctA | 1pbxA | 1bgeA | 2pghA | |
2 . Type b 61 _ | 1yna_ | 1bbdH | 1nldH | 1bafL | 1iaiM | 1nsnH | |
1bbt2 | 3hhrC | 8 fabA | 1eapA | 1opgL | 1bjmA | 1igcL | 1plgH |
1cfb_ | 6fabl | 1flrH | 1 gafL | 1ospL | 1bqlH | 1ikfL | 1 plgL |
1 edhA | 8fabB | 1ggiH | 1gbg_ | 1vgeL | 1bqlL | 1indL | 1tetH |
1gen_ | 1pex_ | 1indH | 1ggiL | 2fbjL | 1dfbL | 1macA | 1xnd_ |
1sacA | 1 vcaA | 1JELH | 1ghfH | 2mcg1 | 1forL | 1mamL | 1yuhA |
1tcrA | 1mfbL | 2cgrH | 1hilB | 7fabL | 1ghfL | 1mreH | 3hfmH |
2ayh_ | 1gnhA | 7fabH | 1ncbL | 1acyL | 1iaiL | 1 ngqH | |
3 . a/ b type 45 | 1nar_ | 1vdc_ | 2ebn_ | 1obr_ | 1lwiA | 1cerO | |
1amp_ | 1ghr_ | 1pbn_ | 1vpt_ | 3pga1 | 1cnv_ | 1 wsaB | 1gia_ |
1ceo_ | 1gym_ | 1pfkA | 1xel_ | 8abp_ | 1exp_ | 2alr_ | 2lip_ |
1cvl_ | 1lbA | 1sbp_ | 1xyzA | 1emp_ | 1trb_ | 3ecaA | 1ula_ |
1d or A | 1lucA | 1 scA | 2bgu_ | 1gdhA | 1ghsA | 4pfk_ | 2gbp_ |
1gca_ | March 1 | 1thtA | 2ctc_ | 1lucB | 1hdgO | 1agx_ | |
4 . a+ b type 46 | 1gtqA | 1mA | 1seiA | 1 vhiA | 1apyB | 2prd_ | |
2aak_ | 1def_ | 1hjrA | 1msc_ | 1sfe_ | 1vsd_ | 1div_ | 1hup_ |
1afb1 | 1doi_ | 1htp_ | 1nhkL | 1snc_ | 1whtB | 1pvuA | 1 nueA |
1bplA | 1epaB | 1ino_ | 1pkp_ | 1std_ | 1ytbB | 1pc_ | 1cdwA |
1cof_ | 1file_ | 1itg_ | 1poc_ | 1tfe_ | 2tbd_ | 1qcqA | 1pne_ |
1 chick_ | 1grj_ | 1lit_ | 1rbu_ | 1vhh_ | 8atcB | 1ril_ | 2 km1 |
According to formulas ( 4 ) and ( 5 ), the standard protein of each subset is calculated, and the results are shown in Fig. 1 . The composition of the standard protein of type a and type b is quite different, but the composition of type a/b and type a+b is very similar. This result shows that it is difficult to distinguish between a/b type and a+b type proteins only by amino acid composition .
Table 2 The correlation coefficient between 20 amino acids of type a protein , calculated according to the data in Table 1
A | 1 | ||||||||||||||||||
C | -0.300 | 1 | |||||||||||||||||
D | -0.281 | -0.011 | 1 | ||||||||||||||||
E | -0.565 | -0.016 | 0.215 | 1 | |||||||||||||||
F | 0.126 | 0.091 | 0.174 | -0.132 | 1 | ||||||||||||||
G | 0.216 | -0.162 | -0.140 | -0.182 | 0.035 | 1 | |||||||||||||
H | -0.194 | -0.181 | -0.130 | 0.022 | 0.019 | 0.164 | 1 | ||||||||||||
I | -0.269 | -0.130 | 0.158 | 0.369 | -0.390 | 0.036 | -0.249 | 1 | |||||||||||
K | -0.036 | -0.316 | 0.182 | 0.141 | 0.314 | 0.151 | 0.402 | 0.145 | 1 | ||||||||||
L | -0.276 | 0.446 | -0.214 | -0.087 | -0.101 | -0.049 | 0.340 | -0.413 | -0.208 | 1 | |||||||||
M | -0.202 | 0.013 | 0.178 | 0.167 | -0.219 | -0.138 | -0.583 | 0.326 | -0.377 | -0.278 | 1 | ||||||||
N | 0.027 | -0.046 | -0.015 | -0.064 | 0.073 | -0.380 | -0.241 | -0.182 | -0.190 | -0.138 | 0.165 | 1 | |||||||
P | -0.085 | 0.465 | -0.421 | -0.217 | -0.065 | -0.044 | 0.101 | -0.127 | -0.341 | 0.402 | -0.147 | -0.318 | 1 | ||||||
Q | -0.056 | 0.242 | -0.487 | 0.110 | -0.359 | -0.258 | -0.248 | -0.026 | -0.506 | 0.310 | 0.286 | 0.233 | 0.175 | 1 | |||||
R | -0.541 | 0.368 | 0.265 | 0.244 | -0.218 | -0.361 | -0.363 | 0.185 | -0.456 | 0.029 | 0.540 | 0.158 | -0.036 | 0.195 | 1 | ||||
S | 0.498 | -0.173 | -0.195 | -0.436 | -0.054 | -0.122 | -0.210 | -0.233 | -0.314 | -0.077 | 0.079 | -0.086 | 0.118 | -0.034 | -0.058 | 1 | |||
T | 0.037 | -0.097 | -0.074 | -0.091 | -0.240 | -0.416 | -0.013 | -0.013 | -0.274 | -0.121 | 0.103 | -0.036 | 0.246 | 0.157 | 0.080 | 0.167 | 1 | ||
V | 0.277 | -0.226 | 0.075 | -0.421 | 0.246 | 0.235 | 0.067 | -0.310 | 0.105 | -0.203 | -0.392 | 0.070 | -0.099 | -0.514 | -0.273 | 0.055 | -0.114 | 1 | |
W | 0.060 | -0.189 | -0.130 | 0.021 | -0.004 | 0.134 | 0.055 | 0.187 | 0.013 | -0.416 | -0.102 | -0.002 | 0.067 | -0.094 | -0.160 | -0.046 | -0.036 | 0.232 | 1 |
Y | -0.297 | 0.186 | 0.262 | 0.031 | -0.264 | -0.288 | -0.298 | 0.176 | -0.136 | -0.110 | 0.394 | 0.139 | -0.147 | 0.022 | 0.569 | -0.001 | 0.100 | -0.200 | -0.141 |
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W |
Table 3 The correlation coefficient between 20 amino acids of type b protein , calculated according to the data in Table 1
A | 1 | ||||||||||||||||||
C | -0.051 | 1 | |||||||||||||||||
D | -0.275 | -0.256 | 1 | ||||||||||||||||
E | 0.021 | -0.057 | 0.258 | 1 | |||||||||||||||
F | -0.178 | -0.489 | 0.484 | 0.204 | 1 | ||||||||||||||
G | 0.017 | -0.447 | -0.274 | -0.397 | 0.031 | 1 | |||||||||||||
H | -0.071 | -0.138 | 0.022 | 0.273 | 0.122 | 0.104 | 1 | ||||||||||||
I | -0.291 | -0.387 | 0.660 | 0.433 | 0.492 | -0.281 | 0.114 | 1 | |||||||||||
K | 0.010 | 0.098 | 0.227 | 0.272 | 0.373 | -0.334 | -0.210 | 0.223 | 1 | ||||||||||
L | -0.213 | 0.350 | -0.099 | 0.030 | 0.027 | -0.086 | -0.078 | -0.218 | 0.191 | 1 | |||||||||
M | -0.226 | -0.083 | 0.037 | 0.109 | 0.119 | -0.175 | 0.021 | 0.003 | 0.134 | -0.258 | 1 | ||||||||
N | -0.155 | -0.593 | 0.328 | 0.029 | 0.366 | 0.268 | 0.287 | 0.413 | -0.255 | -0.599 | 0.152 | 1 | |||||||
P | 0.373 | 0.103 | -0.154 | -0.017 | -0.130 | -0.304 | -0.140 | -0.159 | 0.208 | 0.214 | -0.055 | -0.466 | 1 | ||||||
Q | 0.133 | 0.307 | 0.027 | -0.099 | -0.184 | -0.456 | -0.277 | 0.027 | -0.054 | -0.006 | -0.291 | -0.149 | -0.065 | 1 | |||||
R | -0.378 | 0.014 | 0.448 | 0.292 | 0.193 | -0.255 | 0.140 | 0.451 | -0.168 | 0.177 | -0.036 | 0.079 | -0.223 | 0.076 | 1 | ||||
S | -0.049 | 0.608 | -0.329 | -0.445 | -0.506 | -0.210 | -0.465 | -0.321 | -0.051 | 0.115 | -0.169 | -0.431 | -0.049 | 0.455 | -0.103 | 1 | |||
T | 0.065 | 0.380 | -0.482 | -0.321 | -0.566 | 0.116 | -0.023 | -0.568 | -0.325 | -0.006 | 0.086 | -0.220 | -0.054 | 0.030 | -0.327 | 0.358 | 1 | ||
V | 0.300 | -0.007 | -0.466 | -0.156 | -0.281 | 0.147 | 0.033 | -0.543 | -0.276 | 0.196 | -0.015 | -0.363 | 0.493 | -0.181 | -0.263 | -0.111 | 0.198 | 1 | |
W | -0.042 | -0.432 | -0.126 | -0.205 | 0.157 | 0.551 | 0.113 | -0.058 | -0.164 | -0.290 | 0.139 | 0.328 | 0.015 | -0.475 | -0.352 | -0.428 | -0.069 | 0.042 | 1 |
Y | -0.112 | -0.475 | -0.014 | -0.412 | 0.032 | 0.609 | -0.016 | -0.081 | -0.160 | -0.362 | 0.117 | 0.402 | -0.376 | -0.230 | -0.218 | -0.119 | -0.109 | -0.138 | 0.498 |
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W |
We calculated the correlation matrix according to formulas (6) and (7). Due to the symmetry relationship, we only listed the lower triangular part of the correlation matrix of type a proteins in Table 2. Table 3 is the correlation matrix of type b proteins. There are C 2 20 =190 amino acid related pairs among 20 amino acids . The correlation coefficient should be between 0 and 1, representing from completely uncorrelated to completely correlated. Most of the correlation coefficients in Table 1 and Table 2 are far less than 1, indicating that there is no specific correlation between most amino acids, and there is greater substitution. But there are also individual amino acid pairs with significantly higher correlation coefficients than the average. According to the size of the correlation coefficient of amino acid pairs, we divide amino acid correlation into strong positive correlation (correlation coefficient>0.5), strong negative correlation (correlation coefficient<-0.5), weak correlation (|correlation coefficient|<0.05), and no Correlation (|correlation coefficient|<0.01). We show the correlations of several amino acid pairs with strong positive correlation and strong negative correlation in type a protein and type b protein in Fig. 2, Fig. 3, Fig. 4 and Fig. 5. Fig. 1 The average amino acid content of 4 types of proteins ( 52 types of a type, 61 types of b type, 45 types of a/b type, and 46 types of a+b type). There is a big difference between type a and type b , but the composition of type a/b and type a+b is very similar Figure 2 Amino acids in type a protein
The correlation between H and M , the correlation coefficient R H-M = -0.583 . Amino acid H is negatively correlated with M , when the content of H is high, the content of M is low
Fig.3 Correlation between amino acid R and Y in type a protein, correlation coefficient R R-Y = 0.569, amino acid R and Y are positively correlated, when R content is high, Y content is also highFig.4 Amino acid D and I in type b protein The correlation between them, the correlation coefficient R D-I =0.660 . Amino acid D and I are positively correlated, when the content of D is high, the content of I is also high Fig. 5 Correlation between amino acid V and I in type b protein , correlation coefficient R V-I = -0.543 , amino acid V and I are negatively correlated, When the content of V is high, the content of I is low
-
Analysis of results
The largest correlation coefficient in type a protein is between amino acid M and H, RM -H = -0.583, indicating that there is a certain restriction relationship between amino acid M and H, showing a strong negative correlation (Figure 2). It is worth noting that the frequency of occurrence of amino acids M and H in the a- type protein is not very high, see Figure 1. The largest correlation coefficient in type b protein is between amino acid D and I, R D – I = 0.660, indicating that there is a strong positive correlation between amino acid D and I (Figure 4). Figure 6 Molecular structural formula and side chain properties of aspartic acid (D, Asp), isoleucine (I, Ile) and valine (V, Val)
The correlation between amino acids reflects the restriction relationship between amino acids. Let us take the HIV-1 GP120 protein ( 1acyL ) of the b- type protein HIV as an example to illustrate that in the b- type protein, aspartic acid (D, Asp) and The physicochemical origin of the strong positive correlation between isoleucine (I, Ile) . Figure 6 shows the structural formulas of D ( Asp ), I (Ile ) , and V ( Val ). Aspartic acid D has a strong hydrophilic side chain CH 2 COO – , and the side chains of isoleucine I and valine V are both strongly hydrophobic.
Figure 7 Amino acid sequence and secondary structure analysis of HIV-1 GP120 protein 1acyL of HIV [7,8]
Figure 7 is the amino acid sequence and secondary structure analysis of HIV-1 GP120 protein ( 1acyL ) of HIV [7,8] . In Fig . 7 we found four places where amino acids D and I are closely connected, which are marked with bold letters in the figure. Interestingly, all four of these occur at the head or tail of the b- fold, where the b -fold occurs. The hydrophobic and hydrophilic properties of amino acids play a very important role in the structure of proteins [9,10] . Rose ‘s research showed that [11,12] , the hydrophobicity of amino acids in proteins is related to the hydrophobic region of sequences, and based on this, the corners of secondary structural units can be judged. Since D is a strongly hydrophilic amino acid and I is a strongly hydrophobic amino acid, this is why D and I often appear in pairs at the corner of the β band, resulting in a strong positive correlation between D and I. amino acid D andIn addition to appearing in pairs, I also appeared scattered in many places, belonging to random plus correlation distribution, so the overall correlation coefficient is not very high (0.660) .
Figure 5 shows a strong negative correlation between V (Val) and I (Ile) in b- type proteins. From the molecular structures of V and I in Figure 6, it can be seen that both are strongly hydrophobic amino acids, with only one methylene group (-CH 2 -) difference. In the hydrophobic part of the b -type protein, the presence of one of the two will reduce the probability of the other, which may be the reason why V and I show a strong negative correlation.
-
Conclusion Both
theoretical analysis and computational practice have shown that there are specific correlations between amino acids in various types of proteins. The correlation between amino acids originates from the specific physicochemical properties of amino acids. There are some amino acid pairs that have obvious correlations and are easy to explain, such as the positive correlation between aspartic acid D and isoleucine I and the negative correlation between isoleucine I and valine V in type b proteins . However, due to the complexity of protein structure, not all the correlations of amino acid pairs occur between adjacent amino acids, like amino acids D and I. Correlations may also occur between amino acids that are relatively far apart. The internal factors of the correlation of these amino acid pairs are not obvious, and it is not easy to give an explanation. We will gradually decipher them in future work. As the basic building blocks of proteins, the correlation among amino acids is the matching rule of these building blocks, representing the structural features of protein types. The study of amino acid correlation can be used in the prediction of protein structure type, which we will report in future studies.
REFERENCES
[1] Xu Lu. Chemometrics. Science Publishing House, 1985.
[2] Chou K C. Current Protein and Peptide Science, 2000,1: 171.
[3] Chou K C. Proteins: Structure, Function and Genetics, 1995, 21: 319.
[4] Chou K C. FEBS Letters, 1995, 363: 127.
[5] Chou K C, Liu W, Maggiora G M et al. Proteins: Structure, Function and Genetics, 1998, 31: 97.
[6] Berman H M, Westbrook J, Feng Z et al. Nucleic Acids Research, 2000, 28: 235.
[7] Ghiara J B, Stura E A, Stanfiled R L et al. Science, 1994, 82: 264.
[8] http://www.rcsb.org/pdb/
[9] Qi J X, Xiao Y. Science B, 2003, 47 (6): 425.
[10] Qiu J D, Liang R P, Zou X Y et al. Chemical Journal, 2003, 61: 748.
[11] Rose G D ,Wolfonden R. Hydrogen bonding hydrophobicity packing and protein folding, Ann Rev Biophys Biochem Struct, 1993, 22: 381.
[12] Mandell A J, Selz K A, Shlesinger M F. Physica A, 1997, 244: 254.