|
|
||||||||
1 Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
2 Celera Genomics, Rockville, Maryland 20850, USA
Reprint requests to: Dr. Jonathan King, Department of Biology, Massachusetts Institute of Technology, Room 68-330, Cambridge, MA 02139, USA; e-mail: jaking{at}mit.edu; fax: (617) 252-1843.
(RECEIVED August 1, 2000; FINAL REVISION January 23, 2001; ACCEPTED February 27, 2001)
Article and publication are at www.proteinscience.org/cgi/doi/10.1110/
| Abstract |
|---|
|
|
|---|
Keywords: Sequence database; bioinformatics; aggregation; inclusion body; hydrophobic residues
| Introduction |
|---|
|
|
|---|
The failure of protein folding has emerged as an important biological and biotechnological problem (Mitraki and King 1989). A common failure mode is the self-association of folding intermediates, leading to an aggregated biological state (Speed et al. 1997; Wetzel 1997). In a number of well-studied systems this off-pathway reaction has been shown to be associated with partially folded intermediates. A key role of heat shock chaperonins in cells is recognizing junctional misfolded intermediates and helping them avoid self-association and the kinetically trapped aggregated state. The isolation of mutants influencing this partitioning establishes that it is a property of amino acid sequences (Mitraki et al. 1991).
In recent years, several increasingly detailed computational and theoretical models have been developed to address questions of aggregate structure and dynamics currently inaccessible to experimental techniques (De Young et al. 1993; Patro and Przybycien 1994, 1996; Gupta et al. 1998). In prior work, we reported on a lattice-based computer simulation model for studying the propensity of polypeptide chains to aggregate during simulated folding in a solution (Istrail et al. 1999). The key idea of the method was to follow two identical chains folding independently and periodically test how the folding intermediates could associate with one another in an energetically favorable manner. On the basis of our results, we concluded that propensity of a sequence to aggregate within the model is a consistent property of its amino acid sequence and correlates with measurable properties of the sequence.
One particular property found to predict a propensity to aggregate was the grouping of hydrophobic residues within a peptide into small numbers of consecutive blocks. For a given number of hydrophobic residues, the computer model showed increasing propensity to aggregate as the hydrophobic residues became concentrated in fewer continuous subsequences. This property is of interest for the present work because it implies a testable prediction about actual sequences known to fold in aqueous solution: they will have evolved to select against long blocks of consecutive hydrophobic residues in order to promote a low loss of proteins to aggregation. The present work examines whether selection against long hydrophobic sequences within globular proteins is reflected in actual published sequence data. Our methodology is therefore similar to that adopted by Broome and Hecht (2000), who studied statistical distributions of small patterns of hydrophobic and hydrophilic residues to support the hypothesis that a pattern of alternating hydrophobichydrophilic residues predisposes sequences to aggregation.
The specific issue of hydrophobic run lengths addressed by the present work has been previously examined by White and Jacobs (1990). They performed a careful statistical analysis of extant protein sequences, testing the probabilitygiven the hypothesis that residue hydrophobicities are assigned independently at randomof the number of hydrophobic or hydrophilic runs differing from the expectation by at least as much as that observed in each individual sequence. They concluded that the majority of individual proteins examined have distributions of hydrophobic run lengths statistically indistinguishable from those expected for random sequences. White and Jacobs (1990) argued from this result in favor of the hypothesis that extant protein sequences are essentially random except at small numbers of conserved sites, consistent with the evolutionary hypothesis that current biologically relevant proteins evolved from a large initial pool of random sequences.
Although the result of White and Jacobs (1990) appears to argue against our hypothesis, the questions we ask differ from those originally examined by them, primarily in that we limit our search to proteins known to fold in aqueous solution and in that we compile statistics only for an entire database, rather than for individual sequences. This latter difference is subtle but important in allowing genuine statistical effects that are small in absolute magnitude to rise to the level of statistical significance. In later work, White and Jacobs (1993) explored aggregate statistics on run lengths of hydrophobic versus hydrophilic residues and found a slight bias toward shorter consecutive blocks. We extend these studies primarily in examining statistics for proteins known to fold in aqueous solution and in contrasting those results to statistics derived from known membrane sequences and to complete proteomes.
Polypeptide chains that do refold to their native state in aqueous buffers, generally fail to reach this state in the presence of detergents, lipid vesicles, or organic solvents. In contrast, the refolding of integral membrane proteins in vitro requires both lipid vesicles and surfactants. These experimental observations together with theoretical considerations indicate that the folding of these two classes of proteins proceeds through very different intermediates and pathways, and requires different environmental conditions.
The overall statistical content of protein sequences has since been examined by Strait and Dewey (1996), who analyzed a protein sequence database in terms of its information entropy by different measures of information content. They determined that actual protein sequences carry significantly less information than is theoretically possible from a 20-letter alphabet. Analyses specifically focusing on hydrophobicity have mainly been aimed at screening for transmembrane segments of proteins via hydrophobicity scales, a technique pioneered by Kyte and Doolittle (1980). Subsequent experience has borne out the association of long hydrophobic stretches with transmembrane helices.
| Results and Discussion |
|---|
|
|
|---|
A quantitative measure of the concentration of hydrophobic residues into blocks is the number of alternations in the database, in which a hydrophobic residue is immediately followed by a nonhydrophobic residue or vice versa; the sum of this quantity over all 2753 sequences was computed. In addition, a histogram was recorded of the lengths of all maximal sequences of consecutive hydrophobic residues within the protein sequences analyzed. Hydrophobic residues were defined for the purposes of this study to be Ala, Ile, Leu, Met, Phe, Pro, Trp, Tyr, and Val.
Data were also recorded on the distribution of sequence lengths in the database and on the probability pH that a randomly selected residue within the database was hydrophobic. A prediction of the expected number of alternations given the frequency of hydrophobic residues and the sequence lengths in the database was calculated. Predicted values were also derived computationally for the expected number of blocks of consecutive hydrophobic residues of each possible length, given the measured distribution of sequence lengths, assuming that each residue was assigned independently at random to be hydrophobic with probability pH, as described in the Materials and Methods section. The result provides a measure of the expected block length frequencies given a sufficiently large database and the assumption that no factors influence the relative positions of hydrophobic residues within each sequence.
Both measures suggest long blocks of hydrophobic residues are suppressed relative to what would be expected if residues were chosen independently of their neighbors. For our definition of hydrophobic residues, pH was found to be 0.451. The expected number of alternations given this value of pH was calculated as 232,542, compared to a measured value of 237,716. The difference of 5,174 (2.225% of expected) is 14.97 standard deviations and is therefore unlikely to be caused by chance. Figure 1
shows the measured and expected values of hydrophobic block lengths for non-membrane-associated proteins. The measured values slightly exceed those predicted under the assumption of independence of residues for hydrophobic block lengths one, two, and three. However, they fall significantly lower for all longer block lengths except 16, for which one block was detected whereas 0.388607 blocks were expected. No hydrophobic block lengths longer than 16 were observed in the sequence data. Table 1
lists those sequences containing hydrophobic blocks of length 12 or longer, identified by PDB ID (Bernstein et al. 1977).
|
|
How important is our specific choice of residues to classify as hydrophobic? To examine this question, we repeated our statistical calculations on numbers of alternations for the database for other possible selections of hydrophobic residues. We altered our set of hydrophobic residues in 20 successive recalculations by individually subtracting out each amino acid we classified as hydrophobic and by individually adding in each residue we classified as nonhydrophobic. The resulting data are summarized in Table 2
. In each case, a single amino acid change still results in a statistically significant elevation in the number of alternations, although usually to a lesser degree than with our originally chosen set of hydrophobics. The most dramatic increase occurs with the addition of glycine, increasing the percent elevation over the expectation from 2.225% to 2.964%. The most dramatic decrease comes from adding in aspartate, decreasing the percent elevation to 0.968%.
|
Does removing membrane and cell-surface proteins from the database introduce a sample bias? This decision should not undermine the detection of a pattern in the database if it reflects properties of polypeptide chains from globular proteins that fold in aqueous solution. The selection of sequences to exclude from the analysis was based on this functional criterion, rather than on sequence. Furthermore, the expected values were computed using amino acid frequencies taken from the database after membrane and cell-surface proteins had been excluded, and the statistical suppression is therefore genuine given the overall hydrophobic content of non-membrane-associated sequences.
When similar statistics are computed from the excluded membrane and cell-surface proteins, also removing redundant sequences and those with amino acids of unspecified type, the results are qualitatively reversed. The histogram of expected and measured block frequencies for the membrane and cell-surface proteins is illustrated in Figure 2
. The figure shows that long hydrophobic block frequencies are generally elevated relative to statistically expected values, assuming independent residue selection. However, the measured number of alternations is still slightly higher than expected with 6242 counted and 6144 expected, a difference of 1.595% of expected, or 1.764 standard deviations. Although we attribute the elevated frequencies largely to the contribution of transmembrane helices, blocks of the length required for a full transmembrane helix are not observed owing to the presence of nonhydrophobic residues within the helices. Although our simulation model says nothing about membrane-associated sequences, the statistical data suggest that some of the effects predicted for sequences folding in aqueous solution are reversed for those that are membrane-associated.
|
), 1,431,462 counted versus 1,454,311 expected for S. cerevisiae (a difference of 1.571%, or 25.95
), and 4,885,131 counted versus 4,998,382 expected for C. elegans (a difference of 2.266%, or 70.13
).
|
It might be argued that fully folded proteins cannot tolerate long blocks of hydrophobic residues and remain soluble. However, we know of no reason why long hydrophobic blocks could not be accommodated internally to the structure of a globular protein after it has completed the folding process. Such structures can occur, as some of those listed in Table 2
illustrate. Figure 4
shows the structure of UDP-N-acetylglucosamine enolpyruvyl transferase, which accommodates its 12-residue hydrophobic block in a buried alpha helix, providing an excellent example of how a folded protein can accommodate a long string of consecutive hydrophobic residues. If such structures can be stable in a fully folded protein, then the question remains why they are statistically rare. We suggest that constraints imposed by the process of folding, as opposed to the structural needs of a fully folded protein, may be partially responsible for the observed effects. The results might also be an artifact of an overabundance of certain sequence motifs or elements of secondary structure that favor short hydrophobic blocks. The fact that the excess of short blocks consists primarily of length 2 blocks is not, however, consistent with any common structural motif known to us.
|
The disparity in alternations is small as a percentage of total alternations and would not be expected to reach statistical significance in individual sequences if it were distributed evenly across the database, even though it is quite significant for the database as a whole. Similarly, although disparities in block lengths show up as significant in the database as a whole, the absolute number of residues expected to be involved in long blocks is a small percentage of the total number of residues. A disparity in those frequencies over the whole proteome may therefore not appear to be statistically significant when individual sequences are examined in isolation. We believe that our data and those of White and Jacobs (1990, 1993) are both consistent with two interpretations: a small systematic bias in block lengths across the whole database or large differences in a minority of sequences and no differences in others. We are not aware of a method for deciding between these two hypotheses. We therefore believe the apparent disparity between our results and those of White and Jacobs (1990, 1993) reflects our asking subtly different questions, guided by a concern for the influence of sequence on the process of folding as opposed to final folded states alone. Neither do our results rule out the premise of White and Jacobs that sequences of biologically relevant proteins may have evolved from an initially random set, with the caveat that selective pressures may have eliminated a noticeable subset of that initial set.
On the basis of our prior simulation results, we suggest that the missing sequences reflect in part the evolutionary fitness constraint imposed by selective pressure to avoid off-pathway aggregation. As proteins evolved to favor sequences that could fold reliably, a noticeable suppression of sequences with long strings of consecutive hydrophobic residues occurred. This effect could create a counterbalance to the drive for high hydrophobicity identified by Moult and Unger (1991) as a key predictor of rapid folding, creating the need for a balance between these two competing factors of folding rate and aggregation propensity. This conclusion is consistent with results from our earlier lattice simulations (Istrail et al. 1999), which found that rapid folders in our lattice model tended to have high aggregabilities per unit time, apparently because of a correlation of both fast folding and high propensity to aggregate with high hydrophobic content. The reversed results seen for membrane-associated proteins may reflect how evolutionary selection for sequence operates differently on membrane-associated sequences than on those evolved for aqueous environments. In itself, the similarity between the membrane-associated sequences and the complete-ORF databases does not necessarily mean that membrane-associated proteins form a large fraction of actual ORFs, only that complete genomes contain many membrane-associated proteins that are able to produce a significant effect on the data at long block lengths, where soluble proteins produce few data points. However, the fact that elevations of long blocks in complete genomes are more pronounced than in the membrane proteins of solved structure does suggest that the solved membrane proteins are not fully representative of membrane proteins in general.
The results from membrane-associated and complete-ORF databases strengthen our contention that some constraint imposed by the requirements of folding in aqueous solution is suppressing the number of hydrophobic residues found in long consecutive blocks in soluble proteins. These conclusions suggest that aggregation constraints may contribute to the observation of Strait and Dewey (1996) that actual protein sequences carry considerably less information than a 20-letter alphabet theoretically allows. They further suggest the importance of considering propensity to aggregate as a design constraint in protein evolution on a par with rapid folding and with stability and functional fitness of the native state. The confirmation of an important statistical prediction of our abstract computer model also supports the validity of that model and suggests the benefits of refining computational methodologies for exploring protein folding and aggregation. The tremendous increase in data available for analysis in recent years suggests that similar database analysis techniques for locating interesting proteins or supporting general hypotheses about protein behavior are likely to become increasingly valuable.
| Materials and methods |
|---|
|
|
|---|
Expected hydrophobic block length frequencies given the assumption of independence between positions were calculated via a function P(n,k,m), expressing the probability for fixed pH that a sequence of length n has exactly m hydrophobic blocks of length exactly k. P(n,k,m) can be calculated via a recurrence relation by considering the following cases:
![]() |
![]() |
![]() |
k the string consists of i hydrophobic residues, followed by a hydrophilic residue, followed by a string with no length k hydrophobic blocks. Therefore,
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Standard deviations of the expected block counts were estimated via simulations. Simulation trials were performed by creating a random database of sequences with the same lengths as those in the measured database but with residues assigned independently at random, with the measured probability pH that a given residue was hydrophobic. Block lengths within the random database were then measured as was done with the actual data. Standard deviations were computed based on 1,000,000 such trials.
When computing statistics for all possible partitions of the set of amino acids into two groups, we first counted the number of occurrences of all amino acids and all pairs of consecutive amino acids in the databases. We calculated the number of alternations for each partition by summing the counts of all consecutive residue pairs assigned to different groups by the partition. We similarly calculated pH by dividing the sum of the values for residues classified as hydrophobic in a given partition by the total number of amino acids in the database. Given pH, we then calculated the expectation and standard deviation of numbers of alternations for each partition as described above.
| Acknowledgments |
|---|
The publication costs of this article were defrayed in part by payment of page charges.This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| References |
|---|
|
|
|---|
Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.E., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M. 1977. The Protein Data Bank: A computer-based archival file for macromolecular structures. J. Mol. Biol. 112: 535542.[Medline]
Blattner, F.R., Plunkett, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 14531462.
Brenner, S.E., Koehl, P., and Levitt, M. 2000. The ASTRAL compendium for protein structure and sequence analysis. Nucl. Acids Res. 28: 254256.
Broome, B.M. and Hecht, M.H. 2000. Nature disfavors sequences of alternating polar and non-polar amino acids: Implications for amyloidogenesis. J. Mol. Biol. 296: 961968.[CrossRef][Medline]
The C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282: 20122018.
Cohen, C. and Parry, D.A.D. 1986.
-Helical coiled coilsA widespread motif in proteins. Trends Biochem. Sci. 11: 245248.[CrossRef]
De Young, L.R., Fink, A.L., and Dill, K.A. 1993. Aggregation of globular proteins. Accounts Chem. Res. 26: 614620.[CrossRef]
Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., et al. 1996. Life with 6000 genes. Science 274: 546567.
Gupta, P., Hall, C.K., and Voegler, A.C. 1998. Effect of denaturant and protein concentrations upon protein refolding and aggregation: A simple lattice model. Protein Sci. 7: 26422652.[Abstract]
Istrail, S., Schwartz, R., and King, J.A. 1999. Lattice simulations of aggregation funnels for protein folding. J. Comp. Biol. 6: 143162.
Kyte, J. and Doolittle, R.F. 1982. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157: 105132.[CrossRef][Medline]
Mitraki, A. and King, J. 1989. Protein folding intermediates and inclusion body formation. Bio/Technology 7: 690697.[CrossRef]
Mitraki, A., Fane, B., Haase-Pettingell, C., Sturtevant, J., and King, J. 1991. Global suppression of protein folding defects and inclusion body formation. Science 253: 5458.
Moult, J. and Unger, R. 1991. An analysis of protein folding pathways. Biochem. 30: 38163824.[CrossRef][Medline]
Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536540.[CrossRef][Medline]
Patro, S.Y. and Przybycien, T.M. 1994. Simulations of kinetically irreversible protein aggregate structure. Biophys. J. 66: 12741289.
. 1996. Simulations of reversible protein aggregate and crystal structure. Biophys. J. 70: 28882902.
Rose, G.D. and Roy, S. 1980. Hydrophobic basis of packing in globular proteins. Proc. Natl. Acad. Sci. USA 77: 46434647.
Skarzynski, T., Mistry, A., Wonacott, A., Hutchinson, S.E., Kelly, V.A., and Duncan, K. 1996. Structure of UDP-N-acetylglucosamine enolpyruvyl transferase, an enzyme essential for the synthesis of bacterial peptidoglycan, complexed with substrate UDP-N-acetylglucosamine and the drug fosfomycin. Structure 4: 14651474.[Medline]
Speed, M.A., King, J., and Wang, D.I.C. 1997. Polymerization mechanism of polypeptide chain aggregation. Biotech. Bioeng. 54: 333343.[CrossRef]
Strait, B.J. and Dewey, T.G. 1996. The Shannon information entropy of protein sequences. Biophys. J. 71: 148155.
Tomita, M. and Marchesi, V.T. 1975. Amino-acid sequence and oligosaccharide attachment sites of human erythrocyte glycophorin. Proc. Natl. Acad. Sci. USA 72: 29642968.
von Heijne, G. 1994. Membrane proteins: From sequence to structure. Ann. Rev. Biophys. Biomol. Struct. 23: 167192.[Medline]
Wetzel, R. 1997. Protein misassembly. Adv. Prot. Chem. 50: 330350.
White, S.H. and Jacobs, R.E. 1990. Statistical distribution of hydrophobic residues along the length of protein chains: Implications for protein folding and evolution. Biophys. J. 57: 911921.
. 1993. The evolution of proteins from random amino acid sequences. I. Evidence from the lengthwise distribution of amino acids in modern protein sequences. J. Mol. Evol. 36: 7995.[CrossRef][Medline]
![]()
CiteULike
Connotea
Del.icio.us
Digg
Reddit
Technorati What's this?
This article has been cited by other articles:
![]() |
E. Monsellier, M. Ramazzotti, P. P. de Laureto, G.-G. Tartaglia, N. Taddei, A. Fontana, M. Vendruscolo, and F. Chiti The Distribution of Residues in a Polypeptide Sequence Is a Determinant of Aggregation Optimized by Evolution Biophys. J., December 15, 2007; 93(12): 4382 - 4391. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stefani Generic Cell Dysfunction in Neurodegenerative Disorders: Role of Surfaces in Early Protein Misfolding, Aggregation, and Aggregate Cytotoxicity Neuroscientist, October 1, 2007; 13(5): 519 - 531. [Abstract] [PDF] |
||||
![]() |
R. Schwartz and J. King Frequencies of hydrophobic and hydrophilic runs and alternations in proteins of known structure Protein Sci., January 1, 2006; 15(1): 102 - 112. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. L. Fawzi, V. Chubukov, L. A. Clark, S. Brown, and T. Head-Gordon Influence of denatured and intermediate states of folding on protein aggregation Protein Sci., April 1, 2005; 14(4): 993 - 1003. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ventura, J. Zurdo, S. Narayanan, M. Parreno, R. Mangues, B. Reif, F. Chiti, E. Giannoni, C. M. Dobson, F. X. Aviles, et al. Short amino acid stretches can mediate amyloid formation in globular proteins: The Src homology 3 (SH3) case PNAS, May 11, 2004; 101(19): 7258 - 7263. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Lopez de la Paz and L. Serrano Sequence determinants of amyloid fibril formation PNAS, January 6, 2004; 101(1): 87 - 92. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. L. Sim and T. P. Creamer Abundance and Distributions of Eukaryote Protein Simple Sequences Mol. Cell. Proteomics, December 1, 2002; 1(12): 983 - 995. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |