Proteins and Genes, Singletons and Species
Branko Kozulić
Gentius Ltd, Petra Kasandrića 6, 23000 Zadar, Croatia
Abstract
Recent experimental data from proteomics and genomics are interpreted here in ways that challenge the predominant viewpoint in biology according to which the four evolutionary processes, including mutation, recombination, natural selection and genetic drift, are sufficient to explain the origination of species. The predominant viewpoint appears incompatible with the finding that the sequenced genome of each species contains hundreds, or even thousands, of unique genes - the genes that are not shared with any other species. These unique genes and proteins, singletons, define the very character of every species. Moreover, the distribution of protein families from the sequenced genomes indicates that the complexity of genomes grows in a manner different from that of
Introduction
One of the first issues encountered in the early studies of proteins was their large size. In 1936, under the assumption that a protein has molecular weight of 20,000, Swiss physicists
1
formation of a particular protein molecule corresponded to one against 10321 [1]. Such estimates compelled French biophysicist Pierre Lecomte du Noüy to question any scenario of unguided origination of proteins, for this huge number of different protein molecules, if made, would have a volume many times larger than the volume of the whole universe [2, 3]. In 1953, as part of his Nobel lecture, Hermann Staudinger contrasted the chance of formation of a particular 100,000 molecular weight protein - one in 101270 - to the number of water molecules present in Earth’s oceans - a mere 1046 [4]. In 1957, Isaac Asimov calculated that if the whole universe were packed with neutrinos, and if each neutrino represented a computer generating per second one billion proteins each of a different sequence over the entire universe’s life, the total number of proteins generated would have reached just 10179 [5].
Prominent mathematicians and biologists discussed this mathematical challenge to neo- Darwinian evolution at a special meeting in 1966 [6], but, as noted by Salisbury [7], the question is whether the attending biologists understood the nature and magnitude of the challenge. Over subsequent decades, the same challenge has been repeatedly raised by some scientists only to be diffused by others, until its relevance apparently became unclear. Thus physicist Charles Townes could remark: “The biologists may at first seem fortunate because they have not run into brick walls such as physicists hit in finding quantum or relativistic phenomena that are so strange and different. But this may be because biologists have not yet penetrated far enough towards the really difficult problems where radical changes of viewpoints may be essential” [8]. Here I argue that biologists have actually run into brick walls; hence it is time for radical changes of viewpoints.
Size of protein sequence space
One strategy for defusing the problem associated with the finding of functional proteins by random search through the enormous protein sequence space has been to arbitrarily reduce the size of that space. Because the space size is related to protein length (L) as 20L, where 20 denotes the number of different amino acids of which proteins are made, the number of unique protein sequences will rapidly decrease if one assumes that the number of different amino acids can be less than 20. The same is true if one takes small L values. Dryden et al. used this strategy to illustrate the feasibility of searching through the whole protein sequence
2
space on Earth, estimating that the maximal number of different proteins that could have been formed on planet Earth in geological time was 4 x 1043 [9]. In laboratory, researchers have designed functional proteins with fewer than 20 amino acids [10, 11], but in nature all living organisms studied thus far, from bacteria to man, use all 20 amino acids to build their proteins. Therefore, the conclusions based on the calculations that rely on fewer than 20 amino acids are irrelevant in biology. Concerning protein length, the reported median lengths of bacterial and eukaryotic proteins are 267 and 361 amino acids, respectively [12]. Furthermore, about 30% of proteins in eukaryotes have more than 500 amino acids, while about 7% of them have more than 1,000 amino acids [13]. The largest known protein, titin, is built of more than 30,000 amino acids [14]. Only such experimentally found values for L are meaningful for calculating the real size of the protein sequence space, which thus corresponds to a median figure of 10347 (20267) for bacterial, and 10470 (20361) for eukaryotic proteins.
Protein structure space
Even a small protein composed of 100 amino acids comes from a set of 10130 different possible sequences. As Lau and Dill stated in 1990 (15), it is essentially impossible for chance to find a particular sequence in a set of such a magnitude, as is for a monkey dancing on a typewriter to produce a Shakespearean play. Because this general argument of low probability gained importance “as support for creationism” [15], Lau and Dill proposed the “structure” hypothesis according to which nature seeks only a compact protein conformation with the proper active site. This is an alternative to the view that nature “seeks” a particular sequence. Since proteins of many different sequences can attain one kind of compact conformation, the structure hypothesis reduced the searchable space, and was thus perceived to increase the likelihood of finding a functional protein by a random process, such as random mutations of
In the two decades since the above proposition, scientists have used various criteria to order protein structure space. The primary information about
3
rely on curators who delineate domains and folds within 3D structure of each individual protein, and both classifications bring in taxonomy to involve evolutionary relationships. Since the two basic entities of classification, domains and folds, are subjectively rather than mathematically derived, the recognition of new folds and the quantification of similarity among folds are difficult
While some argue that a protein fold, and its relationship to other folds, cannot be defined without considering the evolutionary context [23, 24], others define relationships between protein folds purely mathematically in terms of a continuous similarity curve. The number of folds sufficient for describing all protein structures then depends on the chosen similarity cut- off value [18, 25, 26]. Recently, a new classification was described based on supersecondary motifs (Smotifs), which are entities smaller than domains and folds. Smotifs are composed of the two regular secondary structure elements,
Regardless of whether the 324 Smotifs or 1,233 folds - or a similar number of other basic elements - are sufficient for describing all 3D protein structures, the existence of the enormous number of possible protein sequences necessarily means that a structure defined by any particular fold or combination of Smotifs might be populated by a huge number of unique protein sequences. Instances of proteins having essentially identical structure but different sequences, with sequence similarity even below 10%, are well known [22, 31, 32].
4
Sequence similarity of
All evolutionary models rely on how certain changes affect fitness. But is changing a protein fold beneficial or detrimental to fitness? Or, is maintaining a protein fold beneficial or detrimental? Under physiological conditions, native metamorphic proteins are known to exist in two alternative folds and both of them appear to be beneficial
[42]described 21 such prior substitutions; each one of them would have represented a crossroad with thousands of directions had these substitutions occurred in vivo instead of in vitro. Population genetics modeling becomes complicated when dealing with multiple amino acid substitutions in one protein
5
the fitness effects due to 3D structural changes in a series of proteins undergoing such mutations. But, as a matter of principle, how can one possibly talk about a separate or additional fitness effect due to a 3D structural change if the protein sequence determines its structure, and the structure determines function and the function determines fitness? My literature search for publications describing evolutionary modeling based on fitness effects of protein structures gave no results. And according to a paper published in 2008: “the precise determinants of the evolutionary fitness of protein structures remain unknown” [47] – 18 years since Lau and Dill proposed the „structure hypothesis“[15]. On the other hand, in a number of papers it was shown that all relationships in the protein structure space can be described in purely mathematical terms [18,
[29].If all relationships in the protein structure space can be described fully without the need to invoke evolutionary explanations, then such explanations should not be invoked at all (Ockham’s razor).
Frequency of functional proteins in protein sequence space
A single mutation, an insertion or a deletion, can in theory force a protein to switch its fold and acquire a new function, especially when the number of inserted or deleted nucleotides is not an integer of 3. Such mutations are known as frameshift mutations, as they completely change the amino acid sequence downstream of the mutation point. The probability that the new sequence is functional in combination with the unchanged upstream sequence correlates with the frequency of folds in the protein sequence space. While scientists generally agree that only a minority of all possible protein sequences has the property to fold and create a stable 3D structure, the figure adequate to quantify that minority has been a subject of much debate.
In 1976, Hubert Yockey estimated the probability of about
6
conclusion that random assembling of amino acids could not have produced a single enzyme during 4.5 billion years [48, 53]. On the other hand, Taylor et al. estimated that a random protein library of about 1024 members would be sufficient for finding one chorismate mutase molecule [54]. Moreover, from an actual library of 6x1012 proteins each containing 80 contiguous random amino acids, Keefe and Szostak isolated four ATP binding proteins and concluded that the frequency of functional proteins in the sequence space may be as high as 1 in 1011, allowing for their discovery by entirely stochastic means [55]. However, subsequent in vivo studies with this
[55].The importance of distinguishing the results of in vitro from in vivo studies is highlighted by the finding that only a tiny fraction, one in about 1010, of the active mutants of triosephosphate isomerase functioned properly in vivo [57]. It is also important to note that nucleotide binding protein families are among the most populous of all: the
A “macromolecular miracle”
In general, there are two aspects of biological function of every protein, and both depend on correct 3D structure. Each protein specifically recognizes its cellular or extracellular counterpart: for example an enzyme its substrate, hormone its receptor, lectin sugar, repressor DNA, etc. In addition, proteins interact continuously or transiently with other proteins, forming an interactive network. This second aspect is no less important, as illustrated in many studies of
7
amino acids (which make up the polypeptide chain) in the correct order” [61, italics in original].
Let us assess the highest probability for finding this correct order by random trials and call it, to stay in line with Crick’s term, a “macromolecular miracle”. The experimental data of Keefe and Szostak indicate - if one disregards the above described reservations - that one from a set of 1011 randomly assembled polypeptides can be functional in vitro, whereas the data of Silverman et al. [57] show that of the 1010 in vitro functional proteins just one may function properly in vivo. The combination of these two figures then defines a “macromolecular miracle” as a probability of one against 1021. For simplicity, let us round this figure to one against 1020.
It is important to recognize that the one in 1020 represents the upper limit, and as such this figure is in agreement with all previous lower probability estimates. Moreover, there are two components that contribute to this figure: first, there is a component related to the particular activity of a protein - for example enzymatic activity that can be assayed in vitro or in vivo - and second, there is a component related to proper functioning of that protein in the cellular context: in a biochemical pathway, cycle or complex. Taking into account both contributions is an essential requirement because a synthetic protein nicely active in the test tube can be lethal in the cellular context, as shown by Stomel et al. for the
In the context of protein sequences, the figure of one in 1020 means that along a polypeptide chain the identity of amino acids at only 15 positions would stay fixed; at each other position there could be any one of the 20 amino acids. With a 50 amino acid peptide, for example, the expectation is then to find 1045 functional sequences out of the 1065 (2050) possible ones. That expectation seems unrealistic. With the median length in eukaryotes of 361 amino acids, the expectation to find 10450 functional proteins and only 1020 nonfunctional ones looks utterly
8
ridiculous. Thus, allowing for the probability of finding one functional protein among 1020 random sequences is obviously extremely generous, bordering on unreasonably generous. Nevertheless, for the sake of simplicity let us remain by this figure for “macromolecular miracle” and apply it to all proteins regardless of their length and cellular context.
To put the 1020 figure in the context of observable objects, about 1020 squares each measuring 1 mm2 would cover the whole surface of planet Earth (5.1 x 1014 m2). Searching through such squares to find a single one with the correct number, at a rate of 1000 per second, would take 1017 seconds, or 3.2 billion years. Yet, based on the above discussed experimental data, one in 1020 is the highest probability that a blind search has for finding among random sequences an in vivo functional protein. This figure denotes the minimal height of the brick wall.
Size of the currently known protein sequence space
One result of rapid advances in DNA sequencing technology is the acquisition of protein sequence data at an exponential rate: a recent extrapolation suggests that the number of known protein sequences will reach one trillion (1012) in 2050 [62]. Currently, several online databases collect protein sequence information and provide various tools for data visualization and analysis. To mention just two of them: present (October 2010) SIMAP database contains over 39 million
What have we learned from these tens of millions of protein sequences originating from the genomes of more than one thousand species? When proteins of similar sequences are grouped into families, their distribution follows a
9
thousands of member proteins having similar sequences, while, at the other extreme, there are thousands of families with just a few members. The most numerous are “families” with only one member; these lone proteins are usually called singletons. This regularity was evident already from the analysis of 20 genomes in 2001 [66], and 83 genomes in 2003 [69]. As more sequences were added to the databases more novel families were discovered, so that according to one estimate about 180,000 families were needed for complete coverage of the sequences in the Pfam database from 2008 [71]. Another study, published in the same year, identified 190,000 protein families with more than 5 members - and additionally about 600,000 singletons - in a set of 1.9 million distinct protein sequences [73].
Novel protein sequences and scaling in
Systems having many interactive members, where the members are sometimes called nodes or vertices, are often depicted as a network in which connectivity among the members is best described by a
By plotting, on a
10
earthquakes of low magnitudes, and an ever decreasing number of stronger earthquakes (Fig. 2b). Moreover, based on common appearance of actors in the same movie, actors’ collaboration network also shows a
Distribution of protein families in sequenced genomes is illustrated by a similar graph (Fig. 2d). Comparable distributions have been observed with protein datasets from individual sequenced genomes [65, 80], as well as with the datasets that encompassed all sequenced genomes at various time points
The first condition that the networks of Figure 2 must fulfill is a continuous addition of new members [78]. Thus, continuously new actors appear in movies, new earthquakes happen and new scientific papers get published. Roughly one person in 105 acts in a movie, earthquakes make one of less than 105 geological phenomena, and the fraction of scientific papers among all publications is higher than one in 105. So, to enter the respective network - to become the first point at the head of the distribution - the newcomers must overcome a barrier not higher than one against 105. After the entry, to become prominent the newcomers have a chance of about one in 105 again. Evidently, the two barriers, of entering and of becoming prominent, are comparable, give or take a few orders of magnitude. What would happen if the entry barrier were one thousand trillion (1015) times higher? Obviously, if just one in 1020 persons could become an actor, we would know of no actors: there would be no records of them, and analogously, there would be no records of scientific papers and earthquakes. And without the records, no one could construct distribution graphs.
The frequency of functional proteins among random sequences is at most one in 1020 (see above). The proteins of unrelated sequences are as different as the proteins of random sequences [22, 81, 82] - and singletons per definition are exactly such unrelated proteins.
11
Thus, to enter the distribution graph as a newcomer (Fig. 2d), each new protein (singleton) must overcome the entry barrier of one against at least 1020. After the entry, singleton’s chance of becoming prominent, that is to grow into one of the largest protein families, is about one in 105 (Fig. 2d). Thus, it is much more difficult for a protein to become biologically functional than to become, in many variations, widespread: the entry barrier is at least fifteen orders of magnitude higher than the prominence barrier. This huge difference between the entry and prominence barriers is what makes the protein family distribution graph unique. In spite of this high entry barrier, in the sequenced genomes the protein newcomers (singletons) always represent the largest, most common, group: if it were otherwise, the distribution graph would break down. The mathematical models that incorporate data from all sequenced genomes in effect “spy” on nature [21]. With the help of one such model we have just uncovered something remarkable: in living organisms the most unlikely phenomenon can be the most common one. This feature clearly distinguishes the complexity of living organisms from the complexity of
Modeling of protein family distributions
Several research groups have attempted to model and explain various aspects of the observed
12
sources of singletons. In another attempt, Hughes and Liberles proposed that just gene duplication and different pseudogenisation rates between gene families were sufficient for emergence of the
Horizontal gene transfer is common in prokaryotes but rare in eukaryotes
The distribution of protein folds and domains also follows a
13
Dokholyan et al. have attempted to explain their protein domain universe graph (PDUG) in terms of gene duplication and sequence divergence only [21]. In their explanation, however, implicit was the assumption that in the protein structure space there were just two alternatives: the old domain and a new domain, where each one of the two domains conferred functionality to the protein regardless of the sequence divergence. That assumption is not plausible because a vast majority of proteins would be
14
Singletons, orphans,
In addition to the term singleton, other terms, with a similar if not synonymous meaning, have been used to denote proteins and genes having no relatives. Thus, Siew and Fischer define genomic ORFans as orphan open reading frames (ORF) with no significant sequence similarity to other ORFs [103, 104]. Wilson et al. suggest that orphans should be named “taxonomically restricted genes” (TRGs) [105, 106], and state that the abundance of orphan genes is amongst the greatest surprises uncovered by the sequencing of eukaryotic and bacterial genomes [105]. Earlier, Russell Doolittle affirmed that there are large numbers of unidentified genes in a variety of organisms, with the origin and function of these unique sequences remaining “baffling mysteries” [107].
In order to understand why the finding of singletons
Siew and Fischer succinctly described the issues at stake: “If proteins in different organisms have descended from common ancestral proteins by duplication and adaptive variation, why is that so many today show no similarity to each other?” And further: “Do these rapidly evolving ORFans correspond to nonessential proteins or to species determinants?” [103].
15
A recent study, based on 573 sequenced bacterial genomes, has concluded that the entire pool of bacterial genes - the bacterial
[112].The trend towards higher numbers of singletons per genome seems to coincide with a higher proportion of the eukaryotic genomes sequenced. In other words, eukaryotes generally contain a larger number of singletons than eubacteria and archaea.
When a relative to a singleton is found, together the two proteins create a family. In the absence of biochemical data, nothing can be said about biological function of that protein family as long as no established domain or structural motif is discernable from the amino acid sequences. Such proteins of obscure function, or POFs, make about 25% of the proteins found in each genome [113, 114]. POFs tend to be shorter than the proteins of defined function [114].
Today, almost ten years since the announcement of the first draft of the human genome sequence, no structural assignment is available for about 38% of human proteins [64]: at present we thus lack basic information about a large fraction of the proteins of human proteome [115]. In the initial publications on the sequence of the human genome, functional characterization of all proteins was recognized as one of the research priorities [116, 117], because understanding human biology is impossible without understanding the function of each individual protein. Subsequently, Richard Roberts called for a
16
observed folds (120, 121). It should be noted that although a solved protein 3D structure represents an important piece of information, alone it is insufficient or even misleading for functional characterization of that protein
Cumulative changes in the total number of identified singletons, and their abundance in relation to other protein sequence families, can be followed from the studies that have periodically summarized advances in sequencing of the genomes of various species. Thus, in 2003, based on the data from 83 genomes, Enright et al. [69] identified 41,133 singletons from a total of 449,033 protein sequences. In this dataset the singletons made 9.2% of all proteins. By dividing the number of singletons with the number of genomes (41,133/83), we can see that there were on average 495 singletons in each genome. Interestingly, the same study reported that just 48 protein families were common to the genomes of all species. In this dataset, therefore, on average the unique proteins outnumber the common proteins by an order of magnitude (495 versus 48).
Based on the data from 120 sequenced genomes, in 2004 Grant et al. reported on the presence of 112,000 singletons within 600,000 sequences [96]. This corresponds to 933 singletons per genome. In 2005, Orengo and Thornton reported on the presence of about 150,000 singletons in 150 sequenced genomes [72]. In 2006, within 203 sequenced genomes and 633,546 non- identical sequences Marsden et al. identified 158,798 singletons [97]; thus the singletons made 24% of all sequences and there were on average 782 singletons in each genome. In 2008, Yeats et al. [73] found around 600,000 singletons in 527 species - 50 eukaryotes, 437 eubacteria and 39 archaea - corresponding to 1,139 singletons per species. No information about the number of singletons is available in the most recent summary of the data from over 1100 sequenced genomes encompassing nearly 10 million sequences [64]. In spite of the missing recent data on singletons, the results of the above calculations are sufficient for an unambiguous conclusion: each species possesses hundreds, or even thousands, of unique genes - the genes that are not shared with any other species. This conclusion is in full agreement with the
17
Singletons as species determinants
A mere idea about the existence of
Figure 3 shows how the number of unique genes (singletons), expressed as an average per each sequenced genome, was changing with the total number of the genomes sequenced. Evidently, the number of singletons tends to increase, from several hundreds to more than one thousand. The presence of a large number of unique genes in each species represents a new biological reality. Moreover, the singletons as a group appear to be the most distinctive constituent of all individuals of one species, because that group of singletons is lacking in all individuals of all other species. The conclusion that the singletons are the determinants of biological phenomenon of species then follows logically. In System of Logic, John Stuart Mill outlined his Second Canon or Method of Difference [133]: “If an instance in which the phenomenon under investigation occurs, and an instance in which it does not occur, have every circumstance in common save one, that one occurring only in the former; the circumstance in which alone the two instances differ, is the effect, or the cause, or an indispensible part of the cause, of the phenomenon.”
Until recently, most attention has been paid to the genes that are shared among species, instead to those that are different. But when the unique genes are studied, they are found to be the ones that are crucial for the very character of the species, or the whole taxon
18
Folding of proteins – domains are not basic units of evolution
Structural annotation of proteins from newly sequenced genomes is typically successful for about 50% of all proteins [58, 64, 70, 128]. At first, this result seems surprising in view of the statements about near completeness, or 100% completeness, of the inventory of protein folds [27, 29, 137, 138]. In fact, that success rate is in accordance with the notion that many proteins with unrelated sequences acquire essentially the same 3D structure, as discussed above. The proteins of partially or largely disordered structure, as well as membrane proteins, also contribute to this group of
The amino acid sequence of a protein determines its structure, which in turn determines its function. In a cell, the structure forms mostly spontaneously by an interplay of attractive and repulsive forces among amino acid side chains, between them and the backbone and among various parts of the backbone, with the participation of hydrophobic interactions, hydrogen bonds, ionic bonds and van der Waals interactions
19
As a solution to the problem of limited CPU power for predicting the structure of a protein from its sequence, researchers have developed a scientific discovery game, Foldit. The game integrates human visual
The idea that protein domains represent conserved units of evolution [72, 108,
That hypothesis - that evolution strives to preserve a protein domain once it stumbles upon it
-contradicts the
Conclusions
The huge amount of DNA sequence data accumulated over the past decade has provided key insights about uniqueness of living organisms. The most important insight is that the genome of each species contains hundreds, or even thousands, of unique genes - the genes that are not
20
shared with any other species. The origin of species is thus intrinsically related to these unique genes.
Each unique gene, and accordingly each novel functional protein encoded by that gene, however, represents a major problem for evolutionary theory because unique proteins are as unrelated as the proteins of random sequences - and among random sequences functional proteins are exceedingly rare. Experimental data reviewed here suggest that at most one functional protein can be found among 1020 proteins of random sequences. Hence every discovery of a novel functional protein (singleton) represents a testimony for successful overcoming of the probability barrier of one against at least 1020, the probability defined here as a “macromolecular miracle”. More than one million of such “macromolecular miracles” are present in the genomes of about two thousand species sequenced thus far. Assuming that this correlation will hold with the rest of about 10 million different species that live on Earth [157], the total number of “macromolecular miracles” in all genomes could reach 10 billion. These 1010 unique proteins would still represent a tiny fraction of the 10470 possible proteins of the median eukaryotic size.
If just 200 unique proteins are present in each species, the probability of their simultaneous appearance is one against at least 104,000. Probabilistic resources of our universe are much, much smaller; they allow for a maximum of 10149 events [158] and thus could account for a
Evolutionary biologists of earlier generations have not anticipated [164, 165] the challenge that singletons pose to contemporary biologists. By discovering millions of unique genes biologists have run into brick walls similar to those hit by physicists with the discovery of quantum phenomena. The predominant viewpoint in biology has become untenable: we are witnessing a scientific revolution of unprecedented proportions.
21
References
1.Guye CE (1942) L’Evolution
2.Lecomte du Noüy P (1949) The Road to Reason. Longmans, Green and Co. (New York, London, Toronto).
3.Lecomte du Noüy P (1947) Human Destiny.: Longmans, Green and Co. (New York, London, Toronto).
4.Staudinger H (1953) Macromolecular chemistry. In: Nobel Lectures, Chemistry 1942- 1962. Elsevier Publishing Company, 1964 (Amsterdam) pp.
5.Asimov A (1957, with new material 1976) Only a Trillion. ACE Books (New York) pp
6.Moorhead PS, Kaplan MM, eds. (1967) Mathematical Challenges to the Neo- Darwinian Interpretation of Evolution. Wistar Institute Press (Philadelphia).
7.Salisbury FB (1969) Natural selection and the complexity of the gene. Nature 224: 342- 343.
8.Townes CH (1998) Logic and uncertainties in science and religion. In: Science and
Religion: The New Consonance. Peters T, ed. Westview Press, Inc. pp.
9.Dryden DTF, Thomson AR, White JH (2008) How much of protein sequence space has been explored by life on Earth? J R Soc Interface 5:
10.Walter KU, Vamvaca K, Hilvert D (2005) An active enzyme constructed from a 9- amino acid alphabet. J Biol Chem 280:
22
11.Tanaka J, Doi N, Takashima H, Yanagawa H (2010) Comparative characterization of
19:
12.Brocchieri L, Karlin S (2005) Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 33:
13.Rost B (2002) Did evolution leap to create the protein universe? Curr Opin Struct Biol
12:
14.Bang ML, Centner T, Fornoff T, Geach AJ, Gotthardt M, McNabb M, Witt CC, Labeit D, Gregorio CC, Granzier H, Labeit S (2001) The complete gene sequence of titin, expression of an unusual approximately
15.Lau KF, Dill KA (1990) Theory for protein mutability and biogenesis. Proc Natl Acad Sci USA
16.Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:
17.Orengo CA, Michie AD, Jones DT, Swindells MB, Thornton JM (1997) CATH: a
hierarchic classification of protein domain structures. Structure 5:
18.Sippl MJ (2009) Fold space unlimited. Curr Opin Struct Biol
19.Sadreyev RI, Kim BH, Grishin NV (2009)
20.Redfern OC, Dessailly B, Orengo CA (2008) Exploring the structure and function paradigm. Curr Opin Struct Biol 18:
23
21.Dokholyan NV, Shakhnovich B, Shakhnovich EI (2002) Expanding protein universe from the biological Big Bang. Proc Natl Acad Sci USA 99:
22.Pearson WR, Sierk ML (2005) The limits of protein sequence comparison? Curr Opin Struct Biol 15:
23.Taylor WR (2007) Evolutionary transitions in protein fold space. Curr Opin Struct Biol
24.Valas RE, Yang S, Bourne PE (2009) Nothing about protein structure classification makes sense except in the light of evolution. Curr Opin Struct Biol 19:
25.Suhrer SJ, Wiederstein M, Gruber M, Sippl MJ (2009) COPS – a novel workbench for explorations in the fold space. Nucleic Acids Res 37:
26.Sippl MJ, Suhrer SJ, Gruber M, Wiederstein M (2008) A discrete view on fold space. Bioinformatics
27.
protein folds. PLoS Computational Biology 6:e1000750. doi:10.1371/journal.pcbi.1000750.
28.
29.Skolnick J, Arakaki AK, Lee SY, Brylinski M (2009) The continuity of protein structure space is an intrinsic property of proteins. Proc Natl Acad Sci USA 106:15690- 15695. doi:10.1073/pnas.0907683106.
24
30.Rackovsky S (2009) Sequence physical properties encode the global organization of protein structure space. Proc Natl Acad Sci USA
31.Gao J, Li Z (2010) Uncover the conserved property underlying
32.Cheng H, Kim BH, Grishin NV (2008) MALISAM: a database of structurally
analogous motifs in proteins. Nucleic Acids Res 36:
33.Rost B (1997) Protein structures sustain evolutionary drift. Fold Des 2:
34.Kosloff M, Kolodny R (2007)
35.Hasegawa H, Holm L (2009) Advances and pitfalls of protein structural alignment. Curr Opin Struct Biol 19:
36.Murzin AG (2008) Metamorhic proteins. Science 320:
37.Bryan PN, Orban J (2010) Proteins that switch folds. Curr Opin Struct Biol
38.Gambin Y, Scug A, Lemke EA, Lavinder JL, Ferreon ACM, Magliery TJ, Onuchic JN, Deniz AA (2009) Direct
39.Colby DW, Prusiner SB (2011) Prions. Cold Spring Harb Perspect Biol 3:a006833. doi:10.1101/cshperspect.a006833.
25
40.Dyson HJ, Wright PE (2005) Intrinsically unstructured proteins and their functions. Nature Reviews: Molecular Cell Biology
41.Bloom JD, Arnold FH (2009) In the light of directed evolution: pathways of adaptive
protein evolutions. Proc Natl Acad Sci USA 106:
42.Alexander PA, He Y, Chen Y, Orban J, Bryan PN (2009) A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci USA 106:
43.Behe MJ, Snoke DW (2004) Simulating evolution by gene duplication of protein features that require multiple amino acid residues. Protein Sci 13:
44.Lynch M (2010) Scaling expectations for the time to establishment of complex
adaptations. Proc Natl Acad Sci USA 107:
45.Lynch M, Abegg A (2010) The rate of establishment of complex adaptations. Mol Biol Evol 27:
46.Axe DD (2010) The limits of complex adaptation: an analysis based on a simple model
of structured bacterial populations.
47.Zeldovich KB, Shakhnovich EI (2008) Understanding protein evolution: from protein
physics to Darwinian selection. Annu Rev Phys Chem 59:
48.Yockey HP (1977) A calculation of the probability of spontaneous biogenesis by information theory. J Theor Biol
49.
26
50.Axe DD (2004) Estimating the prevalence of protein sequences adopting functional enzyme folds. J Mol Biol
51.Eden M (1967) Inadequacies of
52.
53.Axe DD (2010) The case against a Darwinian origin of protein folds.
54.Taylor SV, Walter KU, Kast P, Hilvert D (2001) Searching sequence space for protein catalysts. Proc Natl Acad Sci USA
55.Keefe AD, Szostak JW (2001) Functional proteins from a
56.Stomel JM, Wilson JW, Leon MA, Stafford P, Chaput JC (2009) A
57.Silverman JA, Balakrishnan R, Harbury PB (2001) Reverse engineering the (β/α)8 barrel fold. Proc Natl Acad Sci USA
58.Dessailly BH, Nair R, Jaroszewski L, Fajardo JE, Kouranov A, Lee D, Fiser A, Godzik A, Rost B, Orengo C (2009)
59.Kelly WP, Stumpf MPH (2008)
analyses. Curr Opin Biotechnol 19:
27
60.Figeys D. (2008) Mapping the human protein interactome. Cell Research 18:
61.Crick F (1981) Life itself, Its Origin and Nature. Simon and Schuster (New York), pp. 51.
62.Levitt M (2009) Nature of the protein universe. Proc Natl Acad Sci USA 106: 11079- 11084. doi:10.1073/pnas.0905029106.
63.Rattei T, Tischler P, Götz S, Jehl MA, Hoser J, Arnold R, Conesa A, Mewes HW (2010) SIMAP – a comprehensive database of
64.Lees J, Yeats C, Redfern O, Clegg A, Orengo C (2010) Gene3D: merging structure and
function for a thousand genomes. Nucleic Acids Res 38:
65.Huynen MA, van Nimwegen E (1998) The frequency distribution of gene family sizes in complete genomes. Mol Biol Evol 15:
66.Qian J, Luscombe NM, Gerstein M (2001) Protein family and fold occurrence in genomes:
67.Luscombe NM, Qian J, Zhang Z, Johnson T, Gerstein M (2002) The dominance of the population by a selected few:
68.Unger R, Uliel S, Havlin S (2003) Scaling law in sizes of protein sequence families: from
28
69.Enright AJ, Kunin V, Ouzounis CA (2003) Protein families and TRIBES in genome sequence space. Nucleic Acids Res 31:
70.Lee D, Grant A, Marsden RL, Orengo C (2005) Identification and distribution of protein families in 120 completed genomes using Gene3D. Proteins 59:
71.Sammut SJ, Finn RD, Bateman A (2008) Pfam 10 years on: 10 000 families and still
growing. Brief Bioinform 9:
72.Orengo CA, Thornton JM (2005) Protein families and their evolution – a structural
perspective. |
Annu |
Rev |
Biochem |
74: |
|
doi:10.1146/annurev.biochem.74.082803.133029. |
|
|
73.Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C (2008) Gene3D: comprehensive structural and functional annotation of genomes. Nuclei Acids Res 36:
74.Adamic LA. Zipf,
75.Huberman BA, Adamic LA (1999) Growth dynamics of the
76.Makse HA, Havlin S, Stanley HE (1995) Modelling urban growth patterns. Nature 377: 608.
77.Redner S (1998) How popular is your paper? An empirical study of the citation distribution. Eur Phys J B 4:
78.Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286:
29
79.Gisiger T (2001) Scale invariance in biology: coincidence or footprint of a universal
mechanism? Biol Rev 76:
80.Wuchty S (2001)
81.Lavelle DT, Pearson WR (2010) Globally, unrelated protein sequences appear random. Bioinformatics 26:
82.Weber C, Barton GJ (2001) Estimation of
83.Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS, Koonin EV (2002) Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol Biol 2:18.
84.Karev PV, Wolf YI, Koonin EV (2003) Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve? Bioinformatics 19: 1889- 1900.
85.Karev GP, Wolf YI, Berezovskaya FS, Koonin EV (2004) Gene family evolution: an
86.Karev GP, Berezovskaya FS, Koonin EV (2005) Modeling genome evolution with a diffusion approximation of a
87.Novozhilov AS, Karev GP, Koonin VE (2006) Biological applications of the theory of
30
88.Hughes T, Liberles DA (2008) The
89.Boto L. (2010) Horizontal gene transfer in evolution: facts and challenges. Proc R Soc
B 277:
90.Keeling PJ, Palmer JD (2008) Horizontal gene transfer in eukaryotic evolution. Nature Rev Genet 9:
91.Lercher MJ, Pal C (2007) Integration of horizontally transferred genes into regulatory interaction networks takes many million years. Mol Biol Evol 25:
92.Keeling PJ (2009) Functional and ecological impacts of horizontal gene transfer in eukaryotes. Curr Opin Genet & Develop 19:
93.Ragan MA, Beiko RG (2009) Lateral genetic transfer: open issues. Phil Trans R Soc B
364:
94.McDaniel LD, Young E, Delaney J, Ruhnau F, Ritchie KB, Paul JH (2010) High frequency of horizontal gene transfer in the oceans. Science 330: 50. doi:10.1126/science.1192243.
95.Coulson AFW, Moult J (2002) A unifold, mesofold, and superfold model of protein fold use. Proteins 46:
96.Grant A, Lee D, Orengo C (2004) Progress towards mapping the universe of protein folds. Genome Biol 5:107.
97.Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006) Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res
31
98.Daugherty PS, Chen G, Iversen BL, Georgiou G (2000) Quantitative analysis of the effect of the mutation frequency on the affinity maturation of single chain Fv antibodies. Proc Natl Acad Sci
99.Drummond DA, Iversen BL, Georgiou G, Arnold FH (2005) Why
100.Bloom JD, Silberg JJ, Wilke CO, Drummond DA, Adami C, Arnold FH (2005) Thermodynamic prediction of protein neutrality. Proc Natl Acad Sci 102:
101.Kunichika K, Hashimoto Y, Imoto T (2002) Robustness of hen lysozyme monitored by random mutations. Protein Engineering 15:
102.Zeldovich KB, Chen P, Shakhnovich BE, Shakhnovich EI (2007) A
103.Siew N, Fischer D (2003) Twenty thousand ORFan microbial protein families for the Biologist? Structure
104.Siew N, Fischer D (2003) Unravelling the ORFan puzzle. Comp Funct Genom 4: 432- 441. doi:10.1002/cfg.311.
105.Wilson GA, Bertrand N, Patei Y, Hughes JB, Feil EJ, Field D (2005) Orphans as taxonomically restricted and ecologically important genes. Microbiology 151: 2499- 2501.
106.Wilson GA, Feil EJ, Lilley AK, Field D (2007)
107.Doolittle RF (2002) Microbial genomes multiply. Nature 416:
32
108.Chothia C, Gough J, Vogel C, Teichmann SA (2003) Evolution of the protein repertoire. Science 300:
109.Chothia C (1992) Proteins. One thousand families for the molecular biologist. Nature 357:
110.The Gene Ontology Consortium (2000) Gene Ontology: tool for the unification of biology. Nature Genetics
111.Lapierre P, Gogarten JP (2009) Estimating the size of the bacterial
112.Yooseph S, Sutton G, Rusch DB et al. (2007) The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families. PLoS Biology 5:e16. doi: 10.1371/journal.pbio.0050016.
113.Gollery M, Harper J, Cushman J, Mittler T, Mittler R (2007) POFs: what we don’t know can hurt us. Trends in Plant Science
114.Gollery M, Harper J, Cushman J, Mittler T, Girke T, Zhu JK,
115.Hanson AD, Pribat A, Waller JC, de
116.International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:
117.Venter JC et al. (2001) The sequence of the human genome. Science 291:
33
118.Roberts RJ (2004) Identifying protein function: a call for community action. PLoS Biology
119.Lespinet O, Labedan B (2006) Orphan enzymes could be an unexplored reservoir of
new drug targets. Drug Discovery Today 11:
120.Siew N, Fischer D (2004) Structural biology sheds light on the puzzle of genomic ORFans. J Mol Biol 342:
121.Jaroszewski L, Li Z, Krishna SS, Bakolista, C et al. (2009) Exploration of uncharted regions of the protein universe. PLoS Biol 7(9): e1000205. doi:10.1371/journal.pbio.10000205.
122.Wong WC,
sequence homology. PLoS Comput Biol 6:e1000867. doi: 10.1371/journal.pcbi.1000867.
123.Gerlt JA (2007) A protein structure (or function?) initiative. Structure 15:
124.Omelchenko MV, Galperin MY, Wolf Yi, Koonin EV (2010)
125.Raes J, Harrington ED, Singh AH, Bork P (2007) Protein function space: viewing the
limits or limited by our view? Curr Opin Struct Biol 17:
126.Long M (2001) Evolution of novel genes. Curr Opin Struct Biol 11:
34
127.Nahon JL (2003) Birth of
118:
128.Marsden RL, Lewis TA, Orengo CA (2007) Towards a comprehensive structural coverage of completed genomes: a structural genomics viewpoint. BMC Bioinformatics 8:86.
129.Schmidt EE, Davies CJ (2007) The origins of polypeptide domains. Bioessays 29: 262-
270. doi: 10.1002/bies.20546.
130.Kaessmann H. (2010) Origins, evolution, and phenotypic impact of new genes. Genome Res 20:
131.Marsden RL, Ranea JAG, Sillero A, Redfern O, Yeats C, Maibaum M, Lee D, Addou S, Reeves GA, Dallman TJ, Orengo CA (2006) Exploiting protein structure data to explore the evolution of protein function and biological complexity. Phil Trans R Soc
361:
132.
133.Mill JS (1882) A System of Logic, Ratiocinative And Inductive, Eighth Edition. Harper & Brothers (New York) [Ebook 27942] pp.483.
134.Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG (2009) More than just orphans: are
25:
135.Lin H, Moghe G, Ouyang S, Iezzoni A, Shiu SH, Gu X, Buell CR (2010) Comparative analyses reveal distinct sets of
136.Johnson BR, Tsutsui ND (2011) Taxonomically restricted genes are associated with the evolution of sociality in the honey bee. BMC Genomics 12:164.
35
137.Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA (2009) The CATH classification revisited – architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res
138.Zhang Y, Hubner IA, Arakaki AK, Shakhnovich E, Skolnick J (2006) On the origin and highly likely completeness of
139.Dill KA, Ozkan SB, Shell MS, Weikl TR (2008) The protein folding problem. Annu Rev Biophys
140.Fersht AR (2008) From the first protein structures to our current knowledge of protein folding: delights and scepticism. Nature Reviews Molecular Cell Biology 9:
141.Dill KA, Ozkan SB, Weikl TR, Chodera JD, Voelz VA (2007) The protein folding problem: when will it be solved? Curr Opin Struct Biol 17:
142.Kim DE, Blum B, Bradley P, Baker D (2009) Sampling bottlenecks in de novo protein structure prediction. J Mol Biol 393:
143.Cooper S, Khatib F, Treuille A, Barbero J, Lee J, Beenen M,
144.Cooper S, Treuille A, Barbero J et al. (2010) The challenge of designing scientific discovery games. Proceedings of the
145.Koder RL, Dutton PL (2006) Intelligent design: de novo engineering of proteins with specified functions. Dalton Trans
36
146.Butterfoss GL, Kuhlman B (2005)
Annu Rev Biophys Biomol Struct 35:
147.Jha RK,
148.Schmidt am Busch M, Sedano A, Simonson T (2010) Computational protein design: validation and possible relevance as a tool for homology searching and fold recognition. PLoS ONE 5:e10410. doi:10.1371/journal.pone.0010410.
149.Leisola M, Turunen O (2007) Protein engineering: opportunities and challenges. Appl
Microbiol Biotechnol 75:
150.Liu S, Liu S, Zhu X, Liang H, Cao A, Chang Z, Lai L (2007) Nonnatural protein- protein
151.Gough J (2005) Convergent evolution of domain architecture (is rare). Bioinformatics 21:
152.Vogel C, Teichmann SA,
duplication and recombination. J Mol biol 346:
153.Han JH, Batey S, Nickson AA, Teichmann SA, Clarke J (2007) The folding and evolution of multidomain proteins. Nature Rev Mol Cell Biol
154.
37
155.Apic G, Russel RB (2010) Domain recombination: a workhorse for evolutionary
innovation. Sci Signal 3: pe30. doi:10.1126/scisignal.3139pe30.
156.Lynch M (2007) The frailty of adaptive hypothesis for the origins of organismal complexity. Proc Natl Acad Sci USA 104:
157.Wilson EO (2003) The encyclopedia of life. Trends Ecol Evol 18:
158.Abel DL (2009) The universal plausibility metric (UPM) & principle (UPP). Theor Biol Med Model 8:27.
159.Ayala F, Escalante A, O’Huigin C, Klein J (1994) Molecular genetics of speciation and human origins. Proc Natl Acad Sci 91:
160.Coyne J, Orr HA (2004) Speciation, Sinauer Associates, Sunderland, USA.
161.Phadnis N, Orr HA (2009) A single gene causes both male sterility and segregation distortion in Drosphila hybrids. Science
162.Mihola O, Trachtulec Z, Vlcek C, Schimenti JC, Forejt J (2008) A mouse speciation gene encodes a meiotic histone H3 methyltransferase. Science 323:
163.Nosil P, Schluter D (2011) The genes underlying the process of speciation. Trends Ecol Evol 26:
164.Smith JM (1970) Natural selection and the concept of a protein space. Nature 225: 563- 564.
165.Jacob F (1977) Evolution and tinkering. Science 186:
38
AB
CD
Figure 1.
A pair of proteins of similar sequences but different structures
39
Probability P(k)
|
106 |
|
|
|
|
A |
|
105 |
|
|
|
|
|
cited |
|
|
|
|
|
|
104 |
|
|
|
|
|
|
papers |
103 |
|
|
|
|
|
of |
|
|
|
|
|
|
Number |
102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
101 |
|
|
|
|
|
|
100 |
0 |
1 |
2 |
3 |
4 |
|
10 |
|
10 |
10 |
10 |
10 |
Number of citations per paper
|
|
|
C |
|
|
|
|
|
|
|
(%) |
|
|
|
givensize |
|
|
|
of |
|
|
|
familiesof |
|
|
|
|
|
|
|
Number |
|
|
|
|
|
101 |
|
|
100 |
102 |
103 |
Actor‘s connections, k
|
102 |
|
B |
year |
|
|
|
101 |
|
|
|
per |
|
|
|
|
|
|
|
earthquakesof |
|
|
|
|
100 |
|
|
Number |
|
|
|
|
104 |
106 |
|
|
102 |
Earthquake magnitude (Richter scale)
102
D
101
100
100 |
101 |
102 |
103 |
104 |
Protein family size
Figure 2.
40
1200
Number of singletons per genome
1000 |
|
|
|
|
800 |
|
|
|
|
600 |
|
|
|
|
400 |
|
|
|
|
200 |
|
|
|
|
83 |
120 |
150 |
203 |
527 |
Number of sequenced genomes |
Figure 3.
The average number of singletons present in the genome of one species. The values were obtained by dividing the number of singletons with the number of the sequenced genomes as reported at various time points.
41