Previous Article | Next Article ![]()
Journal of Virology, November 2005, p. 14095-14101, Vol. 79, No. 22
0022-538X/05/$08.00+0 doi:10.1128/JVI.79.22.14095-14101.2005
Copyright © 2005, American Society for Microbiology. All Rights Reserved.
Information Génomique et Structurale, UPR CNRS 2589, 31 Chemin Joseph-Aiguier, 13402 Marseille Cedex 20, France
Received 27 June 2005/ Accepted 8 August 2005
|
|
|---|
|
|
|---|
The recent discovery (16) and subsequent genome sequencing (29) of the largest known virus to date, Acanthamoeba polyphaga Mimivirus, has raised a number of fundamental questions about what had been thought to be established boundaries between viruses and cellular life forms (5, 7, 14). In particular, the size of the mimivirus virion is comparable to that of a mycobacterium. Its genome, containing close to 1.2 million nucleotides (nt) and coding for 911 predicted proteins, holds more than twice as much genetic information as small bacteria find sufficient for life. Moreover, the mimivirus genome hosts a wide spectrum of genes that have never been found in such combination in a virus, in particular, a large set of genes related to protein transcription and translation. On the other hand, what is rather common for a viral genome is the fact that a large fraction of the mimivirus genes display only weak or no homology to any other known genes in the databases. Raoult et al. (29) were able to assign putative functions to only one-third (298/911) of the mimivirus genes, while this ratio is much higher for the genomes of all fully sequenced "living" organisms.
Here, we set out to investigate the question of how many of these genes of unknown origin may have been generated through duplication processes within the mimivirus genome itself and how these duplications may then have shaped the mimivirus genome. The aim of this work was to identify and characterize events of gene and genome duplication in the mimivirus genome in order to shed new light on the origin of the mimivirus' exceptionally large size and on the importance of gene duplication in large DNA viruses in general. I report evidence for an ancient event of duplication of a large part of the mimivirus chromosome, as well as for numerous tandem gene duplication events, and I will show that some of these duplication events may play a role in virus-host adaptation.
|
|
|---|
Remote protein homology detection was done by pairwise Hidden Markov Model (HMM) comparison using the HHsearch package (32), together with HMMs based on multiple alignments from the conserved domain database CDD (20), i.e., COG (33), SMART (17), PFAM (3), and SCOP (22). Multiple alignments of the paralogous genes were computed using the latest version of the T-Coffee package with advanced alignment options (23, 27, 28). Secondary-structure predictions from PSIPRED (13) were included in the HMM-HMM comparison as described previously (32). Results of the HMM search and multiple alignments are available at http://igs-server.cnrs-mrs.fr/suhre/mimiparalogues/.
Genome sequences of all fully sequenced viral genomes (as of November 2004) were downloaded from the National Center for Biotechnology Information viral genomes project (2) at http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html. All 223 genomes with more than 50 annotated genes were included in the analysis.
|
|
|---|
A total of 347 paralogous genes in 77 families were detected by this method when a conservative detection cutoff e-value in the (PSI-)BLAST search of 1010 was applied. When a more permissive (105) or a more stringent (1025) e-value was used, 398 and 244 paralogous genes in 86 and 58 families, respectively, were detected. Thus, between 26.3% and 35.0% of the mimivirus genes have at least one homologue in the virus' genome, depending on the choice of the e-value cutoff. To test for a possible dependence on gene annotation, the mimivirus genome was split into nonoverlapping segments 1,000 nucleotides in length. These segments were compared to segments of the same size but overlapping each other by 50%, using BLAST at the nucleotide level (BLASTN) and at the amino acid level after translation in all six reading frames (TBLASTX). The results were comparable to those found using BLAST at the gene level (BLASTP) in regard to our conclusions with respect to the overall genome and gene duplications, except that these methods were less sensitive and yielded fewer hits at lower sequence identity levels, especially in the BLASTN case. As these computations did not reveal any unexpected new insights but confirmed the robustness of the approach with respect to the applied detection algorithm (BLASTP), BLASTN and TBLASTX results will not be further presented in this paper.
The orientation and location of gene duplication events are not random. The mimivirus genome is coded on a linear chromosome that may adopt a circular topology through noncovalent interactions between two 900-nt-long repeated sequences near the chromosome ends, as observed in some other large DNA viruses (29). The fraction of duplicated genes that are inserted in parallel orientation to the coding direction of the matching gene (cis) is, at 20.2% (22.1% for e = 105; 16.4% for e = 1025), nearly twice as high as the fraction of genes that are duplicated in antiparallel orientation (trans), which is 11.7% (12.9% for e = 105; 9.95%, for e = 1025). Sixty-one percent of all pairs of genes that are duplicated in trans are located on different halves of the mimivirus chromosome, whereas 79% of the duplications in cis occur on the same chromosome half. A large number of tandem, or near-tandem, gene duplications were detected, the most striking case consisting of an 11-fold duplication of genes L175 to L185 (dubbed Lcluster here; see below for gene locations and orientations of the largest families of paralogues). The overall trend is that cis duplications are more localized (often tandem or near tandem), while trans duplications are more likely to occur across the chromosome center. This trend becomes visible when corresponding best-matching pairs that are duplicated in cis and trans, respectively, are connected (Fig. 1).
![]() View larger version (71K): [in a new window] |
FIG. 1. Correspondence between homologous genes in the mimivirus genome; shown are the best BLASTP hits for all 911 mimivirus genes (e-value = 1010). Blue lines indicate gene pairs that are duplicated in parallel orientation (cis), that is, genes that are either both found on the strand coding in the positive direction or both on the strand coding in the negative direction; red lines correspond to genes duplicated in antiparallel orientation (trans). The top frame and the third frame from the top connect homologous genes represented by their positions on the chromosome; the second frame from the top indicates the chromosomal locations of all duplicated genes; the bottom frame gives the chromosomal fractions that are duplicated in parallel (blue), in antiparallel direction (red), and both (black) averaged over a window size of 25,000 nucleotides. Green lines indicate the positions of the three tRNA-Leu genes that were identified in the mimivirus genome.
|
200,000-nt-long telomeric chromosome fraction, followed by a rearrangement (immediately or later) around its center. Interestingly, three tRNA-Leu genes are found duplicated in concert with this event(s). They are highly conserved (displaying only four point mutations), while the adjacent genome regions accumulated such a large number of mutations that homology at the nucleotide level has become difficult to identify. Figure 3 shows the frequency distribution of all gene duplication events. A pronounced maximum for trans duplications is observed at a sequence identity level of 25%, which characterizes the segmental gene duplication as a more ancient event. cis duplications also peak at this value and are likely to correspond to older tandem duplication events. A second pronounced maximum at the 50% sequence identity level for cis duplications suggests a more recent origin for the corresponding tandem duplications (i.e., the Lcluster).
![]() View larger version (18K): [in a new window] |
FIG. 2. Correspondence between antiparallel duplicated genes (cis) in the two "telomeric" regions, 0 to 250,000 and 931,404 to 1,181,404; the latter region is presented in the reverse direction. Potentially syntenic regions are marked by identical colors.
|
![]() View larger version (15K): [in a new window] |
FIG. 3. Distribution of BLASTP hits (weighted by the alignment length) as a function of sequence identity between matching genes (interval size, 5%). Blue, genes duplicated in parallel direction; red, genes duplicated in antiparallel direction.
|
However, in a case where multiple copies of a gene are found in the genome, the idea of using profile or HMM search methods can be taken a step further. Different methods of this type have recently been developed (30, 32, 35). They allow the comparison of an aligned set of genes (the paralogous genes) to a database of annotated profiles, or HMMs, with much higher sensitivity than sequence-to-sequence and sequence-to-profile comparisons. Here, we use the HHsearch software (32), which, in addition to HMM-HMM comparison, evaluates the correspondence between the predicted secondary protein structure of the query protein and those of the potential hits (using observed structure information from the Protein Data Base where available). The result of an HHsearch for a single family of paralogues is then a list of hits, ranked by the probability that a hit is a true positive. For all families of paralogues, these results, together with the corresponding multiple alignments that were used to build the HMMs, are available at http://igs-server.cnrs-mrs.fr/suhre/mimiparalogues/. This data set may serve as a starting point for further analysis of a given mimivirus paralogue family.
Some of the larger paralogous families are related to virus-host interactions. Figure 4 (left) shows the positions of all paralogous genes by their positions on the chromosome. Hot spots of local tandem duplication activities can be detected and are particularly pronounced for the gene family N172 (Lcluster). A clustered view of all genes is given in Fig. 4 (right). By far the largest paralogous gene family (N14), with 66 members, contains the ankyrin double-helix repeat proteins (L14 L22 L23 L25 L36 L42 L45 L56 L59 L62 L63 L66 L72 L88 L91 L93 L99 L100 L109 L112 L120 L121 L122 L148 R229 R267 L279 L482 L483 R579 L589 R600 R601 R602 R603 R634 L675 L715 R760 R777 R784 R787 R789 R791 R797 R810 R825 R835 R837 R838 R840 R844 R845 R846 R847 R848 L863 L864 R873 R875 R880 R886 R896 R901 R903 R911). (In these lists, genes are numbered in increasing order by their positions on the linear mimivirus chromosome. The letter L indicates genes that are transcribed to the left [negative strand], and the letter R stands for genes transcribed to the right [positive strand]. Tandem [cis] duplications can be identified by successive numbering and identical letters [e.g., L121 and L122 are adjacent genes that are coded on the same strand].) Ankyrin repeat-containing proteins are ubiquitously found in large paralogous families in both viral and bacterial genomes. These genes are thought to play structural roles in the cell and are not discussed further here.
![]() View larger version (20K): [in a new window] |
FIG. 4. Dot plot of paralogous genes as a function of their positions on the mimivirus chromosome (left) and clustered by paralogous family (right). Matching genes are marked by red dots. Paralogous families are named for the gene that was used to initiate the PSI-BLAST search (e.g., the N14 family was seeded using gene L14).
|
170 amino-acid-long N-terminal domain that clearly matches the BTB/POZ domain. The BTB/POZ domain mediates homomeric dimerization, and in some instances heteromeric dimerization. POZ domains from several zinc finger proteins have been shown to mediate transcriptional repression and to interact with components of histone deacetylase corepressor complexes. The best matches to proteins with known structure are the promyelocytic leukemia zinc finger protein (PDBid 1buo) and the B-cell lymphoma 6 protein (PDBid 1r28). The genes from the N35 paralogue family are thus likely to play a role in transcriptional regulation. The third-largest cluster (N172; Lcluster) (L172 L174 L175 L176 L177 L178 L179 L180 L181 L182 L183 L184 L185 L697) is also the most exceptional in regard to its 12-fold tandem repeat of proteins. A multiple alignment of these genes indicates that they code for real proteins and that these proteins are likely under selective pressure. For instance, the amino acid type is often conserved within aligned columns, and stretches without any insertions and deletions are followed by indel-rich regions (signatures of structure elements and loop regions, respectively). However, no clear function could be attributed to this cluster, and it has no significant match outside the mimivirus genome. The highest-scoring hits from remote-homology detection, albeit well below certainty levels in regard to the probability that these are true positives, are sometimes linked to interaction with RNA.
The cluster N165 (L60 L162 L165 L166 L167 L168 L170 R286 L414 L415), which is found close to the Lcluster, also contains only genes that are annotated as unknown, most of them containing several Pfam FNIP repeats. Again, using remote-homology detection, we can identify an N-terminal domain that matches the Pfam F-box domain, which is a receptor for ubiquitination targets. This relatively conserved structural motif is present in numerous proteins and serves as a link between a target protein and a ubiquitin-conjugating enzyme. The SCF complex (i.e., Skp1-Cullin-F-box) plays a role similar to that of an E3 ligase in the ubiquitin protein degradation pathway. Different F-box proteins as a part of the SCF complex recruit particular substrates for ubiquitination through specific protein-protein interaction domains. Interestingly, several copies of ubiquitin-conjugating enzymes are also present in the mimivirus genome (i.e., gene L460), as well as a ubiquitin-specific protease (R319). Thus, the genes in cluster N165 can be predicted to play a role in protein degradation using the ubiquitin pathway.
About cluster N226 (L226 L228 R734 L764 L766 L767 L768 L769 L774), little can be said at present. Cluster N232 (L232 L268 R436 R517 L670 L673 R818 R826 R831), on the other hand, contains genes that are predicted to encode protein kinases and that may thus play roles in different cell regulatory processes.
Other notable families, not discussed in more detail here, are family N137, which contains proteins with glycosyltransferase domains; family N105, with remote homologies to potassium channel tetramerization domains; and families N73 and N430, which are similar to yeast and poxvirus transcription factors, respectively. Other interesting families that invite further investigation are N425, which contains the major capsid protein, and the family pair N79 (transposase)/N80 (site-specific integrase-resolvase), which contains three adjacent pairs of transposase/resolvase genes (L79/R80, R104/L103, L770/R771), as well as N238 (L71 R196 R238 R240 R241 L668 L669), which contains collagen triple helix repeats.
|
|
|---|
Using multiple alignments, together with remote-homology detection methods based on Hidden Markov Model comparison, I attribute putative functions to some of the larger paralogous gene families. These attributions indicate that a number of these duplicated mimivirus genes are likely to interfere with important host processes, such as transcription control, protein degradation, and different cell regulatory processes. The toleration and fixation of such important genome expansions under selective conditions may be explained by mimivirus' particular life style, that is, the fact that mimivirus mimics a microbial prey to its amoeban "predator" in order enter its host by phagocytosis. Thus, in order to represent an interesting prey for the amoeba, mimivirus has to maintain bacterial size (15) and can thus more easily tolerate a large genome size than its smaller cousins. With this constraint comes the evolutionary advantage of being able to host a larger spectrum of genes capable of interfering with host defenses, very much in contrast to the situation of small viruses that are optimized for rapid and economic replication and that survive with a rather minimal gene set (for a detailed discussion, see reference 5). Interestingly, if the same detection algorithm is applied to other large DNA viruses, a log-linear trend becomes visible between the number of paralogous genes and the gene content of the genome (Fig. 5).
![]() View larger version (36K): [in a new window] |
FIG. 5. Numbers of paralogous genes and paralogous gene families as a function of the numbers of predicted genes in the genomes. Results are presented for two different e-values, 105 (blue/cyan) and 1010 (red/orange). PBCV1, Paramecium bursaria Chlorella virus 1; Shrimp, Shrimp white spot syndrome virus); Irido, Invertebrate iridescent virus 6; Phages (from largest to smallest), bacteriophage KVP40/bacteriophage Aeh1/Pseudomonas phage phiKZ); Pox (from largest to smallest), Canarypox virus/Amsacta moorei entomopoxvirus/Fowlpox virus. For the large DNA viruses, a log-linear relationship between gene duplication and gene content is observed (red line).
|
Searching the Sargasso Sea environmental genome shotgun-sequencing data set (34), Ghedin and Claverie (10) detected the presence of close relatives of mimivirus in this marine environment. While a large number of the mimivirus genes are found to have a BLAST hit to this data set, none of the genes from the N172 and N226 clusters (with the exception of a spurious match for gene L177) are found in the Sargasso Sea data set. This may be an indication of a more recent emergence of these two families.
The large fraction of viral genes that exhibit no or only remote homology to genes in any other organism, including different viruses (12), is commonly attributed to an assumed faster evolution of viral genes than their bacterial and eukaryotic counterparts. If this assumption is correct, the genes of the two families N172 and N226 may have evolved from an ancient ancestor to a point where no similarity at the sequence level to their orthologues in other genomes can be detected. Determining the three-dimensional structures of members of these (and other) families may therefore answer the question of the origin of these at present mimivirus-specific genes. Comparing the structures of different paralogues may then contribute more generally to our understanding of the evolution of viral genes, as they have evolved in a unique environment in a single genome context, i.e., in a situation where differences in G+C content or constraints related to metabolic differences due to the availability of different amino acids need not be considered.
I believe that gene and genome duplications in large DNA viruses can be analyzed much as is currently done for members of the other three domains of life. For example, reconstructing duplication history has received extensive attention recently. Zhang et al. (38) present a method for inferring the duplication history of tandem-repeat sequences that may be readily applied to mimivirus tandem gene duplications. Davis and Petrov (6) demonstrated that genes that have generated duplicates in the Caenorhabditis elegans and S. cerevisiae genomes were 25% to 50% more constrained prior to duplication than the genes that failed to leave duplicates. They further showed that conserved genes have been consistently prolific in generating duplicates for hundreds of millions of years in these two species, that is, that the set of duplicate genes is biased. This observation may allow us to narrow the range of putative roles of the duplicated mimivirus genes whose functions are still completely unknown.
My analysis shows that a large fraction of the mimivirus genes originated from repeated tandem gene duplications and from segmental genome duplication events, the order of magnitude of the duplications being comparable to what is commonly observed in bacteria, archaea, and eukaryotes. This is compatible with the view that the large DNA viruses establish a deeply rooted branch on the tree of life rather than representing just a collection of genes gathered during their passage through diverse cellular host organisms (see also the discussion in references 21 and 24).
I thank Johannes Söding for assistance with the use of the HHsearch program and acknowledge helpful discussions with my colleagues at the Laboratory IGS, in particular, C. Abergel, S. Audic, G. Blanc, C. Notredame, H. Ogata, and J.-M. Claverie.
|
|
|---|
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 by the American Society for Microbiology. For an alternate route to Journals.ASM.org, visit: http://intl-journals.asm.org | More Info»