Philosophical Transactions of the Royal Society B: Biological Sciences
Open Access Review article

The mystery of extreme non-coding conservation

Nathan Harmston

Nathan Harmston

Institute of Clinical Sciences, Faculty of Medicine, Imperial College London and MRC Clinical Sciences Centre, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK

Google Scholar

Find this author on PubMed

,
Anja Barešić

Anja Barešić

Institute of Clinical Sciences, Faculty of Medicine, Imperial College London and MRC Clinical Sciences Centre, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK

Google Scholar

Find this author on PubMed

and
Boris Lenhard

Boris Lenhard

Institute of Clinical Sciences, Faculty of Medicine, Imperial College London and MRC Clinical Sciences Centre, Hammersmith Hospital Campus, Du Cane Road, London W12 0NN, UK

Department of Informatics, University of Bergen, Thormøhlensgate 55, 5008 Bergen, Norway

[email protected]

Google Scholar

Find this author on PubMed

Published:https://doi.org/10.1098/rstb.2013.0021

Abstract

Regions of several dozen to several hundred base pairs of extreme conservation have been found in non-coding regions in all metazoan genomes. The distribution of these elements within and across genomes has suggested that many have roles as transcriptional regulatory elements in multi-cellular organization, differentiation and development. Currently, there is no known mechanism or function that would account for this level of conservation at the observed evolutionary distances. Previous studies have found that, while these regions are under strong purifying selection, and not mutational coldspots, deletion of entire regions in mice does not necessarily lead to identifiable changes in phenotype during development. These opposing findings lead to several questions regarding their functional importance and why they are under strong selection in the first place. In this perspective, we discuss the methods and techniques used in identifying and dissecting these regions, their observed patterns of conservation, and review the current hypotheses on their functional significance.

1. Introduction

It has been estimated that between 5% and 10% of the human genome is evolving at rates slower than neutral [1,2]. Only 1.2% of the genome encodes proteins, and the remainder is presumed to be non-coding regions of regulatory and/or structural relevance. While there has been evidence that functionally equivalent non-coding regions can have negligible sequence similarities, and even lineage-specific transcription factor (TF) binding patterns [3,4], sequence-level conservation is still a generally applicable criterion indicative of functional conservation.

This review focuses on non-coding genomic sequences showing exceptionally high levels of similarity across species, often greater than among the exons of genes encoding perfectly conserved polypeptides [35]. These elements were discovered genome-wide independently by several groups in 2003–2005 [5,6] and were reported under different names and with varying conservation criteria. In the first published genome-wide report [6], the authors reported 481 sequences completely identical between human and mouse spanning 200 bp or more, whereas Sandelin et al. [5] and Woolfe et al. [7] used lower thresholds combined with a larger evolutionary distance (mammals : fish) to show that, in addition to extreme conservation, many of these elements have been conserved across more than 400 million years of vertebrate evolution. These elements also seem to represent merely the extremes of a distribution of overall highly conserved elements [1,8].

In this review, we shall use the term conserved non-coding elements, or CNEs, as a general term for all these elements. Many other names have been used by different groups, along with different conservation criteria. The conservation criteria consist of (i) a minimal sequence identity (seq. id.) between species under consideration, (ii) this identity score achieved over a minimal sequence length. Bejerano et al. [6] referred to elements as ultraconserved elements (UCEs), which are 100% conserved over their entire length, also known as ultraconserved non-coding elements (UCNEs) [9]. Relaxation of these thresholds enables the identification of elements over larger evolutionary distances, which are still more conserved than would be expected if these elements were neutrally evolving. Other names for these elements include conserved non-coding elements (CNEs), conserved non-coding sequence (CNS) [10], highly conserved non-coding elements (HCNEs) [11], ultraconserved regions (95% identity over at least 50 bp) [5], extremely conserved elements [12], highly conserved non-coding regions (HCNRs) [13], hyperconserved elements [14], long conserved non-coding sequences (95% over at least 500 bp) [15] and conserved non-genic sequences (CNS) [16]. In spite of the different names, they yield highly overlapping sets of elements representative of the same underlying phenomenon of extreme conservation. Several large-scale, publicly available sets of CNEs have been produced; they are listed in table 1.

Table 1.Publicly available CNE datasets.

name CNE definition species dataset size source
ANCORA [17] 70–100% seq. id. over30 or 50 bp window Metazoa 494 human-mousea http://ancora.genereg.net/
cneViewer [18] user-specified human–zebrafish 73187b http://bioinformatics.bc.edu/chuanglab/cneViewer/
CONDOR [19] 65% seq. id. over 40 bp mammalian–fugu >7000c http://condor.nimr.mrc.ac.uk/
TFCONESd 70% seq. id. over 100 bp human–mouse 58 954 http://tfcones.fugu-sg.org/
65% seq. id. over 50 bp human–fugu 2843
UCNEbase [20] >95% seq. id. over 200 bp (human–chicken) 18 vertebrate species 4351 http://ccg.vital-it.ch/UCNEbase
VISTA Enhancer Browser [21] 100% seq. id. over >200 bp human–mouse 1951c http://enhancer.lbl.gov

a100 seq. id. over 200 bp.

bFor the minimum threshold of 50% seq. id. over 50 bp.

cIncludes in vivo functional assay information.

dExclusively surrounding TF genes.

The level of conservation of these sequences [6], their location within vertebrate genomes [5] and their distribution throughout the vertebrate lineage [7] suggested that these were candidates for regulatory elements important in the early stages of vertebrate development, differentiation and coordination between cells. These functions have since been experimentally confirmed for a number of elements [3,22,23].

Although these elements have been primarily identified in vertebrates, equivalent elements have been found to be pervasive throughout Metazoa, although only a few seem to be conserved between deuterostomes and protostomes [24]. Recently, CNEs have also been shown to exist in plants ([25]; see below). This suggests that these elements and the presently unknown cause of their extreme conservation are of very ancient origin, possibly going back to the origins of eukaryotic multi-cellularity.

(a) The unexplained nature of extreme conservation

The distribution of CNEs within the genome and their level of conservation poses one of the most interesting open questions about genomic sequences: what is the reason for such extreme conservation?

To date, no plausible explanation has been proposed for either the source of selective pressure or a potential direct mechanism which would result in such a high level of conservation as seen in a subset of conserved non-coding elements (see the examples in figure 1). No imaginable combination of overlapping TF binding sites (TFBSs) could account for them, and the accumulating ChIP-seq data provide no evidence for massive amounts of combinatorial TF binding to those elements. Furthermore, no known complementary RNA products exist that could target them across their entire conserved length, and no plausible mechanism of active maintenance of the sequence has been proposed. However, their pervasive nature and implication in developmental and multi-cellular processes suggest that the unknown source of conservation holds a key to understanding the regulation of development and differentiation in general.

Figure 1.

Figure 1. Multiple sequence alignments (Multiz alignment of 46 vertebrate genomes) of a set of sequences that are highly conserved over vertebrates. Dots represent bases that are identical to the human GRCh37/hg19 assembly and orange lines represent gaps. (a) Alignments for a CNE near the SOX2 locus, chr3: 180 462 261–180 462 515 and (b) a CNE located at chr3: 177 077 799–177 077 901, which is missing in dog and chicken.

This paper aims to review what is known about CNEs, their currently identified functional and structural features, conservation patterns and their prevalence in the tree of life. Finally, we provide an overview of current opinions on the mechanisms of their emergence, conservation and evolution.

2. General features of extremely conserved elements

(a) Distribution within genomes and its consequences

The location and distribution of CNEs within a genome is not random: they appear in clusters, more often around genes encoding crucial regulators of early development than expected by chance [57,26]. The distribution and density of CNEs within the vicinity of the developmental gene SOX2 is shown in figure 2. Even though there is virtually no sequence homology between the CNEs identified among the genomes of the Drosophila genus and those identified in vertebrates, they tend to be associated with the same functional classes of genes. These elements are also enriched close to genes involved in ion flow across membranes and cell–cell communication, but are underrepresented near housekeeping genes [27]. CNEs are also enriched in 3′ untranslated regions (3′-UTRs) of regulatory genes (less so in invertebrates) [1].

Figure 2.

Figure 2. Overview of the SOX2 locus, its associated gene desert and its local neighbourhood, specifically the 2.4 Mb region on human chr3 centred around the SOX2 gene. (a) Location of CNEs flanking human SOX2, present between human (Hg) and mouse (Mm) (90% identity over 50 base pairs—shown in dark green), human and chicken (Gg; 90% identity over 50 base pairs—shown in yellow) and human and tetraodon (Tn; 70% identity over 50 base pairs—shown in light green). (b) As the distance increases from SOX2, the density of HCNEs decreases dramatically.

There is ample evidence that CNEs are required to be kept in cis with the gene they are involved in regulating (their target gene). This has constrained how the genome is organized [28,29] and has led to the maintenance of large regions of synteny conserved over large evolutionary distances, populated by a set of CNEs targeting one particular gene, referred to as genomic regulatory blocks (GRBs) [3032]. The neighbourhoods of many of these target genes are devoid of other genes (gene deserts) and are heavily populated by CNEs [22] (figure 2), although there exist many examples where they contain bystander genes which contain CNEs of target genes but appear not to be responsive to regulation by them, reflecting differences in their promoter architecture [33] or the importance of the structural organization of the locus [34]. The distribution of CNEs within a GRB tends to a show a high density of CNEs around the (predicted or experimentally demonstrated) target gene (including its introns and the introns of bystander genes), with the density decreasing at larger distances from the promoter of the target gene. In total, CNE clusters can span up to a couple of megabases around their target genes [5,35].

Considering these elements have been linked to key developmental regulators, it has been proposed they might be used as indicators of loci of yet undiscovered or unannotated developmental genes [5,17]. A subset of developmentally controlled microRNAs were also found to be associated with clusters of deeply conserved CNEs [36], again linking these elements with a particular functional subset of genes.

Several studies identified regions that seem to be mutually exclusive with clusters of CNEs. CNEs were depleted in regions with segmental duplications and copy number variations [37]. In addition, many but not all mammalian loci containing clusters of CNEs were shown to be depleted for transposons [38]. The loci that were depleted in retrotransposon insertions were associated with developmental TFs, suggesting that the cis-regulatory architecture of these genes is unable to tolerate insertions of this type.

(b) Prevalence of extreme conservation across species

The human elements most highly conserved in other species are common to all vertebrates, making vertebrate model organisms, especially mouse, zebrafish and medaka, convenient model organisms for in vivo functional assays of CNE effects on target genes. At larger evolutionary distances, the number of conserved non-coding elements rapidly declines—e.g. between human and sea lamprey only 76 CNEs were reported [39], and only 56 between human and the early branching chordate Branchiostoma floridae (amphioxus/lancelet) [40]. Thus, it is not surprising that there is virtually no non-coding sequence similarity to invertebrate CNEs for the orthologous genes, including urochordates as the closest relatives [27].

Recently, very rare individual CNEs were found to show conservation (at lowered thresholds) across larger evolutionary distances. A small number of CNEs were found near the Hox locus in amphioxus versus chicken/mouse [41] or amphioxus versus mouse/Ciona [42] comparisons. Clarke et al. [24] identified two regulatory elements conserved between deuterostomes and protostomes which were found to remain in synteny with their respective genes. In addition, several CNEs have also been identified to show marginal similarity between D. melanogaster and humans [43].

While most CNE studies focused on comparing humans with mammalian or other vertebrate species, several studies found equivalent sets of elements conserved across invertebrate genomes when comparing genomes across a suitable range of evolutionary distances. Equivalent elements were found to be highly conserved across worms of the Caenorhabditis genus [27], Drosophila genus [31,44], across different mosquito genomes [31] and between two species of Ciona (B. Lenhard 2013, unpublished observation). Despite the lack of sequence-level similarity across the different clades, these clade-specific sets of elements have many features in common; they occur in genomic clusters around genes whose protein products themselves regulate embryonic development and differentiation, they impose the constraints of genome rearrangements within those clusters, and the loci of their target genes are characterized by broad Polycomb repression and associated broad H3K27me3 marks when they are being held in an inactive state [31].

Siepel et al. [1] analysed conserved elements (in both coding and non-coding regions) by aligning them within clades: five vertebrates, four insect species, seven Saccharomyces and two Caenorhabditis species. Comparing all conserved elements showed an increase in total element frequency among smaller, more compact genomes and larger fractions of non-coding elements in organisms with more complex genomes, i.e. vertebrates.

Finally, elements with similar properties have been reported in plant genomes [25,45]. Many were found in the vicinity of TF genes that regulate plant development, including those that do not have orthologues in Metazoa. This strengthens the hypothesis that clusters of these elements are a functional feature of the regulation of genes involved in development and differentiation, and suggests an even more ancient origin for them.

(c) General sequence properties of conserved non-coding elements

Despite extensive early efforts to find them, there is little evidence for the existence of sequence-level features common to CNEs as a class of genomic elements. CNEs show a biased AT (adenine and thymine) content with (i) increased total AT content within CNEs when compared with surrounding sequences, (ii) a sharp increase in AT frequency at CNE boundaries and (iii) a sharp decline in AT frequency on the boundaries of sequences flanking CNEs [46]. The strength of this pattern depends on the background properties of the genome sequence in question; it is particularly strong in CNEs of genomes with relatively high GC content—fugu, Caenorhabditis elegans and Drosophila melanogaster—and less prominent in mammals [27]. Finally, the AT content of CNEs differs significantly from average gene surroundings, suggesting selective pressure for this sequence feature [46].

While most studies identify CNEs as the largest stretches of non-coding sequences to satisfy a defined sequence identity threshold, Hare et al. [47] attempted to identify what remains in terms of their functional content after long times of evolutionary separation. They compared six species of sepsids (insects that belong to the same order—Diptera—as Drosophila) with D. melanogaster, which diverged from them approximately 100 million years ago, ensuring the identification of highly diverged regulatory sequences which still drive highly similar expression patterns. They showed that the enhancer of the eve gene contains highly conserved small blocks of only 20–30 nucleotides, enriched in overlapping TFBSs. This finding is in agreement with the billboard model of cis-regulatory modules [48], which proposes that the exact number and order of TFBSs is not necessary for the correct enhancer effect on the target gene. These 20–30 nucleotide clusters of TFBSs may, however, be the smallest blocks selection acts upon in functional CNEs [47]. Woolfe et al. [7] found consistent ordering and mutual positioning of CNEs within vertebrate genomes, suggesting their (yet undetermined) structural/organizational role, although the 56 CNEs conserved between amphioxus and vertebrates show some evidence of shuffling with respect to order and orientation [40].

These studies show that CNEs can have important regulatory functions, although we still cannot account for the pattern or extent of conservation at closer distances. It seems that, as of yet, there is no consensus on the minimal set of features defining an enhancer with a conserved output to be selected against. However, these analyses, along with the relatively high abundance of CNEs in gene deserts, suggest structural importance, making it necessary to view these elements as more complex than just a collection/ordered combination of TFBSs.

(d) Biological function of conserved non-coding elements

The ability of a CNE to drive expression of a gene is typically tested in vivo using transgenic assays, most commonly in mouse [3] or zebrafish [4,49]. A majority of the tested CNEs act as enhancers in reporter constructs [3]. The probability that a conserved sequence has enhancer activity is related to its level of conservation [3] and the density of other conserved sequences in the surrounding locus [50]. Transgenic assays of a number of CNEs lacking enhancer activity revealed that they were able to function as enhancer-blocking insulators [51]. A handful have also been found to be involved in regulating other transcription-related processes, such as splicing [52] and RNA editing [53].

In addition to the sequence of these elements being highly conserved, the majority of CNEs that act as enhancers also show functional conservation over the entire clade in which they are conserved, and very often beyond. A set of lamprey and human CNEs located near the EBF3 gene has been found to upregulate GFP expression in the same set of neurons in zebrafish [39]. However, the expression patterns they are driving in different species can vary dramatically [54]. Transfection of CNEs identified across multiple phyletic groups has found that these elements can still drive expression, although at slightly different development stages [55].

While CNEs detected in three clades independently (insects, worms and vertebrates) do not share sequence similarity, they often associate with and regulate the same set of (often crucial developmental) genes in all three groups [17,27]. This suggests that the involvement of highly conserved non-coding elements in the precise regulation of these genes is crucial for the body plan development within a phylum, whereas recycling regulatory states using the same pool of enhancer sequences in different contexts might be the driving force in the emergence of different body plans during evolution [56]—a phenomenon termed regulatory interaction re-wiring by Vavouri et al. [57]. Tunicates display a typical chordate body plan using a highly diverged set of conserved elements when compared with other chordates [58]; however, the elements still cluster around the same types of genes as in other chordates and indeed other Metazoa.

Given their extreme levels of conservation over long stretches of genomic sequence, it is expected that these elements play important and irreplaceable functions in early development. Surprisingly, at least in some cases, the deletion of large clusters of CNEs yields viable mice with no obvious deleterious phenotypic changes, as shown by transgenic mouse assays [59]. There have been several recent indications that some of the CNEs are phenotypically redundant, or only have phenotypes that are detectable over many generations [6062]. To sum up, it is impossible to infer functional conservation from sequence conservation and vice versa [6365].

More than one-third of top disease-associated regions coming out of genome-wide association studies do not contain any coding sequences [66], thus indicating a common role of non-coding sequences in disease [67,68]. Many of those regions are spanned by multiple CNEs [69], making it possible that a number of genetic diseases are associated with CNE function. In order to shed more light on the role of CNEs within the genome, it is thus crucial to look into the evolutionary background of these elements.

3. Origins and evolutionary dynamics of conserved non-coding elements

(a) Purifying selection versus mutational coldspots

One of the first explanations proposed for the existence of CNEs is that they are located within regions associated with very low rates of mutation (mutational coldspots). However, these elements exhibit features which suggest that they are constrained by extreme levels of purifying selection—a lower than expected single nucleotide polymorphism density [6,70], and a derived allele frequency significantly shifted towards ancestral alleles [71]. The frequency of germline mutations in a set of vertebrate CNEs has been found to be similar to that of other genomic regions, suggesting that mutations in these regions can occur, but are actively selected against [15]. Similar signatures of purifying selection have also been identified in insects [72], suggesting that the same constraints apply to these elements across Metazoa. However, although the majority of evidence is in support of these elements being under selection, the observations that the knockdown of some of these sequences leads to viable mice [59] and that a number of CNEs accumulate fewer mutations than their flanking regions in colorectal cancer [52] have raised as of yet unanswered questions regarding their functional importance and the source of their observed levels of selection.

(b) Emergence and recruitment of conserved non-coding elements

The CNEs in a genome are generally unrelated on the sequence level—the exception being CNEs whose common ancestor can be traced back to a whole-genome duplication (WGD) [7,7375]. This reflects that CNEs appear to have been derived from a multitude of different sources, including former exons [75,76], introns [44], mobile elements [8,77] and ancient repeats [78].

Some CNEs have been found to originate from retrotransposons [8] and other classes of mobile elements [77], which have been exapted and have since come under selection (reviewed in [79]). This finding is in contrast to the findings of Simons et al. [38], where regions of the genome containing developmental regulatory genes were found to be depleted in transposon insertions. However, it appears that exaptation of these elements can be identified only for ancient insertions, indicating that selection against recent insertions is occurring and is potentially responsible for their depletion around specific genes. It may be that some retrotransposon insertions are preferentially retained in certain contexts as they are useful in creating new cis-regulatory elements. Certain families of transposable elements appear to have sequences that are easily mutated into TFBSs [80]; however, it has been shown that transposable elements from all superfamilies have the ability to come under extreme levels of selection [81]. Hundreds of sequences from the MER21 family of ancestral repeats have been found to have been exapted during evolution [78] and are now identifiable as CNEs within the human genome. These sequences appear to contain a set of even more highly conserved short subsequences, which correspond to putative and known binding motifs, although the authors provided no experimental evidence of TF binding.

A highly conserved exonic enhancer involved in hindbrain development has been found to lie within a conserved element found in all vertebrates [76]; the element itself extends into the flanking introns. This implies that the same selective pressure that can be applied to non-coding elements can also be present within coding regions and overlap with the selective pressure acting to conserve the underlying protein sequence.

In conclusion, although certain types of sequence have a higher propensity to gain regulatory functions, there is no evidence that any specific type of sequence element has an increased probability of being recruited as a CNE. It appears that any sequence within the response range of a gene responsive to long-range regulation, once it provides some important regulatory function, has the potential to become recruited as a CNE.

There is evidence that some CNEs have been recruited either through a process of gradual accumulation or in discrete waves. However, the (still) limited sampling of the vertebrate phylogenetic tree makes it difficult to distinguish between these models. Analysis of the vertebrate phylogeny has found that CNEs appear to be recruited in a lineage-specific manner—with approximately 40% of extant eutherian CNEs being present before the divergence of ray-finned fishes from cartilaginous fishes, 12% appearing in the bony vertebrates, 18% in the tetrapods, and 16% and 10% appearing in the amniotic and therian ancestor, respectively [82]. It appears that CNEs evolved rapidly in the early vertebrate lineage [73], and since the divergence of tetrapods and the teleosts, many tetrapod CNEs have been mutating at an extremely low rate [83]. By analysing substitution rates observed in CNEs, Kim et al. [84] found that two-thirds of CNEs evolved at a rate consistent with a one-parameter model; however, the remainder showed branch-specific changes in the observed mutation rate. This suggests that the adaptive evolution of CNEs may occur in short bursts, and that the selective constraints imposed on certain sets of CNEs has not remained constant during mammalian evolution.

Ryu et al. [43] identified CNEs from several phyla and investigated their patterns of evolution. CNEs were identified not only between higher eukaryotes, but also between species in more primitive phyla (e.g. Porifera and Cnidaria). In all of the examined phyla, CNEs were found to be recruited in clusters around genes belonging to equivalent functional groups. These elements could be clustered into their respective lineages based on their sequence similarity, with no identifiable sequence conservation across distant lineages. Ryu et al. suggested that each group of CNEs arose independently in the ancestors of different phyla, and following divergence of that lineage, came under selection and became fixed. However, any mechanism of selection shared across different phyla should have been in place already in their last common ancestor—including the source of purifying selection—so it is likely that the species that lived many hundreds of millions years ago already possessed their own set of CNEs, which diverged by slow but eventually complete turnover in different lineages after their separation. For a further discussion of CNE turnover, see below.

(c) Patterns of loss, gain and divergence of conserved elements

Lowe et al. [85] proposed that, within vertebrates, there have been three distinct periods of CNE recruitment around specific groups of genes. They suggest that this pattern is the result of regulatory innovations, which led to important phenotypic changes during vertebrate evolution. Prior to the divergence of mammals from reptiles and birds, it appears that CNEs were preferentially recruited near TFs and their developmental targets. This was followed by a gradual decline in recruitment near these genes, accompanied by an increase near proteins involved in extracellular signalling, and then an increase in placental mammals near genes responsible for post-translational modification and intracellular signalling. An analysis of CNE gain in the primate and rodent lineage has found that CNEs are either recruited near genes which have not previously been associated with CNEs, or are added near genes which are already flanked by CNEs [86]. The interpretation was that the first set of genes is enriched in functions pertaining to nervous system development, whereas the latter contains genes involved in transcriptional regulation and anatomical development. A set of genes involved in DNA binding and transcriptional regulation was found not to gain new elements in addition to pre-existing ones.

During evolution, the flanking sequences of a CNE can show substantial levels of divergence, whereas a core region remains highly conserved. Comparisons of a well-conserved element identified in Tetraodon show that this element is flanked by lineage-specific mutations in the mammalian and fish lineages. The degree of sequence divergence in the regions surrounding a core CNE [87,88] has led to these elements being proposed as markers for phylogenetic studies, successfully resolving the phylogeny of non-model organisms, in addition to reconstructing the primate and placental tree. Comparisons of human, mouse and primate CNEs suggest the phenomenon of ultraconservation is fragile [89], and that once a mutation within a CNE has become fixed, it appears that the element becomes more susceptible to gaining additional mutations.

Despite being under such high levels of selection, CNEs do show patterns of lineage-specific loss. In several cases, loss of a CNE was shown to be accompanied by detectable alterations in an organism's phenotype and fitness [90,91], further reinforcing their functional importance. It is therefore expected that CNE loss, which negatively affects the fitness of an organism, will be selected against and will not become fixed in populations. Within the rodent lineage, mammalian-specific CNE loss has been estimated to be 300 times less probable than the loss of neutrally evolving sequence [92]. An examination of CNE loss in mammals [93] found that independent CNE loss occurs non-uniformly across the mammalian lineage, with CNEs that are shorter, younger and under less constraint showing a higher likelihood of being lost. The rate of conservation of CNEs dating back to the amniote ancestor is different between mammals and reptiles [94], which have lost similar numbers of CNEs but at different rates.

The current understanding of cis-regulatory evolution proposes that loss of a regulatory element can only occur once the selective pressure on that element is either absent or sufficiently relaxed. This situation can occur by (i) the creation of a new element, which performs the same function, making the original element redundant (known as turnover), (ii) the loss of the pressure on the tissue/phenotype that the enhancer is responsible for or (iii) the loss of the gene it regulates. CNEs are absent from chrY [7], with the exception of the SHOX locus in its pseudo-autosomal region. SHOX-associated CNEs are well conserved between human, dog and fish. Owing to the loss of the SHOX gene in the mouse lineage, no CNEs are identifiable between the human and mouse chrY [15]. However, the loss of the CNE-associated gene is extremely rare and can explain only a small fraction of the observed losses [93].

(d) Turnover of cis-regulatory elements and conserved non-coding elements

The conservation of the expression pattern of a gene is not dependent on the sequence conservation of its regulatory elements [47,63]. It has been found that the cis-regulatory architecture of the yellow gene in Drosophila has changed multiple times during evolution [95]: both the sequence and position of the various enhancers have changed. In addition, enhancers that were responsible for driving expression in specific tissues had changed their genomic location. This region shows no evidence of segmental duplications and transpositions, suggesting that the observed patterns of turnover probably occur owing to the gradual accumulation of mutations, which result in the de novo gain and loss of TFBSs. Small sequence changes can inactivate existing cis-regulatory elements, and can generate new cis-regulatory elements from non-regulatory sequences [96].

Mammals may contain similar amounts of functional sequence, despite loss of many conserved sequences [2,97], suggesting that turnover of functional non-coding sequences is both prevalent and occurring at different rates. The lack of non-coding sequence conservation between different phyla, together with differences in retention of these elements across lineages and between closely related species, suggests that CNEs have been subject to turnover since their initial recruitment.

We propose that all extant CNEs are not indispensable, but that given an adequate amount of time, all of these elements will eventually be replaced by new ones, which provide equivalent functions (figure 3). On the whole, CNEs in a genome are unrelated at the sequence level [7], and they are absent from regions of segmental duplications and copy number variation [37]. This suggests that duplications involving them are strongly selected against in the cases where the duplicate elements still affect a target gene, or are lost rapidly where they do not. On the other hand, their occurrence in the introns of neighbouring genes and recruitment from diverse existing genomic elements suggests that they appear by a gradual process of mutation, recruitment and selection. The different rates at which these elements turn over reflects differences in the levels of selective constraint, and how likely it is that a replacement element can be recruited without interfering with the function of existing ones.

Figure 3.

Figure 3. Schema of our proposed model of CNE turnover. In the common ancestor of two lineages, cis-regulatory elements (shown in light blue) were recruited within the proximity of a gene which was required to be under a specific form of regulation. Over time, other elements were sequentially recruited in both lineages (shown in green and red) and the corresponding ancestral elements were lost. This process continued until all of the elements in the extant set of CNEs no longer contain any of the set of ancestral elements, and these elements are no longer recognizable between lineages. This results in CNEs changing in position and arrangement within the locus, as well as gaining lineage-specific elements.

It has been proposed that CNEs reflect the parallel evolution of regulatory elements for important developmental regulatory genes in different groups [27]. The following model of CNE recruitment and turnover can directly explain this proposition. At some stage during evolution, ancestral developmental cis-regulatory elements appear to have been recruited from sequences near specific sets of genes. These elements provided regulatory innovations that were necessary for development of multi-cellular organisms. Within each of these regions, there was the potential for sequences to gain important functions and for selection on existing elements to be relaxed, allowing them to diverge and be turned over. During evolution, additional genes were recruited to developmental regulatory networks, leading to increasingly complex developmental and morphological features. The presence of clusters of CNEs near orthologous genes in species separated by large evolutionary distances argues for this hypothesis, as does the limited number of CNEs which are found between phyletic groups [24]. In the cases where there are no elements conserved between two distant species around a specific orthologous locus, while they clearly exist between each of the species and its closer relatives, the ancestral elements have completely turned over and are no longer identifiable.

Not all key regulatory elements involved in development are CNEs, which leaves the question of the link between developmental function and source of extreme selective pressure unresolved. As an example, both conserved and non-conserved regulatory sequences are required for controlling developmental genes in the germ layer of zebrafish [98]. It may be that lineage-specific CNEs have the same function as elements that have been lost even if their sequences are not homologous [86]. Some of the new lineage-specific CNEs that are generated by turnover may not be contributing to lineage-specific changes but are required for maintaining important patterns of gene expression as substitute or partially redundant elements.

4. Mechanism of conservation and unexplored potential roles of extremely conserved elements

Despite the amount of research into CNEs, there is as of yet no unifying model relating their functional properties with their observed evolutionary dynamics and the extent of conservation. Currently, there is no known biological or biochemical function that requires such large elements to be under such high levels of sequence constraint. Several hypotheses have been suggested to explain their presence based on their potential functions and patterns of conservation; however, serious objections can be raised to all of them.

(a) Hypotheses on the origin of conservation

Based on experiments that support the hypothesis that these elements act as developmental stage-specific enhancers, one would hope that existing models of enhancer architecture would help illuminate this question; however, they only serve to make the issue more perplexing. Enhancers have been classified into distinct groups based on the arrangement of their constitutive TFBSs and the degree of cooperativity between bound TFs [99]. The enhanceosome [100] model features a strict pattern of TFBSs, which in some cases enables cooperativity between bound TFs. Such an arrangement could potentially span over a large number of nucleotides and be subject to high levels of selection. However, the enhanceosome model only requires sequence conservation at the level of binding sites and their interleaving distances—it does not require conservation of the inter-site sequences. As such, this would not lead to the observed long stretches of extreme conservation.

The degeneracy of TFBSs is typically thought to suggest that DNA–protein interactions are promiscuous and do not require a perfect binding site. However, mutations within cis-regulatory elements can have large and unexpected effects [101]. Phenotypic and morphological evolution can be directly influenced by mutations which have a small effect size [102]; however, these mutations can be selected against. It may be that mutations within these elements have effects that are subject to extreme levels of purifying selection. One potential explanation is that these elements contain overlapping TFBSs, where alteration of one nucleotide position has effects on multiple overlapping TFBSs and may affect nucleosome positioning and retention. Given the levels of TFBS degeneracy and the weak sequence requirements for nucleosome positioning signal, this would require an extremely dense overlap of functional elements that has never been observed at any regulatory element so far. On the contrary, despite the rapidly growing volume of TF binding and histone modification data [103105], there is no evidence that CNEs that act as enhancers bind a larger number of DNA binding proteins or have different histone modification marks than regulatory elements lacking their level of conservation. Indeed, for many elements, over a large number of cellular conditions and embryonic stages tested, there is no evidence for any enhancer-associated features from the binding and histone modification data.

The size of these elements and their patterns of divergence and fragmentation suggest that these may not only have one specific function, but are multi-functional. The flanking elements may be important in determining the function and specificity of a CNE [39,106]. This may suggest that these elements are pleiotropic and under selection owing to multiple coinciding functions. However, this hypothesis still relies on TF binding and chromatin features as sources of selective pressure and as such also fails to explain the extent of conservation.

The hierarchical nature of the developmental genetic regulatory networks (GRNs) [107] has suggested that these elements may be involved in the early stages of embryonic development or during a specific period during development [56]. It has been proposed that CNEs may be responsible for regulation at the end of gastrulation (the phylotypic stage), where patterns of gene expression appear to be highly conserved between species [108], the recruitment and persistence of these elements being due to selective pressure to maintain the observed patterns of expression [109]. Furthermore, the enhancers used at the end of gastrulation show a significant increase in the degree of sequence conservation [110]. However, even this hypothesis still supposes that selection is acting at the level of TFBSs, and predicts that all the most conserved CNEs are involved in transcriptional regulation during the phylotypic stage, when this is clearly not the case. Another potential way for CNE evolution to be constrained by the structure of the GRN is that they could potentially be recruited to act at different levels of the GRNs, having multiple functions and potentially large pleiotropic effects [55]. This explanation suffers from the same problem as the previous ones.

In addition, it has been suggested that these elements may be involved in homologous recombination [27,111], which would provide an active mechanism for the elimination of differences between two alleles of the same element. However, this or any other active mechanism would require them to function primarily in the germline, which does not match what is known about their biological function, although, because the known biological functions cannot explain the level of conservation, this hypothesis cannot be ruled out at the present time despite the lack of any experimental evidence.

The use of chromosome conformation assays have identified that some CNEs appear to be involved in cis- and trans-interactions with other CNE-rich regions of the genome [112]. CNEs were found to interact with promoters of genes as well as other CNEs. This suggests that these interactions may be involved in the regulation of a set of functionally related genes or in the formation of higher-order chromatin structures. Dimitrieva & Bucher [9] investigated the patterns of CNE retention and loss following WGD and suggested that the majority of CNEs are retained in cis with one copy of the duplicated gene while having been completely lost from the other copy. While this reason for their conservation is appealing, the existence of these interactions has only been reported in one study and it remains to be seen whether these interactions are prevalent and functional. Recent results have suggested that the conformation at developmental loci is highly divergent across mammals [113], which may point to CNEs been involved in the conservation of a set of interactions and higher-order chromatin structure.

(b) Clues about function from chromatin and epigenetic data

As noted earlier, the analysis of the recently released ENCODE data suggests that there is nothing special about CNEs that set them apart from other regulatory elements in terms of their epigenetic features. However, it has been shown repeatedly that the genes that are regulated by them (and around which they form dense clusters) are associated with special patterns of histone modifications and TF binding. Intriguingly, it has been shown in both human [114,115] and zebrafish [116] that these genes are the most prominent subset of genes that retains histones and histone modifications in the sperm genome. These genes in sperm typically have bivalent promoters (overlapping H3K4me3 and H3K27me3 marks) as well as locus-wide H3K27me3 marks that often cover the entire gene [117]. While these observations do not tell anything about the role of CNEs in sperm or spermatogenesis, they have the ability to generate hypotheses about the possible role of CNEs in the germline.

5. Conclusion and perspectives

Since their discovery, research into CNEs has led to several important findings regarding their functional importance and evolutionary dynamics. However, despite 10 years of research, there has been virtually no progress towards answering the question of the origin of these patterns of extreme conservation. A number of hypotheses have been proposed, but most rely on modes of DNA : protein interactions that have never been observed and seem dubious at best. As a consequence, not only do we still lack a plausible mechanism for the conservation of CNEs—we lack even plausible speculations.

It is clear that selection is acting on more than the just the sum of the constitutive TFBSs within a CNE. We expect CNEs to be found throughout all of Metazoa and even more broadly throughout multi-cellular organisms. Given the ancient origins of CNE-associated developmental regulation, the model that includes recruitment, selection over large periods of time and turnover is a more parsimonious explanation for their evolutionary dynamics than their independent occurrence in parallel lineages. Further work on the evolutionary dynamics of these elements and new hypotheses about the origin of their conservation is needed in order to begin to understand the mechanism behind this mysterious and fascinating feature of multi-cellular genomes.

Acknowledgements

Many thanks to Vanja Haberle, Petar Glažar, Liz Ing-Simmons and Sarah Langley for their comments.

Funding statement

A.B. is supported by ZF-HEALTH FP7 Integrated Project. N.H. and B.L. are supported by the Medical Research Council UK. B.L. is also supported by Department of Informatics, University of Bergen.

Footnotes

One contribution of 12 to a Theme Issue ‘Molecular and functional evolution of transcriptional enhancers in animals’.

© 2013 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original author and source are credited.

References