The ENCODE Project publishes new genomic insights in special issue of Genome Research

Genome Research publishes online and in print today a special issue dedicated to The ENCODE (ENCyclopedia Of DNA Elements) Project, whose goal is to characterize all functional elements in the human genome. Since the completion of the pilot phase of the project in 2007, covering 1% of the genome, The ENCODE Consortium has fanned out across the genome to study function and regulation on an unprecedented scale. This special issue presents novel findings, methodologies, and resources from ENCODE that bring extensive insight to gene regulation and set the stage for future discoveries. In addition, the issue also contains commentary and perspectives on how our views of the genome have changed as a result of The ENCODE Project. The entire issue will be freely available online on September 6 to coordinate with additional ENCODE Consortium publications in Nature, Genome Biology, and other journals.

1. GENCODE presents the most detailed annotation of the genome yet

From the completion of the pilot phase of The ENCODE Project in 2007, it has been evident that there is much more to a gene than the just a sequence that codes for protein, changing our concept of what defines a gene. We now know that the genome is not a set of discrete genes, but rather a complex system of genes and regulatory regions, much of which is transcribed into RNA, including many RNAs that do not code for protein but have critical cellular functions.

When The ENCODE Project was launched, a subgroup of the project called The GENCODE Consortium was established to accurately map and annotate these complex features across the human genome, by both manual curation and computational methods. In this special issue, Harrow and colleagues of The GENCODE Consortium present the latest release of GENOCDE gene data, describing a wealth of new information that exceeds the depth of annotation of other community resources.

Also in this issue are detailed reports of experimental validations to complement the GENCODE gene data and novel strategies for further annotating the genome. Howald and colleagues developed the RT-PCR-seq method to show that a substantial portion of exons, the protein-coding regions of genes retained by splicing, are not well annotated by unbiased RNA-sequencing alone, requiring a more targeted strategy in combination.

GENCODE has mapped more than 9,500 long non-coding RNA (lncRNAs), but up until now, only about 100 have been characterized with cellular function. lncRNAs, which are transcribed in a range of human tissues and play roles in gene regulation, are particularly interesting because they do not seem to be as well-conserved evolutionarily, in contrast to conservation of genes that code for proteins. Derrien et al. have analyzed the GENCODE lncRNA annotations, integrating the lncRNA data with other ENCODE transcriptome and epigenome data, presenting the most comprehensive lncRNA annotation to date. The authors show that approximately one-third of lncRNAs have arisen in the primate lineage, suggesting that there may be important lncRNA functions yet to be discovered.

References:

  • Harrow et al., GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res doi: 10.1101/gr.135350.111
  • Howald et al., Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome. Genome Res doi: 10.1101/gr.134478.111
  • Derrien et al., The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res doi: 10.1101/gr.132159.111

2. ENCODE studies clarify the murky world of RNAs

The ENCODE Project's efforts to annotate the genome include the sequencing of RNA, the message transcribed from DNA to code for proteins and perform other cellular functions. Splicing can produce different forms of action for that molecule that have varied biological functions but the mechanism and timing by which splicing occurs across the genome has remained poorly understood. Previous studies have shown that splicing can occur while the RNA is still being transcribed from its template.

Now, analyses by The ENCODE Consortium are shedding light on the scale of co-transcriptional splicing genome-wide. In this issue, Tilgner and colleagues analyzed sequencing data from RNA isolated in different regions of the cell, allowing them to define splicing events at different stages and measure which splicing events are occurring during transcription. They found that most RNAs are being spliced while they are transcribed, and interestingly, for lncRNAs, splicing occurs late, and in some cases, not at all.

In previous studies, researchers have found that another well-known class of small regulatory RNAs, called microRNAs (miRNAs), are in some cases generated by splicing (called mirtrons), in addition to the typical miRNA biogenesis pathway. Recently, hundreds of mirtrons were identified in model organisms, but the prevalence of mirtrons in mammals remained unknown. Utilizing the wealth of small RNA datasets produced by The ENCODE Consortium and specialized analysis tools, a study by Ladewig et al. in this issue identified more than 200 mammalian mirtrons, confirming some that had been previously identified and showing evidence for many more that have not been previously characterized, and revealing new insight into the evolution and biology of miRNAs.

References:

  • Tilgner et al., Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res doi: 10.1101/gr.134445.111
  • Ladewig et al., Discovery of hundreds of mirtrons in mouse and human small RNA data. Genome Res doi: 10.1101/gr.133553.111

3. New views of the genome's regulatory landscape

The ENCODE Project continues to illuminate the complex process of gene regulation and chromatin, the combination of DNA and protein that packages DNA in the nucleus. The scale of new data from The ENCODE Project is allowing more accurate characterization than ever of the factors that regulate gene expression. In this issue, Cheng and colleagues have applied a statistical model to the large-scale ENCODE gene expression and transcription factor binding datasets to assess the accuracy of gene expression prediction. Among a number of insights into the predictability of gene expression, their work suggests that gene expression differences in different cell lines are directly reflected in quantitative differences in transcription factor binding levels, challenging the classic "on" or "off" transcription factor binding model.

In addition to studies investigating the myriad transcription factors in the cell, researchers in The ENCODE Consortium are also investigating the function of specific factors genome-wide. Wang et al. present a genome-wide analysis in diverse cell types of the binding pattern of CTCF, a well-known insulator that can suppress the effect of regulatory enhancers on its target gene when bound, playing a role in a number of fundamental genomic processes. The team found that the binding pattern of CTCF is surprisingly plastic yet reproducible, and is significantly different between normal and immortal cells, a finding that could have important implications in cancer.

ENCODE studies are spurring the development of new methods to integrate large genome-wide datasets of different types and to overcome the limitations of current techniques. For example, to investigate the relationship between nucleosome remodeling, histone modifications, and transcription factor binding that governs gene regulation, Kundaje and colleagues have developed a new tool called the Clustered Aggregation Tool (CAGT). The method was applied to datasets of chromatin marks and transcription factor binding to generate an extensive catalog of histone modifications and nucleosome positioning around bound transcription factors. The analysis indicated that both histone modifications and the positions of nucleosomes around transcription factor binding sites are highly heterogeneous, a surprising finding that suggests the features of many regulatory elements are asymmetrical.

References:

  • Cheng et al., Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res doi: 10.1101/gr.136838.111
  • Wang et al., Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res doi: 10.1101/gr.136101.111
  • Kundaje et al., Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements. Genome Res doi: 10.1101/gr.136366.111

4. Regulatory variation and the genetic basis of disease

The data and analyses of The ENCODE Project will help the research community to not only understand genome function, but also disease, with the aim of designing new strategies of treatment and prevention. Much effort in the last decade to understand the genetic basis of disease has been through genome-wide association studies. Many genetic variants found to associate with disease lie in non-coding regions and are relatively common in the population. This challenge in interpreting the data has highlighted the need to understand the influence of genetic variation on the function of genes and regulatory regions.

Two studies in this special ENCODE issue take a step forward in this effort, analyzing the potential functional consequences of individual genetic variants. In a paper from Vernot and colleagues, the most comprehensive assessment of human regulatory variation yet is presented by analyzing regulatory regions marked by DNase I hypersensitivity, an experimental property that indicates gene activity, and the whole-genome sequences of 53 people. The authors found that individuals are more likely to have functionally relevant variants in regulatory regions of DNA compared to protein-coding regions and provide further insights into patterns of regulatory variation at the individual and population levels.

The second study, by Boyle et al., utilized RegulomeDB, a database of ENCODE regulatory data among other sources, to analyze 69 whole-genome sequences and "score" genetic variants to isolate those that may be functionally important. The team identified thousands of potentially functional regulatory variants and estimate that the human genome harbors as much, if not more variation in regulatory regions and than protein-coding DNA. The authors expect this resource to facilitate the annotation of human genome sequences.

References:

  • Vernot et al., Personal and population genomics of human regulatory variation. Genome Res doi: 10.1101/gr.134890.111
  • Boyle et al., Annotation of functional variation in personal genomes using RegulomeDB. Genome Res doi: 10.1101/gr.137323.112

Source: Cold Spring Harbor Laboratory