Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project

The ENCODE Project Consortium


PREPRINT 06-009:
Nature 447: 799-816 (2007)


As part of the Encyclopedia of DNA Elements (ENCODE) project, the location and organization of the sites of transcription across 1% of the human genome sequence have been determined in multiple cell line and tissue samples using a collection of empirical methods. In parallel, a detailed annotation of the known transcript content of this 1% of the genome has been obtained based on available sequence data. These studies reveal a complex organization of lattice-like networks of protein coding and non-coding transcription across the genome indicating that intergenic regions comprise only a small proportion of the genome. Conservatively, 15% of the ENCODE genomic sequences are detected as processed transcripts., of which more than 30% correspond to previously unannotated sites of transcription. These novel sites of transcription have features, which separate them both from the background of unannotated genomic DNA, and from well characterize protein coding transcripts, and, as a bulk, do not appear to be under detectable selective evolutionary constraints. Analysis targeted specifically to the protein coding genes revealed that a large proportion initiate their transcription in a tissue-specific manner at sites greater than 100 kb from the annotated portion of the coding gene, often reaching across many intervening genic loci. Overall, the results here indicate that more than 90% of the genome in the ENCODE regions is transcribed, in at least one strand, as primary nuclear transcripts.


ENCODE, transcription, transfrags/TARs, RNA