kids encyclopedia robot

ENCODE facts for kids

Kids Encyclopedia Facts
Quick facts for kids
ENCODE
ENCODE logo.png
Content
Description Whole-genome database
Contact
Research center Stanford University
Laboratory Stanford Genome Technology Center: Cherry Lab; Formerly: University of California, Santa Cruz
Authors Eurie L. Hong and 17 others
Primary citation PubMed
Release date 2010 (2010)

The Encyclopedia of DNA Elements (ENCODE) is a big science project. Its main goal is to find all the important parts of the human genome. Think of it like making a complete map of all the working pieces of our DNA.

ENCODE also helps other scientists by creating a huge collection of information about our genes. This includes data, computer programs, and tools to study DNA. All of this helps us understand how our bodies work at a very tiny level. The project is always growing, adding more types of cells and data. It even looks at the DNA of mice now!

History of ENCODE

The ENCODE project was started in September 2003 by the US National Human Genome Research Institute (NHGRI). It was created after the famous Human Genome Project. That first project mapped out all our DNA. ENCODE's job is to figure out what all those mapped parts actually do.

Many research groups from all over the world work together on ENCODE. All the information they find is shared with everyone through public databases. The first big release of ENCODE data was in 2013. Since then, it has been updated based on what scientists need. The project aims to be a public place for "how-to" guides, ways to analyze data, and the data itself. It also keeps careful records of where the data came from. The project started its fourth main phase in February 2017.

Why ENCODE is Important

Humans have about 20,000 genes that make proteins. These genes make up only about 1.5% of our DNA. For a long time, scientists called the rest of our DNA "junk." But ENCODE's main goal is to find out what the other 98.5% of our DNA does.

It turns out that many parts of this "junk" DNA are actually very important. They act like control switches for our genes. These switches can turn genes on or off, or change how much protein they make. If these switches don't work right, it can lead to diseases. By finding these control parts and understanding how they work, ENCODE helps us learn why certain diseases happen.

ENCODE also wants to be a helpful resource for all scientists. This way, they can better understand how our genome affects our health. This knowledge can then help create new ways to prevent and treat diseases.

The ENCODE Team

The ENCODE team, called the ENCODE Consortium, is mostly made up of scientists funded by the US National Human Genome Research Institute (NHGRI). Other scientists also join the team to help with the project.

When ENCODE first started, only a few research groups were involved. But after 2007, the number of scientists grew to 440, working in 32 labs around the world! Today, the team has different centers, each doing different jobs.

ENCODE is also part of a bigger group called the International Human Epigenome Consortium (IHEC).

The NHGRI wants all the information from ENCODE research to be shared freely and easily. This helps all scientists do more research on our genes. ENCODE makes sure that all its computer programs, methods, and data are clear and can be checked by others.

How the ENCODE Project Works

The ENCODE project has been done in four main parts or "phases." The first two phases, called the pilot phase and technology development phase, started at the same time. Then came the production phase, and the fourth phase is a continuation of the third.

The pilot phase aimed to find the best ways to study large parts of the human genome quickly and affordably. It also helped scientists see what tools were missing or not good enough for such a big project. The technology development phase then worked on creating new lab and computer methods to find more functional parts of DNA. The results from these first two phases helped decide the best way to study the remaining 99% of the human genome in the main production phase.

ENCODE Phase I: The Pilot Project

The pilot phase was like a test run. Scientists tried and compared different ways to study a small part (about 1%) of the human genome. Many different experts worked together to see which methods worked best. At the same time, the technology development phase created new, faster ways to find working parts of DNA. The goal was to find a set of methods that could find all the functional parts in the human genome.

In this pilot project, scientists worked closely together to figure out the best ways to understand our DNA. They picked specific areas of the human genome, about 1% of the total, to study. All the information they found was quickly shared with everyone.

Choosing the DNA Areas to Study

For the ENCODE pilot project, scientists chose specific areas of the human genome. These areas were about 30 million DNA letters long, which is roughly 1% of our total DNA. These chosen areas were used to test how well different methods could find various working parts in human DNA.

Scientists decided to pick half of these areas by hand and the other half randomly. The areas picked by hand often had well-known genes or other important DNA parts. They also had a lot of existing information to compare with.

The other half of the DNA areas were chosen randomly. This was done to make sure they got a good mix of DNA regions. Some had many genes, and others had fewer. This helped them test their methods on all kinds of DNA.

Pilot Phase Discoveries

The pilot phase was a success! The results were published in 2007. These discoveries greatly increased what we knew about how the human genome works. Here are some key findings:

  • Most of our DNA is active. This means that many parts of our DNA are copied into RNA, even if they don't make proteins.
  • Scientists found many new types of RNA that don't make proteins. Some of these overlap with protein-making genes, and others are in areas once thought to be "silent."
  • Many new starting points for gene activity were found. These starting points often look and act like well-known gene "on" switches.
  • The control switches around these starting points are spread out evenly.
  • How open or closed our DNA is, and certain changes to special proteins called histones, can tell us if a gene is active.
  • Some distant control switches have special marks that tell them apart from gene "on" switches.
  • When DNA copies itself is linked to how open or closed it is.
  • About 5% of our DNA has stayed similar across different mammals over time. This suggests these parts are important. For about 60% of these parts, experiments showed they have a function.
  • Even though experiments found many functional DNA areas, not all parts of these areas showed signs of being important over evolution.
  • Different functional parts of DNA change a lot in people, or are in parts of the genome that vary in size.
  • Surprisingly, many functional parts of DNA don't seem to have changed much during mammal evolution. This might mean there's a large collection of active DNA parts that don't give a specific benefit right now. They could be a "storage" for natural selection, ready to become important later.

ENCODE Phase II: The Production Phase

EncodeSample
This picture shows ENCODE data in a special browser. It displays information about how genes are controlled. The gene on the left (ATP2B4) is active in many cells. The gene on the right is active in only a few cell types, like stem cells.

In September 2007, the National Human Genome Research Institute (NHGRI) started funding the main production phase of ENCODE. In this phase, the goal was to study the entire human genome.

Like the pilot project, this effort involved many groups working together. In October 2007, NHGRI gave out over $80 million in grants. This phase also included centers to manage data, analyze data, and develop new technologies. The project truly became global, with 440 scientists from 32 labs worldwide. With new gene sequencing machines, the project grew hugely. Scientists created a massive amount of raw data, about 15 terabytes!

By 2010, ENCODE had produced over 1,000 sets of genome-wide data. These data sets showed which parts of DNA are copied into RNA, which parts likely control genes in certain cells, and which parts are linked to many different proteins. The main tests used in ENCODE included ChIP-seq (to find where proteins bind to DNA), DNase I Hypersensitivity (to find open DNA areas), RNA-seq (to measure RNA), and tests for DNA methylation (chemical changes to DNA).

Production Phase Discoveries

In September 2012, the project released many more results. These were published in 30 different science papers at the same time!

The scientists described how they created and first looked at 1,640 data sets. These sets were designed to find functional parts in the entire human genome. They combined results from different experiments and cell types. They also linked ENCODE data with other information, like areas found in disease studies. All this work showed important things about how the human genome is organized and how it works:

  • A huge part (over 80%) of the human genome is active in at least one cell type. Most of our DNA is very close to a control event.
  • Even DNA parts that are unique to primates or don't show clear signs of being important over evolution still seem to be functional.
  • Scientists divided the genome into seven different states based on its structure. This helped them find nearly 400,000 areas that act like gene "boosters" (enhancers) and over 70,000 areas that act like gene "on" switches (promoters).
  • It's possible to connect how much RNA is made with the DNA structure and where certain proteins bind to DNA at gene "on" switches. This means that how these switches work can explain most of the differences in gene activity.
  • Many small changes in a person's DNA that don't change proteins are found in ENCODE-mapped functional areas. This number is at least as big as changes found in protein-making genes.
  • Small DNA changes linked to diseases are often found in these non-coding functional areas. In many cases, the disease can be linked to a specific cell type or protein that controls genes.

The most surprising discovery was that a much larger part of human DNA is biologically active than anyone thought before. The ENCODE team reported that they found functions for over 80% of the genome. Much of this active DNA is involved in controlling how much protein-making DNA (which is less than 1% of the genome) is used.

Some of the most important new parts of this "encyclopedia" include:

  • A full map of "DNase 1 hypersensitive sites." These are like open doors in the DNA, showing where control elements are. They found almost 3 million of these sites.
  • A list of short DNA sequences that certain proteins recognize and bind to. They found about 8.4 million such sequences.
  • A first look at the complex network of human proteins that control genes. This network is very complicated, with different levels of control and many feedback loops.
  • A measurement of how much of the human genome can be copied into RNA. This was estimated to be over 75% of the total DNA, much higher than earlier guesses. The project also started to describe the types of RNA made in different places.

Other Related Projects

As the ENCODE project continued, it became involved with other projects that have similar goals.

modENCODE project

The MODel organism ENCyclopedia Of DNA Elements (modENCODE) project is like a spin-off of the original ENCODE project. It focuses on finding functional parts in the DNA of specific model organisms, like fruit flies (Drosophila melanogaster) and tiny worms (Caenorhabditis elegans). Studying these simpler organisms helps scientists test and confirm what they find in human DNA. It's often easier to do experiments on these animals than on humans. This project finished its work in 2012.

modERN

modERN, which stands for "model organism encyclopedia of regulatory networks," grew out of the modENCODE project. It focuses on finding more places where proteins that control genes bind in worms and flies. This project started around the same time as ENCODE's Phase III.

Genomics of Gene Regulation

In early 2015, the NIH started the Genomics of Gene Regulation (GGR) program. This program aims to study how gene networks and pathways work in different body systems. The goal is to better understand how genes are controlled. Even though ENCODE and GGR are separate, the ENCODE data center helps host GGR's information.

Roadmap

In 2008, NIH began the Roadmap Epigenomics Mapping Consortium. Its goal was to create a public collection of human "epigenomic" data. This data helps us understand how genes are controlled without changing the DNA sequence itself. In 2015, they released a big article that showed how they combined information from 127 different human epigenomes. Some of these were part of the ENCODE project.

fruitENCODE project

The fruitENCODE project is a plant version of ENCODE. It aims to collect data on DNA changes, protein modifications, and gene activity in different types of fleshy fruits as they ripen. This helps scientists understand how fruits develop.

FactorBook

The information about where proteins that control genes bind, which was found by the ENCODE project, is available online in a place called FactorBook. Factorbook.org is like a Wiki-based database for this kind of data. The first version of Factorbook included:

  • Information from 457 experiments on 119 different gene-controlling proteins in human cells.
  • Details about how special proteins called histones are changed around these binding areas.
  • Patterns in the DNA sequences found in these areas.

See also

Kids robot.svg In Spanish: ENCODE para niños

  • GENCODE
  • SIMAP
  • Functional genomics
  • Human Genome Project
  • 1000 Genomes Project
  • International HapMap Project
  • List of biological databases
kids search engine
ENCODE Facts for Kids. Kiddle Encyclopedia.