Skip to main content

Code Finders

In search of the meaning of DNA once labeled as Junk

What was once thought to be junk DNA may play an important role in why humans are different from chimps.

The sequencing of the human and many other genomes may be finished, but our understanding of how those sequences can produce a complex functioning organism like a human being remains far from complete. Take our genome, for example; of its three billion base pairs of DNA, genes that code for enzymes, structural components and signaling molecules account for just two percent. So what about the other 98 percent? For many years, scientists had essentially disregarded the runs of As, Ts, Cs and Gs in the genome that didn't code for proteins as genetic gibberish, meaningless syntax that had no relevance to the understanding of our genomic blueprint.

 

However, many scientists are realizing that using the classic 1960's notion of the genetic code -- the translation of triplet codons of sequence into the amino acid building blocks of proteins -- to judge whether a piece of DNA is functional or not leaves out many of the sequence elements needed for a gene to function.

 

"That genetic code is really just one of many," said Greg Wray, director of the IGSP's Center for

  Evolutionary Genomics. "You need another kind of genetic information in order to make, process and package the genetic message; otherwise it would just sit there."

Wray is among several investigators at the IGSP searching for these other genetic and genomic codes among the vast uncharted sequences of DNA. Such codes could regulate how genes become active, how DNA is replicated, how chromosomes are inherited, and how two meters of the DNA double helix gets packed into a tiny three-dimensional sphere within each cell; the list goes on and on. "The first genetic code, which quite deservedly resulted in a Nobel Prize, was actually easier to find," Wray said. "But there is no single thing that every gene uses in its regulation -- it is all highly gene specific. That is why it has taken so long to figure out these other codes. It is not a single other code that just happens to be hard to crack; it is myriad modifications on a very loose theme."

 

With the Human Genome Project concluded, it is as if scientists have all of the nouns in place and now just have to insert the proper verbs and adjectives, along with some punctuation. This additional genomic language includes the promoters, enhancers, repressors, and other regulatory sequences that determine when and where any one of the predicted 25,000 genes is expressed or ‘turned on.' IGSP Investigators Terry Furey and Greg Crawford are part of a broader effort to identify each and every one of these functional elements as part of the international ENCODE (Encyclopedia Of DNA Elements) Project. Their specific approach looks for regions of the genome that are not tightly wound up or packaged into the protein-DNA complex known as chromatin, with the idea that these areas of open chromatin contain sequences that control the activity of other genes.

Currently, their interdisciplinary team is scanning these regions for known regulatory codes, such as the binding site of a classic transcription factor -- the protein responsible for turning genes on or off. But they are also trying to uncover novel patterns of sequence that lead to other important factors that have never been discovered before. The researchers are following their quantitative analysis with work in the wet lab to determine exactly what influence each discovered factor has on gene regulation.

 

"It is a very daunting task," said Furey. "I think we are just beginning to shed light on the whole complexity of the regulatory machinery. And I imagine the more we learn, the more complicated it is going to get."

 

Furey and Crawford have turned to Wray to help them whittle down their hundreds of thousands of sequences to ones that have functional meaning. Wray has been lining up their human genome sequences with ones from chimpanzees, gorillas, and other species to map the elements that have been conserved through evolution. Such highly conserved blocks of sequence are a genomic road sign for what nature has considered important for the last 50 to100 million years, and thus far their studies suggest that over five percent of the genome is actually ‘important'. Wray says these evolutionary comparisons are revealing certain elements that appear to be critical for all life, and others that may underlie species-specific traits.

 

"It enables us to tease apart what you might call the fine-tuning dials of evolution -- the regulatory regions -- versus the master volume control, the protein-coding regions that you can't change without messing everything up," said Wray. "We are finding these little blocks of conserved sequence that are specific to a particular species and are essentially the codes that might tell you how you make a mouse different from a monkey."

Filling in Blank Pages

Despite the fact that the human genome is said to be fully sequenced, a few blank pages do remain, especially in the repetitive sequences flanking the centromeres, found near the middle of chromosomes. These sequences -- called satellite DNA -- are so notoriously difficult to assemble that only a couple of labs are even attempting the task, one of them that of Hunt Willard, director of the IGSP.

 

fish

Once ignored, centromeres are now being recognized as critical to genome function.

Ignored for decades as trivial text in complex genomes, the repetitive sequences around the centromere are increasingly being recognized as critical to genome function. Willard's lab finally established their relevance about 10 years ago when they succeeded in making functional artificial chromosomes using satellite DNA. "The key question now," says Willard, "is not whether these sequences work, but rather how they work." To address this question, researchers including IGSP Investigator Beth Sullivan are exploring how centromeres are organized and regulated. She has discovered that centromeres have a unique organization whereby certain histone proteins -- those involved in packaging and compacting DNA -- are alternated in a regular pattern. If that pattern is changed or the proteins are removed, the centromeres fail to do their job of separating chromosomes during cell division.

"We have good evidence that regions of the genome that are highly repetitive -- like at the centromeres -- are regulated more by the three-dimensional chromatin structure than the underlying DNA sequence," said Sullivan.

 

"Surprisingly, it appears that they contain not only the tightly packaged regions that we know to be repressive to gene expression, but also more open regions where genes can actually be actively turned on. Centromeres are much more complicated than we originally thought."

 

Sullivan is now studying what happens when chromosomes fuse to each other to create one chromosome with two centromeres, a rearrangement found in one in a thousand people. By building such dicentric chromosomes from scratch in the laboratory, Sullivan's group has analyzed which chromosomes -- from numbers 1 to 22 -- are most likely to fuse, and how they behave in the cell once they are linked together. Sullivan's experiments have demonstrated that any two chromosomes will fuse together, but only the centromere from one of them will ultimately remain active in the dicentric chromosome. Her laboratory is now looking at the sequence differences between the centromeres to see if there are any genomic cues for choosing one centromere over another.

 

Kristin Scott, an IGSP Associate Investigator, is also interested in the genomic elements needed to establish a functional centromere. She has been using fission yeast as a model system to investigate the codes that are required to assemble the tightly compacted form of DNA -- called heterochromatin -- that is so characteristic of centromeres. Scott has discovered sequence-specific elements that basically behave like molecular stop signs, allowing heterochromatin to form in some regions surrounding the centromere, but not in others. She is currently looking into whether the sequence she identified acts alone or recruits other molecules that might accumulate and behave as a roadblock to additional packaging.

 

Scott and graduate student Bayly Wheeler in the Willard lab tinkered with several different genomic sequences -- such as those necessary to establish, maintain or spread heterochromatin -- to see what will happen when they are taken out of the tightly packaged heterochromatin and placed in more open expanses of DNA. Wheeler has found that she can remove the sequence elements that are required to wind up the DNA in the first place, and the DNA still remains tightly compacted for hundreds of generations.

 

"It is just fascinating to consider how such memory can be maintained in the absence of DNA sequence through so many rounds of mitosis and meiosis," Scott said.

Around the Genome in 60 Minutes

If DNA is the book of life, then replication is the process of making precise copies of that text, so that every cell gets the same book to read. Just like the genomic codes that determine what genes are active or how chromosomes are packaged, DNA replication requires specific sequences, called origins of replication, to initiate the process. Until recently, surprisingly little has been known about how these start sites in the genome are selected. As part of the model organism arm of the ENCODE project, Dave MacAlpine, an IGSP member in the

 

Department of Pharmacology, is currently using gene chips to systematically map all of these sites in the Drosophila genome.

"It turns out replication is quite a huge feat," said MacAlpine. "You have three billion base pairs of DNA, and every cell cycle you've got to copy them all not only accurately, but just one time -- because if you underreplicate or overreplicate the genome it will lead to catastrophic problems. Somehow you have to coordinate thousands of potential start sites across the entire genome in order to finish a cycle of replication in the right amount of time."

 

MacAlpine has crunched the data from the model ENCODE project and found that chromatin structure plays a key role in specifying where start sites are located. Apparently, the same genomic modifications that promote the unwinding of the DNA helix to allow genes to be activated may also be involved in opening up the chromatin for the DNA to be copied. In the end, MacAlpine doesn't think that one single code will be driving the DNA duplication machinery. Rather, he predicts that the signals will come from a combination of sequence elements, activated genes and chromatin structure.

Translation, Interrupted

One of the biggest surprises to be uncovered in recent genome studies has been the existence of a class of new ‘genes' that do not follow the language of the classic genetic code. These non-coding RNAs -- sequences of DNA that are transcribed into functional RNA molecules but are not translated into proteins -- were initially discovered in the 1990's by both Hunt Willard's laboratory and Shirley Tilghman's laboratory at Princeton. Now it appears that there are many thousands of such non-coding genes in the genome, ranging from the short

 

microRNA genes involved in the regulation of development to much longer non-coding RNAs that control epigenetic silencing.

 

"While these non-coding RNAs do not produce proteins themselves, many of them do serve to fine-tune how much protein is produced by their target genes in the cell," said IGSP Investigator Ashley Chi. "For instance, microRNAs can bind the sequences next to a target gene to keep them from being translated into protein."

 

Chi specifically studies the composition of microRNAs in red blood cells from patients with sickle cell disease. He has found that microRNA levels change dramatically between healthy people and those with the illness. In particular, certain microRNAs are linked to more severe forms of sickle cell disease by making the cells less able to defend against oxidative stress. On the flip side, increased levels of microRNAs may make patients more resistant to malaria.

 

Over a decade ago, no one had even heard of non-coding RNAs. Today, scientists are finding that they can have a significant impact not only on sickle cell disease but also on other illnesses like cancer, autism and Alzheimer's disease. Clearly, much of the "other 98 percent" of our genome is doing something -- whether by directing when and where genes are active or by packaging chromatin in ways that ensure the proper regulation of gene expression and the proper mechanics of chromosomes. And only time will tell what researchers will uncover next as they continue to decode the information stored in our genome.