March 2013—Gene sequencing is hot—and for good reason. Genomic data helps researchers decipher everything from the sequenced organism’s susceptibility to disease to genetic tweaks that could make it work better (though this applies more to plants than animals). “Every assembly we do is the foundation for years of downstream research on that organism,” says Steven Salzberg, director of the Center for Computational Biology in the McKusick-Nathans Institute of Genetic Medicine.
A computer scientist by training, Salzberg realized early in his career that computation held the key to answering major questions in genetics. Today, members of Salzberg’s lab— which relocated to Johns Hopkins in July 2011—create the computer programs that isolate patterns and anomalies in massive gene sequences. With so many organisms sequenced, the lab’s members have recently begun taking a magnifying glass to the data to look at how single genes are expressed.
Salzberg has used his move to Johns Hopkins as an opportunity to collaborate with geneticists, particularly those involved in translational medicine. “To really make progress on solving human diseases, you need to work closely with people who are experts on those diseases and you need to have access to patient samples,” he says.
A key initiative in Salzberg’s lab is to make the software behind gene sequencing—a process known as whole genome assembly—less error prone. Although genomes can extend across billions of base pairs, the machines that sequence the data can only spit out a hundred base pairs at a time. So bioinformaticists have created programs that identify patterns in the fragments to stitch them back together—akin to piecing together a sentence from word scraps. But even the best assembly programs churn out hundreds of mistakes. Unchecked, such computational errors can be misconstrued as crucial genetic information.
One fix from Salzberg’s lab used the inherent nature of genetic building blocks to follow a preset order. With maps of individual genes already largely available in a central databank, Salzberg and colleagues enabled their assembly program to compare block sequences within a gene that had been stitched back together to blocks of the same gene in the databank. They then tested that program on the genome of the domestic cow, Bos taurus. That let the researchers determine when blocks showed up in the right order and when there was a break in the assembly process.
These days, Salzberg and colleagues also spend a lot of time interpreting completed genomes. In 2009, they developed a program that simplified the process of identifying genes with a new RNA-based sequencing technology called RNA-seq. Genes are composed of alternating introns and exons, and the machinery of the cell splices out the introns before translating the sequence into proteins. RNA-seq captures and sequences this spliced RNA, which then must be mapped back onto the genome. The mapping program developed by Salzberg’s lab has revealed that for many genes, the cell performs this splicing process in more than one way, sometimes excluding some exons, and other times including them. In this way, a single gene can create many different proteins. “What we were calling one gene,” Salzberg says, “is actually a collection of closely related genes.”
That suggests that genes change function across time, which could explain many age-related genetic changes. For instance, Liliana Florea, a member of Salzberg’s lab and assistant professor at Johns Hopkins, has been cataloging all the possible variants of given genes. Once a gene catalogue is complete, she develops algorithms that determine the odds of developing a given variant. Some variants, she says, such as the code for a brain tissue cell appearing in the liver “can tell us something about cancer.”
In recent years Salzberg has also begun to explore other areas including metagenomics, a field that bridges archaeology, genetics and bioinformatics. People working in metagenomics go to a study site, collect all the genetic information on that site and then parse out what is—or was—there through genetic analysis. In a recent study, for instance, Salzberg and colleagues sought to compare genomes from the woolly mammoth and Columbian mammoth, species thought to have diverged after the Columbian mammoth wandered over the Siberian land bridge some 3 million years ago.
The researchers first collected genetic samples from sites in Utah and Wyoming, where the Columbian mammoth lived. Mitochondrial DNA in the samples “looked a lot like a rough draft of the elephant,” the mammoth’s closest living relative, Salzberg says. That confirmed that the genetic material at hand was indeed from a mammoth and not something else, like bacteria. They then compared those samples to the already sequenced genome of the woolly mammoth. To their surprise, the samples shared 99.5 percent of their mitochondrial DNA. “The fact that they are essentially identical means that at some recent point there was some interbreeding of the species,” Salzberg says.
Besides unearthing archaeological mysteries, metagenomics could also reveal what the bacteria on our bodies are telling us. Salzberg has been working with other Johns Hopkins researchers and the National Institutes of Health to identify the different types of bacteria that live on or inside us, from those in our gut to on our eyelashes. “We’re in the process of cataloging these bacteria to figure out how they affect human health and how people differ,” Salzberg says. In a related project, he is collaborating with Cynthia Sears, a professor of infectious disease in the School of Medicine, to identify and characterize bacteria associated with colon cancer.
Core to the intellectual philosophy of Salzberg’s lab is making genetic data accessible to everyone. All the programs he creates are free, and he and lab member Mihaela Pertea, an assistant professor at Johns Hopkins, have spoken out against gene patents, which require patients to pay for access to their own genetic data. Mutations in the BRCA gene, for instance, significantly increase a woman’s likelihood of developing breast cancer. But Myriad Genetics holds the patent for the BRCA gene, so patients have to pay the company $3,000 to determine their risk.
As costs go down, though, entire genome sequencing is becoming much more cost effective than individual gene sequencing. So Salzberg and Pertea developed software that can pull out BRCA and other individual genes from the entire genome and pinpoint mutations. Once sequencing one’s own genome becomes commonplace, such do-it-yourself gene testing could transform the medical landscape, Pertea says. “If you have your genome sequenced, you can use this software,” she says, “and you don’t have to pay thousands of dollars for analysis.”
Q and A with Steve Salzberg