Sign Up for Fundamentals

Stay up-to-date with the latest research findings from the Institute for Basic Biomedical Sciences.

Please enter a valid email address.
Fundamentals Topics+

Data of the People, by the People and for the People

More Advancements in Research

Data of the People, by the People and for the People

Jaime Combariza inspects wiring for a stack of processors at MARCC. He wears protective ear gear due to the noise generated by the computers.

Data of the People, by the People and for the People

By Catherine Kolf

April 2015—Since its rise, the power of biological “big data” has been concentrated in relatively few bioinformatics laboratories, where computer algorithms crunch thousands of data points to reveal patterns and find meaning in genomes and other complex systems. Though driving knowledge forward at an unprecedented speed, the age of big data has created a divide between the bioinformatics “haves” and “have-nots,” crowding out those with big ideas but little access to expensive equipment and infrastructure. With next month’s opening of the Maryland Advanced Research Computing Center (MARCC, pronounced “MAR-see”), however, Johns Hopkins aims to democratize the big data revolution, putting top-of-the-line computing power within reach of every researcher.

Of Bytes and Bits

Like many in his field, Alex Szalay, an astrophysicist and cosmologist at The Johns Hopkins University, has been working with big data for decades. In 1992, he began work on the Sloan Digital Sky Survey project, the equivalent of the human genome project for stargazers. As a member of a large collaboration, Szalay helped build a virtual map of the universe that’s accessible to everyone with an internet connection.

Now Szalay is applying his expertise to biomedical research. One new project, slated to run on MARCC computers, involves newly designed software, written in collaboration with scientists from the McKusick-Nathans Institute of Genetic Medicine and the Department of Computer Science, that can line up fully sequenced genomes against a standard reference genome. With MARCC on board, computations that used to take a day will finish in much less than an hour, enabling his team to crunch several hundred genomes’ worth of data in a matter of days.

Genomes are not the only type of big data flooding biology; experts say we should expect continuous tsunamis of data about the physical world around and inside us, gathered by increasingly sophisticated instruments. A single experiment comparing gene activity in two types of tissue generates 30 to 40 gigabytes of data. A simulation of the workings of the heart generates 1 terabyte of data. A single MRI or CT scan creates 1 to 2 terabytes. None of that data has any value without analysis, but a regular server could take days or even weeks to do it.

MARCC building

BuildingThe Maryland Advanced Research Computing Center. The large vents on the building's facade suck in air, chill itand send it to cool the computers which can reach more than 115 degrees Fahrenheit.

MARCC is a $30 million investment from the state of Maryland that will make the potential value of all that data real. At its heart are more than 19,000 processors and 17 petabytes of storage capacity—that’s 17 million gigabytes. All of that power is housed in a 3,786-square-foot space in an unobtrusive structure at the edge of the Johns Hopkins Bayview Medical Center campus. It is shared by the Johns Hopkins University School of Medicine, Bloomberg School of Public Health, Krieger School of Arts and Sciences, Whiting School of Engineering, and the University of Maryland, College Park. Fiber optic cables connect the campuses at a speed of 100 gigabits per second, up to 10,000 times faster than a home Internet connection.

Centralized Power

Jaime Combariza is a computational chemist who came on board as the director of MARCC in June of last year. “MARCC allows all of Johns Hopkins and the University of Maryland to centralize their computing power,” he says. Instead of individual groups using time, money and space to create their own high-performance computing centers, everyone shares the costs of cooling, networking and running the center.

Centralization also means less idle time for the computers, which means less wait time for the researchers. Natalia Trayanova, a professor of biomedical engineering, says she is eagerly awaiting MARCC. She and her team create complex simulations of the heart by compiling everything from MRIs to the latest information on heart-specific proteins. Currently, they use computing centers at Johns Hopkins’ Homewood campus and often have to wait for enough processors to become available, since many of their simulations run on multiple processors at once. If they need 10 and only nine are available, they have to wait. Now, with thousands of processors in a central location, idle computers can be used by any researchers who need them.

Better Questions … Better Answers

aisle of racksThe Maryland Advanced Research Computing Center. The large vents on the building's facade suck in air, chill itand send it to cool the computers which can reach more than 115 degrees Fahrenheit.

Just as profound as speed are the ways new computational approaches are expanding the types of questions researchers can ask. No longer are they limited to looking at the activity of a few genes at a time, for example. Now they can take a complex disorder like autism and assess thousands of genes at a time, with computers to do the heavy analysis. They are also moving toward better personalizing medical treatment to each patient’s unique physiology.

Another big change to be wrought by MARCC has to do with access. Because big data activities require big bucks, big infrastructure and big expertise, individual scientists have had to commit to it as their primary form of research. With MARCC, the infrastructure is available when needed, and full-time staff members will be on hand to help new and more occasional users with their questions. Steven Salzberg, a professor of biomedical engineering, computer science and biostatistics, predicts that one of the biggest uses of the new center will come from researchers adding “transcriptome analysis” to their experimental toolkits. By assessing RNA instead of DNA, transcriptomes offer insights into the activity levels of each gene. Whereas an organism’s genome is essentially static over time, transcriptomes are unique to each type of cell at each stage of health and development, and can therefore shed light on development and the progression of disease states.

Room to Grow

Eighty percent of MARCC’s computing power is already spoken for, but with enough land for four more identical centers on the lot at Hopkins Bayview, there’s plenty of room to grow, according to Szalay, who predicts that growth will come soon since access to MARCC will make researchers more competitive for grants.

MARCC is the sixth largest academic computing center in the country, but its administrators are already planning for future technologies and needs. At the top of the wishlist is to secure a portion of computer space for working with private patient data. As Szalay says, “Data-heavy medicine is just getting started.”