Sign Up for Fundamentals

Stay up-to-date with the latest research findings from the Institute for Basic Biomedical Sciences.

Please enter a valid email address.
Fundamentals Topics+

Data Dump: Not All Data Are Created Equal

More Advancements in Research

Data Dump: Not All Data Are Created Equal

Data tell us nothing if not properly labeled.

Data Dump: Not All Data Are Created Equal

By Catherine Kolf

April 2014—The advance of technology has rapidly multiplied the questions that scientists can ask and answer in their experiments. These new possibilities make this a time of excitement and promise, but they also bring challenges that must be overcome if the promise is to be realized.

One person working to draw attention to these challenges and their solutions is Kenneth Witwer, an assistant professor of molecular and comparative pathobiology at the Institute for Basic Biomedical Sciences at Johns Hopkins. His is a field that generates massive amounts of data that can vary greatly in quality. He became frustrated that so few publications in his field referred readers to repositories where they could see the original data, so he conducted his own quality assessment of 127 large sets of data published in a 10-month period. His findings were troubling, but he believes that good things are happening because of them.

“Data quality is not just about dotting i’s and crossing t’s,” Witwer says. “Patients’ lives and millions of dollars are often at stake.” One of the biggest cases of poor data quality (not included in his review) involved several related studies (see the “60 Minutes” story), published beginning in 2006 that used genetic information from cancer patients to predict which chemotherapy drugs would be most effective. The information was used to guide clinical trials for several years until other researchers tried time and again to replicate the results, without success. Instead, they found simple errors in data labeling, placement and analysis that sometimes flipped the originally reported predictions.

According to Witwer, the criticisms of the 2006 study resulted in several retractions and ended at least one career, but there’s no telling how many patients could have received better care had the mistakes—and possible misconduct—been caught sooner. Given the clear importance of verifying data quality, Witwer wanted to find out how often data sets from his field were made public, so he systematically reviewed 127 papers and found that less than 40 percent of them were linked to publicly shared data. Knowing that we all clean our homes most thoroughly before visitors arrive, Witwer worried that the unshared results were not “inspection-ready.” Indeed, after giving each paper a quality score based on the stringency of the experiments described in it, Witwer found that papers of the highest caliber were much more likely to link to shared data than the inferior studies.

microRNA Microarrays

One common source of biological “big data” is microarrays, which have enabled groundbreaking discoveries in areas as diverse as genetic diseases, virology and botany. As in the 2006 cancer study, microarrays are used to glean large amounts of gene-related information from small samples. Biotech companies create microarrays by tethering thousands of short sequences of DNA or RNA to a small square of glass or silica in a precise pattern so that they know exactly which sequence is at each position on the “chip.” Scientists then take samples from the cells they study and add them to the chip. The bits of RNA from the sample link together with matching sequences tethered to the chip, which is then analyzed by a special machine that detects how much material is bound to each location. Microarrays were first used to analyze the presence and quantity of messenger RNAs (mRNA), which reveal which genes are active in a given sample.

Witwer uses microarrays to study microRNAs as a window into the immune system’s response to HIV infection. First discovered about 20 years ago, microRNAs are RNA molecules that aren’t involved in creating proteins. In fact, their job is to prevent mRNAs from making proteins by targeting the mRNAs for destruction. Before microRNAs were discovered, it was generally assumed that the end result of an active gene was the production of its corresponding protein. Now we know that a lot can happen in between the production of an mRNA and the production of its corresponding protein. Scientists like Witwer use microarrays to identify the microRNAs produced by cells, because they give us more clues about how protein production is regulated.

Data Repositories

Analyzing the data from microarrays is complicated. First of all, there is a lot of it. A single microarray chip can contain tens of thousands of test sequences. That's far too much data for a human being to process, so researchers have developed software programs to help. But data still don’t give direct answers to research questions; scientists must interpret the data. And that is where human error—and a desire to produce good results—comes into play.

Ideally, both errors and bias are corrected through peer review, the process by which scientific journal articles are reviewed by outside scientists before publication. Huge amounts of data encumber the process, so the more data there are to go through, the less likely the peer review will be done thoroughly.

To provide transparency and to allow other scientists to double-check—and even reuse—data sets, several online sites have arisen as storehouses for publicly shared data. When a study is published in a journal, a link can be included to the site where the data are stored. According to Witwer, many journals have adopted policies to encourage or require data sharing, but even some of those journals that require data sharing don’t always enforce the policy.

You Can’t Get Good Fruit from a Bad Tree

Good results can’t come from poorly designed experiments, and good experiments take a lot of forethought. Among other things, they must be squarely founded on statistics.

Science is dead without statistics. Statistics determine how many samples must be tested before making a generalization like “Gene X is responsible for observation Y.” Statistics determine when a result is meaningful versus when it’s likely due to chance. And statistics determine when a trend is a real trend.  Witwer believes many of the data problems he and others have identified originated before the experiments even started, with an experimental design that wasn’t firmly rooted in statistical understanding.

“A poorly formed experiment cannot produce solid results,” he says. “Collaborating with a statistician during the design phase of an experiment is just as important as collaborating with a statistician to interpret the results.” For that reason, institutions should make available statisticians who are familiar with biomedical research, he says. He applauds the school of medicine, where collaborations between researchers and statisticians from the school of public health, for example, are encouraged and facilitated.

You Can’t Make Good Lemonade from Bad Lemons

A good statistician will tell you that not all results are meaningful. In the case of microarrays, even an empty chip will create a tiny signal in the chip reader. To know which numbers are meaningful, a researcher has to know what that “empty signal” is so that a threshold can be determined. If scientists compare two numbers but don’t realize that both are below threshold, they will end up making claims the data don’t support. That is what happened with at least one paper that Witwer found in his review.

An image of a bar graph B cannot be reported as being twice as effective as A since both data points are below the sensitivity threshold of the experiment. C is the only valid data point.

Dicey statistics can also make bad data look good. However, full transparency of results helps reviewing scientists catch mistakes in data processing that can sometimes be the difference between a real difference and no difference at all. As in the case of the cancer study, sometimes the errors are as simple as putting data in the wrong column on a spreadsheet—an innocent mistake, but one that can have drastic consequences.

A Picture Is Worth a Thousand Words … Unless You Don’t Know What You’re Looking At

Sometimes pictures speak for themselves. Data never do. In order to be informative, data must be presented in context, which means that the experimental methods must be thoroughly and clearly described. This is particularly true of microRNA data, since a cell’s microRNA makeup changes continually in response to its circumstances. A scientist can share mountains of microRNA microarray data, but, without the contextual information of the experiment, no one else will be able to make sense of the results. It would be like displaying a picture of a broken, graffiti-covered piece of concrete without telling your audience that it came from the Berlin Wall.

“Even details about sample storage are important,” Witwer says. “A sample that is fresh versus frozen can generate very different results. And two samples from the same person, taken at two different time points, can be very different.”

Witwer certainly isn’t the first person to recognize the need for uniform reporting standards. A set of standards called MIAME (pronounced like the city in Florida), or Minimum Information About a Microarray Experiment, was first published in 2001, and compliance is widely encouraged by journal editors. The standards require that data be accompanied by enough experimental information for the results to be interpreted unambiguously and replicated in other labs. But in Witwer’s study, only 27 percent of the papers were MIAME-compliant. A previous study, showed that even in the top journal Nature Genetics, only 10 of 18 microarray studies linked to MIAME-compliant data.

While complying with MIAME ultimately rests with the authors, its enforcement requires the combined efforts of journal editors and reviewers, Witwer says. He adds that the most unexpected finding of his study is that journals aren’t fulfilling their role as gatekeepers. “A journal with a good reputation puts its stamp of approval on the data it publishes,” he explains. “Microarray data pose a new challenge to journals, and studies like this help them meet that challenge."

Waste Not, Want Not

One final plug for sharing large data sets publicly? Frugality. "Data sharing allows the scientific community to maximize the value of each experiment—and to prevent study after study from being built on a faulty foundation,” says Witwer.

Instead of beginning studies by assessing the activity of a handful of genes, many scientists today begin by assessing the activity of the tens of thousands of genes on their microarray chip of choice. Researchers then go on to focus future experiments on a few genes that seem most important for the question at hand, and the rest of the data might never be looked at again. Sharing data allows other labs to avoid the time and expense required for these broad, preliminary experiments.

For those who are anxious about sharing data they might want to pursue later on, Witwer suggests de-identifying the data. “Removing the data labels still allows reviewers to assess the quality of the data without having to immediately turn over all of your hard-earned insights,” he says.

Moving Forward

As of about 14 months after the publication of his review, two of the poorly done papers Witwer identified have been retracted, and several others are under investigation. Remarkably, he hasn’t gotten any negative feedback, but he has won praise from academics in fields as far removed as philosophy.

Witwer believes the discussions provoked by his assessment have been fruitful. “Some of the journals were open to working with me, and at least one journal changed its policies during the course of my investigation,” he reports. That was PLOS ONE, the journal that publishes the largest amount of microRNA microarray data, and it now includes data sharing on its checklist for reviewers. Witwer thinks that a simple checklist to ensure data sharing is enough to greatly improve the quality of data being submitted.

“The scientific community is generally quite good at catching its own mistakes,” says Witwer, “but the overwhelming amount of data being produced these days has left the traditional process in need of an upgrade. Data sharing is a great solution, and it’s good to see so many journals and researchers already on board.”

Another recent change, which may or may not have been inspired by his review, is the addition of a full-time staffer at the Journal of Biological Chemistry who is in charge of the assessing data integrity of submitted manuscripts. “It’s encouraging to see a respected journal like that taking practical steps to tackle this issue,” says Witwer. “I hope that many others will follow suit.”