Since the advent of the human genome project at the turn of the century, big data has had researchers looking for analytical tools to bring order from the chaos. Creating a maestro to orchestrate the cacophony is the aim of genomic analysis projects like Bioconductor.
Bioconductor is an open source array of software packages that enable worldwide dissemination and analysis of genomic data. Launched in the fall of 2001, Bioconductor provides a common set of tools to facilitate bioinformatic statistical analysis, including high-throughput sequencing, DNA microarray, flow cytometry, single nucleotide polymorphism, and other genetic data.
The software suite, funded in part by the US National Science Foundation, is developed in the statistical computing language R by scientists around the world; a team based primarily at the Fred Hutchinson Cancer Research Center in the US leads the effort.
In some ways, the story of Bioconductor is the story of big data. Prior to tools like Bioconductor, genomic research relied on Perl for processing sequence and text-like data. But there were few tools for working with quantitative data types like microarray data, says Wolfgang Huber, senior scientist at the European Molecular Biology Laboratory in Heidelberg, Germany.
“The working mode was that computation-savvy academic labs would build up analysis systems from scratch, with little regard for interoperability or code re-use,” says Huber. “Lab biologists lived with the expectation that they could satisfy their needs with point-and-click software. Code tended to be closed, obscure, non-portable, and insular.”
The result was a discordant score for deciphering the human genetic code. In contrast, analysis projects like Bioconductor have created a transparent, reproducible code source as a common language for users, and built a worldwide community to foster developers from among subject matter experts.
To say that anything accomplished with Bioconductor could not have been accomplished without it would be arrogant, Huber admits. Scientists, after all, are an inventive lot. Nevertheless, The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), and the US National Human Genome Research Institute (NHGRI) are examples of projects in which Bioconductor was heavily used and helped speed progress, Huber says.
More generally, Bioconductor is responsible for the rapid adoption of microarrays in many research hospitals around the world and other functional assays like ChIP-Seq and RNA-Seq, Huber says.
Bioconductor partner NHGRI hosts the Encyclopedia of DNA Elements(ENCODE) database, housing in excess of 1.1 petabytes of data. Bioconductor's global reach indicates it is responsive to an even larger scale of throughput.
The volume of medical data processed by the Bioconductor project over the last 14 years, though difficult to measure, is likely in the petabyte scale at this point. “More important than the number of bytes,” Huber notes, “is the complexity and richness of the data, the number of cross-relationships, or the importance of a certain disease to society.”
To hold and transfer these large datasets, Bioconductor looks to resources such as the US National Center for Biotechnology Information, TCGA, or the European Bioinformatics Institute. In the US, other repositories for storing, cataloging, and accessing cancer genome sequences include those held by the Cancer Genomics Hub at UC Santa Cruz, and institutes like the Broad Institute and the cBio-portal at Memorial Sloan-Kettering Cancer Center.
Big data successes
As outlined in Huber's Nature Methodsarticle, Bioconductor is a resounding success. Success for big data echoes in many registers, however. Under the model of precision medicine, recently prioritized in the US by the White House, data flows from patient to an analysis engine like Bioconductor and finally to a clinic, where the happy end result is a pharmaceutical product for the patient's malady.
Big data has shown the promise of precision medicine by enabling the production of drugs such as imatinib for leukemia, trastuzumab for breast cancer, and gefitinib for lung cancer. Often, the efficacy of a pharmaceutical remedy hinges on a tumor's molecular portrait, so analytical software like Bioconductor provides invaluable assistance to healthcare practitioners. Certain drugs are only effective in patients with a particular mutation, says Carolyn Hutter, program director in the division of genomic medicine at the NHGRI.
Hutter points to the Adjuvant Lung Cancer Enrichment Marker Identification and Sequencing Trials (Alchemist) as an example of how big data is exploring the potential of precision medicine. The trials look for changes in the ALK and EGFR genes, both thought to drive cancer growth. Once a patient's tumor is removed, they are given rizotinib or erlotinib to see how well the drugs prevent recurrence and improve cancer survival.
Beyond direct pharmaceutical applications, helping determine a cancer's severity is another measure of big data's success in oncology. Thanks to the increased access to genomic information provided by analytical tools like Bioconductor, scientists have learned molecular characterization may be more important than a cancer's tissue of origin.
Classifying tumors based on their molecular profile is informative to the patient and their physician; knowing what is happening with a cancer can yield a very different prognosis and course of treatment. “If you can separate people with benign cancer from those with malignant tumors, then you can limit more aggressive treatments – which may have more severe side effects – only to those patients who need them,” says Hutter.
But perhaps the most important metric of success for big data is the scaffolding it provides for future, yet unforeseen, advances in health practice and human knowledge. When placed in a larger epistemological context, it is hard to overstate the function big data and tools like Bioconductor serve as we come to understand the wealth of information under our fingertips.