- Comprehensive soybean database is enabling a community of researchers.
- XSEDE resources and Cyverse are integral components of burgeoning soybean knowledge base.
- SoyKB is equipped with user-friendly tools and can be adapted for other species.
Knowledge of the soybean in the US has come a long way since seeds were first smuggled from China in the 1700s. Growing expertise through selective breeding and manipulation of its environment — the warm weather, targeted water, loose soil, and full sunlight — would come from outside the US until the 20th century.
Today, an ambitious project called Soybean Knowledge Base (SoyKB)) developed at the University of Missouri-Columbia (MU) aims to find and share comprehensive genetic and genomic soybean achieved through the use of high-performance computing (HPC).
SoyKB is a web resource with several analytical tools for all soybean-related data. SoyKB promotes deeper understanding through data analysis for scientists who want to improve crops to develop and verify their hypothesis.
“Our goal, first of all, is to provide a resource for people to find information about the soybean genes, their behavior, their gene expression, the metabolic pathways, and more,” says Dong Xu, one of the principal investigators on the SoyKB project.
From sprout to shout
SoyKB started small, initially focusing on the genomics aspects of soybean data, but added the USDA germplasm data set after a year or two. Now, the dataset holds phenotypic information for about 19,000 soybean germplasm lines.
More than 2,000 unique users log on to the SoyKB website every month, and over 10,000 unique users have utilized SoyKB since it was developed in 2010. The ultimate goal of SoyKB is to improve soybean traits and support researchers to enhance soybean breeding techniques.
The SoyKB project started its computation with the NSF-sponsored eXtreme Science and Engineering Discovery Environment (XSEDE), through an allocation awarded in 2014 on the Stampede supercomputer at the Texas Advanced Computing Center.
In all, it has used about 370,000 core hours on a massive project to sequence and analyze the genomes of over 1,000 soybean germplasm lines.
The main technique used is called resequencing, where genomic variations compared to a reference genome are found for each line. “The data are huge, millions of fragments mapped to a reference,” says Xu. “That's actually a very time consuming process. Resequencing data analysis takes most of our computing time on XSEDE.”
SoyKB sought the genetic markers for major soybean traits that include oil and protein content; soybean cyst nematode resistance; resistance to drought, heat and salinity; and healthy root system structure.
“Once we identified the genetic variations of those lines, they are useful for breeding purposes,” says Xu. “It's really valuable data, but without XSEDE, we wouldn't be able to perform our analysis. Now that the data are mostly analyzed, and have deposited it into SoyKB, other researchers can utilize it to answer questions of their interest,” Xu says.
XSEDE goes the extra mile
According to Mats Rynge, XSEDE’s chief task was making SoyKB data broadly usable. Rynge is a computer scientist with the Information Sciences Institute (ISI), part of the University of Southern California (USC).
When he wears his XSEDE hat, he's part of the XSEDE Extended Collaborative Support Services (ECSS), a pool of experts that help researchers use the XSEDE HPC resources.
Like the warm weather soybeans require, XSEDE provided the environment of hardware, software, and expertise SoyKB needed to thrive.
Rynge's group at ISI had experience with the Pegasus workflow, and he thought it would make a good fit for SoyKB to transform to a workflow optimized for supercomputers. Like the flow of water for a data-thirsty SoyKB platform, “Pegasus is a workflow system that can take a set of computational tasks, where one task produces a piece of data that is used by another task downstream,” explains Rynge.
Pegasus ensured that task ordering was correct and that the data were formatted to best suit the parallel processing machines on XSEDE. It also handled the data management between tasks and workflow inputs and outputs.
The inputs were moved to the data store of NSF-funded Cyverse (formerly iPlant). Cyverse resources supported the framework that allowed SoyKB to scale up for its thousand genome resequencing project.
“For example, the data store framework really helped us tremendously,” says co-PI Trupti Joshi. “We generated close to 25-30 terabytes of raw data from just one large-scale sequencing project.”
Another reasons SoyKB project has taken root is its suite of informatics data analysis tools to meet researchers on their own ground.
“We built a system that stressed the user's perspective,” Joshi says. “From doing analysis with the soybean genome to getting a view of what the gene expression might look like in different soybean tissues, or how certain soybean lines might respond to stress – the tools are complete.”
Looking forward, Xu envisions modifying SoyKB into a genetic platform for other science groups to quickly develop their own knowledge base.
“Basically you could input the genome of any species and some annotations, and that would feed into what we call the 'KBCommons,'” he says. “Scientists can develop a knowledge base for a particular disease, like heart disease or diabetes,” Xu said. “Our platform can allow people to generate a specific platform quickly and easily.”
With the help of XSEDE hardware, software, and expertise, SoyKB is blossoming into a rich ecosystem for the community of interdisciplinary researchers, students, industry, and nonscientists hoping to take advantage of the latest science on soybeans.