The Soybean Knowledge Base (SoyKB) project developed at the University of Missouri-Columbia (MU) to find and publicly share a comprehensive soybean genetic and genomic database is achieving its goals via high-performance computing.
One of the SoyKB project’s principal investigators, MU professor and department chair of computer science Dong Xu describes the database as a web resource for all soybean data ranging from molecular data to field data and including several analytical tools.
“Our goal, first of all, is to provide a resource for people to find information about the soybean genes, their behavior, their gene expression, the metabolic pathways, and more,” Xu said in a press release.
Xy added that SoyKB is more than just a data clearinghouse. It also promotes deeper understanding for scientists who want to improve crops and to develop and verify hypotheses through data analysis.
SoyKB initially focused on the genomics aspects of soybean data, said Trupti Joshi, co-principal investigator and director of translational bioinformatics at MU’s School of Medicine medical research office and an assistant research professor in the school’s Department of Molecular Microbiology and Immunology.
“After a year or two,” Joshi said, “we added the USDA germplasm data set, which gives you phenotypic information for about 19,000 soybean germplasm lines . . . That is when we started building a lot of tools in the informatics suite.”
Joshi said the efforts are helping researchers find connections between genomic data and variations in germplasm lines. Over the years, SoyKB has had users from academia and industry worldwide.
SoyKB’s ultimate goal is to improve soybean traits and provide support for researchers to develop more enhanced soybean breeding techniques.
“Our focus has been mainly on integrating multi-omics data sets about gene expression, protein expression, variations in the soybean, and then bridging it from this translational genomics side to the molecular breeding side, where it affects the soybean researchers and farmers,” Joshi said.
The SoyKB project started its computation with the National Science Foundation (NSF) sponsored eXtreme Science and Engineering Discovery Environment (XSEDE) which is funded through The University of Texas at Austin’s Texas Advanced Computing Center (TACC)‘ Stampede supercomputer program. XSEDE is a collection of advanced digital resources that are available to scientists for sharing and analyzing massive datasets in almost every field of research.
Stampede, a Dell PowerEdge C8220 Cluster with Intel Xeon Phi coprocessors, is one of the largest computing systems in the world for open science research. The system provides high-powered computational capabilities to the international research community enabling breakthrough science. It used about 370,000 core hours to sequence and analyze the genomes of over 1,000 soybean germplasm lines.
SoyKB was like a pipeline of Perl scripts when began using XSEDE, according to Mats Rynge, a computer scientist at the University of Southern California, Information Sciences Institute (ISI) and a member of the XSEDE Extended Collaborative Support Services (ECSS) effort and Workflow Community Applications Team. ECSS is a pool of experts that help researchers use the XSEDE cyberinfrastructure — grid made up of the most powerful computer hardware and software in the world. With XSEDE, SoyKB researchers accelerate the search for genetic markers that determined major soybean traits such as oil and protein content; soybean cyst nematode resistance; drought, heat, and salinity resistance; and healthy root system structure.
“These data were very useful,” said Xu. “Without XSEDE, we wouldn’t be able to analyze this data. Now that the data are mostly analyzed, and we deposited this data into SoyKB, other researchers can also utilize it to answer questions of their interest.”
Rynge’s ISI group also employed the Pegasus workflow for the SoyKB project to transform data from scripts to workflow optimized for supercomputers. They ensured that ordering of tasks was correct and that the data were formatted to best suit the XSEDE parallel processing machines’ execution.
A fully-configurable cloud computing environment provided by XSEDE Jetstream also helped broaden SoyKB participation, with workflow inputs moved from MU and hosted on the NSF-funded Cyverse data store. Cyverse is a multi-institution life sciences resource for managing big data with platforms that provide data storage, bioinformatics tools, image analyses, cloud service, APIs, and more. It supported a framework for SoyKB to scale up its resequencing project.
Another move made by SoyKB investigators was to transfer memory-guzzling genomic analysis from Stampede to TACC’s Wrangler, a data intensive system launched in 2015. In 2013, NSF awarded TACC and its academic partners Indiana University and the University of Chicago $11.2 million to build and operate Wrangler, which has been designed to work closely with the older Stampede system. Wrangler shaved days off SoyKB’s genomic analysis runs.
Rynge said Wrangler is part of the success story.
“When Wrangler came on, it turned out to be a much better fit. We transitioned from Stampede to Wrangler, and we have been very happy with it since,” he said.
Joshi said a highlight of the SoyKB project is the suite of easy-to-use tools that were developed for informatics data analysis: “They are complete all the way from doing analysis with the soybean genome to getting you a view of what the gene expression might look like in different soybean tissues versus how certain soybean lines might respond to stress, whether it is in response to soybean cyst nematode worms or whether it is in response to drought stress.”
Xu said he evisions SoyKB expanding its platform to other systems through something like an app store.
“This means we have many individual tools other than the data analysis pipeline,” Xu said. “We have a genotype-phenotype analysis pipeline. We also developed some visualization capacity. We have more than a dozen tools. We would like to make these tools available to any other databases.”
Xu said another future direction for SoyKB is to make it a genetic platform for other science groups to develop their knowledge bases.
Joshi said the project has been a great training opportunity and provides a framework for training the next generation of scientists: “It gets high school students involved, even if they’re simply interested in knowing what a soybean plant looks like and how it responds to stress.”