A team of researchers at Rice University, Baylor College of Medicine (BCM) and the University of Texas at Austin are working together to develop new statistical tools that can find clues about cancer that are hidden like needles in enormous haystacks of raw data.
“The motivation for this is all of these new high-throughput medical technologies that allow clinicians to produce tons of molecular data about cancer,” says project lead Genevera Allen in a Rice U. release. Dr. Allen is an Assistant Professor at Rice University in the Departments of Statistics and the Electrical and Computer Engineering (by courtesy), and at Baylor College of Medicine in the Department of Pediatrics-Neurology. She is also Rice’s Dobelman Family Junior Chair of Statistics, and a member of the Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital.
“For example,” Dr. Allen notes, “when a tumor is removed from a cancer patient, researchers can conduct genomic, proteomic and metabolomic scans that measure nearly every possible aspect of the tumor, including the number and location of genetic mutations and which genes are turned off and on. The end result is that for one tumor, you can have measurements on millions of variables.”
This type of data exists — the National Institutes of Health (NIH) has compiled such profiles for thousands of cancer patients — but scientists don’t yet have a way to use the data to defeat cancer.
Dr. Allen and her collaborators hope to change that, thanks to a new $1.3 million federal grant that will allow them to create a new statistical framework for integrated analysis of multiple sets of high-dimensional data measured on the same group of subjects.
“There are a couple of things that make this challenging,” observes Dr. Allen, who is a principal investigator (PI) on the new grant, which was awarded jointly by the National Science Foundation and the NIH. “First, the data produced by these high-throughput technologies can be very different, so much so that you get into apples-to-oranges problems when you try in make comparisons. Second, for scientists to leverage all of this data and better understand the molecular basis of cancer, these varied ‘omics’ data sets need to be combined into a single multivariate statistical model.”
For example, Dr. Allen explains in the Rice release that some tests, like gene-expression microarrays and methylation arrays, return “continuous data” — numbers with decimal places that represent the amounts of a particular protein or biomarker. Other tests, like RNA-sequencing, return “count data” — integers that indicate how often a biomarker shows up. And for yet other tests, the output is “binary data.” An example of this would be a test for a specific mutation that produces a zero if the mutation does not occur and a one if it does.
“Right now, the state of the art for analyzing these millions of biomarkers would be to create one data matrix — think one Excel spreadsheet — where all the numbers are continuous and can be represented with bell-shaped curves,” says Dr. Allen. “That’s very limiting for two reasons. First, for all noncontinuous variables — like the binary value related to a specific mutation — this isn’t useful. Second, we don’t want to just analyze the mutation status by itself. It’s likely that the mutation affects a bunch of these other variables, like epigenetic markers and which genes are turned on and off. Cancer is complex. It’s the result of many things coming together in a particular way. Why should we analyze each of these variables separately when we’ve got all of this data?”
She notes that developing a framework where continuous and noncontinuous variables can be analyzed simultaneously won’t be easy. For starters, most of the techniques that statisticians have developed for parallel analysis of three or more variables — a process called multivariate analysis — only work for continuous data.
“It is a multivariate problem, and that’s how we’re approaching it,” Dr. Allen says. “But a proper multivariate distribution does not exist for this, so we have to create one mathematically.”
To do this, Dr. Allen and her collaborators — grant co-PIs Zhandong Liu of BCM and Pradeep Ravikumar an Assistant Professor in the Department of Computer Science, The University of Texas, Austin — are creating a mathematical framework that will allow them to find the “conditional dependence relationships” between any two variables.
A member of the International Society of Computational Biology, Dr. Zhandong Liu develops bioinformatics approaches for analyzing high-throughput biological data produced by gene expression arrays, RNA-seq and genomic sequencing. His work integrates multiple data types in the interest of advancing our understanding of neurological diseases. Dr. Liu developed a graphical random walk (GRW)-based algorithm that can accurately predict pathway activity from microarray gene expression data. GRW uses gene-gene interaction data to construct a pathway signature in a manner analogous to particle-particle interactions described by Coulomb’s law. By comparing GRW to other standard approaches, he has demonstrated that GRW can sensitively and specifically predict pathway activity across tissues, species, and platforms.The Liu lab’s long-term goal is to develop computational models and algorithms in areas of genomics and computational biology. These tools will allow us to better understand the etiology of neurological diseases from computational and systems biology perspectives.
Dr. Ravikumar heads the Statistical Machine Learning Group at the Department of Computer Science at the University of Texas, Austin, and is the assistant director of an upcoming Center for Big Data Analytics at UT Austin. He is also affiliated with the Division of Statistics and Scientific Computation, and the Institute for Computational Engineering and Sciences. He was Program Chair for the Sixteenth International Conference on Artificial Intelligence and Statistics (AISTATS) 2013. His main area of research is in statistical machine learning. The core problem here is to infer conclusions from observations or data. The caveat is to do so reliably with limited computation and limited data. Of particular interest are modern settings where the dimensionality of data is high, and simultaneously achieving these twin objectives is difficult.
To illustrate how conditional dependence works, Dr. Allen suggests considering three variables related to childhood growth — age, IQ and shoe size. In a typical child, all three increase together.
“If we looked at a large dataset, we would see a relationship between IQ and shoe size,” she says. “In reality, there’s no direct relationship between shoe size and IQ. They happen to go up at the same time, but in reality, each of them is conditionally dependent upon age.”
Read other articles about Big Data:
- UPDATE: UT Austin’s TACC Launches Maverick Big Data Solution
- UNT Library Director Opens New CLIR Report On Research Data Management Best Practices
- UT Austin TACC to Deploy Maverick Supercomputer Resource For The Open Science And Engineering Community
- Team from Rice, BCM, And UT Austin Develop New Statistical Tools For Mining Cancer Data
- DNAnexus And Baylor College of Medicine Human Genome Sequencing Center Collaborate To Advance Clinical Analysis of Genomic Data
For cancer genes, where the relationships aren’t as obvious, developing a mathematical technique to decipher conditional dependence could avoid the need to rule out such errors through years of expensive and time-consuming biological experiments.
Dr. Allen and her collaborators have already illustrated how to use the technique by producing a network model for a half-”million biomarkers related to a type of brain cancer called glioblastoma. The model acts as a sort of road map to guide researchers to the relationships that are most important in the data.
“All these lines tell us which genetic biomarkers are conditionally dependent upon one another,” she says in reference to the myriad connections in the model. “These were all determined mathematically, but our collaborators will test some of these relationships experimentally and confirm that the connections exist.”
Dr. Allen says the team’s technique will also be useful for big data challenges that exist in fields ranging from retail marketing to national security.
“This is a very general mathematical framework,” she said. “That’s why I do math. It works for everything.”
Collaboration and leadership in genetics and neuroscience have allowed faculty at Baylor College of Medicine and Texas Children’s Hospital to discover the underlying causes of dozens of neurological disorders.
Dr. Allen is also an assistant professor of pediatric neurology at BCM and a member of the Jan and Dan Duncan Neurological Research Institute (NRI) at Texas Children’s Hospital that opened in December 2010. Dedicated to improving the lives of patients facing devastating neurological disorders, the NRI is a basic research institute committed to understanding the pathogenesis of neurological diseases with the ultimate goal of developing treatments.
The Texas Medical Center (TMC) is the largest in the world, and its institutions treat about 6 million patients each year. The 344,000 square foot silver-level LEED-certified NRI building, nestled in the heart of the TMC, relocates BCM and TCH researchers working on neurodevelopment and neurological diseases under one roof and provides space for us to recruit investigators with complementary areas of expertise. NRI also physically links BCM and TCH to M.D. Anderson Cancer Center, where University of Texas collaborators lend strength in developmental biology and epigenetics. Most importantly, NRI’s location adjacent to the hospitals fosters relationships between the lab and the clinic, which are integral to the translational research enterprise.
Baylor College of Medicine
Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital
Texas Medical Center
University of Texas, Austin
Jan and Dan Duncan Neurological Research Institute at Texas Children’s Hospital