The University of Texas at Austin’s Texas Advanced Computing Center (TACC) announced its deployment of Wrangler — a next generation supercomputer that in the manner of Old West cowboys who tamed wild horses, tames big data — massive computing tasks involving analyses of thousands of files that need quick opening, examination, and cross-correlation.
Niall Gaffney, a former Hubble Space Telescope scientist and currently director of Data Intensive Computing at TACC, says Wrangler fills a gap in the supercomputing resources of XSEDE, the Extreme Science and Engineering Discovery Environment. Supported by the National Science Foundation, XSEDE is a collection of advanced digital resources being made available to scientists for use in sharing and analysis of the massive datasets that are being produced today in almost every field of research.
Gaffney has managed some of the richest astronomical data ever recorded in terms of scientific and public impact, and in his role at TACC he oversees the center’s big-data strategy, which includes storage and storage systems, data collections, analytics (data mining and statistics), and architectures for data-driven science and data intensive computing.
TACC is one of America’s leading centers of computational research. Located on UT Austin’s J.J. Pickle Research Campus, the center’s mission is to enable discoveries that advance science and society through the application of advanced computing technologies.
In 2013, the NSF awarded TACC and its academic partners, Indiana University and the University of Chicago, $11.2 million to build and operate Wrangler, which has been designed to work closely with UT Austin TACC’s older 10-petaflop HPC system Stampede supercomputer that was ranked 10th most powerful in the world according to the bi-annual Top500 list, and has been the flagship of TACC’s supercomputer fleet, which also includes Maverick, Longhorn, Ranger, and Stockyard — a 20-petabyte large-scale global file system. TACC reports that since coming online in 2013, Stampede has computed over six million jobs for open science.
The TACC science environment includes high performance computing, visualization, data analysis, storage systems, software, and portal interfaces that enable researchers to make discoveries that help them to understand the world better, plan better cities, and design more precisely targeted drugs. The center’s experts work with thousands of researchers each year on projects that help them work more efficiently through the use of UT Austin’s advanced computing resources.
Founded in June 2001, TACC has expanded from a staff of a dozen working with one mid-level Cray supercomputer, to more than 110 staff and students who operate several of the most powerful supercomputers and visualization systems in the world, along with network and data storage infrastructure to support them. TACC has become one of the leading academic advanced computing centers in the U.S., deploying and operating since its inception a succession of supercomputers and advanced visualization systems for national programs supported with funding from the NSF, UT at Austin, UT System, and grants from other federal agencies and private foundations.
“We kept a lot of what was good with systems like Stampede,” Gaffney said in a TACC news release, “but added new things to it like a very large flash storage system, a very large distributed spinning disc storage system, and high-speed network access. This allows people who have data problems that weren’t being fulfilled by systems like Stampede and Lonestar to be able to do those in ways that they never could before.”
Gaffney said Wrangler is equipped to lead the way in computing in the “bumpy world” of data-intensive science research, comparing older supercomputers like Stampede to race cars that are optimized for going fast on smooth, well-defined racecourse circuits. In contrast, he likens Wrangler to being built like a rally car, able to go fast on unpaved, bumpy roads slick with muddy gravel.
“If you take a Ferrari off-road you may want to change the way that the suspension is done,” Gaffney said. “You want to change the way that the entire car is put together, even though it uses the same components, to build something suitable for people who have a different job.”
Image courtesy TACC
At Wrangler’s supercomputing heart is its 10 petabyte disk-based storage system and 600 terabytes of flash memory shared via PCI interconnect across more than 3,000 Intel Core Haswell family computing cores. “All parts of the system can access the same storage,” Gaffney said. “They can work in parallel together on the data that are stored inside this high-speed storage system to get larger results they couldn’t get otherwise.”
Wrangler’s massive volume of flash storage capacity that enables real-time analytics at scale is supplied by DSSD, a startup co-founded by Sun Microsystems alumnus Andy Bechtolsheim and acquired in May 2015 by EMC. Bechtolsheim’s influence at TACC extends back to his leading design of the Magnum Infiniband network switch used in the now-decommissioned Ranger supercomputer that was superseded by Stampede.
Image courtesy TACC
“What’s new is that DSSD took a shortcut between the CPU and the data. “The connection from the brain of the computer goes directly to the storage system. There’s no translation in between,” Gaffney said. “It actually allows people to compute directly with some of the fastest storage that you can get your hands on, with no bottlenecks in between.”
Speeding Up the Gene Analysis Pipeline
Gaffney recalls the hangup scientists encountered with the code called OrthoMCL, which trolls through DNA sequences to find common genetic ancestry in seemingly unrelated species. The problem was that OrthoMCL compiled databases described as being as “wild as a bucking bronco.”
“It generates a very large database and then runs computational programs outside and has to interact with this database,” said biologist Rebecca Young, a post-doctorate research fellow in integrative biology at UT Austin’s Center for Computational Biology & Bioinformatics’ Hofmann Lab. “That’s not what Lonestar and Stampede and some of the other TACC resources were set up for.”
Young recalls that when she first used OrthoMCL with online resources, she was only able to pull out 350 comparable genes across 10 species. By contrast, she said, “When I run OrthoMCL on Wrangler, I’m able to get almost 2,000 genes that are comparable across the species. This is an enormous improvement from what is already available. What we’re looking to do with OrthoMCL is to allow us to make an increasing number of comparisons across species when we’re looking at these very divergent, these very ancient species separated by 450 million years of evolution.”
“We were able to go through all of these work cases in anywhere between 15 minutes and six hours. This is a game changer,” said Gaffney, adding that getting quicker results enables scientists to conduct deeper investigations by working with larger data collections, thereby leading to discoveries that would have been previously unattainable.
Dark Energy in the Spotlight
UT Austin’s Department of Astronomy is ranked as one of the top 10 astronomy research programs in the United States. “Data is really the biggest challenge with our project,” said UT Austin astronomer and assistant professor Steve Finkelstein, whose current main NSF-funded project is the Hobby-Eberly Telescope Dark Energy Experiment (HETDEX), claimed to be the largest survey of galaxies ever attempted. UT Austin scientists expect HETDEX to map over a million galaxies in three dimensions in the process of discovering thousands of new galaxies, but the project’s main goal is the study of dark energy — a mysterious force that pushes galaxies apart.
“Every single night that we observe — and we plan to observe more or less every single night for at least three years — we’re going to make 200 GB of data,” Finkelstein said. “It’ll measure the spectra of 34,000 points of skylight every six minutes [and] process it. By the end of the night it will actually be able to take all the data together to find new galaxies.”
Optimizing Energy Efficiency in Buildings
Computer scientist Joshua New, principal investigator of the Office of Energy Efficiency & Renewable Energy’s Autotune project at Oak Ridge National Laboratory (ORNL) in Oak Ridge, Tennessee, hopes to take advantage of Wrangler’s big data capability to actually tame big data. Autotune’s objective is to create a software energy model analog of existing buildings and use them effectively for retrofit planning, retro-commissioning, or measurement and verification (M&V) of efficiency measures, and then calibrate the model with more than 3,000 different data inputs deriving difficult-to-obtain model inputs like air infiltration rates, actual occupancies and plug loads, degraded equipment efficiencies, and internal thermal mass.
This is done by calibration against measured data like monthly utility bills, interval meter or sub-meter data (such as GreenButton), zone temperatures, or other sensor data streams in order to generate useful information for determining what an optimal energy-efficient retrofit will entail for particular buildings.
The Autotune project aims to replace art with science, and expensive human time with cheap computing time by using evolutionary computation to calibrate model inputs using any sources of measured data which can be mapped to simulate engine output.
“Wrangler has enough horsepower that we can run some very large studies and get meaningful results in a single run,” New said in the TACC release. He is currently using ORNL’s Titan supercomputer that can run 500,000 simulations and write 45 TB of data to disk in 68 minutes, and said that he wants to scale out his parametric studies to simulate all 125.1 million buildings in the U.S.
“I think that Wrangler fills a specific niche for us in that we’re turning our analysis into an end-to-end workflow, where we define what parameters we want to vary,” New said. “It creates the sampling matrix. It creates the input files. It does the computationally challenging task of running all the simulations in parallel. It creates the output. Then we run our artificial intelligence and statistic techniques to analyze that data on the back end. Doing that from beginning to end as a solid workflow on Wrangler is something that we’re very excited about.”
Gaffney explained that when he talks about Wrangler’s storage capacity, he’s actually referencing a 10 petabyte Lustre-based file system hosted at TACC and replicated at Indiana University. “We want to preserve data,” he said. “The system for Wrangler has been set up for making data a first-class citizen amongst what people do for research, allowing one to hold onto data and curate, share, and work with people with it. Those are the founding tenants of what we wanted to do with Wrangler.”
Mining Fossil Data for Human Origins
Another example of new user research enabled by Wrangler is an NSF-funded Open Source science initiative called PaleoCore whose purpose is to develop data standards and digital infrastructure for the field of paleoanthropology. In addressing the common scientific challenge of merging and correlating results from independent research programs, the PaleoCore research team hopes to take advantage of Wrangler’s speed with databases to enable scientists to mine geospatially-aware data on all fossils related to human origins. The project would combine older digital collections in formats like Excel worksheets and SQL databases with newer mods of data gathering such as real-time fossil GPS information collected using Apple iPhones or iPads.
“We’re looking at big opportunities in linked open data,” PaleoCore principal investigator and UT Austin associate anthropology professor Denne Reed said. Linked open data allows queries to derive meaning from relationships identified in seemingly disparate streams of data. “Wrangler is the type of platform that enables that,” Reed said. “It enables us to store large amounts of data, both in terms of photo imagery, satellite imagery, and related things that go along with geospatial data. Then also, it allows us to start looking at ways to effectively link those data with other data repositories in real time.”
Hadoop and Apache Spark frameworks support Wrangler’s shared memory and data analytics, which are connected via the optical network Internet2, which enables 100 gigabytes per second data throughput to most academic institutions in the country.
“Hadoop is a big buzzword in all of data science at this point,” Gaffney said. “We have all of that, and are able to configure the system to be able to essentially be like the Google search engines are today in data centers. The big difference is that we are servicing a few people at a time, as opposed to Google.”
In addition, TACC has the tools and techniques needed to transfer various data in parallel. “It’s sort of like being at the supermarket,” he said. “If there’s only one lane open, it is just as fast as one person checking you out. But if you go in and have 15 lanes open, you can spread that traffic across and get more people through in less time.”
Wrangler is also more web-enabled than is typical in high-performance computing, with a web portal that enables users to manage the system, and provides the ability to access and employ web interfaces likes VNC, RStudio, and Jupyter notebooks to support more desktop-like user interactivity.
“We need these bigger systems for science,” Gaffney emphasized. “We need more kinds of systems. And we need more kinds of users. That’s where we’re pushing toward with these sort of portals. This is going to be the new face, I believe, for many of these systems that we’re moving forward with now. Much more web-driven, much more graphical, much less command line-driven.”
Biologists, astronomers, energy efficiency experts, and paleontologists are examples of the new supercomputer-user community Gaffney hopes Wrangler will attract.
“The NSF shares with TACC great pride in Wrangler’s continuing delivery of world-leading technical throughput performance as an operational resource available to the open science community in specific characteristics most responsive to advance data-focused research,” said Robert Chadduck, the program officer who is overseeing the NSF award.
“There are some great systems and great researchers out there who are doing groundbreaking and very important work on data, to change the way we live and to change the world,” Gaffney concluded. “Wrangler is pushing forth on the sharing of these results, so that everybody can see what’s going on.”