Clemson researchers and IT scientists team up to tackle Big Data

CLEMSON -- While researchers at Clemson University have recently announced an array of breakthroughs in agricultural and life sciences, the size of the data sets they are now using to facilitate these achievements is like a mountain compared to a molehill in regard to what was available just a few years ago.

But as the amount of "Big Data" being generated and shared throughout the scientific community continues to grow exponentially, new issues have arisen. Where should all this data be stored and shared in a cost-effective manner? How can it be most efficiently transferred across advanced data networks? How will researchers be interacting with the data and global computing infrastructure?

A team of trail-blazing scientists and information technologists at Clemson is working hard to answer these questions by studying ways to simplify collaboration and improve efficiency.

pic Dr. Alex Feltus, an associate professor in genetics and biochemistry at Clemson, discusses his research at the Palmetto Cluster, a supercomputer owned and operated by Clemson University. Credit: Jim Melvin / Clemson University

"I use genomic data sets to find gene interactions in various crop species," said Alex Feltus, an associate professor in genetics and biochemistry at Clemson. "My goal is to advance crop development cycles to make crops grow fast enough to meet demand in the face of new economic realities imposed by climate change. In the process of doing this, I've also become a Big Data scientist who has to transfer data across networks and process it very quickly using supercomputers like the Palmetto Cluster at Clemson. And I recently found myself -- especially in just the past couple of years -- bumping up against some pretty serious bottlenecks that have slowed down my ability to do my best possible work."

Big Data, defined as data sets too large and complex for traditional computers to handle, is being mined in new and innovative ways to computationally analyze patterns, trends and associations within the field of genomics and a wide range of other disciplines. But significant delays in Big Data transfer can cause scientists to give up on a project before they even start.

"There are many available technologies in place today that can solve the Big Data transfer problem," said Kuang-Ching "KC" Wang, associate professor in electrical and computer engineering and also networking chief technology officer at Clemson. "It's an exciting time for genomics researchers to vastly transform their workflows by leveraging advanced networking and computing technologies. But to get all these technologies working together in the right way requires complex engineering. And that's why we are encouraging genomics researchers to collaborate with their local IT resources, which include IT engineers and computer scientists. This kind of cross-discipline collaboration is reflecting the national research trends."

In their recently published paper titled "The Widening Gulf between Genomics Data Generation and Consumption: A Practical Guide to Big Data Transfer Technology," Feltus, Wang and six other co-authors at Clemson, the University of Utah and the National Center for Biotechnology Information discussed the careful planning and engineering required to move and manage Big Data at the speeds needed for high-throughput science. If properly executed, sophisticated data transfer networks, such as Internet2's Advanced Layer2 Service, as well as the inclusion of advanced applications and software, can improve transfer efficiency by orders of magnitude.

"Universities and other research organizations can spend a lot of money building supercomputers and really fast networks," Feltus said. "But with research computing systems, there's a gulf between the 'technology people' and the 'research people.' We're trying to bring these two groups of experts together and learn to speak a common dialect. The goal of our paper is to expose some of this information technology to the research scientists so that they can better see the big picture."

It won't be long before the information being generated by high-throughput DNA sequencing will soon be measured in exabytes, which is equal to one quintillion bytes or one billion gigabytes. A byte is the unit computers use to represent a letter, number or symbol.

In simpler terms, that's a mountain of information so immense it makes Everest look like a molehill.

"The technology landscape is really changing now," Wang said. "New technologies are coming up so fast, even IT experts are struggling to keep up. So to make these new and ever-evolving resources available quickly to a wider range of different communities, IT staffs are more and more working directly with domain science researchers as opposed to remaining in the background waiting to be called upon when needed. Meanwhile, scientists are finding that the IT staffs that are the most open-minded and willing to brainstorm are becoming an invaluable part of the research process."

The National Science Foundation and other high-profile organizations have made Big Data a high priority and they are encouraging scientists to explore the issues surrounding it in depth. In August 2014, Feltus, Wang and five cohorts received a $1.485 million NSF grant to advance research on next-generation data analysis and sharing. Also in August 2014, Feltus and Walt Ligon at Clemson received a $300,000 NSF grant with Louisiana State and Indiana universities to study collaborative research for computational science. And in September 2012, Wang and James Bottum of Clemson received a $991,000 NSF grant to roll out a high-speed, next-generation campus network to advance cyberinfrastructure.

"NSF is increasingly showing support for these kinds of research collaborations for many of the different problem domains," Wang said. "The sponsoring organizations are saying that we should really combine technology people and domain research people and that's what we're doing here at Clemson."

Feltus, for one, is sold on the concept. He says that working with participants in Wang's CC-NIE grant has already uncovered a slew of new research opportunities.

"During my career, I've been studying a handful of organisms," Feltus said. "But because I now have much better access to the data, I'm finding ways to study a lot more of them. I see fantastic opportunities opening up before my eyes. When you are able to give scientists tools that they've never had before, it will inevitably lead to discoveries that will change the world in ways that were once unimaginable."

source: Clemson University