Duke Adds Storage to Manage ‘Big Data’

Additional storage helps researchers analyze and protect data

Sifaka lemur, the genome of which was recently made public. (photo: David Haring)

Duke evolutionary biologist Peter Larsen studies lemurs at the Duke Lemur Center, home to the largest collection of lemurs in the world outside of their natural habitat in Madagascar.For Larsen and Anne Yoder’s research team (Anne Yoder is Director of the Duke Lemur Center), the center is a remarkable resource for lemur genome data. A genome for one lemur contains about three billion pairs of DNA. When compressed, that represents about 700 megabytes of data, which fits on a standard compact disc. Yet, when collecting lots of sets of DNA, researchers end up with large volumes of data. Once analysis begins, even more data is generated requiring significant volumes of storage.“For evolutionary biology, the amount of DNA sequence data is growing exponentially,” Larsen said. “All of this is a wealth of information that needs to be stored and protected.” 

Duke Research Computing, which is part of the Office of Information Technology (OIT), helps researchers maximize the storage, speed, and security of data, which can be massive. Duke Research Computing recently added 500 terabytes of data storage for Duke researchers like Larsen for a total of two petabytes of storage dedicated to research uses – roughly the equivalent of all content at all U.S. academic research libraries.  “That was a huge breakthrough for us,” Larsen said. “We used up three terabytes in a matter of days and were ready to expand.”

Peter Larsen
Peter Larsen, PhD., research scientist in biology.

Without Duke Research Computing systems and its associated infrastructure support, grant seekers could be less competitive for grant funding. Many funders, such as the National Institutes of Health and National Science Foundation, require that data management plans be specified when a grant is submitted. These data management plans often assume a volume of ongoing storage capability that can be difficult for individual researchers to guarantee.“High performance computing has existed for some time, but now many more researchers are having to consider it,” said Victor Orlikowski of Duke Research Computing. “Folks who previously didn't realize a need for it, now say, ‘I have all this data and I have a need for massive processing of data to find subtle trends to further my research.’ Things are more robust now that we have dedicated staff for this at Duke.”In addition to storage, Larsen uses Duke Research Computing’s “compute cluster” – computers networked together for more power – to analyze genomic information to help identify pathogens and to better understand disease in lemurs, the world’s most critically endangered primates. “One interesting aspect about lemur biology is that some species show signs of cognitive impairment and plaque formation in the brain as they age, symptoms similar to Alzheimer’s disease in humans,” Larsen said. Because lemurs are more closely related to humans than other model species, the National Institutes of Health (NIH) might be interested in lemur biomedical research at the Duke Lemur Center that could help fight Alzheimer’s, heart disease, diabetes and other diseases in humans as well.  “Duke Research Computing has allowed us to form this cross disciplinary, campus wide network,” Larsen said. “The Yoder lab is becoming a leader in lemur genomics because we are able to use the compute cluster to work with researchers in cancer biology, molecular genetics and evolutionary biology, to create new research opportunities.”Larsen has several manuscripts in development featuring his work on lemur genomics, which will have lasting implications for lemur studies and human medicine. See his work featured on the National Geographic channel.

 Peter Larsen)

Diagram of lemur antibody sequencing project as highlighted by National Geographic. (Photo: Peter Larsen)