Exploring a New World of Social Data

Researchers need new tools to understand the massive amount of human-created data

This summer, a group of statisticians, mathematicians, and engineers will move into Gross Hall, one floor up from SSRI West. The group is collaborating to build a Big Data toolbox -- techniques to analyze enormous datasets such as the billions of daily posts on social media websites; or the 600 million proton collisions per second at the Large Hadron Collider; or the three billion A, C, G, and T letters in the genomic sequence of a single person.

Humankind is creating so much data, at such an astonishing rate, and in such varied forms (numbers, words, images, video) that the standard tools for data analysis are no longer up to the task.

"It's exciting because there are people from a variety of backgrounds," said David Dunson, a professor in the Department of Statistical Sciences, who will move with the Big Data group. "I'm a statistician -- doing inferences, allowing for uncertainly -- but there are also excellent mathematicians, who are really good at characterizing low-dimensional structures and high-dimensional data, and electrical engineers who are really good at algorithms."

In high-dimensional data, each datapoint has many variables -- for example, a group of patients (the data points) and billions of pieces of genomic information (the variables) about each of them.

"When you have millions and billions of observations on a given patient, and the number of patients you have is much smaller, you can't analyze that using traditional methods," Dunson said. Instead, the strategy is to create a "lower-dimensional structure" that will allow the meaning to shine through.

Dunson uses Bayesian statistics as a route to low-dimension structure. Others, such as Rebecca Willett, a self-described electrical engineer with a mathematical bent, use geometrical tools to help describe complicated structures in simpler ways.

"Think about your sheet in a dryer," she said. "That sheet is just a flat, two-dimensional surface, but then it's crumpled up and floating around in a three-dimensional space. It's no longer flat, but still, if you were an ant, at any one point it looks flat."

This concept -- called manifold -- can be applied to high-dimensional datasets to create something simpler than can be analyzed.

Tools developed to analyze high-dimensional data in one setting, such as astronomy, can be applied to high-dimensional data in another setting, such as the social sciences.

"They are pretty different kinds of problems on the surface, but they all have similar types of challenges that show up," Willett said.

The new Data Analytics Center in Gross Hall will be co-led by Robert Calderbank, dean of the natural sciences in the Trinity College of Arts and Sciences, and Larry Carin, the William H. Younger professor of electrical and computer engineering. Others who will set up shop in the space include Ingrid Daubechies, James B. Duke professor of mathematics; Mauro Maggioni, professor of mathematics and computer science and electrical and computer engineering; and Guillermo Sapiro, professor of computer science and electrical and computer engineering.

"We all bring different set of tools to bear on this big underlying problem, but right now we're scattered across campus," Willett said. "After the move to Gross Hall, we're hopeful we'll be able to accomplish a lot more as a team than we've been able to."

The design of the new space -- with plenty of common areas and white boards -- will encourage interaction, and a regular influx of new ideas and energy will be provided by visiting faculty, both from Duke and from the larger scholarly community.

Willett said the move will create new opportunities for students as well. "Right now our students don't interact that much, or when they do see each other, it's a seminar and there's no time to chat," Willett says.

In Gross Hall, she said, "They'll be together all the time and will be able to gather around a white board on the spur of the moment. My students will get to learn from the statisticians, and the statisticians will get to learn from the mathematicians."

Being just one floor up from SSRI West will provide the scientists of the Data Analytics Center plenty of what they crave: more data. Social scientists have traditionally gathered a lot of data through surveys, but now a whole new world of social data is available. Dunson says an emerging field, which many of his students are interested in, involves writing programs to "scrape" data off web sources, such as Twitter, Facebook, and Google.

Willett hopes SSRI will provide not only more data, but also new challenges. "It's exciting that we might be able to help out with social problems via these collaborations, and it would be great if we have to develop new tools to look at these problems from an entirely new perspective," she said. "Not only do I hope to be able to help SSRI, but I hope they'll be able to help me figure out new and exciting areas of math to study."