With the Research Data Alliance, Dr. Francine Berman is hoping to change the way data is collected, used, and shared to solve problems around the globe.
If anyone understands the deep penetration of data into all facets of modern life it’s Francine Berman. Berman directs the Center for a Digital Society at Rensselaer Polytechnic Institute, in Troy, New York, the nation’s oldest technological research university. When she was director of the San Diego Supercomputing Center from 2001-2009, home to IBM’s DataStar, one of the most powerful supercomputers on the planet, Business Week dubbed Berman the “reigning teraflop queen” (For the uninitiated: This is a compliment. A teraflop is a computing speed of 1 trillion calculations per second).
Berman has investigated many “Xs” during her odyssey. From 2007-2010, she co-chaired the U.S.-U.K. Blue Ribbon Task Force for Sustainable Digital Preservation and Access, work that led the Library of Congress to name her as a “Digital Preservation Pioneer.” In honoring Berman with the first-ever Ken Kennedy Award, the IEEE/ACM-CScited Berman’s “influential leadership in the design, development and deployment of national-scale cyberinfrastructure.”
In March 2013, Berman began work on her most challenging project yet, combining her expertise in all aspects of data, from hardware to software and from cutting-edge technical advances to the thorny intricacies of social and political interactions. The project goes by an unassuming name, the Research Data Alliance (RDA). Its goal, however, is revolutionary: to change the way in which data is collected, used, and— most importantly — shared to solve specific problems around the globe.
RDA’s grand mission quickly captured the imagination of researchers. In just over a year, RDA (which Berman co-chairs) has grown from a core group of eight to more than 1,600 members from 70 countries and includes chemists, anthropologists, librarians, programmers, and materials scientists.
RDA is in the process of formalizing a relationship with the Group on Earth Observations (GEO), an international effort to build a Global Earth Observation System of Systems. GEO chief Barbara Ryan calls RDA’s work “essential” and believes that “a strong partnership between GEO and RDA is a key ingredient to both organizations accomplishing their respective missions.” (Read more Earthzine coverage of GEO’s work, here, here, and here.)
“It’s pretty clear that in the 21st Century, data drives everything, from the health sciences to climate change,” says Berman. “But there’s only so far you can go in solving problems using your own data and your own team. Today, you need to reach across boundaries.”
And that can be difficult, because there are countless obstacles hindering the free-flow of data. Countries restrict information, of course, citing national security issues. But so do businesses that fear losing proprietary information to competitors, and even when no one is trying to block data, a host of factors still make sharing difficult or impossible.
“What RDA is doing,” explains Berman, “is working to foster an environment in which infrastructure isn’t a roadblock.”
In recent years, the phrase “Big Data” has been used as shorthand for this infrastructure roadblock. The buzzword is everywhere. A Google search returns several million hits for this term which has been splashed across the covers of magazines from the Harvard Business Review to the journal Nature to the news magazine Der Spiegel (where it was the only English text on the page in German).
But there’s a problem with this shorthand, says RDA secretary general Mark Parsons. “Calling it ‘Big Data’ misses the point,” he says. “It’s not the volume of data that’s unique. There’s always been more data than we know how to handle.”
Take, for example, British physicist Thomas Young’s prediction that because of the proliferation of published articles, “the sciences will shortly be overwhelmed by their own unwieldy bulk.” Similar critiques are now common across the Internet. Young, however, penned his warning in 1807.
Attempts to get a handle on data may not be new, but the nature of the dilemma is. Data cognoscenti like Parsons parse the issue using what they call the Three Vs of Data: volume, velocity, and variety.
Volume refers to the mountain of data produced by researchers – the one feature that can rightfully be called Big Data. Advances in technology, including the advent of supercomputers, also are increasing the speed (or velocity) at which data is collected. But it’s the final “V,” the variety of data, that most accurately defines this moment in history, says Parsons.
“We’re measuring tons of stuff, but it’s in ‘higgily-piggily’ formats,” he explains. “Earth science alone probably has a million ASCII formats. At the same time we need more diverse information to address the grand challenge science questions.”
Leaps in scientific capabilities are fueling the variety of data collected. The 2009 launch of the Kepler space telescope, for example, has resulted in new kinds of information about planets outside of our solar system — a thousand worlds we didn’t even know existed before.
Back on our own planet, new and creative techniques are also driving the problem. Take the revolution in snow observation, the field Parson knows best from his days as lead project manager at the National Snow and Ice Data Center in Boulder, Colorado. Researchers have long been limited to physical examination of snow, he explains. Satellites added a number of other possibilities, including analysis of microwave radiation.
“Now,” Parsons says, “we’re using hyperspectral techniques, scatterometry and altimetry,” all of which make for more accurate representations of the snow pack. But they also require researchers to integrate data from a variety of formats. And as research becomes ever-more transnational, answering questions such as, “Are you at greater risk of developing asthma in Los Angeles or Mexico City,” depends on combining data from myriad sets. Taken together, these examples reflect the enormity of the problem facing modern researchers.
Berman explains RDA’s solution with a simple metaphor: “We’re the bridge builders in the brave new world of data.”
But building bridges has its own form of complexity. An efficient data infrastructure requires the construction of all kinds of new structures. Some are built of code, others are constructed from standards of best practices, and still others from policies at all levels – from how to key-in data at a single laboratory to designing massive Web portals for international bodies. In each instance, the RDA stresses the impact on the ground over technical wizardry. There’s no room for “bridges to nowhere” in this brave new data world no matter how dazzling their design.
“Everything we do has to be in service of solving a problem,” she says, and uses another analogy to explain RDA’s focus on practicality. “Say I’m in France and I want to read my email. I’m in my hotel room after a busy day and get my laptop out and go to plug it in. But the plug on my power cord doesn’t match the wall outlet and, darn it, I forgot my adapter. So what do I do? I can try to find a store that sells adaptors. But maybe the hotel has a computer I can use. Or I can use a different device like my smartphone. The point is, I don’t really care about the plug, or the adaptor, or my laptop. I just want to read my email.”
The emphasis on finding solutions to real-world problems is built into RDA through broad Interest Groups (IG) which then create more tightly-focused Working Groups (WG). (See Table 1, below.) The IGs allow members to coalesce around areas of research, forming communities that include everything from “Digital Practices in History and Ethnography,” to “Agricultural Data Interoperability.” The Working Groups produce actionable solutions (called “deliverables”) to well-defined problems within a strict 12-18 month time frame. RDA’s first deliverable, a “cookbook” for helping agricultural researchers share data about wheat, is on target to be unveiled Fall 2015.
This brings up a critical element behind the group’s success: its reliance on the power of human interactions. This may seem counterintuitive. After all, the public’s image of computing is of a virtual reality, an infinite stream of ones and zeros zipping through the ether at the speed of light, information untethered to a specific place – or even to individual people. Why, it seems fair to ask, can’t all this be done online?
Because it can’t, says Berman, flatly.
“Nothing takes the place of face-to-face encounters between researchers,” she explains. That’s because, ultimately, RDA isn’t just about data and computers. It’s also about the humans who use them.
Table 1. For more information about a specific group, click on any of the links above