In Search of Dark Data

Figure 1: EuroGEOSS Broker. Image Credit: S Nativi

Figure 1: EuroGEOSS Broker. Image Credit: S Nativi.

It is very easy for the data painstakingly recorded by scientists to fade away into the background once it has been used for a paper or study. The goal of Earthcube, a collaboration between the National Science Foundation and Earth scientists, is to bring that data back into the open and to make sure other teams can see, analyze and build on that work to obtain insights into the environment.

“Everything is so coupled in the Earth system,” said Siri Jodha Singh Khalsa of the National Snow and Ice Center (NSIDC) in Boulder, Colorado, and standards lead on the Global Earth Observation System of Systems (GEOSS).

“To get to the pressing problems, especially in the area of global change, in order to predict the trends and predict what is going to happen in the Earth system, you have to understand these dependencies. Science has to take a cross-disciplinary approach,” Khalsa said.

There is a huge amount of data already in existence. The problem has been making it readily accessible.

“A researcher may be funded to do fieldwork. They derive results and they produce an Excel spreadsheet with their measurements, whether it’s worms in the soil or ice crystals in a snow pack,” Khalsa said. “That data set often doesn’t get formally archived. It might be put on a website or included as supplementary information in a journal or sits on an FTP server. That dark data presents a challenge because, usually, the only route to it is through the investigator. But it is something that can be addressed and it’s something that is of great interest to Earthcube.”

Figure 2: EuroGEOSS Broker. Image Credit: S. Nativi.

Figure 2: EuroGEOSS Broker. Image Credit: S. Nativi.

Data delivery could be easier if the data were converted to common formats. For example, the WaterML format, which currently is limited to data on water resources, lets researchers grab data on groundwater levels, mineral concentrations and other readings in a standard way so that they do not have to work out how to convert the values for use in their own computer systems.

Many new Earth observation techniques rely on the ability to bring data in from different fields. But researchers in those other fields may use different data formats, coordinate systems and even use different terms for the same measurements. Even if scientists wanted to go back to old datasets locked away in Excel spreadsheets and bring them online, they are unlikely to have the time to convert them into a form that can easily be used by workers in other fields.

Information technology provides one answer: The EuroGEOSS project, funded by the European Commission, adopted a technique widely used in other large-scale software-based systems. This technique makes the computer convert data into more convenient forms on the fly instead of forcing scientists to adopt common languages and structures. The core of this system is a ‘broker’ – a computer program that understands the format used by each piece of data in its catalogue so that it can reformat the data automatically.

See also:

EarthCube: Helping Scientists Share What They Know

EarthCube Brings Big Data Sets to Diverse Researchers

The EuroGEOSS team split the broker into three elements. The first component, the access broker, pulls information from the online resources in its catalog, converting the values into a format the user’s software understands.

The discovery broker attempts to solve another problem that scientists face when assembling data from different disciplines – working out what data is available and whether it is relevant. The discovery broker makes it easier to find information from sources they might not have previously considered.

The key to the broker’s overall success is its support for semantic brokering, says Stefano Nativi of Italy’s National Research Council and a member of the infrastructure board at the Group on Earth Observations (GEO). Terminology across disciplines can change dramatically. This results in scientists overlooking sources of information if they do not perform searches using terms employed outside their own discipline. Semantic brokering automatically looks for synonyms during a search so that researchers interested in water levels within a region do not miss data simply because they asked for “precipitation” but neglected to type in “runoff” or “evapotranspiration.” The semantics broker contains a model that links these terms to “water” so that a search on that will bring up all the related research.

GEOSS now uses the EuroGEOSS brokering technology for its Discovery and Access Broker (DAB) and, in turn, it was quickly embraced by their community. Just three months after its introduction in autumn 2011, the number of data resources through the Discovery Framework of the GEOSS Common Infrastructure that could be accessed using the DAB expanded from a few hundred to more than 28 million. The number continues to rise.

“Within GEOSS, brokering has definitely been accepted and recognized as a real, viable approach,” Nativi said

Figure 3:  Web2.0 Components. Image Credit: Huerta and Diaz.

Figure 3: Web2.0 Components. Image Credit: Huerta and Diaz.

EuroGEOSS is far from being the only broker in use, although it performs a wider range of tasks, such as semantic mapping, compared to others. Recent ‘hack-a-thons,’ used to both check the usability and usefulness of the technology and encourage end users to try it, have employed three different broker implementations so far. In successive hack-a-thons, they tested the Esri Geoportal, the Environmental Research Division’s Data Access Program (ERDDAP) and the EuroGEOSS broker.

The hack-a-thons have demonstrated that the broker technology need not just be used with static data from completed experiments. It can capture real-time information relayed live by electronic sensors, such as the data from sensors on buoys floating in the ocean.

“The hack-a-thons are useful for showing some of the strengths and weaknesses of the approach, and what sort of effort is involved to take it further,” Khalsa said.

A team based at the Universitat Jaume I Castellón in Spain has worked on incorporating live social-media data with the Web 2.0 Broker. According to Professor Joaquín Huerta Guijarro of the Universitat Jaume I Castellón, tweets, photos on Flickr and Facebook posts can provide real-time, rapidly updated input on fast-moving ecological events, such as forest fires, or other natural disasters.

A further extension is into computer simulation. One of the subjects discussed at the GEOSS Future Products workshop at the end of March 2013 was that of the model web, which pulls together the wide variety of computer simulations of geophysical processes now available online.

“We need to include models,” Nativi said. “To extract knowledge from data you need processing and modeling. But we have a similar situation to that of data. There are many different models and many different frameworks. There is the need to combine and interconnect the modeling frameworks with data. This is what the brokering approach could address.”

Figure 4: Web2.0 Forest Fires on a smartphone.   Image Credit: Universitat Jaume.

Figure 4: Web2.0 Forest Fires on a smartphone. Image Credit: Universitat Jaume I Castellón.

The question for the community is whether a single broker should be used worldwide – which raises the question of how it should be selected. The alternative is to have multiple brokers, each run by a different group, which can share information between them. But having multiple brokers raises issues of governance.

“With the brokering approach you need to decide how to bring together many different systems managed by diverse organizations,” Nativi said. “You don’t do that by following the classical federal approach where you decide with other people to sign a contract and adopt common standards.

“We are following a different path. We don’t want to bring these people into a common implementation. We want to integrate systems without forcing people into alignment. But we have a concern. How can we govern systems that are not completely bound within an overall system? We need a smart way to govern this system of systems.”

“Building a system of systems is a socio-technical exercise. It is a strong sociological effort on many different levels. As different groups develop their own brokers, there are going to be governance issues related to how the brokers are going to interact,” Khalsa said.

“There is the possibility that an overarching framework to link the different community brokers together and provide secure access if required.

”There has to be a service of the infrastructure itself that is not governed by certain communities but a structure to provide translation and mediation between users and communities.”

Nativi adds, “We have to consider this as a real international effort. Brokering is about bringing together diversity.

“If we don’t address diversity it will be a real pity.”

Chris Edwards is a technology reporter with more than 20 years’ experience of journalism as an editor and writer.

Topic: