Senior Staff Writer
This essay is a follow-on to my essay in the August 2010 issue of Earthzine, “18 Reasons for Open Publication of Geoscience Data.“1 The premise of both is that science can be made more transparent and true to its principles through better use of Information Technology (IT) and a global infrastructure of technical standards that make it easy to publish, discover, assess and access data. This essay argues that in the geosciences, the necessary institutional commitment and technical standards are largely in place, but the standards’ availability and usefulness are not yet well known. As science fiction author William Gibson observed, “The future is here. It’s just not widely distributed yet.”
Evidence Of Institutional Commitment
In the 24 August 2010 issue of EOS2, in “Data Citation and Peer Review,” authors Parsons, Duerr and Minster argue that, “The scientific method and the credibility of science rely on full transparency and explicit references to both methods and data.”
Looking back a year, we see that the Geological Society of America (GSA), in its Open Data Access Position Statement3, adopted May 2005 and revised May 2009, “Strongly supports open access to scientific data by all purveyors of such data to promote advancement in research, support education, and improve the economic progress, health, and welfare of society.Û
Eight years ago, the US National Academies’ 2002 “Geoscience Data and Collections: National Resources in Peril (2002)4” referenced the US National Science Foundation (NSF) Division of Earth Sciences (EAR) “Guidelines for Geoscience Data and Collections Preservation and Distribution,” whose “overall purpose and fundamental objective Û_ is to ensure and facilitate full and open access to quality data for research and education in the Earth Sciences. These guidelines are considered to be a binding condition on all EAR-supported projects.”
We see from these examples that there has been an ongoing call for and official commitment to open publication of geoscience data. Progress is evident in programs like the Global Earth Observation System of Systems (GEOSS) and OneGeology, through which national government agencies are beginning to share their data more openly. The EOS article cited above noted other efforts. Progress is also evident in the number of papers on this general subject that were presented at the recent IGARSS 2010 conference.5
Obstacles to Success
Despite all of this, however, data created for most geoscience studies are unavailable, and most of the data that are available are difficult to find and use. In science as in other domains such as government, geospatial data are hard to discover and access for a number of reasons.6 Scientists and governments are still creating data in idiosyncratic and often complex data formats. Data have been and continue to be created in software-specific files and there is no guarantee that proprietary databases and database models will be maintained. Metadata, when provided, may be in non-standard schemas and may neglect important elements such as data dictionaries. Some obstacles are new: In the emerging world of service-oriented IT architectures, web-service derived results are ephemeral and typically lack any record of provenance, including service history. (Open standards for tracking geospatial data provenance in a Web services environment do not yet exist.)
To repeat from Parsons, Duerr and Minster: “The scientific method and the credibility of science rely on full transparency and explicit references to both methods and data.” “Climategate,” as well as the simple fact that most geoscience data are not available, suggest that, frustrated by the difficulties summarized above, scientists and the institutions of science have failed to provide the transparency that good science and credibility require.
Today’s Technical Standards Overcome Interoperability Obstacles
The concept of Ûopen scienceÛ involves scientists and researchers publishing, discovering, assessing and accessing not only research reports, but also the data and computation on which research findings are based. Current technology has the capacity to meet these functional requirements but only when the technologies implement existing open software interface and data encoding standards that allow the technologies to interoperate within a worldwide system.
Free and open standards, like TCP/IP and HTTP, encourage innovation and rapid acceptance, resulting in expanded networks of communication and sharing. Users and providers of geospatial technologies and data have been cooperating since 1994 in the Open Geospatial Consortium (OGC)7 to develop free and open standards that enable communication between different geoprocessing systems from different vendors and of different types: GIS, Earth imaging systems, navigation systems, location services, sensor webs, databases, etc. Requirements have come from a wide range of stakeholders, resulting in a framework of open standards that enable, among other things, Web-based applications for publishing, discovering, assessing and accessing geoscientific data and computational resources.
One example is the OGC Catalog Services ÛÒ Web (CSW) Interfaces Standard. This standard specifies service interfaces that enable developers to write applications for publishing and discovering geoscience data and services and associated metadata. The CSW standard is designed to work with ISO standard metadata8 as well as other metadata structures or standards.
Implementations of the CSW standard make possible fine-grained searches of many kinds. For example, a wildlife biologist studying ducks in Canada might publish data that happen to include water temperature readings at certain locations. Years later (if the metadata included basic information about the temperature readings), a hydrologist searching for historical surface water temperature data in that region could easily discover this data, along with information about when and how the data were collected. Metadata tools, some free and open source9, are already available that streamline the creation of such metadata, and open source software code is available that streamlines implementation of CSW by software developers.
It is important for scientists to begin thinking in terms of Web services rather than file-based computing. Google Maps, for example, is a service offered over the Internet, enabled by the Web. A query returns useful information and little is required of the user in terms of expertise or hardware and software. Wolfram Alpha10 is perhaps a better example, because it is a Web service that provides sophisticated analytical capabilities that operate on many different kinds of data available from government agencies and other sources. The point is that both distributed data and diverse software services can be “in the cloud,” and this rapidly advancing paradigm promises to revolutionize the geosciences.
OGC Web Services standards specify the open interfaces and encodings necessary for building open Web services that provide access to virtually any kind of vector or raster data as well as processing functions that use that data. OGC Sensor Web Enablement standards11 enable developers to make any Web-accessible sensor and/or sensor data repository discoverable, accessible and useable via the Web. This includes Earth observation sensors. Many, but not all, of the standards necessary for chaining of Web services, as in climate models, for example, are available. Others are in development.
Some geoscience communities, notably those involved in hydrology12 and in meteorology and ocean observation13, have begun working in OGC Technical Committee working groups to facilitate their data sharing efforts based on these standards. Typically, this IT standards activity builds on prior data coordination efforts. Other OGC working groups14 focus on topics such as data preservation, geospatial rights management, data quality, geosemantics and workflow, all of which have significance for open science.
Existence Of Standards: Necessary But Not Sufficient
This brief discussion of standards leaves many important questions unanswered, such as: What is to be done with currently available data and services on the Web that do not implement standards? How will researchers’ data dictionaries be coordinated for cross-disciplinary studies? How much metadata expertise will be required of scientists, and will each data producer produce their own metadata?
A key question is ÛÏWho will pay for this?Û David Hastings, creator of the Human Security Index, in a comment on my August Earthzine article, noted, “Geoscience Australia, formerly using the long-established restrictive Crown Copyright, now protects its intellectual property via the 21st century approach of Creative Commons licensing.” Will Geoscience Australia’s embrace of Creative Commons licensing become the norm or remain the exception? Standards from the OGC and other standards organizations provide much of the infrastructure for a market in proprietary scientific information. The market is important, because curation is essential but not free, and governments will almost certainly not pay all the costs. Someone has to review data and edit and index the literature; maintain the information in readable form, and keep it online as platforms change; and promulgate the use of specific standards. Technical standards will be necessary in Web-mediated management of privacy, liability and intellectual property, as well as professional attribution, a main currency of science.
Such questions require institutional responses. Technological change induces institutional change, but can the pace of institutional change keep up with the pace of technological change? Search companies and social networking companies, not geoscience institutions, are the main innovators in “data science”15, which focuses on turning massive datasets and data streams into information products. The institutions of science will need to imagine challenging scenarios. For example, what will be the result of millions, and soon billions, of sensor-packed cell phones, automobiles and buildings streaming location-specific environmental data into public repositories? What if these streams of data and associated, increasingly capable and publicly available cloud services result in a surge of citizen science? How will data integrity be addressed in this scenario?
We can expect funding institutions, publishers, scientific associations, universities and scientists themselves to develop new policies, behaviors, business models, funding propositions, and long-term data curation solutions. This will happen partly in response to new capabilities enabled by new technologies and technical standards, and partly in response to social, economic and political factors.
We know that none of this “just happens.” Each step depends on people making decisions and taking actions. The third article in this series will consider some of the risks and opportunities that will figure in such decisions.
2 Transactions, American Geophysical Union http://www.agu.org/pubs/crossref/2010/2010EO340001.shtml
5 Geoscience Depends on Geospatial Information Standards, Siri Jodha Khalsa and George Percivall. In the Geoscience and Remote Sensing Newsletter, December 2010. http://www.grss-ieee.org/wp-content/uploads/2010/03/12.10.pdf
6 Preserving Geospatial Data: Challenges and Opportunities, Steven P. Morris
In the Proceedings Indo-US Workshop on International Trends in Digital Preservation, March 24-25, 2009.
15 “What is data science?” Mike Loukides, O’Reilly Radar
Lance McKee was on the startup team of the OGC in 1994 and currently serves as Senior Staff Writer. Over the years he has served on local not-for-profits (in Worcester, Massachusetts) and written to promote awareness of issues involving climate, energy and watershed awareness. His interests include the evolving use of information technology in science.