NASA Earth Exchange (NEX): Managing Scientific Knowledge and Collaboration

EarthzineArticles, Earth Science Informatics, Informatics Theme, Original, Sections, Themed Articles

Cropped image of a map of Slovenia

NASA Earth Exchange (NEX) is a collaborative platform that combines state-of-the-art supercomputing, Earth system modeling, remote-sensing data from NASA and other agencies, and a scientific social network to provide an environment in which users can explore and analyze large Earth science data sets, run modeling and analysis codes, collaborate on new or existing projects, and share results within or among communities. A number of technologies are being tested to enhance scientific productivity within the NEX community.

Petr Votava, NASA Ames/University Corp. at Monterey Bay

Dr. Ramakrishna Nemani, NASA Ames Research Center, Moffett Field, California

Figure 1 - An annual (2010 climate year: December 2009 to November 2010) global Web- Enabled Landsat Data (WELD) Top of atmosphere (TOA) true color reflectance product. WELD is one of the projects using NASA Earth Exchange (NEX) for its computational and data needs. Image Credit: David Roy and Valeriy Kovalskyy, South Dakota State University.

Figure 1 – An annual (2010 climate year: December 2009 to November 2010) global Web- Enabled Landsat Data (WELD) Top of atmosphere (TOA) true color reflectance product. WELD is one of the projects using NASA Earth Exchange (NEX) for its computational and data needs. Image Credit: David Roy and Valeriy Kovalskyy, South Dakota State University.

Background

Global change research is conducted in a highly collaborative manner by teams of researchers including climate scientists, biologists, economists, social scientists and resource managers distributed around the world. Their work is characterized by use of community-developed models and analysis codes and by a need to access a broad range of large datasets found in geographically distributed data centers. Stovepipes and segmentation currently limit collaboration and often lead to duplication of efforts and thus increased costs. Additionally, as the length and diversity of the global Earth observation data records grows, modeling and analysis of biospheric conditions increasingly requires multiple terabytes of data from a diversity of models and sensors. With network bandwidth beginning to flatten, transmission of these data from centralized data archives presents an increasing challenge, and costs associated with local storage and management of data and compute resources are often significant for individual research and development efforts. Sharing community valued intermediary data sets, results and codes from individual efforts with others who are not in directly funded collaboration can also be a challenge with respect to time, cost and expertise.

As we move forward, we believe that we can be more effective and efficient, both scientifically and fiscally. Over the past several years, we have been developing the NASA Earth Exchange (NEX), a centralized data, supercomputing and knowledge platform that houses NASA satellite data, climate data and ancillary data, where a focused community may come together to share modeling and analysis codes, scientific results, knowledge and expertise.åÊ NEX tries to accomplish this by providing scientists with four key capabilities: 1) A Web-based collaborative environment as well as access to on-line real-time collaboration tools. 2) A data management environment providing discovery and access to key datasets. 3) A ‰ÛÏsandbox‰Û computing system where codes can be prototyped and evaluated on a platform that provides access to the NEX data repository, integrated suite of data, analysis tools and utilities as well as tools for workflow capture and management. 4) A supercomputing environment to allow large scale model and data analysis runs and computational experiments.

NEX utilizes NASA’s Pleiades supercomputing system together with over 1 petabyte (PB) of satellite, climate and model datasets and enabling users the ability to share project details with the community through the NEX portal ‰ÛÒ a Web-based knowledge network. NEX capabilities also include a collaboration facility that enables distributed teams to work together and share with the rest of the community through seminars and other outreach activities. The Pleiades supercomputing architecture combined with the massive data store and high-speed network enables NEX to engage large scientific communities and provide them with capabilities to execute modeling and data analysis on a grand scale, which was not previously achievable by most scientists.

Knowledge Management

As more and more projects are being executed on NEX, there is an increased focus on capturing the knowledge base of NEX researchers and provide mechanisms for sharing it with the community in order to facilitate reuse and accelerate the scientific process. There are many possible knowledge contributions to NEX. Those can include ‰ÛÏWiki‰Û entries on the NEX portal contributed by a developer, information extracted from a publication in an automated way, or a knowledge captured during the conduct of research on the supercomputing platform. The goal of the NEX knowledge management system is to capture and organize this information and make it easily accessible to the NEX community and beyond.

The knowledge acquisition process consists of three main faucets ‰ÛÒ data and metadata, workflows and processes, and Web-based information. Once the knowledge is acquired, it is processed in a number of ways ranging from custom metadata parsers to Natural Language Processing (NLP)-based entity extraction. The processed information is linked with existing taxonomies and aligned with internal ontology (which heavily reuses number of external ontologies). This forms a knowledge graph that can then be used to improve users’ search query results as well as provide additional analytics capabilities to the NEX system and is an important building block in creating a dynamic knowledge base for the NEX community where knowledge is both generated and easily shared. With the deployment of technologies for secure ‰ÛÏvirtual machines‰Û using hardware virtualization, an opportunity exists to go even further and create complete modeling, analysis and compute environments that are customizable, ‰ÛÏarchiveable‰Û and transferable. Allowing users to instantiate such environments on large compute infrastructures that are directly connected to data archives may significantly reduce costs and time associated with scientific efforts thus alleviating users from redundantly retrieving and integrating data sets and building modeling analysis codes. NEX is pursuing this development through OpenNEX, a partnership with Amazon Inc., as well as through the NEX OpenSandbox, which provides private cloud environment collocated with the Pleiades supercomputing platform.

The OpenNEX platform makes a large collection of NASA’s climate and satellite data available to the research community, students, and the public through the Amazon Web Services. The focus of OpenNEX is to generate interest and participation from a large number of geoscientists, software engineers, students and the general public for the National Climate Assessment mandated by Congress.

Managing scientific collaborations with VisTrails

NEX utilizes VisTrails workflow and provenance management infrastructure to support transperancy, re-usability and collaboration within a scientific process.åÊVisTrails provides easy-to-use visual interface to building scientific workflows in which researchers can develop, explore and share workflows and workflow components. VisTrails is especially useful in support of the scientific process development and experiments, because it manages not only the current workflow and its components, but also the entire development history – complete with notes and annotations. Researchers can then easily compare and execute multiple versions of the workflow and re-discover reasons behind different key decisions/debugs/edits during the process development. Finally, VisTrails delivers state-of-the-art visualization capability, which further simplifies analysis of large Earth science datasets.

While VisTrails provide good foundation for the NEX process and provenance infrastructure, there is still a learning curve that at times creates an obstacle to adoption. In order to improve users’ interaction with the workflow system and to minimize the learning curve, the NEX software platform provides a set of tools and utilities that enable automatic conversion of users processes into workflow components or entire workflows. These tools extract information from a running process, identify software components and convert them into VisTrails workflow, allowing researchers to further explore the process using VisTrails environment, but without having to manually convert their code. We are in the process of scaling VisTrails to the Pleiades supercomputing environment to make it applicable to large data and compute experiments. Finally, to improve collaboration, we are working with our partners to enhance the VisTrails recommendation engine that assists users during the workflow development by suggesting specific components based on both users’ history and the reputation of their contribution in terms of workflows or their components.

As the development of NEX continues, it strives to lower the barrier of entry to data- and compute-intensive science. We hope to provide the community a NEX platform that will provide a mechanism for continuous engagement among members of the global geoscience communities to work together and address grand challenges in the Earth sciences.

Petr Votava is a senior software engineer at NASA Ames Research Center and University Corporation at Monterey Bay. His research interests include data mining/anomaly detection, knowledge management, semantic Web and software architecture. He is the technical lead for the NEX science platform.

Dr. Ramakrishna Nemani is a senior research scientist at NASA Ames Research Center. His research interests include ecological forecasting and collaborative computing in the Earth sciences.

References

[1] R.R. Nemani et al., ‰ÛÏCollaborative Supercomputing for Global Change Science‰Û

in Transactions of Eos American Geophysical Union, Vol. 92, No. 13, 29 March 2011.

[2] J. Freire and C. Silva, ‰ÛÏMaking Computations and Publications Reproducible with VisTrails‰Û in Computing in Science and Engineering 14(4): 14-25, 2012.

[3] V. Kovalskyy, V. and D.P. Roy, ‰ÛÏThe global availability of Landsat 5 TM and Landsat 7 ETM+ land surface observations and implications for global 30m Landsat data product generation‰Û in Remote Sensing of Environment, 130, 280‰ÛÒ293, 2013.

åÊ