The data center community must work to allow researchers more time to spend on analyzing results and less time coding and worrying about file formats and data transfers.
Glenn K. Rutledge
NOAA National Climatic Data Center (NCDC)
Daniel J. Crichton
NASA Jet Propulsion Laboratory (JPL)
NOAA National Centers for Environmental Prediction (NCEP)
In the era of petascale General Circulation Model (climate GCMs), multi-model ensembles and Numerical Weather Prediction (NWP) data archives, the need for on-demand computational resources and user-driven services has never been more apparent.åÊ There is a long-recognized value in society to begin to develop cross-discipline science studies linking data across massive databases to provide products for the attainment of knowledge and not only access to the raw data.
Can users be required to download petabytes of data? What should be the role of the archive in the 21st century under these new and growing massive databases?åÊ How can NWP and GCM data providers and associated data centers and archives begin to provide not only distributed access, but distributed web services capabilities from a semantic-web standpoint?
Improvements to NWP models have been identified by the National Centers for Environmental Prediction (NCEP) using historical data archives that in turn better protect life and property through improvements of forecasting skill.åÊ Such improvements therefore become a technology and security implementation problem and not initially a scientific one. The data center community must work to allow researchers more time to spend on analyzing results and less time coding and worrying about file formats and data transfers.åÊ We identify some of the existing limitations of traditional archives, discuss examples of model data diagnostics, and explore the many benefits of providing archive-based computational resources on peta-scale databases from a still emerging web services-based viewpoint.
A Distributed Data Access Philosophy: NOMADS
A major transition in our ability to evaluate transient GCM simulations and NWP models has been occurring over the last decade.åÊ RealÛÔtime and retrospective numerical weather prediction analysis, model runs, climate simulations and assessments are proliferating from a handful of national centers to dozens of groups across the world.åÊ The rapid growth of model development and higher spatio-temporal resolution, differing model grids, incorporation of historical data and the emergence of new web-based distributed data services and the ÛÏcloudÛ act to complicate data access.
Therefore, the coordination of climate data infrastructure and management must extend beyond traditional organizational boundaries.åÊ In 2001, the original NOAA Operational Model Archive and Distribution System (NOMADS, Rutledge, 2001) partnership initiated the formation of an international collaboration for model data access allowing for inter-comparisons that currently supports the Intergovernmental Panel on Climate Change (IPCC) Coupled Model Inter-comparisonProject (CMIP).åÊ This collaboration is called the Global Organization for Earth Science Portals, (GO-ESSP, Rutledge 2011) and continues today.åÊåÊ It is clear that it is no longer sufficient for any one national center to develop its data archive services alone.åÊ Semantic multidisciplinary access (Davis, 2003) and analysis is now possible, yet remains elusive in the operational community.åÊ Several efforts across NOAA and NASA (Raskin, 2009) continue to show the value of linking data from a semantic standpoint to provide coordinated web services.åÊ NOMADS promotes interoperability through the use of standards and format neutral protocols.
Distributed data access technologies are more than a decade old, and despite operational distributed data access and grid computing methodologies using open systems standards such as the NOMADS and grid computing technologies such as the Earth System Grid Federation (Williams, 2009), access systems need improvements in providing users fast and interoperable data throughput (2010, NWS-NCDC Climate Forecast Systems Users Workshop, Silver Spring MD).åÊ For many atmospheric state calculations and strictly non-research weather and climate users, a core set of traditional pre-generated and derived products may satisfy many such high volume requests.åÊ There will always be a smaller set of users that require all the data to ensure the integrity and provenance and reliability of investigator’s datasets under the scientific method.
The need for climate change information has never been greater. City planners, business leaders, and researchers demand such information climate adaptation purposes. At the same time the growth of users of the data and the exponential increase in the volume of data being produced is limiting access to such information.
Today, NOMADS provides access to approximately 650 terabytes of model data arranged in more than 50 million files on tape.åÊ Approximately 1.2 million of these files are arranged on disk for immediate access, and a sub-set of those are pre-aggregated by variables in a time-series to promote easier access for users.åÊ Some data need to be pre-staged from tape prior to access — a very slow and tedious process.åÊ During 2013, NCDC NOMADS serviced more than 850 Terabytes from 116 million requests by 130,000 separate users (an example of data downloaded from NOMADS is provided in Figure 1).
On average, given many dependencies on throughput performance such as the users’ own Internet connection speeds, servicing multiple concurrent users initiating multiple concurrent download sessions, we find that a typical user throughput experience of approximately 400GB per day.åÊ Given this throughput a complete reanalysis product (e.g., Saha et al.) would require a user approximately 154 days to download.åÊ While NOMADS is one of NOAA’s most accessed data access sites, petascale archives now warrant a new approach of on-site computational analysis and for some, Web services rather than data downloading.
Climate Web Services as a Form of Remote Computational Analysis
While computational resources for users on demand has been available for several decades under the name of Grid Computing (Foster and Kesselman, 2004) and secure cloud computing is gaining popularity, on-demand web services for climate and weather products have been slow to emerge.åÊ IT security concerns create barriers in user access to CPU resources across institutions. Traditionally, archive centers focus on the role of an archive and resources for such high-volume computations are limited and seemingly outside the scope of their primary mission. However, data is becoming increasingly unusable due to the volume and need to download data to local hosts where those resources are even more limiting in terms of storing and using petascale datasets.
Common Web-based products and tools will eventually become a cost-effective distributed approach to provide users with information they require.åÊ We envision that such services today could provide support for assessments, downscaling, diagnostics, multi- modeling, and Model-to-obs-Intercomparison Process Studies (MIPS). Some example scripts and capabilities that would be placed in front of the NCDC NOMADS for archives and climate and weather model users are depicted in Table 1.
|Re-gridding of Gridded Fields
|Preservation of monthly means when translating from high resolutions to low resolutions and back
|Format Conversions, Transformations, Data Assimilation, and Geo-Rectifications
|Extraction of a pseudo-station locations (e.g., use for example an Atmospheric Radiation Measurement site from gridded model data; users would select from a menu of analysis techniques)
|Provide first look diagnostic capabilities
|Display of basic thumbnail plots with brief descriptions of the available data from python source code- Plots using Open Geospatial Consortium (OGC) Web Mapping Service (WMS) plotting capabilities directly callable from Unidata’s THREDDS Data Server;- Average Annual cycles, diurnal cycles, annual averages, examination of anomalies, measures of extreme events, climate sensitivity, decadal trends, ratio of variances, Empirical Orthogonal Functions (EOF’s), and where applicable comparisons to observations
|Examination of tropical variability
|Calculation of ENSO indices, simulation of satellite measures from both models and observations
|On-line access to Climate Model Analytical Engines
|Climate model Data Analysis Tool (CDAT, Williams D.N., 2007) for advanced model diagnostics and model-to-obs inter-comparisons. CDAT can produce thumbnail plots to accurately geo-locate non-discrete points to gridsMax Planck’s Climate Data Operators (CDO, Schulzweida, 2010), a user-installed client or a host server front-end capability to manipulate and analyze climate and NWP models.åÊ It is a collection of command line operators developed for processing and analysis of data produced by a variety of climate and numerical weather prediction models (e.g. for file operations, simple statistics, arithmetics, interpolation or the calculation of climate indices). Supported file formats are therefore the frequently used output formats of models such as GRIB, netCDF, and several binary formats. With installations in several hundred groups world-wide, the package is widely established in the climate modelling community.åÊ The main CDO features are:- More than 400 operators availableModular design and easily extendable with new operators
– Very simple UNIX command line interface
– A dataset can be processed by several operators, without storing the interim results
– Most operators handle datasets with missing values
– Fast processing of large datasets
– Support of many different grid types
– Tested on many UNIX/Linux systems, Cygwin, and MacOS-X
Table 1. Potential web services to support NCDC users.
There are many client and host-side applications that are appropriate to be used as analytical engines, such as GrADS (Doty, 2001); and the Live Access Server with its accompanying ÛÏFerretÛ (Hankin, 2001), or commercial packages such as IDL or MATLAB.
NWP Model Improvements using the Data Center Archives
To improve NWP models, numerical modelers create controlled experiments where they compare forecast experiments against control forecasts, or against initial conditions (the analysis), and the ÛÏfitÛ to observations.åÊ Thus, the historical forecasts, the model analysis, and the observations are the basis for these comparisons and potential improvements.åÊ This paradigm is true for increasing forecast model accuracy through higher resolution, and improved representation of the atmospheric, oceanic and land physical processes.åÊ Metrics to verify results also are well-established.
Products from NWP Model Archives
Model output products rely on archives to produce statistics that correct model bias for climate and weather purposes.åÊ Products that are generated can be verified against archive data to sharpen downscaling results.åÊåÊ The aggregation of archived datasets allows users to enter the data set matrix across all the relevant dimensions including forecast time, and cycle/date times.åÊ “This has been successfully done for ensemble and other datasets.åÊ In the ensemble case, the dimensions consist of the ensemble component, the forecast time of each realization run, vertical levels for 3-dimensional fields, and latitude and longitude for each variable.åÊ Server-side abilities are the best way to handle these large datasets because the ability to locally use memory for unpacking and organizing chores are most efficient for concatenating often widely disparate storage locations.åÊ Users can use their own workstation program of self-composed script text commands as objects of non-interactive web download commands to embed queries to the archive OPeNDAPåÊ(Gallagher, 1995) service from any computer language.åÊ There are high-level freeware software packages such as GrADS that will communicate directly with our archive server to display data on a user’s workstation, the software composing the queries to the server as needed.åÊ Commercial software packages such as MATLAB and IDL also compose OPeNDAP queries so any MATLAB/IDL command can be issued on the archive data and the commercial software will compose and send the required queries to the archive.
A Proposed Systems Architecture
Scaling the analysis of massive, highly distributed data requires considering the topology of the system architecture. In distributed systems, the underlying topology of the system is critical in the ÛÏbig dataÛ era to ensure that appropriate decisions are made regarding the amount of data moved over the network, the computing resources given, the types of methodologies and algorithms applied to reduce and analyze the data, and where computing is executed.
Traditional data systems have largely implemented topologies where users retrieve and download data from remote archives for analysis on their local systems or through a high-performance computing service.åÊ Such systems have assumed compute-intensive limitations focusing on ensuring access to adequate computing environments that can scale the computation.åÊ Such topologies have assumed that the movement of data over the network has not been the limiting factor.åÊ In cases where data were too large, data on physical media has been shipped between systems and users.
As data increases, there is an increasing need to reconsider architectural topologies that support data intensive computing (NRC Committee on Massive Data, 2013).åÊ This requires understanding how the architectural topology will scale as the data size increases.åÊ In particular, re-considering methods that will allow computing and data reduction to occur as close to the data as possible in order to increase the efficiency of moving data over the network åÊrequires understanding of how data analysis and data reduction methods can be implemented as services in a distributed system, rather than assuming that such methods will be executed by a user or client program.
Separating the layers of the architecture is a critical principle that should guide modern large-scale data systems requiring the decoupling of the computing, storage, services and visualization of a system to allow enough agility in the implementation to support various distributed approaches. åÊMany systems have assumed an incremental approach to building systems.åÊ This approach can be difficult to evolve if systems have tightly coupled the computing, storage, services and visualization.åÊ Redistributing components and services in order to scale in a distributed environment often requires developing a cyberinfrastructure that employs such principles from the outset. How such systems are specifically implemented requires understanding various parameters of the system, and their effect of scalability requires understanding the bottlenecks for a given set of data analysis use cases.åÊ As a result, architectures should allow for Web-based services to be integrated into systems at various points in the architecture in order to reduce the amount of data that is moved and increase the efficiency of computation.åÊ Furthermore, the implementation of data analysis methods, which reduce the size of the data as close to where the data is stored, is critical as the data size increases.
Current technology trends have centered around the use of cloud computing and technologies such as Apache Hadoop as ÛÏbig dataÛ solutions. Much of this, however, assumes specific use cases and algorithms that are optimized for these types of environments and technologies.åÊ One of the major benefits of the cloud is providing the storage and computational services that can be collectively managed either privately by an organization or procured through commercial vendors.åÊ As mentioned, architecting systems that separate these layers are important because it allows systems more agility in taking advantage of services such as cloud computing.
Coupling Hadoop and compute services may also provide a significant advantage for specific jobs. However, one still needs to consider where and how these topologies and technologies can be applied to specific data-intensive problems and supported by a long-term architectural approach and implementation (e.g., storage, computation, distribution, data analysis services).
Earth science needs to begin employing systems that are constructed for the data-intensive computing era.åÊ Upcoming NASA satellite missions will be capturing data and constructing archives in the petabyte range.åÊ The next IPCC, Coupled Model Inter-comparison Project 6 (CMIP6), will produce similar sized repositories of massive model output.åÊ The obs4MIPS project, developed for the previous CMIP5, demonstrated the value of bringing together climate model output and observational data in order to support the evaluation of climate models relative to measurements.åÊ From a systems standpoint, the implementation of obs4MIPs required the movement of data and the distribution of data from the Earth System Grid (ESG).åÊ However, as the access to the data, the development of new methods, and scale of data increases, new approaches are required to consider how data, from multiple systems, can be brought together efficiently in order to support the needed analysis (Crichton, 2012).
Figure 1 shows the concept of pushing toward distributed, online services, and the need to orchestrate computation across such systems which represents a paradigm shift for how data is managed, distributed, and analyzed across multi-agency Earth Science data systems today.åÊåÊ The orchestration requires understanding how and where computation is performed along with how algorithms and methodologies can be deployed in a distributed architecture. Future systems, such as NOMADS, the Earth System Grid Federation, NASA’s Distributed Active Archive Centers, and other highly distributed, massive scientific data systems, have entered an era where they now need to consider the architectural implications of enabling science on their large data collections.åÊ In many ways, this ÛÏdata scienceÛ paradigm is pointing to a need for a whole new approach to bringing together the computing, data, and methodologies to provide a systematic capability to scaling the analysis of massive data sets through the deployment of online, Web-based services.
With a petascale archive of climate and weather models, traditional access methods as well as distributed data access and federated frameworks such as NOMADS and Earth System Grid Federation (ESGF) are under a heavy burden to provide raw data and even subsets of climate and weather model data.åÊåÊGiven an increase of the number of users and growing volumes of data, and the growing need for climate adaptation resources, information alternatives to data access have been presented. Such alternatives include the use of Web-based services exploiting the massive datasets with online tools and computational analysis. åÊGrid technologies have not yet been implemented in many organizations and security issues surrounding grid capabilities are slow to implement or garner confidence by security professionals that such capabilities are indeed secure.
Online clients and services have been presented that provide many examples of common analysis capabilities needed by the climate and weather modelling community. These services can be used in combination with data management technologies such as pre-staging and aggregations of common state variables.åÊ Online tools such as CDAT and CDO and GrADS can provide a path to satisfy many users of high volume model archives. åÊIt is clear that given the growth of model archives across data centers, users can no longer be expected to download petabytes of data for analysis. åÊThe improvement of numerical weather prediction models using historical observations and other model input and output datasets have been demonstrated but require better access at archive and data centers given the high volumes of many of these data. Due to the exponential growth of model data, users must be granted Web-based access to a suite of basic computations that are placed in front of today’s massive model data archives, rather than forcing multiple users to download petabytes of data. åÊåÊSuch an approach requires understanding new architectures, development methodologies, analysis paradigms, and technologies that can be used to usher in this new era of data analysis.
Alpert, J.C., et al.; 2009: ÛÏHigh Availability Applications for NOMADS at the NOAA Web Operations Center Aimed at Providing Reliable Real Time Access to Operational Model Data.;Û American Geophysical Union Spring meeting conference proceedings.
Committee on the Analysis of Massive Data; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Research Council.åÊ ÛÏFrontiers in the Analysis of Massive Data,Û National Academy Press, 2013.
Crichton, D.J.; Mattmann, C.A.; Cinquini, L.; Braverman, A.; Waliser, D.; Gunson, M.; Hart, A.F.; Goodale, C.E.; Lean, P.; Jinwon Kim, “Sharing Satellite Observations with the Climate-Modeling Community: Software and Architecture,” Software, IEEE , vol.29, no.5, pp.73,81, Sept.-Oct. 2012.
Davies J., Dieter Fensel, Frank van Harmelen (ed.); 2003: ÛÏTowards the Semantic Web: Ontology-driven Knowledge Management.Û John Wiley & Sons.
Dean, J., Ghemawat, S., MapReduce: Simplified Data Processing on Large Clusters OSDI ’04, pp. 137-150.
Doty,åÊ B.E.,åÊ Wielgosz, J., Gallagher, J., Holloway, D., 2001: GrADS and OPENDAP. Proceedings of the 17th International Conference on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, Amer. Meteor. Soc., Albuquerque, NM. 385-387.
Foster, I., and Carl Kesselman; 2004:åÊ ÛÏThe Grid2: Blueprint for a New Computing InfrastructureÛ.åÊ Copyright 2004 Elsevier, Inc. ISBN: 1-55860-933-4
Hankin,S., D.E.Harrison, J.Osborne, J.Davison and K. O’Brien, 1996:åÊ A Strategy and a Tool, FERRET, for Closely integrated visualizationåÊ and analysis. J. Visualization and Computer Animation, 7, 149-157.
Lorenc, A., 1981: A Global Three-Dimensional Multivariate Statistical Interpolation Scheme. Mon. Wea. Rev.
Raskin, R., 2009: ÛÏEnabling Semantic Interoperability for Earth Science DataÛ, unpublished, Final Report to NASA Earth Science Technology Office (ESTO); JPL.
Rutledge, G.K., 2001: NOMADS, Developments in Teracomputing: Proceedings of the Ninth ECMWF Workshop on the Use of High Performance Computing in Meteorology, World Scientific, Ed.: W. Zwieflhofer and N. Kreitz, 269ÛÒ276.
Rutledge, G.K., 2011 unpublished: ÛÏFinal Report of the 2011 Global Organization for Earth Systems Science Portals (GO-ESSP) Workshop;Û http://go-essp.gfdl.noaa.gov/2011
Rutledge, G.K., J. Alpert, and W. Ebuisaki, 2006: NOMADS: A Climate and Weather Model Archive at the National Oceanic and Atmospheric Administration. Bull. Amer. Meteor. Soc., 87, 327-341.
Saha, Suranjana, and Coauthors, 2010: The NCEP Climate Forecast System Reanalysis. Bull. Amer. Meteor. Soc., 91, 1015.1057. doi: 10.1175/2010BAMS3001.1
Schulzweida, U., Kornblueh, L., Quast, R., 2010: Climate Data Operators (CDO) User’s Guide, version 1.4.6. pp 1-173, unpublished.åÊ See https://code.zmaw.de/projects/cdo
Williams, D. N., and Coauthors, 2009: The Earth System Grid: Enabling Access to Multimodel Climate Simulation Data. Bull. Amer. Meteor. Soc., 90, 195ÛÒ205.
Williams, D.N, unpublished 2007: “Climate Data Analysis Tools – Merging Technologies for Climate Change Research”, The 2007 Earth System Grid Meeting, DOE Lawrence Livermore National Laboratory.