Data. To store or not to store, that is the question.

Panel Session on “Big data processing: What is the added value, for which applications?”

3rd October 2019

by René Garello, IEEE Fellow, Past President and Hari Vishnu, Editor, OES Earthzine

Oceans form a significant portion of Earth's area. How do we manage the vast amount of data collected from this area ?

With the invention of smartphones, better internet-of-things devices and storage devices, and more shift towards an internet-based information systems, there is a big surge in the amount of data collected and recorded these days.

“The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s ^[1]. Based on an International data corporation (IDC) report prediction, the global data volume will grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020 ^[2]. By 2025, IDC predicts there will be 163 zettabytes of data ^[3]”, says Wikipedia.

As of 2018, it was quoted that 2.5 exabytes of data are generated every day ^[4]. Indeed, ‘Big data’ has now become a buzzword and captured the interest of scientists, engineers, lawyers and policy-makers alike. The term usually revolves around techniques to utilize or find patterns in large datasets for useful purposes.

The panel

A panel discussion on the topic of Big Data processing was held on 19^th June at the Oceans conference held at Marseille. This session was labeled as a “Round table” and the purpose was to review the impact of big data analytics in ocean technology, and to identify the most promising applications and how to manage and process the data.

Story Index

Vast Oceans, vast data
The panel
What about ocean data ?
Cost of storage
Democratizing data and making it affordable
Final takeaway

Panel Quote: "The value of data lies in obtaining insights and making decisions"

Big data analytics is a hot topic addressing many domains, even some related to everyday life. Commercial optimization, on-line services, economical predictions, predictive maintenance, security through surveillance, digitalization of processes can be cited as examples.
Less clear is the impact of functionalities on technical systems and their performances. As a matter of fact, more and more digital data are generated by systems through the extended use of digital equipment and intelligent sensors.
For ocean technology, some applications include environmental assessment using distributed sensors and buoys, remote sensing using airborne vehicles or satellites, on-board ship control and mission management, and the optimization of power generated by renewable energy system at sea.

The moderator initiated the discussion by presenting a short history of big data awareness and means to deal with this detected and potential “problem” ^[5]. A first hint at the concept, quoted as “growth in volume” or “information explosion” was already encountered in 1941. In 1944, a projection of the volume of the Yale library for 2040 would be over 200 million volumes and about 10 km of shelves.

In 1967, the “information explosion” lead to several conclusions such as the need for the storage to be kept to a minimum and the development of compression techniques coupled with the rate of information (bandwidth) ^[5]. Clearly it is a problem still present in telecommunications (at large) with the push to have a maximum of information (text, voice, images, video) passed in a minimum space (communication channels).

In 1975 the concept of “segmenting” the information rather than proceed with a blind mass-storage appeared ^[5]. That was a first attempt to having a real structuration of the storage going beyond simply storing all information.

In 1990, the question “Saving all the bits?” was addressed with the intent to develop “machines” able to monitor data flows and predict patterns, avoiding the need to record everything. This is a very hot topic nowadays.

To conclude the introduction, René added that in 1996, digital storage was recognized for the first time to be more effective than paper storage and in 1997, the concept of “big data” was quoted for the first time. René noted the words by the famous mathematician Richard Hamming, that “the aim of scientific computing is to get insight into data, and not (merely to churn out) numbers” ^[6]. On a related note, panel member Fabien also pointed out that we must ask - is it worth holding on to all the data?

"There is a cost to retaining all the data. However, it is yet to be quantified against the benefits, and we are yet to clearly designate whose responsibility it is to bear it." - John Waterston

The cost of storage

John Waterston discussed - There is a cost to retaining all the data. However, it is yet to be quantified against the benefits, and we are yet to clearly designate whose responsibility it is to bear it.

How do we quantify the cost of retaining all the data? One aspect to be considered is - the value of data is in obtaining insights and making decisions. Thus, the immediacy of usefulness of the data in making the decision should be one aspect deciding its storage cost. There are also other aspects to the cost, such as the power required for the storage. As an example, René noted that the electric power required to maintain a single large data center such as the Google Tennessee facility is about 1 GigaWatt per year.

On the contrary, Hélène expressed that the hydrographic community usually does not prefer to throw away collected data. This is because hydrographic data is painstaking to obtain and can prove to be valuable several decades down the lane (read on hydrographers acknowledged on World hydrographic day). This is understandable in hydrography, where it is hard to have exhaustive coverage of all regions. Hélène cited an example of how a decades-old shipping lane map helped her identify a route to a remote island.

The panel discussing Big data and its relevance in marine sciences and engineering ²

Democratizing data, making it affordable

The often-cited reason that scientists do not want to throw away data is the large amounts of time they have spent in developing the sensors, calibrating them for accuracy, and collecting the data. The value of the resultant data from this effort appears to be too much to discard it.

John pointed out - one way to offset this cost of acquiring and storing data in the first place, would be to use cheaper sensors in larger quantities and in a smart way. These sensors could be calibrated against high-cost equipment. Once this is done, they provide a cheap way to obtain data. These cheap sensors could even be smart-phones held by the common man – citizen science could pave the way for plugging the gap and obtaining cheap wide-coverage real-time data (read related article on citizen science and how it has been useful). An example of this is Google Street view. It collects data from many handheld cameras with limited GPS accuracy. It then employs smart processing to build a good representation on Google street maps and offset the disadvantage of sensor quality.

Smart sensing and processing such as edge processing must take the lead in determining which data must be retained, the panel pointed out. An example to support this - human beings do not retain every single detail they have observed, but rather just the essence as memories. Machine learning techniques may play an important part in providing tools for crunching big data into simpler denser blocks that are easier to store but more informative [8], or in developing smart sampling strategies to select a suitable sparse set of data points that are representative of a larger set [9].

Final takeaway

In general, the panel’s take was that there may be no one-solution-fits-all answer on how to tackle the growing amount of data. A look into how tech giants like Amazon, Google and Facebook handle the problem, may help us better. Fabien mentioned some aspects of how data statistics can be used as a proxy for the data. Also, the more prior information you have, the less data you need to complement it.

Hélène stated that identifying good data formats to store data is also an important aspect of minimizing data storage overheads. A large amount of data being stored ends up being obsolete due to lack of supporting metadata, and yet it survives far too long. A chain of custody must be established for handling data, and relevant metadata must also be transmitted to avoid obsolence of stored data.

Finally, Richard Spinrad from Oregon State University who was seated in the audience, pointed out that there must be a distinction between model-generated and measured data, and how the two are handled. Case in point: National Oceanic and Atmospheric Administration (NOAA) generated 15 petabytes per day, 75% of which was model-generated. One way forward to tackle copious amounts of data storage is to make model code available online. These models can be used by data-consumers to generate the required data on a by-need basis. Research into compressed information representation and extrapolation of data from few points to a larger space, also holds relevance here.

At the end, to quote the panel,

“To store data or not, that is the question” - Fabien Chaillan

References

[1] Hilbert M, López P (April 2011). "The world's technological capacity to store, communicate, and compute information"(PDF). Science. 332 (6025). doi:10.1126/science.1200970. PMID 21310967.
[2] Sh. Hajirahimova, Makrufa; Sciences, Institute of Information Technology of Azerbaijan National Academy of; str., B. Vahabzade; Baku; AZ1141; Azerbaijan; Aliyeva, Aybeniz S. (2017). "About Big Data Measurement Methodologies and Indicators". International Journal of Modern Education and Computer Science. 9 (10): 1–9. doi:10.5815/ijmecs.2017.10.01.
[3] Reinsel, David; Gantz, John; Rydning, John (13 April 2017). "Data Age 2025: The Evolution of Data to Life-Critical" (PDF). seagate.com.
[4] “How much data do we create everyday?” www.forbes.com, Bernard Marr, May 21 2018, Retrieved 10 Aug 2019
[5] “A very short history of Big data”, www.forbes.com, Gil Press, Dec 21 2013, Retrieved 10 Aug 2019
[6] Groß, M., 1994. Introduction, in: Visual Computing: The Integration of Computer Graphics, Visual Perception and Imaging. Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 1–12. doi:10.1007/978-3-642-85023-3_1
[7] Oceans: the great unknown, www.nasa.gov Retrieved 20 July 2019
[8] Hinton, G, Salakhutdinov, R., 2006. Reducing the Dimensionality of Data with Neural Networks. Science (80). 313, 504–507. doi: 10.1126/science.1127647
[9] E. Snelson and Z. Ghahramani, “Sparse gaussian processes using pseudo-inputs,” in Advances in Neural Information Processing Systems, 2006, pp. 1257–1264

¹Image 1: "big-data-analytics" by learn_tek is licensed under CC CC0 1.0

²Image of the panel discussion is credited to the authors

Read coverage on a Hydrography workshop at Oceans 2019 conference, Oceans-related coverage on Careers for students and Young Professionals, or more articles covering Oceans conferences.

VPS	Writing
Archive	Initiatives
	Fellowships

Panel Session on “Big data processing: What is the added value, for which applications?”

3rd October 2019

by René Garello, IEEE Fellow, Past President and Hari Vishnu, Editor, OES Earthzine

Story Index

Related Stories