Good Enough? Nichesourcing in Data Quality Assessment

EarthzineCrowdsourcing Theme, Original, Themed Articles

Life isn‰Ûªt simple when you deal with crowdsourced data. For a small, niche group of expert volunteers, it is possible to assess if the quality of data sourced from official national maps and volunteered geographic information (VGI) is ‰ÛÏgood enough‰Û using a crowdsourcing approach.However, ‰ÛÏgood enough‰Û is a tricky concept when the crowd has a known face and can influence you.

By Francisco J Lopez-Pellicer1 and Jesus Barrera2

1 Assistant Professor, Universidad Zaragoza (Spain)

2 Project Manager, GEOSLAB (Spain)

It is tempting to integrate crowdsourced data into our databases for identifying a data quality issue (e.g., ‰ÛÏis my Earth observation data still valid?‰Û). Here we present a use case developed during the Linked Map project (part of the EU FP7 project PlanetData) named ‰ÛÏnichesourcing of data quality assessment.‰Û In this use case, a small group of expert volunteers assessed if the quality of a conflation of data sourced from an official national map (the BTN25, a National Map database of Spain) and a volunteered geographic information (VGI) dataset (the OpenStreetMap database) is good enough using a crowdsourcing approach. We learned that ‰ÛÏgood enough‰Û is a tricky concept especially when the crowd has a known face and can influence you.

Nichesourcing

Nichesourcing [1] is a recent term. It was coined in 2012 to describe a specific type of crowdsourcing where knowledge-intensive tasks are distributed amongst a small crowd of volunteer experts (e.g., Earth observation experts) rather than a largely anonymous large crowd. The niche group is gathered either from distributed experts or local networks that have domain knowledge, motivation and ability to contribute. Nichesourcing is not new in the field of Earth observation. The EU FP7 project VOICES describes a case study for digitalizing hand-written pluvial data forms from the Sahel region in Africa where the niche group consisted of people with an African-related background that had access to the Web and were motivated toward benefiting Africa (e.g., African Diaspora, African youth) [2]. Under some circumstances, the niche group is a subset of a community of practice [3]: a community of practice is characterized by having a common purpose, peer-affinity based on this purpose, and a regular interaction between members that generates social trust and reputation. That is, the members of the niche group are acquaintances.

‰ÛÏGood Enough‰Û in Crowdsourcing

We should be aware that citizen scientists have their own point of view when they produce data, often in conflict with the points of view of the government and research agencies. In many scenarios, data produced by citizen scientists is good enough for solving problems and understanding the natural world to suit the citizen scientists‰Ûª needs. Of course, a conflict arises as soon as citizen scientists and government and research agencies contradict each other. For example, government and research agencies often distrust crowdsourced citizen science contributions because they believe that their quality is unknown. They fail to understand that the quality of citizen science contributions is good enough for the requirements of citizen scientists. This is a basic premise. However, we should admit that the requirements of many crowdsourcing initiatives are by definition ambiguous and confusing, and not only from the point of view of government and research agencies. In such cases, the crowd may influence the perception of what is ‰ÛÏgood enough‰Û for a citizen scientist. But, how big is the minimum “crowd” that can influence a citizen scientist?

Linked Map platform overview. Image Credit: Linked Map project

Linked Map platform overview. Image Credit: Linked Map project

Data Quality Assessment and Nichesourcing

In the Linked Map project, we have developed a case study for the application of a nichesourcing approach for tasks related to data quality assessment [4]. The purpose of the case study was to evaluate if a small and engaged group of volunteer experts can produce quality assessment data good enough for decision-making.

The Linked Map platform was developed within the Linked Map project as a means of showcasing technologies developed within the project and to support experiments related to the use of crowdsourcing techniques for data quality assessment. The platform is a Web portal that enables volunteers to review the reliability of integration of VGI data and official data by examining matched pairs of features. The site is powered by a semantic backend that uses the geographic query language for Resource Description Framework (RDF) OGC GeoSPARQL. Provenance of exposed data was annotated with the W3C PROV ontology. Users can submit their own reviews after analyzing previous non-anonymous reviews made by other users. The platform uses a map as its main user interface (UI) element. Users can employ the map to explore a dataset formed by pairs of features matched from different sources and then assess if these matches are correct. Furthermore, users can assess if the geometry of each matched feature is a faithful representation of the real-world feature that they describe. These assessments provide data that is stored for further analysis.

The nichesourcing case study involved a data quality assessment task restricted to the administrative limits of the city of Zaragoza, Spain. The extent of the administrative limits of Zaragoza is 969 km2. There are around 700,000 people residing within the administrative limits of Zaragoza (it ranks as the fifth most populous in Spain and 35th in the European Union), and more than 650,000 people live in the city. The surface occupied by the city is approximately 240 km2. That is, there is a very dense city center (96 percent of the inhabitants are concentrated in less than 25 percent of the area) surrounded by small rural villages. In addition, a quarter of the surface is a restricted military area (airfield, bases and training facilities).

Zaragoza

Zaragoza in the Linked Map platform. Image Credit: Linked Map project

Zaragoza covers a wide range of context including very dense urban zones with a high concentration of features coincident in both datasets, but also includes rural and military areas with a low number of coincident features. The scenario is a nichesourcing task because it has these three defining characteristics [1]: the task is knowledge-intensive with clear and defined goals, its success is determined by the credibility of the results, and participants are selected from a community of practice focused on GIS which originated in the IAAA Lab at the Universidad Zaragoza.

The area selected for the nichesourcing scenario (41å¼31‰ÛªN 1å¼11‰ÛªW, 41å¼49‰ÛªN 0å¼31‰ÛªW) includes Zaragoza city limits. It contains 1,516 features of BCN/BTN25 linked with 2,146 features of OpenStreetMap. Twenty participants were recruited from the community of practice via e-mail and personal contacts. Participants were selected taking into account their Geographic Information System (GIS) expertise and degree of knowledge of Zaragoza. They were provided with additional guidance and the goal of verifying between 15 and 20 mappings each. Participants were advised that both adding a review or agreeing or dissenting on a mapping previously reviewed by another participant counted as a verified mapping.

Expert contributions can be expected to be of high quality, but this is a hypothesis that must be verified. To do so, three participants were instructed to behave secretly as rogue participants: at least five of their reviews must be wrong. In addition, they were instructed to add a deceptive review only if nobody has previously reviewed the feature, ensuring that the misleading review was the first available review about it. The person in charge of the activity kept a secret list of wrong reviews.

We expected that due to the background of the participants (experience with GIS tools, living in Zaragoza) their opinions should not be influenced by previous ones. However, we found that reviewers agreed with a deceitful opinion more than we expected. When the first review of an item was a misleading assessment, 29.2 percent of reviewers expressed agreement with it. After analyzing the results, we started to suspect that this was occurring because they were being influenced by the opinion of the other members of the community. In order to investigate our hypothesis, we created a questionnaire of 10 questions to get anonymous opinions on the usage of the platform from participants in the experiment. Ten in 20 participants filled out the survey. Results showed that participants had long experience with GIS (8.5 years on average) and knew the expertize of the other reviewers. Moreover, the survey showed that some surveyed participants, three or four depending on the task, acknowledged that they were often influenced by previous comments.

Assessing integration quality in the Linked Map platform. Image Credit: Linked Map project.

Assessing integration quality in the Linked Map platform. Image Credit: Linked Map project.

Participants acknowledged that the content was not accurate (e.g., believability values are 56.2 percent and 71.1 percent). Not knowing which content is accurate is a well-known cause of risk associated with user-generated content sites [5]. So participants should be aware that there is a chance of potential errors in reviews from other experts. However, credibility of geographic information is granted on the perceived authority of experts in the geographic domain (official geographic information) or locals immersed in the area (volunteer geographic information) that produce such data [6]. Both factors were present in the niche group and therefore could have reduced the likelihood of dismissing other reviews when a participant had doubts about the quality of the reviewed resource. The survey also suggested a potential third factor: some reviewers are susceptible to being influenced by other members of their community of practice.

Lessons Learned

Our nichesourcing experiment shows that bootstrapping a crowdsourcing experiment with the support of a niche group or a community of practice is easy. The trade-off is that well-known issues found in crowdsourcing initiatives such as a potential bias in data do not disappear even when participants belons to a community of practice. The opinions within a niche group of some leaders and acquaintances are ‰ÛÏgood enough‰Û to be followed by some participants, even if these opinions are clearly wrong.

Author Bio

Francisco J. Lopez-Pellicer holds a Ph.D. in computer science and is now a researcher on Open Data, Geo Semantic Web and Geo Web Services at IAAA Labs, Universidad Zaragoza (Spain). He can be reached at fjlopez@unizar.es.

Jesus Barrera is a software engineer and project manager at GeoSpatiumLab S.L (Spain). He can be reached at jesusb@geoslab.com.

EUThe research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257641. This work has been partially supported by the Spanish Government (project TIN2012-37826-C02-01), the National Geographic Institute (IGN) of Spain, and GeoSpatiumLab S.L.

References

[1]åÊåÊåÊåÊåÊ V. de Boer et al., ‰ÛÏNichesourcing: Harnessing the Power of Crowds of Experts,‰Û Journal on Data Semantics III, vol. 7603, no. 3, S. Spaccapietra and E. ZimÌÁnyi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 16‰ÛÒ20.

[2]åÊåÊåÊåÊåÊ B. M. Tesfa, ‰ÛÏNichesourcing: a case study for pluvial data digitalization for the Sahel,‰Û VU University, Amsterdam, NL, 2012.

[3]åÊåÊåÊåÊåÊ E. Wenger et al., Cultivating Communities of Practice. Harvard Business Press, 2002.

[4]åÊåÊåÊåÊåÊ F. J. Lopez-Pellicer and J. Barrera, ‰ÛÏD19.2 Call 2: Linked Map Report on crowdsourcing trade-offs for geospatial data curation,‰Û PlanetData, 2014.

[5]åÊåÊåÊåÊåÊ P. Denning et al., ‰ÛÏWikipedia risks,‰Û Commun. ACM, vol. 48, no. 12, pp. 152‰ÛÒ152, Dec. 2005.

[6]åÊåÊåÊåÊåÊ A. J. Flanagin and M. J. Metzger, ‰ÛÏThe credibility of volunteered geographic information,‰Û GeoJournal, vol. 72, no. 3, pp. 137‰ÛÒ148, Jul. 2008.