Exploring Research Data Repositories with geoextent¶

Authors¶

Author1 = {“name”: “Sebastian Garzón”, “affiliation”: “Opening Reproducible Research, Institute for Geoinformatics, University of Münster, Germany”, “email”: “jgarzon@uni-muenster.de”, “orcid”: “https://orcid.org/0000-0002-8335-9312”}
Author2 = {“name” : “Daniel Nüst”, “affiliation”: “Opening Reproducible Research, Institute for Geoinformatics, University of Münster, Germany”, “email”: “daniel.nuest@uni-muenster.de”, “orcid”: “https://orcid.org/0000-0002-0024-5046”}

This notebook presents geoextent, a Python library for reliably extracting the geospatial and temporal extents of files, directories, and repository records. The geospatial and temporal metadata of research data could greatly benefit the discovery of relevant and related datasets (Gregory et al., 2018). However, it is underused in scientific data repositories except for specialised repositories. Much more scientific disciplines collect data and publish work that has some temporal or spatial relation. These datasets may not be connected through regular search indices based on keywords or full texts. The library geoextent presented in this notebook helps to understand the potential of extract information from files shared in data repositories and may be used to integrate geospatial and temporal metadata into repository infrastructures.

Technical contributions¶

The geoextent library is a wrapper around the most commonly used software for geospatial data loading and saving, GDAL (GDAL/OGR contributors, 2021). The main contribution is the ease of use of extracting discovery metadata from data files using GDAL, the handling of most common cases with defaults to support automation, the aggregation of extents for multiple files or directories, and the integration of retrieval functions for common scientific data repositories. This notebook relies on geoextent version 0.7.1 (Nüst, Garzón & Qamaz, 2021), and some helper functions are shared next to the notebook file. This notebook is developed for Python 3.6+ and a standard Jupyter environment (Kluyver et al., 2016). Some cells require a stable connection to the Zenodo API.

Methodology¶

We performed a case study of Zenodo records to explore the potential of automatically extractable geographic coverage metadata in research data repositories. Furthermore, the case study validated the features of geoextent and improved the automated handling of data types. First, a set of records based on the search term ‘geology&geo’ and below a record 500 MB size limit are downloaded, and metadata are extracted with geoextent. Results are stored in a local GeoPackage file and then analyzed. We determine the percentage of records where geospatial metadata can be automatically extracted and render the extracted geospatial and temporal metadata for visual inspection. Second, we analyze the distribution of total files and the success rate of extraction by repositories. Finally, we determine the proportion of potential files with geospatial data and its success rate of extraction.

Results¶

The extraction of geospatial and temporal information performed with geoextent suggests that files stored in repositories could fill gaps in the metadata of research data repositories. On the one hand, our approach to extract geographic coverage metadata generates a bounding box (bbox) for 14.4% (see Fig. 1). of the repositories explored without any manual intervention. This number is considerably higher than the current 0.77% of zenodo records with geometadata (locations) and 0.14% specifically for dataset records, though our search uses a filtered baseline for geospatial records. For the extraction of temporal extent (tbox), the successful extractions with geoextent are considerably lower with only 2.51% (see Fig. 1). This can be ascribed to time data being less explicitly modeled in common file formats compared to location data. Differences between geospatial and temporal information extractions over the total number of files (see Fig. 1) could result from file formats that reduce the ambiguity of the information only for geographical features (e.g., shp or tif). Nevertheless, these temporal extents could complement the Zenodo dates parameter, which only concerns the publication time.

As the main observation about the explored records, we found that almost 50% of files have a format known to be able to store geospatial information (‘geoformats’) (see Fig. 3). These include standardized file types for geographical information, such as GeoJSON, GeoPackage, NETCDF, and GeoTIFF, as well as other less standardized but widely used formats as CSV or png. We encountered that in terms of records, 51% have at least one file known to possibly store geospatial data. That implies that almost half of the records analyzed do not model location information in their content so that it could be extracted automatically. From the portion of records with at least one geospatial format, only 28% had a successful extraction (see Fig. 4). These observations point out the two main challenges of our approach: absence of data to explore (i.e., no geoformats in records) and low extraction success rate from available data (i.e., potential geoformats not providing the required information).

As for the files’ distribution in the repositories, we encountered that geoformats are present in repositories of all sizes and follow a similar distribution as the total number of files by records (see Fig. 5). For the repositories with successful extractions, we encountered that a single success is the most common output. However, the extraction of few repositories can rely on up to 180 successful file extractions (see Fig. 5). This number of successes is only relevant if analyzed in the context of each repository and compared with the total number of files and potential files with geospatial information. We encountered that we extracted geospatial information either in records with a low and high proportion of geoformats. A similar scenario resulted in different proportions of successful extractions from the number of geoformats (see Fig. 6). Records with no extraction (i.e., 0% success rate over potential) vary from 0% to 100% geoformat files. That suggests that there is still space for improvement for geoextent in the case of ambiguous files to increase the total percentage of successful extractions.

The proportion of success by geoformats indicates that, as expected, ambiguous formats as CSV and png are a large part of the unsuccessful extractions (see Fig. 7). As these formats do not necessarily store geographical information (e.g., can hold anything from survey data to DNA sequences) or the geographical information is not easily detectable (e.g., unexpected column names for latitude and longitude), it would be necessary to manually analyze their content to determine a perfect test dataset and possibly provide more rules for automatic extraction. In contrast, standardized formats for geographical data have a higher success rate of extraction. That confirms that these files store geospatial features in a more accessible way to other researchers than ambiguous formats. However, popular geospatial formats as GeoPackage, shapefile, or GeoTIFF have success extraction rates between 43% and 58% indicating that even standardized formats do not guarantee the availability of all required information (e.g., the coordinate reference system may be missing). Similarly, a powerful format such as netCFD (.nc) has a low extraction rate (6.6%) which shows that while it can be used for georeferenced data, it might not have sufficient metadata, or usage of non-geospatial datasets is much higher (see Fig. 7).

Finally, the bounding boxes (see Map 1 and Image 1) automatically extracted by geoextent suggest that human verification is required to identify problems with the files (e.g., incorrect or incomplete georeferencing) or with geoextent’s approach (e.g., assuming a default coordinate reference system if it is not clearly defined). Authors and data curators could easily identify common errors, e.g., flipped coordinates or absence of coordinate reference systems.

Image 1. Example of bounding boxes extracted by geoextent.(Left) Correct extraction, (Center) partially correct extraction and (Right) erroneous extraction. (Classification after human verification)

As a conclusion, we observe that the extraction of geospatial information from records in a general-purpose research data repository could provide geospatial metadata to aid data discovery. Our approach encountered potential geospatial information in a relevant percentage of repositories of different characteristics and successfully extracted geospatial information from various file types. We propose that including geoextent into the pipeline of data curation, e.g., by proposing a bounding box based on the data during record creation, could help researchers and data repositories to improve the quality of the record metadata, but also in terms of data understandability, e.g., by encouraging non-ambiguous formats.

Funding¶

Award1 = {“agency”: “German Research Foundation (DFG)”, “award_code”: “PE1632/17-1”, “award_URL”: ‘https://gepris.dfg.de/gepris/projekt/415851837’}

Keywords¶

keywords=[“geospatial”, “discovery”, “metadata”, “repositories”, “data sharing”]

Citation¶

Garzón, Sebastian and Nüst, Daniel, 2021. Exploring Research Data Repositories with geoextent. Accessed 2021-05-14 at https://github.com/o2r-project/geoextent/tree/master/showcase.

Suggested next steps¶

First, the survey of Zenodo records can be extended to include records based on more search terms - or even all records - and to include larger records. Second, the record retrieval features of geoextent can be extended to include additional research data repositories. These can be general-purpose ones, e.g., Figshare or OSF, but also specialised repositories for geospatial data, e.g., Pangaea, or GFZ Data Services. In the case of the specialised repositories, the extracted metadata should be compared with the metadata of the platform. Third, the development of geoextent will be continued, e.g., to support more data types, to support more output options for integration into other tools, or to communicate progress to users or including tools. Especially the support of more data formats and increased stability of the library could make it possible to integrate it into Open Source data repository software, such as InvenioRDM (the base software of Zenodo), and thereby turn geospatial and temporal metadata into regular record-level metadata which can be validated by authors on the creation of new records, and which can aide interdisciplinary collaboration through novel connections between datasets.

	Number open access records	Number open access records with geometadata	% records with metadata	% proportion over total geometadata
Type of record
publication	997743	13970	1.400160	99.849904
poster	8397	0	0.000000	0.000000
presentation	21225	0	0.000000	0.000000
dataset	75542	19	0.025152	0.135802
image	601753	0	0.000000	0.000000
video	3286	0	0.000000	0.000000
software	53737	1	0.001861	0.007147
lesson	2555	0	0.000000	0.000000
physicalobject	28	0	0.000000	0.000000
other	6168	1	0.016213	0.007147
record	1770435	13991	0.790258	100.000000

	% Successful extractions	% No extraction	% Geoxtent failure	Number success	Number No extractions	Number Geoextent Failure
Geospatial extraction	14.420063	81.191223	4.388715	46.0	259.0	14.0
Temporal extraction	2.507837	93.103448	4.388715	8.0	297.0	14.0

	% Successful extractions	% No extraction
Geospatial extraction	2.134463	97.865537
Temporal extraction	0.080909	99.919091

	Geospatial extraction
Total number of files	25955.000000
Number of files with potential	12954.000000
% Files with potential	49.909459
% Successful extractions over potential	4.276671
% No extractions over potential	95.723329

	Geospatial extraction
Total repositories	319.000000
Repositories explored	301.000000
Number of repositories with potential	164.000000
% Repositories with potential	51.410658
% Successful extractions over potential	28.048780
% No extractions over potential	71.951220

	Files in repository	Potential files with geoextent	Files with geoextent
Mean	86.229236	43.036545	1.840532
Standard deviation	457.793327	400.699765	11.620147
Min	1.000000	0.000000	0.000000
25%	3.000000	0.000000	0.000000
50%	12.000000	1.000000	0.000000
75%	31.000000	4.000000	0.000000
Max	6515.000000	6515.000000	180.000000

EarthCube 2021 Call for Notebooks

Exploring Research Data Repositories with geoextent¶

Authors¶

Table of Contents

Purpose¶

Technical contributions¶

Methodology¶

Results¶

Funding¶

Keywords¶

Citation¶

Suggested next steps¶

Setup¶

Library import¶

Local library import¶

Parameter definitions¶

Data import¶

Data processing and analysis¶

geoextent usage¶

Supported file types¶

Individual files¶

Multiple files¶

Data repositories¶

Command-line interface¶

Case study¶

Zenodo geometadata¶

Table 1. Zenodo records statistics by record type¶

Collect data¶

Analysis¶

Load data¶

Extraction results¶

Figure 1. Repository extraction status by parameter¶

Table 2. Repository extraction status by parameter¶

Figure 2. Files extraction status by parameter¶

Table 3. Files extraction status by parameter¶

Figure 3. Geospatial extraction status by potential files¶

Table 4. Geospatial extraction by files¶

Figure 4. Geospatial extraction status by repositories with potential files¶

Table 5. Geospatial extraction by repositories¶

Figure 5. Files distribution over repositories¶

Table 6. Distribution of the number of files by repositories.¶

Figure 6. Percentage of potential geospatial files and success rate of extraction¶

Figure 7. Success rate of extraction by file format¶

Visualization of extracted geospatial extents¶

Map 1. Extracted bounding boxes visualization¶

References¶