cf-xarray: Scale your analysis across datasets with less data wrangling and more metadata handling¶
Author(s)¶
Author1 = {“name”: “Mattia Almansi”, “affiliation”: “National Oceanography Centre”, “email”: “mattia.almansi@noc.ac.uk”, “orcid”: “0000-0001-6849-3647”}
Author2 = {“name”: “Deepak Cherian”, “affiliation”: “National Center for Atmospheric Research”, “email”: “deepak@cherian.net”, “orcid”: “0000-0002-6861-8734”}
Author3 = {“name”: “Pascal Bourgault”, “affiliation”: “Ouranos Inc.”, “email”: “bourgault.pascal@ouranos.ca”, “orcid”: “0000-0003-1192-0403”}
Table of Contents
Purpose¶
This notebook demonstrates how the cf-xarray Python package (Cherian et al, 2021) helps climate data scientists to process several CF-compliant datasets from a variety of sources. Under the hood, CF-xarray decodes and makes use of the widely adopted Climate and Forecast (CF) conventions. Therefore, workflows integrating cf-xarray do not need knowledge of arbitrary dataset-specific metadata such as variable names.
Xarray (Hoyer et al, 2021) is a Python package that enables easy and convenient labelled data analytics by allowing users to leverage metadata such as dimension names and coordinate labels. Xarray provides two core data structures:
DataArray: a container wrapping a single multidimensional array with metadata.
Dataset: a dict-like container of multiple DataArrays.
cf-xarray uses Xarray’s plugin interface, or “accessor”, to provide extensive
functionality on both Datasets and DataArrays under the .cf
namespace.
For example, the zonal average of an Xarray Dataset ds
is seamlessly
calculated as ds.cf.mean("longitude")
on any CF-compliant dataset, regardless
of the actual name of the “longitude” variable (e.g., "lon"
, "lon_rho"
,
"long"
, …).
Technical contributions¶
Development of cf-xarray, an extension library that
adds awareness of the Climate and Forecast (CF) conventions to core Xarray functionality.
Provides utility functions with minimal dependencies, allowing easy integration of CF-aware functionality in other packages such as xesmf (for regridding).
Methodology¶
This notebook can be executed in a pre-configured interactive environment at: https://binder.pangeo.io/v2/gh/malmans2/cf-xarray-earthcube/main?filepath=DC_01_cf-xarray.ipynb
Results¶
This notebook contains use case examples demonstrating the following functionalities of cf-xarray:
Seamless analysis of various CF-compliant datasets.
Standardization of datasets to comply with CF conventions.
Inference of grid-cell coordinates and bounds using CF conventions.
Integration with other libraries (xESMF).
Funding¶
N/A
Keywords¶
Include up to 5 keywords, using the template below.
keywords=[“cf-conventions”, “xarray”, “netcdf”]
Citation¶
Almansi, M. and Cherian, D. and Bourgault, P. (2021). cf-xarray: Scale your analysis across datasets with less data wrangling and more metadata handling. 2021 EarthCube Annual Meeting. Accessed 14/5/2021 at https://github.com/malmans2/cf-xarray-earthcube
Work In Progress - improvements¶
N/A
Suggested next steps¶
It is common for datasets to not be perfectly CF-compliant. Here we work around
these deficiencies using assign_coordinates_and_cell_measures
. cf_xarray
is
considering adding heuristics to guess such metadata attributes, possibly using
other metadata conventions such as SGRID.
Acknowledgements¶
We acknowledge contributions from all cf-xarray contributors: https://github.com/xarray-contrib/cf-xarray/graphs/contributors. We also acknowledge MetPy for providing inspiration and various criteria for identifying CF variables. Discussion with Jon Thielen was instrumental in development of cf-xarray. We also acknowledge the Pangeo project for collating and providing an immense amount of datasets on the cloud that motivated this work, as well as provoking, enabling, and fostering discussions that led to the development of this project.
Setup¶
Library import¶
# The package demonstrated here
import cf_xarray as cfxr
# For parallelization
import dask
# For loading shared data
import intake
# Visualizations
import matplotlib.pyplot as plt
# For basic data manipulation
import numpy as np
import xarray as xr
# For regridding
import xesmf as xe
# silence a minor warning
dask.config.set(**{"array.slicing.split_large_chunks": False})
Parameter definitions¶
# Paths and urls pointing to data
MOM6_GRID_PATH = "./data/ocean_grid_sym_OM4_05.nc"
MOM6_DATA_URL = "http://35.188.34.63:8080/thredds/dodsC/OM4p5/ocean_monthly_z.200301-200712.nc4"
CMIP6_OCE_CATALOG = "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
CMIP6_OCE_EXPERIMENT = dict(
table_id="Omon",
grid_label="gn",
source_id="ACCESS-CM2",
# source_id="GFDL-CM4",
member_id="r1i1p1f1",
experiment_id=["historical"],
variable_id=["thetao", "volcello"],
)
CMIP6_ICE_PATH = "data/sic_SImon_CCCma-CanESM5_ssp245_r13i1p2f1_2020.nc"
NCEP_PATH = "data/air_temperature.nc"
Data import¶
This notebook uses a variety of publicly available data:
ds_mom6
: Data from the Modular Ocean Model - v6 (Adcroft et al, 2019)ds_cmip6_oce
: Ocean Data from the Climate Model Intercomparison Project - Phase 6ds_cmip6_ice_ice
: Ice Data from the Climate Model Intercomparison Project - Phase 6ds_ncep
: Data from the National Centers for Atmospheric Prediction Reanalysis (Kalnay et al, 1996)
def assign_coordinates_and_cell_measures(ds):
"""
Functions to add missing CF metadata (coordinates and cell measures).
Fully CF-compliant datasets do not need this pre-processing.
Functions to automatically assign missing coordinates
and measures metadata will be implemented in cf_xarray.
See https://github.com/xarray-contrib/cf-xarray/issues/201
Parameters
----------
ds: xarray.Dataset
Dataset to modify
"""
for varname, variable in ds.data_vars.items():
# Add coordinates attribute when the dimensions
# of a coordinate variable are a subset of those of
# a variable
coordinates = []
for coord in sum(ds.cf.coordinates.values(), []):
if set(ds[coord].dims) <= set(variable.dims):
coordinates.append(coord)
# sets an attribute like "geolon geolat"
if coordinates:
variable.attrs["coordinates"] = " ".join(coordinates)
else:
variable.attrs.pop("coordinates", None)
# Add cell_measures attribute when appropriate measures are available
cell_measures = {}
possible_measures = {
"cell_thickness",
"cell_area",
"ocean_volume",
} & set(ds.cf.standard_names)
for stdname in possible_measures:
key = stdname.split("_")[-1]
value = ds.cf.standard_names[stdname]
for measure in value:
if (
set(ds[measure].dims) <= set(variable.dims)
and measure != varname
):
cell_measures[key] = measure
if cell_measures:
# sets an attribute like "area: areacello volume: volcello"
variable.attrs["cell_measures"] = " ".join(
[f"{k}: {v}" for k, v in cell_measures.items()]
)
else:
variable.attrs.pop("cell_measures", None)
MOM6¶
Read grid and data variables for a MOM6 ocean model simulation
# Open grid and data variables, then merge them together to one dataset
grid = xr.open_dataset(MOM6_GRID_PATH, chunks={})
ds = xr.open_dataset(
MOM6_DATA_URL,
chunks={"time": 1},
)
ds_mom6 = xr.merge([grid, ds], compat="override")
# Illustrate the equivalent of a curvilinear grid case,
# where axes and coordinates are different
axes = ["xh", "xq", "yh", "yq"]
ds_mom6 = ds_mom6.drop_vars(axes)
ds_mom6 = ds_mom6.assign_coords({axis: ds_mom6[axis] for axis in axes})
ds_mom6 = ds_mom6.set_coords(
[
var
for var in ds_mom6.variables
for prefix in ["geo"]
if var.startswith(prefix)
]
)
assign_coordinates_and_cell_measures(ds_mom6)
CMIP6 Ocean¶
Read a historical CMIP6 simulation from the ACCESS-OM2 climate modelling system.
# Use intake-esm to access data on Pangeo Cloud
col = intake.open_esm_datastore(CMIP6_OCE_CATALOG)
cat = col().search(**CMIP6_OCE_EXPERIMENT)
ddict = cat.to_dataset_dict(
zarr_kwargs={
"consolidated": True,
"decode_times": True,
"use_cftime": True,
}
)
_, ds_cmip6_oce = ddict.popitem()
assign_coordinates_and_cell_measures(ds_cmip6_oce)
CMIP6 Ice¶
ds_cmip6_ice = xr.open_dataset(CMIP6_ICE_PATH)
NCEP¶
ds_ncep = xr.tutorial.open_dataset(NCEP_PATH)
Data processing and analysis¶
Overview of cf-xarray¶
cf-xarray uses Xarray’s plugin interface, or “accessors”, to provide extensive
functionality on both Datasets and DataArrays under the .cf
namespace.
.cf
is an entrypoint for cf-xarray functionality¶
When cf_xarray
is imported, the cf
accessor is automatically added to any
xarray
object. cf-xarray is able to wrap most of Xarray’s functions. The repr
for Dataset.cf
prints out a list of detected “CF names” and the corresponding
dataset-specific variable names. cf-xarray parses the attributes associated with
each variable in the Dataset to build these mappings
# After `import cf_xarray`, the cf_xarray accessor has been added to the xarray object
ds_mom6.cf
Use CF metadata in standard xarray methods¶
With standard xarray
syntax, one would specify the variable names on the
right-hand side of the mappings printed above. With cf_xarray, one can instead
use the standardized “CF names” on the left-hand side.
For example, the next two cells show two ways of calculating an average along the vertical dimension
# xarray way:
ds_mom6.mean(["z_i", "z_l"])
# cf_xarray way:
# By calling .cf.mean we can provide "Z" which is then rewritten to ["z_i", "z_l"]
# This statement is entirely equivalent to ds_mom6.mean(["z_i", "z_l"])
ds_mom6.cf.mean("Z")
cf_xarray
knows that z_i
and z_l
correspond Z
axes because the two
variables have a CF-compliant attribute cartesian_axis: Z
. A full list of
criteria used by cf-xarray is documented
here.
for var_name in ["z_i", "z_l"]:
print(f"{var_name}: {ds_mom6[var_name].attrs['cartesian_axis']}")
Dictionaries mapping CF keys to variable names¶
The example object contains variables lying on staggered grids. Therefore, a CF
key can be associated with multiple variables. cf_xarray
provides several
properties that return dictionaries mapping CF keys to lists of variable names,
such as:
.cf.axes
.cf.coordinates
.cf.cell_measures
.cf.standard_names
.cf.bounds
# maps "axes" to variable names
ds_mom6.cf.axes
# maps CF standard name to variable name
ds.cf.standard_names
Dealing with incomplete metadata¶
The usefulness of cf-xarray fully depends on the amount of CF-compliant metadata present in a dataset. Although many datasets have incomplete metadata, in most cases one can guess appropriate metadata by looking at variable names.
We can use Dataset.cf.guess_coord_axis
to identify, guess, and add missing CF
metadata for “axes” (“X”, “Y”, “Z”, “T”) and “coordinates” (“latitude”,
“longitude”, “time”). It does so by using regular expressions to parse variable
names and make reasonable guesses.
We also demonstrate the verbose
mode so that the user can double check
cf-xarray’s inferences.
First, note that no X
or Y
axes variables have been detected in the current
dataset:
# The "axes" property maps axes names 'X', 'Y', 'Z', 'T' to variable names in the dataset
# here the metadata only identify 'Z' and 'T'
ds_mom6.cf.axes
Now we will ask cf_xarray
to autoguess more axes:
ds_mom6 = ds_mom6.cf.guess_coord_axis(verbose=True)
# The `X` and `Y` axes variables that have been detected are sensible!
ds_mom6.cf.axes
Indexing using CF keys¶
CF metadata precisely describes the physical quantities being represented by all variables. More importantly, CF conventions also describe links between different variables in a dataset.
Here we examine the “sea floor depth” variable. First, we pick it out using standard xarray syntax:
xr_da = ds_mom6["deptho"]
xr_da
Note that, in the output above, the cell_methods
attribute indicates that the
areacello
variable contains the appopriate “cell area” for this variable (see
under “Attributes”). However, this variable is not associated with the DataArray
(see under “Coordinates”).
Now we pick out the variable using the cf
accessor and the appropriate
standard name:
cf_da = ds_mom6.cf["sea_floor_depth_below_geoid"]
cf_da
Now notice that areacello
is present under “Coordinates”. This is because
cf_xarray
decodes CF metadata linking variables with each other (e.g.,
coordinates
, cell_measures
, ancillary_variables
).
As opposed to xr_da
, cf_da
extracted in the previous cell contains all
cell_measures
associated with the variable extracted.
additional_coords = set(cf_da.coords) - set(xr_da.coords)
print("Cell measure extracted by cf_xarray:", additional_coords)
Automagic plotting¶
cf_xarray
automagically sets some optional keyword arguments for plotting
functions. As opposed to xarray
, in the example below cf_xarray
assigns the
appropriate coordinates to the plot axes (i.e., longitude and latitude).
cf_xarray
does so by parsing the "coordinates"
attribute to identify the
appropriate latitude and longitude variables:
fig, (xr_ax, cf_ax) = plt.subplots(1, 2, figsize=(12, 4))
# left: xarray plot
xr_da.plot(ax=xr_ax)
xr_ax.set_title("xarray")
# right: cf_xarray plot
cf_da.cf.plot(ax=cf_ax)
cf_ax.set_title("cf_xarray")
fig.tight_layout()
CF keys expansion¶
As mentioned above, the example dataset is characterized by multiple dimensions
associated with the same spatial axes. Such information is decoded by
cf_xarray
and is used under the hood of wrapped functions. In the example
below, the CF Axes keys (i.e., “X”, “Y”, and “Z”) are expanded and multiple
dimensions are sliced at once:
ds_mom6_sliced = ds_mom6.cf.isel(
X=slice(10), Y=slice(10), Z=slice(10), T=slice(10)
)
print("Original dataset sizes:", dict(ds_mom6.sizes))
print(" Sliced dataset sizes:", dict(ds_mom6_sliced.sizes))
Use case examples¶
Seamlessly extract statistics from CF-compliant datasets¶
cf_xarray
allows one to use the same code on a wide variety of CF-compliant
objects that each has their own nomenclature.
There are two approaches to leveraging cf-xarray in applications.
Access variables through the .cf
interface¶
In the example below, we define a function that uses many cf_xarray
features,
then we apply to objects with different dimension and coordinate names. All
cf_xarray functionality is accessed using the .cf
accessor
def compute_top_10m_temp_anomaly(ds):
"""
Compute the volume weighted temperature anomaly from the climatology.
Parameters
----------
ds: xarray.Dataset
Dataset to analyze
Returns
-------
DataArray
"""
# Compute and plot line
with xr.set_options(keep_attrs=True):
# Extract ocean potential temperature
da = ds.cf["sea_water_potential_temperature"]
# Fill missing cell volumes with zeros
da = da.cf.assign_coords(volume=da.cf.coords["volume"].fillna(0))
# Select temperature in the top 10m in 2003
da = da.cf.sel(T="2003", Z=slice(0, 10))
# Compute volume-weighted mean temperature
da = da.cf.weighted("volume").mean(["X", "Y", "Z"])
# Calculate an anomaly relative to the time mean
da = da - da.cf.mean("T")
# Update metadata
da.attrs["standard_name"] += "_anomaly"
da.attrs["long_name"] += " Anomaly"
return da.squeeze(drop=True)
# Run the function on two different datasets and compare the results
compute_top_10m_temp_anomaly(ds_mom6).cf.plot(label="ds_mom6")
compute_top_10m_temp_anomaly(ds_cmip6_oce).cf.plot(label="ds_cmip6_oce")
_ = plt.legend()
Standardize datasets using cf_xarray¶
Alternatively, cf_xarray
provides utility functions to rename variables and
dimensions in one object to match another object. Matching variables/dimensions
are determined using CF metadata. One might choose to use this approach of
standardizing datasets prior to passing them through a data processing pipeline.
Here we illustrate the rename_like
feature.
cf_xarray also supports renaming datasets through .cf.rename
mom6_da = ds_mom6.cf["sea_water_potential_temperature"]
cmip6_da = ds_cmip6_oce.cf["sea_water_potential_temperature"]
renamed_mom6_da = mom6_da.cf.rename_like(cmip6_da)
print(" MOM6 dimensions:", mom6_da.dims)
print(" CMIP6 dimensions:", cmip6_da.dims)
print("renamed MOM6 dimensions:", renamed_mom6_da.dims)
Regridding with xESMF¶
cf-xarray
is used by xESMF, a
regridding package wrapping the powerful Fortran libray “ESMF”. This example
will regrid sea ice concentration data extracted from the CCCma-CanESM5 model, a
participant in the CMIP6 experiment.
Our original sea ice data is on a tripolar grid. The target grid is a regular grid used by the NCEP reanalysis dataset that ships with Xarray. This regridding problem requires providing grid cell corners to xESMF in a specific CF-compatible format. Here we illustrate utility functions provided by cf_xarray to make this task easy and convenient.
# Let's look at the grid shape itself and the data for one time step
fig, axs = plt.subplots(ncols=2, figsize=(12, 4))
# Notice how with .cf we will use the same keyword arguments
# Although here we explicitely pass the coordinate standard names for the plot axes,
# cf_xarray default scatter plot would produce the same results.
scatter_kwargs = dict(x="longitude", y="latitude", s=0.1)
# CMIP6: Input grid
ds_cmip6_ice.cf.plot.scatter(**scatter_kwargs, ax=axs[0])
axs[0].set_title(
"The input horizontal grid points as seen on a lat/lon map."
"\nOnly the northern hemisphere is shown."
)
axs[0].set_ylim(0, 90)
# NCEP: Target grid
ds_ncep.cf.plot.scatter(**scatter_kwargs, ax=axs[1])
axs[1].set_title("The target horizontal grid points")
axs[1].set_ylim(0, 90)
fig.tight_layout()
We will regrid the sea ice data using the “conservative” method. Please refer to
the xESMF documentation for details about the
different algorithms.
The important information here is that the “conservative” regridding algorithms
requires the grid points coordinates, but also the grid corners coordinates.
While the target grid doesn’t provide them, they are easily computable with the
help of cf_xarray
:
# Make a reasonable guess of the bounds of the spatial coordinates:
ds_ncep = ds_ncep.cf.add_bounds(["latitude", "longitude"])
ds_ncep
This was easy since the grid is regular (i.e., latitude and longitude are 1D). Inferring bounds of 2D grids is not yet supported by cf-xarray.
Luckily, our sea ice data includes the corner coordinates in the
vertices_latitude
and vertices_longitude
variables. However, xESMF expects a
format that is different from the CF convention followed here. No worries,
cf_xarray
has an helper method just for this:
# Get the bounds variable and convert them to "vertices" format
# Order=None, means that we do not know if the bounds are
# listed clockwise or counterclockwise, so we ask cf_xarray to try both.
lat_corners = cfxr.bounds_to_vertices(
ds_cmip6_ice.vertices_latitude, "vertices", order=None
)
lon_corners = cfxr.bounds_to_vertices(
ds_cmip6_ice.vertices_longitude, "vertices", order=None
)
# We are using special variable names "lon_b" and "lat_b" for easier detection by xESMF
ds_in = ds_cmip6_ice.assign(lon_b=lon_corners, lat_b=lat_corners)
ds_in
Finally, the regridding is performed with xESMF. Under the hood, it uses cf-xarray to get the coordinates and their bounds, so we do not need to worry about renaming. The only exception is the input grid’s corners, where we used hardcoded variable names because of the difference between xESMF’s and CF’s syntaxes for 2D grid bounds.
# Regrid
regridder = xe.Regridder(ds_in, ds_ncep, "conservative")
sic_reg = regridder(ds_in.cf["sea_ice_area_fraction"])
# Plot the results
sic_reg.isel(time=0).plot()
_ = plt.title("Regridded sic data (Jan 2020)")
Conclusion¶
This notebook was a quick walkthrough of a few core cf-xarray features that enable scaling analysis pipelines across datasets. It is fairly common for datasets to not have consistent terminology, and be imperfectly tagged with CF attributes. cf_xaray both allows you to leverage the presence of attributes, and provides utility functions to quickly fix imperfect tagging. For more see the documentation, specifically the introductory notebook.
References¶
Adcroft, Alistair, Whit G Anderson, V Balaji, Chris Blanton, Mitchell Bushuk, Carolina O Dufour, John P Dunne, Stephen M Griffies, Robert Hallberg, Matthew J Harrison, Isaac M Held, Malte Jansen, Jasmin G John, John P Krasting, Amy R Langenhorst, Sonya Legg, Zhi Liang, Colleen McHugh, Aparna Radhakrishnan, Brandon G Reichl, Anthony Rosati, Bonita L Samuels, Andrew Shao, Ronald J Stouffer, Michael Winton, Andrew T Wittenberg, Baoqiang Xiang, Niki Zadeh, and Rong Zhang, October 2019: The GFDL Global Ocean and Sea Ice Model OM4.0: Model Description and Simulation Features. Journal of Advances in Modeling Earth Systems, 11(10), DOI:10.1029/2019MS001726.
Cherian, Deepak, Almansi, Mattia, Bourgault, Pascal, keewis, Kent, Julia, Thyng, Kristen, … Chauhan, Subhendra Singh. (2021, May 11). xarray-contrib/cf-xarray: (Version v0.5.2). Zenodo. http://doi.org/10.5281/zenodo.4749736
Hoyer, Stephan, Hamman, Joe, Roos, Maximilian, keewis, Cherian, Deepak, Fitzgerald, Clark … Bovy, Benoit. (2021, May 19). pydata/xarray: v0.18.2 (Version v0.18.2). Zenodo. http://doi.org/10.5281/zenodo.4774304
Kalnay, E., Kanamitsu, M., Kistler, R., Collins, W., Deaven, D., Gandin, L., Iredell, M., Saha, S., White, G., Woollen, J., Zhu, Y., Chelliah, M., Ebisuzaki, W., Higgins, W., Janowiak, J., Mo, K. C., Ropelewski, C., Wang, J., Leetmaa, A., Reynolds, R., Jenne, R., & Joseph, D. (1996). The NCEP/NCAR 40-Year Reanalysis Project, Bulletin of the American Meteorological Society, 77(3), 437-472.