Frequency-Domain Analysis of Large Datasets¶

Author(s)¶

Paige Martin and Ryan Abernathey

Author1 = {“name”: “Paige Martin”, “affiliation”: “Australian National University/Lamont-Doherty Earth Observatory”, “email”: “paigemar@umich.edu”, “orcid”: “0000-0003-3538-633X”}
Author2 = {“name”: “Ryan Abernathey”, “affiliation”: “Lamont-Doherty Earth Observatory”, “email”: “rpa@ldeo.columbia.edu”, “orcid”: “0000-0001-5999-4917”}

Purpose¶

Climate model datasets are typically stored as global snapshots, i.e. chunked in time rather than space. For many workflows, this chunking works well (e.g. computations across spatial domains at every point in time). However, this storage format can create serious challenges for processing long time series at each point in space, as is the case for frequency-domain analysis. For large datasets with frequent (e.g. daily) output, it is not feasible to process each spatial point as a single time series, even with the help of distributed computing such as Dask.

This notebook provides an example scientific workflow for performing frequency-domain analysis on large datasets. Specifically, this notebook presents a workflow for computing the power spectrum of sea surface temperature in the Community Earth System Model (CESM). While we carry out computations on a specific model, the main goal of this notebook is not to interpret scientific results from our computations, but rather to provide a working example of frequency-domain analysis on large datasets that others could follow.

Because the goal of this notebook is to provide an example of a workflow that works on large datasets, we have chosen to use a large dataset (CESM) that is available on Pangeo Cloud. This notebook is therefore developed for a Jupyter Hub environment that can access data stored on the Pangeo Cloud (e.g. PangeoBinder).

Technical contributions¶

demonstrates how to quickly rechunk data, e.g. from chunks in time to chunks in space, using the package Rechunker
demonstrates how to easily perform Fourier analysis using the package xrft
shows that all of these steps, with the use of Xarray and Dask, can be taken with large datasets

Methodology¶

The notebook follows three main steps:

Rechunk the data. We begin by rechunking the CESM sea surface temperature (SST) output from global, daily snapshots to chunks in space and 5-year chunks in time. This step is accomplished using the package rechunker.
Fourier analysis. Next we compute the power spectrum of SST using the package xrft, which nicely integrates with Xarray and Dask. Within xrft, we are also able to easily apply detrending and windowing functions to our data, and also account for the fact that our data are real (with no imaginary components).
Visualize the data. Last, we average over various frequency bands to show the spatial distribution of the SST power spectrum as global maps.

Between each of these steps, we would typically write out the intermediate data. Specifically, we would write out the rechunked data, as well as the processed power spectra. Being able to write out data at intermediate steps is crucial to this workflow. However, due to the inability to write out data from Binder, we instead import previously rechunked data and use a spatial subset to compute and plot the power spectrum in this notebook. We still include all code necessary to run every step if a user wishes to run this notebook elsewhere that allows for data to be written out.

Results¶

This notebook presents a feasible example for performing frequency-domain analysis on large datasets. Specifically, this notebook demonstrates how to quickly rechunk ~500GB of data from chunking in time to chunking in space. It also demonstrates how to pair the xrft library with the rechunker library to perform frequency-domain spectral analysis (here power spectra) and obtain interpretable results. We finish with a few sample plots to round out the workflow. This notebook is meant to serve as an example for others who wish to perform similar types of analysis.

Funding¶

Award1 = {“agency”: “Gordon and Betty Moore Foundation”, “award_code”: “”, “award_URL”: “https://www.moore.org”}

Keywords¶

keywords=[“frequency-domain”, “Pangeo”, “spectral analysis”, “rechunking”, “cloud computing”]

Citation¶

Martin and Abernathey 2021. Frequency-Domain Analysis of Large Datasets. Accessed at https://github.com/paigem/EC2021_Martin_and_Abernathey.

Acknowledgements¶

We thank the Pangeo community for developing and maintaining most of the packages used in this notebook. We also acknowledge Pangeo Cloud, which provides the computing power for this analysis.

EarthCube 2021 Call for Notebooks

Frequency-Domain Analysis of Large Datasets¶

Author(s)¶

Purpose¶

Technical contributions¶

Methodology¶

Results¶

Funding¶

Keywords¶

Citation¶

Acknowledgements¶

Setup¶

Library import¶

Data import, processing, and analysis¶

Step 1: Open and rechunk the original data¶

Access the data¶

Rechunk the data¶

(Set up Dask cluster)¶

Execute the `rechunk()` function¶

Step 2: Compute the power spectrum of SST¶

Work with a 10-year subset of the rechunked data¶

Define function to take power spectrum¶

Compute the power spectrum of SST¶

Load Some Results¶

Start Dask cluster¶

Write to Zarr¶

Step 3: Plot Results¶

Area Average Power Spectrum¶

Maps of different frequency band averages¶

EarthCube 2021 Call for Notebooks

Frequency-Domain Analysis of Large Datasets¶

Author(s)¶

Purpose¶

Technical contributions¶

Methodology¶

Results¶

Funding¶

Keywords¶

Citation¶

Acknowledgements¶

Setup¶

Library import¶

Data import, processing, and analysis¶

Step 1: Open and rechunk the original data¶

Access the data¶

Rechunk the data¶

(Set up Dask cluster)¶

Execute the rechunk() function¶

Step 2: Compute the power spectrum of SST¶

Work with a 10-year subset of the rechunked data¶

Define function to take power spectrum¶

Compute the power spectrum of SST¶

Load Some Results¶

Start Dask cluster¶

Write to Zarr¶

Step 3: Plot Results¶

Area Average Power Spectrum¶

Maps of different frequency band averages¶

Execute the `rechunk()` function¶