binder

12. Frequency-domain analysis of large datasets

Climate model datasets are typically stored as global snapshots, i.e. chunked in time rather than space. For many workflows, this chunking works well (e.g. computations across spatial domains at every point in time). However, this storage format can create serious challenges for processing long time series at each point in space, as is the case for frequency-domain analysis. For large datasets with frequent (e.g. daily) output, it is not feasible to process each spatial point as a single time series, even with the help of distributed computing such as Dask.

In this notebook, we show an example workflow for performing frequency-domain analysis on large datasets in the Pangeo Cloud environment. Specifically, we compute the power spectrum of sea surface temperature from a 0.1˚ horizontal resolution global ocean model with daily output. This analysis involves the computation of spectra to yield processed data in the frequency (rather than the time) domain. To compute a frequency-domain Fourier transform, we need information from the entire time series and thus cannot work with data that is chunked in time. Hence, this workflow first requires a complete rechunking of the data (using the Rechunker package) from global snapshots to spatially chunked time series. We then compute the Fourier transform on this rechunked data to obtain output in frequency space. Although computationally intensive, we are able to carry out this computation using distributed computing via Dask Gateway and adaptive scaling of the Dask cluster.