Summary

Implications for future development

  • Virtualizing archival file formats greatly improves performance relative to archival file readers such as h5netcdf and motivates the generation of virtual references whenever possible.
  • The Web-Optimized Zarr example shows the potential for Zarr overviews to enable highly performant visualization and motivates the development of the GeoZarr and multi-scales Zarr specifications.
  • Pyinstrument showed a significant fraction of the total time when resampling Web-Optimized Zarr using rioxarray went towards Xarray importing Pandas and guessing the chunk manager. Both of these components could be improved or removed through future development.
  • The dramatic difference between using XESMF with and without pre-generated weights raises the question of whether similar relative performance improvements could be gained by pre-generating weights for reprojection with GDAL. Given that pyinstrument shows only ~1/4 of the time is spent on the actual resampling operation when using COGs, building specifications for web-optimizing Zarr (i.e., GeoZarr and multi-scales), virtualizing existing datasets, and reducing import times would likely be much simpler and more fruitful activities.

Summary figures

from plotting import (
    plot_duration_by_weboptimization,
    plot_memory,
    plot_memory_by_weboptimization,
    plot_time,
    plot_time_by_format,
)

Summary figures for comparing resampling methods

# Plot time required for resampling GPM IMERG
gpm_imerg_local = plot_time("gpm_imerg", local=True, format="netcdf")
gpm_imerg_remote = plot_time("gpm_imerg", local=False, format="netcdf")
(gpm_imerg_local + gpm_imerg_remote).cols(1)
# Plot time required for resampling MUR SST
mur_sst_local = plot_time("mursst", local=True, format="netcdf")
mur_sst_remote = plot_time("mursst", local=False, format="netcdf")
(mur_sst_local + mur_sst_remote).cols(1)
# Plot memory required for resampling GPM IMERG
gpm_imerg_local = plot_memory("gpm_imerg", local=True, format="netcdf")
gpm_imerg_remote = plot_memory("gpm_imerg", local=False, format="netcdf")
(gpm_imerg_local + gpm_imerg_remote).cols(1)
# Plot memory required for resampling MUR SST
mur_sst_local = plot_memory("mursst", local=True, format="netcdf")
mur_sst_remote = plot_memory("mursst", local=False, format="netcdf")
(mur_sst_local + mur_sst_remote).cols(1)

Summary figures for comparing storage formats and I/O libraries

plot_time_by_format("mursst")
plot_time_by_format("gpm_imerg")

Summary figures for exploring web-optimization

plot_duration_by_weboptimization()
plot_memory_by_weboptimization()