Store Affine GeoTransform and CRS in grid_mapping variable

Load example dataset from NetCDF into Xarray

import json

import cf_xarray  # noqa
import panel
import rasterio
import rioxarray  # noqa
import xarray as xr
import zarr

# For zarr_format=2 encoding
from numcodecs import Zstd
fp_base = "20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1"
input = f"../data/{fp_base}.nc"
v2_output = f"../output/v2/{fp_base}_geotransform.zarr"
v3_output = f"../output/v3/{fp_base}_geotransform.zarr"
ds = xr.open_dataset(input)

Check that all variables have a CF-compliant standard name

standard_names = ds.cf.standard_names
vars_with_standard_names = [v[0] for v in ds.cf.standard_names.values()]
compliant_vars = []
non_complaint_vars = []
for var in ds.variables:
    if var not in vars_with_standard_names:
        non_complaint_vars.append(var)
    else:
        compliant_vars.append(var)
        assert ds[var].attrs["standard_name"]

print(f"These variables do NOT have a CF-compliant standard name: {non_complaint_vars}")
print(f"These variables have a CF-compliant standard name: {compliant_vars}")
These variables do NOT have a CF-compliant standard name: ['analysis_error', 'mask']
These variables have a CF-compliant standard name: ['time', 'lat', 'lon', 'analysed_sst', 'sea_ice_fraction']

Not all the variables in this dataset have a CF-compliant standard name. See https://github.com/zarr-developers/geozarr-spec/issues/60 for a recommendation that CF-compliant standard names should be a “SHOULD” rather than a “MUST” condition in the GeoZarr spec. For now, let’s subset to the variables that do use CF-compliant standard names.

ds = ds[compliant_vars]

Assign CRS and geotransform using rioxarray and rasterio

First, let’s specify the CRS for the dataset

ds = ds.rio.write_crs("epsg:4326")
# Specify which variable contains CRS information using grid_mapping
for var in ds.data_vars:
    ds[var].attrs["grid_mapping"] = "spatial_ref"

Next, let’s get the appropriate GeoTransform from the bounds, width, and height of one of the data variables.

length, height, width = ds.analysed_sst.shape
transform = rasterio.transform.from_bounds(*ds.analysed_sst.rio.bounds(), width, height)

Now, let’s store the GeoTransform as a space separated string in the GeoTransform attribute of the spatial_ref auxiliary variable.

# Convert transform to GDAL's format
transform = transform.to_gdal()
# Convert transform to space separated string
transform = " ".join([str(i) for i in transform])
# Save as an attribute in the `spatial_ref` variable
ds["spatial_ref"].attrs["GeoTransform"] = transform

Finally, we’ll remove the explicit coordinates that are redundant with the GeoTransform.

ds = ds.drop_vars(["lat", "lon"])
ds
<xarray.Dataset> Size: 10GB
Dimensions:           (time: 1, lat: 17999, lon: 36000)
Coordinates:
  * time              (time) datetime64[ns] 8B 2002-06-01T09:00:00
    spatial_ref       int64 8B 0
Dimensions without coordinates: lat, lon
Data variables:
    analysed_sst      (time, lat, lon) float64 5GB ...
    sea_ice_fraction  (time, lat, lon) float64 5GB ...
Attributes: (12/47)
    Conventions:                CF-1.5
    title:                      Daily MUR SST, Final product
    summary:                    A merged, multi-sensor L4 Foundation SST anal...
    references:                 http://podaac.jpl.nasa.gov/Multi-scale_Ultra-...
    institution:                Jet Propulsion Laboratory
    history:                    created at nominal 4-day latency; replaced nr...
    ...                         ...
    project:                    NASA Making Earth Science Data Records for Us...
    publisher_name:             GHRSST Project Office
    publisher_url:              http://www.ghrsst.org
    publisher_email:            [email protected]
    processing_level:           L4
    cdm_data_type:              grid

Specify encoding and write to Zarr V2 format

spatial_chunk = 4096
compressor = Zstd(level=1)
encoding = {
    "analysed_sst": {
        "chunks": (1, spatial_chunk, spatial_chunk),
        "compressor": compressor,
    },
    "sea_ice_fraction": {
        "chunks": (1, spatial_chunk, spatial_chunk),
        "compressor": compressor,
    },
}
ds.to_zarr(v2_output, mode="w", consolidated=True, zarr_format=2, encoding=encoding)
<xarray.backends.zarr.ZarrStore at 0x16aacde10>

Inspect Zarr V2 store

First, let’s look at the structure of Zarr arrays using zarr’s Group.tree() method

root = zarr.open_group(v2_output)
root.tree()
/
├── analysed_sst (1, 17999, 36000) float64
├── sea_ice_fraction (1, 17999, 36000) float64
├── spatial_ref () int64
└── time (1,) int32

Second, let’s look at what’s actually recorded in the Zarr metadata using the consolidated metadata at the root of the Zarr store.

In order to match valid JSON, we convert the nan fill_value entries to “nan”.

Key observations

  • For each array, metadata is stored under '.zattrs'.
  • All arrays contain a .zattrs/standard_name.
  • The root group specifies that the metadata follows CF conventions, which should be validated.
  • .zattrs/_ARRAY_DIMENSIONS for time contains a list with only the the name of the array, indicating that it is a coordinates variable.
  • .zattrs/_ARRAY_DIMENSIONS for spatial_ref contains an empty list, indicating that it is an auxiliary coordinate.
  • .zattrs/_ARRAY_DIMENSIONS for analysed_sst, sea_ice_fraction contain a list referring to other arrays, indicating that they are data variables rather than coordinate variables.
  • .zattrs/grid_mapping for analysed_sst, sea_ice_fraction is "spatial_ref" indicating that CRS information is included in that auxiliary variable’s metadata.
  • spatial_ref/.zattrs contains the OGC WKT for the CRS with a GeoTransform attribute containing a string separated GDAL format Affine GeoTransform
  • .zattrs/coordinates for analysed_sst, sea_ice_fraction include lat and lon indicating that the GeoTransform produces those coordinates. The part of the stack that adds this metadata coordinates should be further investigated.
panel.extension()
consolidated_metadata_file = f"{v2_output}/.zmetadata"
with open(consolidated_metadata_file) as f:
    metadata = json.load(f)["metadata"]
metadata["sea_ice_fraction/.zarray"]["fill_value"] = str(
    metadata["sea_ice_fraction/.zarray"]["fill_value"]
)
metadata["analysed_sst/.zarray"]["fill_value"] = str(
    metadata["sea_ice_fraction/.zarray"]["fill_value"]
)
panel.pane.JSON(metadata, name="JSON")

Specify encoding and write to Zarr V3 format

While GeoZarr v0.4 is Zarr V2 specific, let’s write a Zarr V3 store to get an idea about how GeoZarr could be adapted for Zarr format 3.

spatial_chunk = 4096
compressor = zarr.codecs.ZstdCodec(level=1)
encoding = {
    "analysed_sst": {
        "chunks": (1, spatial_chunk, spatial_chunk),
        "compressors": compressor,
    },
    "sea_ice_fraction": {
        "chunks": (1, spatial_chunk, spatial_chunk),
        "compressors": compressor,
    },
}
ds.to_zarr(v3_output, mode="w", consolidated=True, zarr_format=3, encoding=encoding)
/Users/max/Documents/Code/developmentseed/geozarr-examples/.pixi/envs/test/lib/python3.13/site-packages/zarr/api/asynchronous.py:203: UserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(
<xarray.backends.zarr.ZarrStore at 0x16aacd510>

Key observations

  • For each group and array, metadata is stored under the ‘attributes’ key in ‘zarr.json’
  • All arrays contain a attributes/standard_name
  • The dimensions associated with an array are stored under zarr.json/dimension_names (separately from the attributes) rather than _ARRAY_DIMENSIONS
  • the node_type specifies whether the key holds a Zarr Array or a Zarr Group
  • The coordinates associated with an array are still specified within the array metadata. Currently this is an Xarray implementation detail rather than a part of the GeoZarr specification.
  • The fill_value for sea_ice_fraction and analysed_sst is 0 instead of nan. This is likely an error with the fill value not being explicitly specified.
panel.extension()
consolidated_metadata_file = f"{v3_output}/zarr.json"
with open(consolidated_metadata_file) as f:
    metadata = json.load(f)["consolidated_metadata"]["metadata"]
panel.pane.JSON(metadata, name="JSON")