Merge Pipeline¶
lazymerge uses Cubed to lazily merge
geospatial Zarr arrays. The merge is expressed as a cubed.map_blocks
operation, meaning that each output chunk is computed independently and only
the source data needed for that chunk is read.
Overview¶
The pipeline has three phases:
- Target grid construction -- define the output array shape, CRS, resolution, and chunk layout.
- Per-chunk source discovery -- for each output chunk, find which source arrays intersect its spatial extent.
- Per-chunk warp and composite -- read the intersecting source regions, reproject them to the target grid, and composite them into the output chunk.
Target grid¶
merge() creates a target grid from the user-provided CRS, bounding box,
and resolution. This produces a SpatialAttrs (affine transform, shape,
bbox) and a ProjAttrs (CRS code). A lazy Cubed array filled with NaN is
allocated at the target shape and chunk size.
When temporal grouping or multi-band output is requested, extra dimensions
are prepended to the target array: (time, band, y, x), (time, y, x),
or (band, y, x).
Source discovery¶
For each output chunk, the pipeline computes the chunk's bounding box in the target CRS and finds intersecting sources using one of two strategies:
ScanIndex -- a pre-built in-memory index created by scan_store(). Each
source entry stores its spatial extent and CRS, and bounding box intersection
is checked after reprojecting to the target CRS.
DataFusion -- a SQL query against the /meta group in the Zarr store
(via zarr-datafusion-search). The chunk bbox is reprojected to EPSG:4326
and used in a ST_Intersects spatial filter. This path supports
sortby (controlling composite priority) and sql_filter (arbitrary
attribute predicates).
When temporal grouping is active, an additional SQL filter restricts sources to the time window for the current time step.
Overview selection¶
When a source has a multiscale pyramid (indicated by a multiscales
attribute on its Zarr group), the pipeline selects the coarsest overview
level whose pixel size does not exceed the target resolution. This avoids
reading full-resolution data when the output is at a coarser scale.
The target resolution is converted to the source CRS before comparison, handling cases where the target and source use different CRS with different units.
Warp and composite¶
For each source intersecting a chunk:
-
Pixel mapping -- an affine + CRS transform maps every target pixel center to fractional source pixel coordinates. For cross-CRS merges,
pyproj.Transformerhandles the reprojection. -
Region read -- the bounding pixel range in the source is computed from the mapped coordinates. Only this subregion is read from the Zarr array.
-
Resampling -- source pixels are sampled at the mapped coordinates using nearest-neighbor interpolation.
-
NaN compositing -- the warped source data fills NaN pixels in the output. Sources are processed in order (scan index order, or
sortbyorder for DataFusion), and once all pixels in a chunk are filled, remaining sources are skipped (early stop optimization).
Execution model¶
Because the merge is a cubed.map_blocks call, Cubed controls scheduling.
Each chunk is an independent task that can be executed in parallel. The
library reads source data via zarr-python, which supports local
filesystems, cloud object stores, and Icechunk repositories.