Skip to content

Configuration

Datasets and benchmark runs are described as data, not code, under configs/. The schema is pydantic-validated in cng_benchmark.config and loaded with load_dataset_config / load_benchmark_config. Worked examples live in configs/datasets/ and configs/benchmarks/ and are validated by the test suite.

Adding a dataset or a target format must not require touching CI or the manifests — it is a new file here plus (for a new format) a registered adapter.

Dataset descriptor

configs/datasets/<id>.yaml — names where the baseline lives, its format, the candidate target formats, and the object-grouping lever to sweep.

id: example-raster
description: Generic multi-band raster scene used to exercise the harness.
source: s3://example-bucket/rasters/scene.tif
baseline_format: geotiff
target_formats:
  - cog
  - geozarr
grouping_lever:
  # Candidate object-grouping settings to sweep (format-specific keys).
  cog_block_size: [256, 512, 1024]
  zarr_shard_shape: [[1, 1024, 1024], [1, 2048, 2048]]
Field Meaning
id stable dataset identifier
source dataset root: a single object (single-object) or the prefix scenes/granules live under
baseline_format e.g. geotiff
target_formats cloud-native targets to evaluate
reader layout-aware reader that enumerates the dataset's products/components (default single-object)
options reader-specific picks, validated against the reader's typed Options model
grouping_lever format-specific knob(s) that control how bytes group into objects
description optional human note

Readers and layout-specific options

A real delivery is rarely one object. The reader selects a layout-aware Dataset subclass that enumerates the dataset's products (scenes/granules) and the components within each (bands, masks, …). Component selection is layout-specific, so it lives in a typed options block owned by the reader — not in generic benchmark params. Adding a layout is a new subclass + its Options + one registry line; the core config and runner are untouched.

reader Layout options
single-object (default) one product, one component = source none
sentinel2-maja a .zip-per-scene MAJA L2A delivery under source reflectance (FRE/SRE), bands, masks (CLM/EDG/SAT/MG2)
sentinel1-otb-rtc a .zip-per-scene S1Tiling (OTB) RTC gamma0 delivery under source polarizations (VV/VH)
swot-raster100m one netCDF granule per file under source (SWOT L2 HR Raster 100m) variables (CF variables; default wse)
swot-lakesp-prior a .zip-per-pass shapefile delivery under source (SWOT L2 HR LakeSP Prior) none (one .shp member = one component)
swot-pixc one netCDF point-cloud granule per file under source (SWOT L2 HR PIXC) groups (default pixel_cloud); point_variables (allow-list of carried point vars; default all) / exclude_variables (deny-list)
id: sentinel2-l2a-maja
reader: sentinel2-maja          # selects the Sentinel2MajaDataset subclass
source: s3://sentinel2-l2a-sprid/T31TCJ/   # tile root; scenes (zips) underneath
baseline_format: geotiff
target_formats: [cog]
options:                        # validated by Sentinel2MajaOptions
  reflectance: [FRE]
  bands: [B2, B3, B4, B8]
  masks: [CLM, EDG, SAT, MG2]

MAJA members are read on the fly through GDAL's /vsizip//vsis3 chain — no pre-extraction — so the write metric pays the real archive read cost. The member-name patterns (…_FRE_B3.tif, MASKS/…_CLM_R1.tif) live in the reader, never in config.

Not every delivery is a zip. The swot-raster100m reader is the granule layout — one netCDF file per granule, flat under source — where each selected CF variable becomes one component, read in place via GDAL's CF subdataset syntax (NETCDF:"<granule>":<variable>) and converted to a sharded GeoZarr store by the existing GeoZarr adapter. variables defaults to the primary water-surface-elevation variable (wse).

The swot-lakesp-prior reader is the vector arm — a cross-mission proof that a non-raster delivery wires through config alone. It is the same .zip-per-scene shape as the S1/S2 readers, but each pass's zip holds an ESRI Shapefile: every .shp member becomes one component (one pass = one layer), read on the fly via /vsizip//vsis3 (the OGR driver finds the .shx/.dbf/.prj sidecars inside the archive) and converted to a GeoParquet file by the GeoParquet adapter.

The swot-pixc reader is the point-cloud arm — the same granule layout as swot-raster100m, but a component is a netCDF group read as points rather than a CF raster variable. Each selected group (default pixel_cloud) becomes one component, converted to a COPC file by the COPC adapter, whose point loader reads the group with xarray in place. The COPC is content-complete: the geometry (lon/lat/height → x/y/z) and every other per-point variable (sig0, water_frac, the quality flags, …) are carried — the geometry as the LAS point and the rest as LAS extra dimensions, each keeping its source dtype where LAS allows it (so the produced size is a like-for-like basis for comparison, not a geometry-only fraction). point_variables (allow-list) and exclude_variables (deny-list) choose the carried set; the default carries every point-dimensioned variable. A variable whose name collides with a reserved LAS dimension (e.g. classification) is carried under a suffixed name. The COPC adapter and this point-cloud path are reused by the CO3D CARS arm (tiled LAZ → COPC).

Benchmark descriptor

configs/benchmarks/<id>.yaml — names which dataset and formats to exercise, which metrics to collect, and the storage-tier policy that object-size fitness is judged against.

id: synthetic-cog-end-to-end
dataset: synthetic-cog
formats:
  - cog
metrics:
  - write
  - object_size
  - read
  - display
tiers:
  - name: warm
    min_object_bytes: 33554432    # 32 MiB
  - name: cold
    min_object_bytes: 104857600   # 100 MiB
params:
  block_size: 256                 # COG internal tiling — the grouping lever
# Location URIs (optional in the file; usually supplied per-deployment):
# source: s3://bucket/scene.tif   # baseline to convert (COG end-to-end path)
# object_source: s3://bucket/objs/ # existing objects to list (object-only path)
# output: s3://bucket/results/     # where artifacts + the produced object go
Field Meaning
dataset the dataset id this run targets
formats target format(s); the first is used unless overridden
metrics any of write, object_size, read, display
tiers tier policy: a name + minimum recommended mean object size (bytes)
params format params: the grouping lever + run shape (see below)
source / object_source / output location URIs; CLI flags override them

The grouping-lever params are format-specific — the runner resolves the adapter by name and reads what it needs, so adding a format never changes the schema:

Format params levers
cog block_size (internal tiling), compress
geozarr chunk_shape (addressable unit), shard_shape (stored object), codec (zstd/gzip/blosc/none), multiscale_levels; display_titiler_path selects the multidim/xarray TiTiler router for display
geoparquet row_group_rows (rows per row group — the addressable unit a bbox query fetches), spatial_partitioning (spatially order features so each group's covering bbox is tight), compression
copc span (per-node voxel-grid edge — the per-node point budget ≈ span**3), max_depth (octree depth; null derives it from point density), scale (null derives LAS quantisation from the extent)

GeoZarr is a per-component, 2D adapter: each source raster becomes one sharded 2D store (a directory of shard objects), the per-component analogue of the COG arm, so it flows through the same --source and --dataset paths. chunk_shape / shard_shape accept a 2D [y, x] or a 3D [t, y, x] shape (the trailing two, spatial, dims are used) and tolerate a swept list of shapes (the first is taken). Time-stacking the scenes into a 3D cube, and reading a set of objects as a cube, are deferred follow-ups.

Why URIs are usually omitted from the file

Keeping source / output out of the committed config makes it portable across targets. The deployment supplies the concrete URIs via CLI flags (--source, --output) or Helm values, so the same benchmark file runs against the synthetic stack, a kind cluster, or a real bucket unchanged.

Running over a dataset's products (fan-out)

Pass a dataset descriptor with --dataset <dataset.yaml> (or runner.datasetFile in the chart) and the run fans out over the dataset's product(s) instead of a single --source raster. The benchmark carries only run-shape params — the component picks live in the dataset options:

params:
  block_size: 512
  scope: product-set            # product (one scene) | product-set (many)
  products: {prefix: "2015/", limit: 3}    # bounds a product-set enumeration
  samples: {read: 1, display: 1}           # object_size + write cover ALL objects
Param Meaning
scope product (one product) or product-set (the bounded set)
products.prefix / products.limit bound which/how many products a set covers — prefix is a path prefix under source (applied server-side for S3), limit caps the count
samples.read / samples.display how many components per product to sample for read/display (default 1)

object_size and write cover every component; read and display run on the first samples.{read,display} components. The run writes a product-set tree:

<output>/
  product/<scene-id>/result.json   # ObjectSizeProfile over that scene's components
  product/<scene-id>/summary.md
  rollup/result.json               # profile pooled over ALL products' objects
  rollup/summary.md
  summary.md                       # per-product table + roll-up

Each run reuses the BenchmarkRun model; params carries product_id and scope (product / rollup) to tell the per-product runs apart from the pooled roll-up.

Tier policy

Object size is a hard constraint on a tiered object store, so it is first-class. Each tiers entry is a name and the minimum recommended mean object size to qualify for that tier. The result reports every tier the layout satisfies and the coldest (highest) one — or none, if the objects are too small for any tier.

Metrics

Name Collector Reports
object_size metrics/objects.py object_count, total_bytes + the object_profile
write metrics/write.py write_elapsed, write_throughput (output bytes/s, source read included)
read metrics/read.py read_window_count (vector: read_query_count), read_latency_mean/p50, read_decoded_throughput
display metrics/display.py (+ display_tiles.py) per chunk-bucket display_{1,2,4,9}chunk_latency_mean/p50, display_scenarios, plus a display_chunk_layout.png artifact

read and display adapt to the produced object kind: a COG is read with rasterio over /vsis3 and served by TiTiler's /cog endpoints; a GeoZarr store is read zarr-natively over fsspec (GDAL cannot read the sharding_indexed codec) and served by a multidim/xarray TiTiler surface (params.display_titiler_path); a GeoParquet file is read with a bbox/row-group spatial query over fsspec (only the row groups whose covering bbox overlaps are fetched); a COPC file is read with an octree-node spatial query over fsspec (only the octree nodes that overlap the bbox are fetched). The vector and point-cloud arms have no display metric — a table or point cloud is not a TiTiler raster tile. All raster paths emit the same read_* / display_* names; the vector and point-cloud read swap read_window_count for read_query_count and count returned features / points rather than pixels.

read throughput is decoded bytes/s (a fair relative cross-format number), not bytes over the wire; latency reflects the full range-request round-trip.

display does not time a single fixed tile. It inspects the produced object's block/chunk grid and overview/multiscale levels to pick WebMercator tiles that each touch a target number of internal blocks/chunks — 1, 2, 4 and 9+ — and times each, so latency can be read against chunk-crossing. Unreachable buckets (e.g. on a tiny raster) are skipped; the targets default to (1, 2, 4, 9) and can be overridden via params.display_chunk_targets. A display_chunk_layout.png overlaying each served tile on the block/chunk grid is written alongside the object.

The result

A run produces a BenchmarkRun (cng_benchmark.models):

  • run context — timestamp, tool_versions, dataset_id, format_id, params
  • object_profilecount, total_bytes, mean/median/p50/p90/p95/p99, min_bytes/max_bytes, a histogram, and tier_fit / highest_tier
  • object_layouts — per produced object, its partial-access layout, typed per format (discriminated by kind). Every format answers the same "can a client fetch part without the whole" question through its own structure:
  • cog → a CogLayout: is_tiled (range-read friendly vs striped), block_width/block_height, overview_decimations, internal_tiles; summary.md renders a "Tiling layout" table + a tiled/striped count.
  • geozarr → a GeoZarrLayout: chunk_shape (addressable unit), shard_shape (stored object), chunks_per_shard, codec, multiscale_levels, shard_count; summary.md renders a "Chunk/shard layout" table + a shard-object count.
  • geoparquet → a GeoParquetLayout: geometry_column, num_rows, num_row_groups, row_group_rows (the addressable unit a bbox query fetches), and has_bbox_covering (whether spatial pushdown to row groups is possible).
  • copc → a CopcLayout: num_nodes (octree nodes — the addressable units), max_depth, point_count, points_per_node (the largest node, i.e. the realised per-node point budget), and extra_dimensions (the carried point variables, recovered from the LAS ExtraBytes schema — so the run is self-describing about its content); summary.md renders an "Octree layout" table plus the carried-variable list.

Captured for every object (no tile server needed). The chunk-aware display metric also publishes a display_chunk_layout.png next to the sampled object (the block/chunk grid with each served tile's footprint). A COPC run, which has no display tiles, instead publishes a copc_octree_lod.png — the clustered-octree level-of-detail (coarse overview → full detail), the point-cloud structural artifact; its sink URI is in the octree_lod metric detail. - metrics — a list of {name, value, unit, detail} scalars

It is written as result.json and rendered to summary.md (report.py).

Adding a format

Register a FormatAdapter subclass (see formats/cog.py) under a name in FORMATS; the runner resolves it by the name used in a config's formats. No CI or manifest change is needed — see Architecture › Plug-in seams.