Skip to content

Zarr Structure

zarr-datafusion-search uses a set of 1D arrays to store metadata, treating them as if they were columns in a columnar storage format like Parquet.

Schema

The library requires a special group named meta. It discovers each array stored in the meta group and maps its Zarr v3 dtype to an Arrow type to build an Arrow schema. These are the current supported type mappings.

Supported dtype mappings

Zarr v3 dtype Arrow type Notes
bool Boolean
int8 Int8
int16 Int16
int32 Int32
int64 Int64
uint8 UInt8
uint16 UInt16
uint32 UInt32
uint64 UInt64
float16 Float16
float32 Float32
float64 Float64
VariableLengthBytes BinaryView When the field name is bbox, mapped to WKB with EPSG:4326 CRS via the GeoArrow extension type instead
r<N> (raw bits) BinaryView
VariableLengthUTF8 Utf8View
numpy.datetime64[s] Timestamp(Second, None)
numpy.datetime64[ms] Timestamp(Millisecond, None)
numpy.datetime64[us] Timestamp(Microsecond, None)
numpy.datetime64[ns] Timestamp(Nanosecond, None)
complex64 Not supported
complex128 Not supported

Chunking

Because the meta arrays are combined by Datafusion into a single schema they need to maintain chunk alignment. This means that during array creation time, the chunk size must be the same for each array. And as data is appended, it needs to be appended to each array simultaneously. Because the chunks are aligned across arrays, the library can treat the combined chunks similar to a Parquet row group.

This diagram demonstrates the chunk alignment

Indexes

zarr-datafusion-search supports the optional use of materialized indexes to improve scanning performance. We use the following convention, if a meta array is used in the filter predicate zarr-datafusion-search will look for a group called indexes and search for an array of the same name as the predicate array. Currently the only index type supported are R-tree indexes generated by the geo-index library but we will continue to expand index support.

Decoding performance

Whenever possible it is worthwhile to avoid including VariableLengthBytes fields in your query's projection. Variable length decoding incurs a significantly higher cost than decoding fixed width types and can drastically affect query performance. This is especially evident when an index is present and the actual chunks don't need to be materialized for filter operations. See the results of

cargo bench --bench bbox_zarrs_raw
as an example.