Query Pipeline¶

The library uses Apache Datafusion as the foundation of its expression parser and execution planner. To interact with the custom Zarr schema described in zarr.md we use a custom Datafusion TableProvider

The ZarrTableProvider uses several techniques to try and improve filter execution performance. There are 2 main drivers of execution cost.

Retrieving and deserializing Zarr chunks.
Evaluating expressions against chunks.

Non-indexed columns¶

In the most basic case, the TableProvider looks at the "columns" used in the expression predicate and retrieves each chunk for the metadata arrays of these "columns" to build a RecordBatch with a schema of just the predicate columns.

This RecordBatch is evaluated against the filter to find matching rows. If it contains matching rows and there are additional projected "columns" in expression, the chunk for that column's array is loaded and the indexes for the matching rows are selected.

These selected rows are combined and added to the RecordBatchStream.

Indexed columns¶

While the approach described above can be highly performant for many types of columns and expressions, it can be slow for dtypes that are slow to deserialize (variable length types) or when the predicate evaluation is expensive (geodatafusion functions).

In these cases we want to use additional metadata or materialized indexes to 1. Reduce the number of chunks we retrieve and deserialize. 2. Reduce the number of values we check against the filter predicate.

Currently we support using geo-index R-tree indexes. In this case the "basic" pipeline is short circuited. The TableProvder checks if the "column" in the predicate has an index. If an index exists, it uses it find the "rows" that satisfy the predicate. The TableProvider fetches only the chunks for those rows and then only decodes those portions of the chunks. The filter predicate is never used against the raw values.