Skip to content

Quickstart

This guide walks through querying a Zarr store with SQL in three steps: create a store, register it as a DataFusion table, and run queries.

1. Create a Zarr store

A zarr-datafusion-search store is a Zarr v3 group containing a meta/ subgroup. Each 1D array inside meta/ becomes a SQL column. The simplest way to build a test store is with zarr-python and shapely:

import numpy as np
import shapely
import zarr
from zarr.dtype import VariableLengthBytes, VariableLengthUTF8

root = zarr.open_group("my_store.zarr", mode="w", zarr_format=3)
meta = root.create_group("meta")

# Timestamps
meta.create_array(
    "date",
    data=np.array(["2023-01-01", "2023-01-02", "2023-01-03"], dtype="datetime64[ms]"),
)

# String metadata
collection = meta.create_array(
    "collection",
    shape=(3,),
    dtype=VariableLengthUTF8(),
)
collection[:] = ["sentinel-2", "sentinel-2", "landsat-8"]

# Bounding boxes stored as WKB
bbox = meta.create_array(
    "bbox",
    shape=(3,),
    dtype=VariableLengthBytes(),
)
bbox[:] = shapely.to_wkb([
    shapely.box(-10.0, -10.0, 10.0, 10.0),
    shapely.box(-20.0, -20.0, 20.0, 20.0),
    shapely.box( 30.0,  30.0, 50.0, 50.0),
])

2. Register the store

Use ZarrTable.from_obstore to open the store through obstore, then register it with a DataFusion SessionContext:

from datafusion import SessionContext
from obstore.store import LocalStore
from zarr_datafusion_search import ZarrTable

store = LocalStore("my_store.zarr")
zarr_table = await ZarrTable.from_obstore(store, "/meta")

ctx = SessionContext()
ctx.register_table("my_data", zarr_table)

3. Query with SQL

Once registered, any DataFusion SQL query works against the table:

# Inspect the schema
df = ctx.sql("SELECT * FROM my_data LIMIT 5")
print(df.schema())
df.show()

# Filter by date
df = ctx.sql("""
    SELECT date, collection
    FROM my_data
    WHERE date >= '2023-01-02'
""")
df.show()

4. Spatial queries

Spatial SQL functions are provided by geodatafusion, which is included as a dependency and requires no separate install. Call register_all to make the functions available in your session:

from geodatafusion import register_all

register_all(ctx)

df = ctx.sql("""
    SELECT date, collection
    FROM my_data
    WHERE ST_Intersects(
        bbox,
        ST_GeomFromText('POLYGON((-15 -15, -15 15, 15 15, 15 -15, -15 -15))')
    )
""")
df.show()

Using an Icechunk store

If your data lives in an Icechunk repository, open a read-only session and pass it directly to ZarrTable.from_icechunk:

import icechunk
from zarr_datafusion_search import ZarrTable

storage = icechunk.local_filesystem_storage("my_repo")
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")

zarr_table = await ZarrTable.from_icechunk(session=session, group_path="/meta")

ctx = SessionContext()
ctx.register_table("my_data", zarr_table)

From here the SQL interface is identical to the local store example above.