Tiny coordinate chunks¶
The reasons to avoid tiny coordinate chunks are the same as for avoiding tiny data chunks and are repeated below. It's worth mentioning separately though because many people produce datacubes by concatenating data split across many files. When using Xarray, the chunks in each individual file are propagated to the final datacube for the coordinates. For example, if you concatenate 365 data files each containing one time-step, your time coordinate would consist of 365 chunks each of length one, which would produce terrible performance.
Reasons for avoiding tiny chunks:
- Inefficient compression since most compression algorithms leverage correlations within a chunk.
- Inefficient data loading when querying large subsets of the data cube due to numerous GET requests with high latency. The excessive GET requests also increases costs.
- Inefficient encoding/decoding due to the number of chunks greatly exceeding available parallelism.
- Issues with parallel computing frameworks like Dask that have a 1:1 mapping between tasks and chunks.
Please note that issue of too many GET requests can be mitigated by using Zarr V3 sharding or a cloud-native file format that allows storing multiple chunks in a single file.
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas # noqa
from azure.identity import DefaultAzureCredential
from obstore.auth.azure import AzureCredentialProvider
Set constants to use when comparing datacubes
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
config.credential_provider = AzureCredentialProvider(
credential=DefaultAzureCredential()
)
config.create_data = True
config.warmup_samples = 1
zarr.config.set({"async.concurrency": config.zarr_concurrency})
<donfig.config_obj.ConfigSet at 0x78ddcd55c2d0>
Demonstrating inefficiencies of tiny coordinate chunks¶
Create (or reuse) a blosc compressed array with 25 MB chunks
url_for_unchunked_coords = "https://datacubeguide.blob.core.windows.net/performance-testing/unchunked-coords.zarr"
unchunked_coords_store = datacube_benchmark.create_or_open_zarr_store(
url_for_unchunked_coords,
target_chunk_size="25 megabyte",
config=config,
chunked_coords=False,
)
url_for_chunked_coords = "https://datacubeguide.blob.core.windows.net/performance-testing/chunked-coords.zarr"
chunked_coords_store = datacube_benchmark.create_or_open_zarr_store(
url_for_chunked_coords,
target_chunk_size="25 megabyte",
config=config,
chunked_coords=True,
)
Test time required to open the Zarr store using Xarray
unchunked_coords_results = datacube_benchmark.benchmark_dataset_open(
unchunked_coords_store,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
)
chunked_coords_results = datacube_benchmark.benchmark_dataset_open(
chunked_coords_store,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
)
df = pd.concat([chunked_coords_results.T, unchunked_coords_results.T])
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["zarr_store"] = df["zarr_store"].replace(
{
'object_store://AzureStore(container_name="performance-testing", account_name="datacubeguide", prefix="performance-testing/chunked-coords.zarr")': "Chunked",
'object_store://AzureStore(container_name="performance-testing", account_name="datacubeguide", prefix="performance-testing/unchunked-coords.zarr")': "Unchunked",
}
)
title = "Duration to open dataset using Xarray"
plt = df.hvplot.bar(
x="zarr_store",
y="mean_time",
width=1000,
rot=45,
title=title,
ylabel="Duration (s)",
xlabel="Coordinate chunking",
)
plt
Note how much longer it takes to even open a dataset when the coordinates are split into many chunks.