Massive data chunks¶

There are multiple reasons why you should avoid using too large of chunk sizes in your datacubes:

In most cases, a chunk represents the minimum amount of data that needs to be fetched for any data operations. Consider a datacube with 1 GB chunks. An application or library would fetch the full GB even if a user wants to query only a single point represented by an 8 byte value. This leads to slowness as the user waits for the data to be fetched and decompressed and potentially costs if there is data egress.
Many parallelization frameworks, such as Dask, operate on a chunk-by-chunk basis. Splitting your datacube over too few/large chunks means that you cannot effectively leverage parallel computations.

In [1]:

Copied!





import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas  # noqa

from azure.identity import DefaultAzureCredential
from obstore.auth.azure import AzureCredentialProvider
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas  # noqa

from azure.identity import DefaultAzureCredential
from obstore.auth.azure import AzureCredentialProvider

Set constants to use when comparing datacubes

In [2]:

Copied!





config = datacube_benchmark.Config
config.target_array_size = "1 GB"
config.credential_provider = AzureCredentialProvider(
    credential=DefaultAzureCredential()
)
config.warmup_samples = 1
config.create_data = False
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
config.credential_provider = AzureCredentialProvider(
    credential=DefaultAzureCredential()
)
config.warmup_samples = 1
config.create_data = False

In [3]:

Copied!

zarr.config.set({"async.concurrency": config.zarr_concurrency})
zarr.config.set({"async.concurrency": config.zarr_concurrency})

Out[3]:

<donfig.config_obj.ConfigSet at 0x739745abc550>

Demonstrating performance inefficiencies of too large of chunks¶

Create (or reuse) a blosc compressed array with 25 MB chunks

In [4]:

Copied!





url_for_reg_chunks = (
    "https://datacubeguide.blob.core.windows.net/performance-testing/reg-chunks.zarr"
)
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)
url_for_reg_chunks = (
    "https://datacubeguide.blob.core.windows.net/performance-testing/reg-chunks.zarr"
)
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)

Create (or reuse) a blosc compressed array with 1 GB chunks

In [5]:

Copied!





url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_massive_chunks, target_chunk_size="1 GB", config=config
)
url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_massive_chunks, target_chunk_size="1 GB", config=config
)

Test the time required to load a random point, a time series, or a spatial slice for the array.

In [6]:

Copied!





massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
    massive_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
    reg_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)
massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
    massive_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
    reg_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)

In [7]:

Copied!





df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
    {
        "point": "Random point",
        "time_series": "Time series",
        "spatial_slice": "Spatial slice",
        "full": "Full scan",
    }
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
    lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df
df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
    {
        "point": "Random point",
        "time_series": "Time series",
        "spatial_slice": "Spatial slice",
        "full": "Full scan",
    }
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
    lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df

Out[7]:

	mean_time	median_time	min_time	max_time	total_samples	access_pattern	array_shape	chunk_shape	chunk_size	nchunks	shard_shape	array_dtype	array_size_memory	array_size_storage	array_compressors	compression_ratio	zarr_concurrency
0	0.586832	0.5868318689999796 second	0.5868318689999796 second	0.5868318689999796 second	1	Random point	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	35.34:1	128
1	0.444508	0.4445084300000417 second	0.4445084300000417 second	0.4445084300000417 second	1	Time series	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	35.34:1	128
2	0.492603	0.49260268600005475 second	0.49260268600005475 second	0.49260268600005475 second	1	Spatial slice	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	35.34:1	128
3	1.017236	1.0172359650000544 second	1.0172359650000544 second	1.0172359650000544 second	1	Full scan	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	35.34:1	128
0	0.027823	0.027823419000014837 second	0.027823419000014837 second	0.027823419000014837 second	1	Random point	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	53.80:1	128
1	0.072818	0.07281813499980672 second	0.07281813499980672 second	0.07281813499980672 second	1	Time series	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	53.80:1	128
2	0.052765	0.05276520500001425 second	0.05276520500001425 second	0.05276520500001425 second	1	Spatial slice	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	53.80:1	128
3	0.921520	0.921520467000164 second	0.921520467000164 second	0.921520467000164 second	1	Full scan	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(typesize=4, cname=<BloscCname.zstd...	53.80:1	128

In [8]:

Copied!





title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
    x="chunk_size",
    y="mean_time",
    by="access_pattern",
    width=1000,
    rot=45,
    title=title,
    ylabel="Duration (s)",
    xlabel="Chunk Size, Query type",
)
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
    x="chunk_size",
    y="mean_time",
    by="access_pattern",
    width=1000,
    rot=45,
    title=title,
    ylabel="Duration (s)",
    xlabel="Chunk Size, Query type",
)

In [9]:

Copied!

plt
plt

Out[9]:

Note how much longer it takes to query the dataset when the dataset is comprised of a single chunk, especially for small subsets of the data such as a time-series, spatial slice, or random point.