Massive data chunks¶

Summary¶

Using massive data chunks leads to poor performance.

Details¶

There are multiple reasons why you should avoid using too large of chunk sizes in your datacubes:

In most cases, a chunk represents the minimum amount of data that needs to be fetched for any data operations. Consider a datacube with 1 GB chunks. An application or library would fetch the full GB even if a user wants to query only a single point represented by an 8 byte value. This leads to slowness as the user waits for the data to be fetched and decompressed and potentially costs if there is data egress.
Many parallelization frameworks, such as Dask, operate on a chunk-by-chunk basis. Splitting your datacube over too few/large chunks means that you cannot effectively leverage parallel computations.

Demonstration¶

In [1]:

Copied!





import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas  # noqa
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas  # noqa

Set constants to use when comparing datacubes

In [2]:

Copied!





config = datacube_benchmark.Config
config.target_array_size = "1 GB"
# config.credential_provider = AzureCredentialProvider(
#     credential=DefaultAzureCredential()
# )
config.warmup_samples = 1
config.create_data = False
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
# config.credential_provider = AzureCredentialProvider(
#     credential=DefaultAzureCredential()
# )
config.warmup_samples = 1
config.create_data = False

In [3]:

Copied!

zarr.config.set({"async.concurrency": config.zarr_concurrency})
zarr.config.set({"async.concurrency": config.zarr_concurrency})

Out[3]:

<donfig.config_obj.ConfigSet at 0x108f26350>

Demonstrating performance inefficiencies of too large of chunks¶

Create (or reuse) a blosc compressed array with 25 MB chunks

In [4]:

Copied!





url_for_reg_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/reg-chunks.zarr"
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)
url_for_reg_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/reg-chunks.zarr"
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)

Create (or reuse) a blosc compressed array with 1 GB chunks

In [5]:

Copied!





url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_massive_chunks, target_chunk_size="1 GB", config=config
)
url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
    url_for_massive_chunks, target_chunk_size="1 GB", config=config
)

Test the time required to load a random point, a time series, or a spatial slice for the array.

In [6]:

Copied!





massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
    massive_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
    reg_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)
massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
    massive_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
    reg_chunk_arr,
    num_samples=config.num_samples,
    warmup_samples=config.warmup_samples,
).reset_index(drop=True)

In [7]:

Copied!





df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
    {
        "point": "Random point",
        "time_series": "Time series",
        "spatial_slice": "Spatial slice",
        "full": "Full scan",
    }
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
    lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df
df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
    {
        "point": "Random point",
        "time_series": "Time series",
        "spatial_slice": "Spatial slice",
        "full": "Full scan",
    }
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
    lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df

Out[7]:

	mean_time	median_time	min_time	max_time	total_samples	access_pattern	array_shape	chunk_shape	chunk_size	nchunks	shard_shape	array_dtype	array_size_memory	array_size_storage	array_compressors	compression_ratio	zarr_concurrency
0	4.570757	4.5707570829999895 second	4.5707570829999895 second	4.5707570829999895 second	1	Random point	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	35.34:1	128
1	9.399285	9.399284749999424 second	9.399284749999424 second	9.399284749999424 second	1	Time series	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	35.34:1	128
2	7.174652	7.174651583999548 second	7.174651583999548 second	7.174651583999548 second	1	Spatial slice	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	35.34:1	128
3	9.732877	9.732876707999822 second	9.732876707999822 second	9.732876707999822 second	1	Full scan	(965, 360, 720)	(630, 360, 630)	571.54 (MB)	4	None	float32	1.000512 gigabyte	0.02831126 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	35.34:1	128
0	0.574409	0.5744085000005725 second	0.5744085000005725 second	0.5744085000005725 second	1	Random point	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	53.80:1	128
1	0.503665	0.5036646669996117 second	0.5036646669996117 second	0.5036646669996117 second	1	Time series	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	53.80:1	128
2	0.718962	0.7189615829993272 second	0.7189615829993272 second	0.7189615829993272 second	1	Spatial slice	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	53.80:1	128
3	2.883359	2.883359292000023 second	2.883359292000023 second	2.883359292000023 second	1	Full scan	(965, 360, 720)	(185, 185, 185)	25.33 (MB)	48	None	float32	1.000512 gigabyte	0.018597328 gigabyte	(BloscCodec(_tunable_attrs=set(), typesize=4, ...	53.80:1	128

In [8]:

Copied!





title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
    x="chunk_size",
    y="mean_time",
    by="access_pattern",
    width=1000,
    rot=45,
    title=title,
    ylabel="Duration (s)",
    xlabel="Chunk Size, Query type",
)
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
    x="chunk_size",
    y="mean_time",
    by="access_pattern",
    width=1000,
    rot=45,
    title=title,
    ylabel="Duration (s)",
    xlabel="Chunk Size, Query type",
)

In [9]:

Copied!

plt
plt

Out[9]:

Note how much longer it takes to query the dataset when the dataset is comprised of a single chunk, especially for small subsets of the data such as a time-series, spatial slice, or random point.