Massive data chunks¶
Summary¶
Using massive data chunks leads to poor performance.
Details¶
There are multiple reasons why you should avoid using too large of chunk sizes in your datacubes:
- In most cases, a chunk represents the minimum amount of data that needs to be fetched for any data operations. Consider a datacube with 1 GB chunks. An application or library would fetch the full GB even if a user wants to query only a single point represented by an 8 byte value. This leads to slowness as the user waits for the data to be fetched and decompressed and potentially costs if there is data egress.
- Many parallelization frameworks, such as Dask, operate on a chunk-by-chunk basis. Splitting your datacube over too few/large chunks means that you cannot effectively leverage parallel computations.
Demonstration¶
In [1]:
Copied!
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas # noqa
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas # noqa
Set constants to use when comparing datacubes
In [2]:
Copied!
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
# config.credential_provider = AzureCredentialProvider(
# credential=DefaultAzureCredential()
# )
config.warmup_samples = 1
config.create_data = False
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
# config.credential_provider = AzureCredentialProvider(
# credential=DefaultAzureCredential()
# )
config.warmup_samples = 1
config.create_data = False
In [3]:
Copied!
zarr.config.set({"async.concurrency": config.zarr_concurrency})
zarr.config.set({"async.concurrency": config.zarr_concurrency})
Out[3]:
<donfig.config_obj.ConfigSet at 0x108f26350>
Demonstrating performance inefficiencies of too large of chunks¶
Create (or reuse) a blosc compressed array with 25 MB chunks
In [4]:
Copied!
url_for_reg_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/reg-chunks.zarr"
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)
url_for_reg_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/reg-chunks.zarr"
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)
Create (or reuse) a blosc compressed array with 1 GB chunks
In [5]:
Copied!
url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_massive_chunks, target_chunk_size="1 GB", config=config
)
url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_massive_chunks, target_chunk_size="1 GB", config=config
)
Test the time required to load a random point, a time series, or a spatial slice for the array.
In [6]:
Copied!
massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
massive_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
reg_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
massive_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
reg_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
In [7]:
Copied!
df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
{
"point": "Random point",
"time_series": "Time series",
"spatial_slice": "Spatial slice",
"full": "Full scan",
}
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df
df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
{
"point": "Random point",
"time_series": "Time series",
"spatial_slice": "Spatial slice",
"full": "Full scan",
}
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df
Out[7]:
| mean_time | median_time | std_time | min_time | max_time | total_samples | access_pattern | array_shape | chunk_shape | chunk_size | nchunks | shard_shape | array_dtype | array_size_memory | array_size_storage | array_compressors | compression_ratio | zarr_concurrency | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.570757 | 4.5707570829999895 second | 0.0 second | 4.5707570829999895 second | 4.5707570829999895 second | 1 | Random point | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 35.34:1 | 128 |
| 1 | 9.399285 | 9.399284749999424 second | 0.0 second | 9.399284749999424 second | 9.399284749999424 second | 1 | Time series | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 35.34:1 | 128 |
| 2 | 7.174652 | 7.174651583999548 second | 0.0 second | 7.174651583999548 second | 7.174651583999548 second | 1 | Spatial slice | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 35.34:1 | 128 |
| 3 | 9.732877 | 9.732876707999822 second | 0.0 second | 9.732876707999822 second | 9.732876707999822 second | 1 | Full scan | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 35.34:1 | 128 |
| 0 | 0.574409 | 0.5744085000005725 second | 0.0 second | 0.5744085000005725 second | 0.5744085000005725 second | 1 | Random point | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 53.80:1 | 128 |
| 1 | 0.503665 | 0.5036646669996117 second | 0.0 second | 0.5036646669996117 second | 0.5036646669996117 second | 1 | Time series | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 53.80:1 | 128 |
| 2 | 0.718962 | 0.7189615829993272 second | 0.0 second | 0.7189615829993272 second | 0.7189615829993272 second | 1 | Spatial slice | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 53.80:1 | 128 |
| 3 | 2.883359 | 2.883359292000023 second | 0.0 second | 2.883359292000023 second | 2.883359292000023 second | 1 | Full scan | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(_tunable_attrs=set(), typesize=4, ... | 53.80:1 | 128 |
In [8]:
Copied!
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
x="chunk_size",
y="mean_time",
by="access_pattern",
width=1000,
rot=45,
title=title,
ylabel="Duration (s)",
xlabel="Chunk Size, Query type",
)
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
x="chunk_size",
y="mean_time",
by="access_pattern",
width=1000,
rot=45,
title=title,
ylabel="Duration (s)",
xlabel="Chunk Size, Query type",
)
In [9]:
Copied!
plt
plt
Out[9]:
Note how much longer it takes to query the dataset when the dataset is comprised of a single chunk, especially for small subsets of the data such as a time-series, spatial slice, or random point.