Massive data chunks¶
There are multiple reasons why you should avoid using too large of chunk sizes in your datacubes:
- In most cases, a chunk represents the minimum amount of data that needs to be fetched for any data operations. Consider a datacube with 1 GB chunks. An application or library would fetch the full GB even if a user wants to query only a single point represented by an 8 byte value. This leads to slowness as the user waits for the data to be fetched and decompressed and potentially costs if there is data egress.
- Many parallelization frameworks, such as Dask, operate on a chunk-by-chunk basis. Splitting your datacube over too few/large chunks means that you cannot effectively leverage parallel computations.
In [1]:
Copied!
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas # noqa
from azure.identity import DefaultAzureCredential
from obstore.auth.azure import AzureCredentialProvider
import datacube_benchmark
import zarr
import pandas as pd
import hvplot.pandas # noqa
from azure.identity import DefaultAzureCredential
from obstore.auth.azure import AzureCredentialProvider
Set constants to use when comparing datacubes
In [2]:
Copied!
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
config.credential_provider = AzureCredentialProvider(
credential=DefaultAzureCredential()
)
config.warmup_samples = 1
config.create_data = False
config = datacube_benchmark.Config
config.target_array_size = "1 GB"
config.credential_provider = AzureCredentialProvider(
credential=DefaultAzureCredential()
)
config.warmup_samples = 1
config.create_data = False
In [3]:
Copied!
zarr.config.set({"async.concurrency": config.zarr_concurrency})
zarr.config.set({"async.concurrency": config.zarr_concurrency})
Out[3]:
<donfig.config_obj.ConfigSet at 0x739745abc550>
Demonstrating performance inefficiencies of too large of chunks¶
Create (or reuse) a blosc compressed array with 25 MB chunks
In [4]:
Copied!
url_for_reg_chunks = (
"https://datacubeguide.blob.core.windows.net/performance-testing/reg-chunks.zarr"
)
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)
url_for_reg_chunks = (
"https://datacubeguide.blob.core.windows.net/performance-testing/reg-chunks.zarr"
)
reg_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_reg_chunks, target_chunk_size="25 megabyte", config=config
)
Create (or reuse) a blosc compressed array with 1 GB chunks
In [5]:
Copied!
url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_massive_chunks, target_chunk_size="1 GB", config=config
)
url_for_massive_chunks = "https://datacubeguide.blob.core.windows.net/performance-testing/massive-chunks.zarr"
massive_chunk_arr = datacube_benchmark.create_or_open_zarr_array(
url_for_massive_chunks, target_chunk_size="1 GB", config=config
)
Test the time required to load a random point, a time series, or a spatial slice for the array.
In [6]:
Copied!
massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
massive_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
reg_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
massive_chunk_results = datacube_benchmark.benchmark_access_patterns(
massive_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
reg_chunks_results = datacube_benchmark.benchmark_access_patterns(
reg_chunk_arr,
num_samples=config.num_samples,
warmup_samples=config.warmup_samples,
).reset_index(drop=True)
In [7]:
Copied!
df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
{
"point": "Random point",
"time_series": "Time series",
"spatial_slice": "Spatial slice",
"full": "Full scan",
}
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df
df = pd.concat([massive_chunk_results, reg_chunks_results])
df["access_pattern"] = df["access_pattern"].replace(
{
"point": "Random point",
"time_series": "Time series",
"spatial_slice": "Spatial slice",
"full": "Full scan",
}
)
df["mean_time"] = df.apply(lambda row: float(row["mean_time"].magnitude), axis=1)
df["chunk_size"] = df.apply(
lambda row: f"{row['chunk_size'].to("MB").magnitude:,.2f} (MB)", axis=1
)
df
Out[7]:
mean_time | median_time | std_time | min_time | max_time | total_samples | access_pattern | array_shape | chunk_shape | chunk_size | nchunks | shard_shape | array_dtype | array_size_memory | array_size_storage | array_compressors | compression_ratio | zarr_concurrency | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.586832 | 0.5868318689999796 second | 0.0 second | 0.5868318689999796 second | 0.5868318689999796 second | 1 | Random point | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 35.34:1 | 128 |
1 | 0.444508 | 0.4445084300000417 second | 0.0 second | 0.4445084300000417 second | 0.4445084300000417 second | 1 | Time series | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 35.34:1 | 128 |
2 | 0.492603 | 0.49260268600005475 second | 0.0 second | 0.49260268600005475 second | 0.49260268600005475 second | 1 | Spatial slice | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 35.34:1 | 128 |
3 | 1.017236 | 1.0172359650000544 second | 0.0 second | 1.0172359650000544 second | 1.0172359650000544 second | 1 | Full scan | (965, 360, 720) | (630, 360, 630) | 571.54 (MB) | 4 | None | float32 | 1.000512 gigabyte | 0.02831126 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 35.34:1 | 128 |
0 | 0.027823 | 0.027823419000014837 second | 0.0 second | 0.027823419000014837 second | 0.027823419000014837 second | 1 | Random point | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 53.80:1 | 128 |
1 | 0.072818 | 0.07281813499980672 second | 0.0 second | 0.07281813499980672 second | 0.07281813499980672 second | 1 | Time series | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 53.80:1 | 128 |
2 | 0.052765 | 0.05276520500001425 second | 0.0 second | 0.05276520500001425 second | 0.05276520500001425 second | 1 | Spatial slice | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 53.80:1 | 128 |
3 | 0.921520 | 0.921520467000164 second | 0.0 second | 0.921520467000164 second | 0.921520467000164 second | 1 | Full scan | (965, 360, 720) | (185, 185, 185) | 25.33 (MB) | 48 | None | float32 | 1.000512 gigabyte | 0.018597328 gigabyte | (BloscCodec(typesize=4, cname=<BloscCname.zstd... | 53.80:1 | 128 |
In [8]:
Copied!
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
x="chunk_size",
y="mean_time",
by="access_pattern",
width=1000,
rot=45,
title=title,
ylabel="Duration (s)",
xlabel="Chunk Size, Query type",
)
title = "Duration to load data for difference access patterns"
plt = df.hvplot.bar(
x="chunk_size",
y="mean_time",
by="access_pattern",
width=1000,
rot=45,
title=title,
ylabel="Duration (s)",
xlabel="Chunk Size, Query type",
)
In [9]:
Copied!
plt
plt
Out[9]:
Note how much longer it takes to query the dataset when the dataset is comprised of a single chunk, especially for small subsets of the data such as a time-series, spatial slice, or random point.