HLS Example (https)¶
The Harmonized Landsat Sentinel 2 dataset contains grid-aligned cloud-optimized geotiffs that are great for deep time series analysis because it contains the combined archives from Landsat 8/9 and Sentinel 2A/B/C. NASA MAAP maintains a stac-geoparquet archive of all of the HLS STAC records. This archive solves a common bottleneck that users have encountered when trying to work with HLS STAC records at scale by eliminating the need to query the CMR STAC API!
This notebook demonstrates how to use lazycogs to interact with the HLS STAC Geoparquet Archive in the optimal way.
Note: This tutorial shows how to access the HLS granules using https links. If you are able to run your compute in the AWS us-west-2 region, try the method outlined in HLS Example (us-west-2 only).
import asyncio
import os
from urllib.parse import urlparse
import requests
from async_geotiff import GeoTIFF
from obstore.exceptions import NotFoundError, PermissionDeniedError
from obstore.store import HTTPStore, LocalStore, from_url
from pyproj import Transformer
from rustac import DuckdbClient
import lazycogs
HLS_STAC_GEOPARQUET_PREFIX = "s3://nasa-maap-data-store/file-staging/nasa-map"
HLS_STAC_GEOPARQUET_PATH_FORMAT = (
"hls-stac-geoparquet-archive/v2/{collection}"
"/year={year}/month={month}/{collection}-{year}-{month}.parquet"
)
HLS_STAC_GEOPARQUET_HREF = (
HLS_STAC_GEOPARQUET_PREFIX + "/" + HLS_STAC_GEOPARQUET_PATH_FORMAT
)
HLS_STAC_GEOPARQUET_HIVE_FORMAT = (
"hls-stac-geoparquet-archive/v2/{collection}/**/*.parquet"
)
Download some of the HLS STAC Geoparquet Archive¶
The HLS STAC Geoparquet Archive is stored in a pair of hive-partitioned parquet dataset (one each for HLSS30 and HLSL30). The files are publicly accessible but duckdb requires ListBucket privileges in order to use the hive-partitioning structure. Since ListBucket privileges are not granted to the general public, an alternative is to copy the files you want to a store for which you have full privileges (e.g. your own bucket, local disk).
The following example will download all of the available files for the years 2025 and 2026 (as of May 13, 2026 this is ~1.4 GB).
hls_s3_parquet_store = from_url(
HLS_STAC_GEOPARQUET_PREFIX,
region="us-west-2",
skip_signature=True,
)
hls_local_parquet_store = LocalStore(prefix="data")
hls_stac_geoparquet_paths = [
HLS_STAC_GEOPARQUET_PATH_FORMAT.format(
collection=collection,
year=year,
month=month,
)
for collection in ["HLSS30_2.0", "HLSL30_2.0"]
for year in [2025, 2026]
for month in range(1, 13)
]
async def _check_exists(store, path) -> bool:
try:
_ = await store.head_async(path)
except (FileNotFoundError, NotFoundError, PermissionDeniedError):
return False
else:
return True
async def download_if_missing(path):
exists_on_s3 = await _check_exists(hls_s3_parquet_store, path)
exists_locally = await _check_exists(hls_local_parquet_store, path)
if exists_on_s3 and not exists_locally:
resp = await hls_s3_parquet_store.get_async(path)
await hls_local_parquet_store.put_async(path, resp)
tasks = [download_if_missing(p) for p in hls_stac_geoparquet_paths]
_ = await asyncio.gather(*tasks)
Set up a duckdb client for rustac¶
rustac can be configured to search your local copy of the HLS STAC Geoparquet Archive, taking advantage of the hive-partitioned structure.
duckdb_client = DuckdbClient(use_hive_partitioning=True)
Define the array's spatial extent¶
The HLS granules are projected in UTM coordinate reference systems. To access the underlying data in an undistorted efficient manner the best way is to build an array for a single UTM coordinate system that is aligned with the grid of the COGs themselves.
We can define an AOI for an entire UTM zone like this:
epsg_code = "32615"
utm_zone = 15
dst_crs = f"epsg:{epsg_code}"
east_edge = (utm_zone * 6) - 180
west_edge = east_edge - 6
bbox_4326 = (west_edge, 0, east_edge, 84)
# transform to epsg:4326 for STAC search
transformer = Transformer.from_crs("epsg:4326", dst_crs, always_xy=True)
dst_bbox = transformer.transform_bounds(*bbox_4326)
print(bbox_4326)
print(dst_bbox)
(-96, 0, -90, 84) (166021.44308053772, 0.0, 833978.5569194595, 9329005.182447437)
Retrieve a single STAC item so we can snap our approximate bounding box to the HLS granule grid.
sample_item = duckdb_client.search(
"data/" + HLS_STAC_GEOPARQUET_HIVE_FORMAT.format(collection="HLSS30_2.0"),
bbox=bbox_4326,
filter={"op": "=", "args": [{"property": "proj:epsg"}, epsg_code]},
max_items=1,
)[0]
Use the proj:transform property to align our rough bounding box with the HLS grid
dst_bbox_aligned = lazycogs.align_bbox(
sample_item["properties"]["proj:transform"],
dst_bbox,
)
print(f"aligned bbox: {dst_bbox_aligned}")
aligned bbox: (166020.0, 0.0, 834000.0, 9329010.0)
Configure access to the HLS COGs over HTTP¶
The STAC assets in the stac-geoparquet file contain https hrefs but they require an EDL token to access. obstore's HTTPStore allows you to provide the Authorization: Bearer <token> header via default_headers in the client_options dict.
This will work for you if you have your NASA Earthdata credentials stored in the EARTHDATA_USERNAME and EARTHDATA_PASSWORD environment variables.
def get_earthdata_token(username: str, password: str) -> str:
r = requests.post(
"https://urs.earthdata.nasa.gov/api/users/find_or_create_token",
auth=(username, password),
timeout=10,
)
r.raise_for_status()
return r.json()["access_token"]
token = get_earthdata_token(
os.getenv("EARTHDATA_USERNAME"),
os.getenv("EARTHDATA_PASSWORD"),
)
store_prefix = "https://data.lpdaac.earthdatacloud.nasa.gov"
store = HTTPStore(
store_prefix,
client_options={
"default_headers": {
"Authorization": f"Bearer {token}",
},
},
)
Inspect a sample COG¶
We can use async-geotiff to inspect a sample COG and its overviews:
geotiff = await GeoTIFF.open(
urlparse(sample_item["assets"]["B04"]["href"]).path,
store=store,
)
resolutions = [geotiff.res[0]]
print("full resolution:", geotiff.res)
for i, overview in enumerate(geotiff.overviews, start=1):
resolutions.append(overview.res[0])
print("overview", i, overview.res)
print("nodata value:", geotiff.nodata)
print("data type:", geotiff.dtype)
full resolution: (30.0, 30.0) overview 1 (60.0, 60.0) overview 2 (120.0, 120.0) overview 3 (239.73799126637556, 239.73799126637556) overview 4 (479.4759825327511, 479.4759825327511) nodata value: -9999.0 data type: int16
Open a massive array¶
Now open an array for the full resolution data for the year 2025. To load a lower-resolution array you can select a different value from the resolutions list and lazycogs will use the COGs' overviews instead of the full-resolution data. If you pick something in between one of the listed resolutions lazycogs will have to warp the arrays and do a nearest-neighbor interpolation.
stac_search_args = {
"href": "data/" + HLS_STAC_GEOPARQUET_HIVE_FORMAT.format(collection="HLSS30_2.0"),
"filter": {"op": "=", "args": [{"property": "proj:epsg"}, epsg_code]},
}
da = lazycogs.open(
crs=dst_crs,
bbox=dst_bbox_aligned,
resolution=30,
time_period="P1D", # one time coordinate per day
bands=["B04", "B03", "B02"],
dtype=geotiff.dtype,
nodata=geotiff.nodata,
store=store,
**stac_search_args,
)
da
<xarray.DataArray (band: 3, time: 494, y: 310967, x: 22266)> Size: 21TB
[10261354991004 values with dtype=int16]
Coordinates:
* band (band) <U3 36B 'B04' 'B03' 'B02'
* time (time) datetime64[s] 4kB 2025-01-01 2025-01-02 ... 2026-05-09
* y (y) float64 2MB 9.329e+06 9.329e+06 9.329e+06 ... 45.0 15.0
* x (x) float64 178kB 1.66e+05 1.661e+05 ... 8.34e+05 8.34e+05
spatial_ref int64 8B 0
Indexes:
┌ x RasterIndex (crs=EPSG:32615)
└ y
Attributes:
grid_mapping: spatial_ref
zarr_conventions: [{'schema_url': 'https://raw.githubusercontent.c...
spatial:dimensions: ['y', 'x']
spatial:bbox: (166020.0, 0.0, 834000.0, 9329010.0)
spatial:transform_type: affine
spatial:transform: [30.0, 0.0, 166020.0, 0.0, -30.0, 9329010.0]
spatial:shape: [310967, 22266]
spatial:registration: pixel
_stac_backend: MultiBandStacBackendArray(bands=['B04', 'B03', '...
_stac_time_coords: 2025-01-01 … 2026-05-09 (n=494)
proj:code: EPSG:32615Extract all values for a single point.
%%time
pt = da.sel(
x=576465,
y=5196159,
method="nearest",
).load()
pt
CPU times: user 1min 53s, sys: 8.13 s, total: 2min 1s Wall time: 36.4 s
<xarray.DataArray (band: 3, time: 494)> Size: 3kB
array([[-9999, -9999, 5404, ..., -9999, 1649, -9999],
[-9999, -9999, 5062, ..., -9999, 1705, -9999],
[-9999, -9999, 4619, ..., -9999, 1721, -9999]],
shape=(3, 494), dtype=int16)
Coordinates:
* band (band) <U3 36B 'B04' 'B03' 'B02'
* time (time) datetime64[s] 4kB 2025-01-01 2025-01-02 ... 2026-05-09
spatial_ref int64 8B 0
x float64 8B 5.765e+05
y float64 8B 5.196e+06
Attributes:
grid_mapping: spatial_ref
zarr_conventions: [{'schema_url': 'https://raw.githubusercontent.c...
spatial:dimensions: ['y', 'x']
spatial:bbox: (166020.0, 0.0, 834000.0, 9329010.0)
spatial:transform_type: affine
spatial:transform: [30.0, 0.0, 166020.0, 0.0, -30.0, 9329010.0]
spatial:shape: [310967, 22266]
spatial:registration: pixel
_stac_backend: MultiBandStacBackendArray(bands=['B04', 'B03', '...
_stac_time_coords: 2025-01-01 … 2026-05-09 (n=494)
proj:code: EPSG:32615Operate on a subset in space and time¶
Select a spatial subset for the days with non-nodata values in the point extraction.
subset = da.sel(
x=slice(550_000, 600_000),
y=slice(5_200_000, 5_150_000),
time=pt.time.values[pt.isel(band=0).values != geotiff.nodata],
).sel(time=slice("2026-03-01", "2026-03-11"))
subset
<xarray.DataArray (band: 3, time: 4, y: 1668, x: 1668)> Size: 67MB
[33386688 values with dtype=int16]
Coordinates:
* band (band) <U3 36B 'B04' 'B03' 'B02'
* time (time) datetime64[s] 32B 2026-03-01 2026-03-04 ... 2026-03-11
* y (y) float64 13kB 5.2e+06 5.2e+06 5.2e+06 ... 5.15e+06 5.15e+06
* x (x) float64 13kB 5.5e+05 5.5e+05 5.501e+05 ... 6e+05 6e+05
spatial_ref int64 8B 0
Indexes:
┌ x RasterIndex (crs=EPSG:32615)
└ y
Attributes:
grid_mapping: spatial_ref
zarr_conventions: [{'schema_url': 'https://raw.githubusercontent.c...
spatial:dimensions: ['y', 'x']
spatial:bbox: (166020.0, 0.0, 834000.0, 9329010.0)
spatial:transform_type: affine
spatial:transform: [30.0, 0.0, 166020.0, 0.0, -30.0, 9329010.0]
spatial:shape: [310967, 22266]
spatial:registration: pixel
_stac_backend: MultiBandStacBackendArray(bands=['B04', 'B03', '...
_stac_time_coords: 2025-01-01 … 2026-05-09 (n=494)
proj:code: EPSG:32615%%time
_ = subset.load()
CPU times: user 4.91 s, sys: 1.48 s, total: 6.39 s Wall time: 7.78 s
(
subset.plot.imshow(
rgb="band",
col="time",
col_wrap=2,
vmin=0,
vmax=2000,
size=4,
aspect=subset.shape[3] / subset.shape[2],
)
)
<xarray.plot.facetgrid.FacetGrid at 0x7f1e48e99a90>