STAC at Scale

GSA GtCoP November 2024: Designing and maintaining STAC infrastructure for large‐scale applications

Henry Rodman

2024-11-19

Development Seed

  • geospatial engineering and product design consultancy
  • 20 years of operations with offices in Washington DC, and Lisbon, Portugal. With 54+ staff working across the globe.

STAC refresher

SpatioTemporal Asset Catalog

  • STAC spec
  • STAC API spec
  • community-driven

Vision

The goal is for all providers of spatiotemporal assets (Imagery, SAR, Point Clouds, Data Cubes, Full Motion Video, etc) to expose their data as SpatioTemporal Asset Catalogs (STAC), so that new code doesn’t need to be written whenever a new data set or API is released.

  • Data providers don’t need to write new APIs
  • Data consumers don’t need to learn new APIs
  • Tool builders don’t need to build against new APIs

How is it going?

  • Providers: producing STAC metadata and hosting STAC APIs
  • Developers: building awesome applications
  • Consumers: pushing the limits of APIs and STAC spec
  • Community: updating the STAC spec to meet needs of providers, consumers, and developers

The STAC advantage: producers

Benefits for data producers

  • Standardized metadata
    • no need to invent custom schemas
    • use existing validation tools
    • re-use or create extensions for specialized needs
  • Flexible implementation
    • multiple database backends

Reduced development overhead

  • community benefits
    • shared development costs
    • battle-tested implementations
    • active ecosystem growth

The STAC advantage: developers

Benefits for developers

  • collaboration among many organizations
  • iteratively improve existing tools rather than building from scratch every time
  • dig into details of the spec to support as many use-cases as possible
  • re-use tested components!

STAC Browser

{
  "collections": [
    {
      "id": "sentinel1-grd",
      "type": "Collection",
      "links": [
        {
          "rel": "items",
          "type": "application/geo+json",
          "href": "https://dev.asdi-catalog.org/collections/sentinel1-grd/items"
        },
        {
          "rel": "parent",
          "type": "application/json",
          "href": "https://dev.asdi-catalog.org/"
        }
      ],
      "title": "Sentinel-1 GRD",
      "extent": {
        "spatial": {
          "bbox": [
            [-180, -90, 180, 90]
          ]
        },
        "temporal": {
          "interval": [
            [
              "2014-10-10T00:00:00Z",
              null
            ]
          ]
        }
      }
    }
  ]
}

Explore STAC Browser

Maxar imagery from Maui wildfires in August 2023

STAC at scale: a journey

Outline

LandsatLook

  • user-facing USGS application for visualizaing mosaics of Landsat scenes

Microsoft Planetary Computer

  • massive geospatial data catalog
  • dynamic visualization capabilities

Amazon Sustainability Data Initiative

  • public science data
  • event-driven STAC metadata generation pipelines

LandsatLook

  • an early STAC-powered application
  • interactive Landsat imagery browser
  • spurred the development of MosaicJSON
  • inspired the next generation of database backends

Explore LandsatLook

MosaicJSON

  • pre-compute which STAC items are required for each XYZ tile
    • based on search criteria and item spatial extents
  • makes dynamic mosaic visualizations possible!
  • reduces burden on STAC API, but costly to create

Microsoft Planetary Computer

  • built on open standards
  • STAC API with many public datasets
  • continuous flow of new data
  • interactive visualization platform powered by STAC metadata

Explore Planetary Computer

pgstac origin story

How can we optimize for lightning fast queries?

pgstac was born

pgstac

PostgreSQL schema and functions for Spatio-Temporal Asset Catalog (STAC)

strengths

  • it’s Postgres!
  • built to handle hundreds of millions of items
  • optimized for “needle in haystack” queries
  • hashed searches

weaknesses

  • requires some tinkering to achieve optimal performance
  • aggregation statistics can be somewhat slow

titiler-pgstac

  • dynamic mosaic visualizations from STAC collections
  • connects directly to pgstac database
    • (no API requests for STAC queries)
  • hashed searches!

Integrated mosaic tiling

Amazon Sustainability Data Initiative (ASDI)

  • sustainability data situated next to a compute platform
  • continuously updated
  • NODD: partnership with NOAA

ASDI Catalog

stactools-packages

The stactools-packages github organization is a home for datasets and tool packages using the SpatioTemporal Asset Catalog (STAC) specification.

  • user-generated STAC metadata building packages
  • href -> STAC item

Browse stacktools-packages

stactools-pipelines

eoAPI

State-of-the-art services to enable data discovery, search, visualization, and access.

The STAC advantage: consumers

  • don’t need to learn file path schema for a new dataset (don’t need to even think about file ops)
  • can rely on tools like pystac_client for API operations and odc_stac for file i/o

Example workflow

  • thanks to the STAC spec we only need a few libraries to interact with data from any collection!
import odc.stac
import pyproj
import pystac_client
import rasterio
import xarray as xr
from shapely.geometry import box
from shapely.ops import transform

Define AOI

  • set up an AOI
crs = pyproj.CRS.from_string("epsg:5070")

# bounding box that surrounds the North Shore of Lake Superior
bbox = (373926, 2744693, 406338, 2765304)
aoi = box(*bbox)

# STAC items **always** store bounding box info in epsg:4326
transformer_4326 = pyproj.Transformer.from_crs(
    crs_from=crs,
    crs_to="epsg:4326",
    always_xy=True,
)

bbox_4326 = transform(transformer_4326.transform, aoi).bounds
print(bbox_4326)
(-91.05816506751593, 47.592675062538454, -90.61526491213358, 47.79503898873347)

odc.stac.load

  • load into an xarray.Dataset with odc.stac.load
  • don’t read any data yet, though (just metadata!)
dem_ds = odc.stac.load(
    stac_items,
    crs=crs,  # assets get reprojected to match crs
    resolution=30,
    chunks={},  # lazy eval
    x=(bbox[0], bbox[2]),  # limit to aoi
    y=(bbox[1], bbox[3]),
)
print(dem_ds)
<xarray.Dataset> Size: 3MB
Dimensions:      (y: 688, x: 1081, time: 1)
Coordinates:
  * y            (y) float64 6kB 2.765e+06 2.765e+06 ... 2.745e+06 2.745e+06
  * x            (x) float64 9kB 3.739e+05 3.74e+05 ... 4.063e+05 4.063e+05
    spatial_ref  int32 4B 5070
  * time         (time) datetime64[ns] 8B 2021-04-22
Data variables:
    data         (time, y, x) float32 3MB dask.array<chunksize=(1, 688, 1081), meta=np.ndarray>

xarray API

  • access to the xarray API for visualization and analysis
  • read data from cloud storage “just-in-time”
with rasterio.Env(aws_no_sign_request=True):
    dem_ds["data"].squeeze().plot.imshow()

New source, same tools

  • read another dataset from a different source
client = pystac_client.Client.open(
    "https://services.terrascope.be/stac"
)

# sometimes necessary before running a collection search
client.add_conforms_to("COLLECTIONS")
collection_search = client.collection_search(
    q="WorldCover", bbox=bbox_4326
)

stac_items = client.search(
    collections=next(collection_search.collections()).id,
    bbox=bbox_4326,
).item_collection()

Load land cover

landcover_ds = odc.stac.load(
    stac_items,
    crs=crs,  # assets get reprojected to match crs
    resolution=30,
    dtype="uint8",
    resampling="nearest",
    chunks={},  # lazy eval
    x=(bbox[0], bbox[2]),  # limit to aoi
    y=(bbox[1], bbox[3]),
)
print(landcover_ds)
<xarray.Dataset> Size: 758kB
Dimensions:                 (y: 688, x: 1081, time: 1)
Coordinates:
  * y                       (y) float64 6kB 2.765e+06 2.765e+06 ... 2.745e+06
  * x                       (x) float64 9kB 3.739e+05 3.74e+05 ... 4.063e+05
    spatial_ref             int32 4B 5070
  * time                    (time) datetime64[ns] 8B 2020-12-31T23:59:59
Data variables:
    ESA_WORLDCOVER_10M_MAP  (time, y, x) uint8 744kB dask.array<chunksize=(1, 688, 1081), meta=np.ndarray>

Merge datasets

  • combine the two Datasets into a single xarray.Dataset
merged = xr.merge(
  [
      landcover_ds.squeeze().rename({"ESA_WORLDCOVER_10M_MAP": "land_cover"}),
      dem_ds.squeeze().rename({"data": "elevation_m"}),
  ],
  compat="override"
)

print(merged)
<xarray.Dataset> Size: 4MB
Dimensions:      (y: 688, x: 1081)
Coordinates:
  * y            (y) float64 6kB 2.765e+06 2.765e+06 ... 2.745e+06 2.745e+06
  * x            (x) float64 9kB 3.739e+05 3.74e+05 ... 4.063e+05 4.063e+05
    spatial_ref  int32 4B 5070
    time         datetime64[ns] 8B 2020-12-31T23:59:59
Data variables:
    land_cover   (y, x) uint8 744kB dask.array<chunksize=(688, 1081), meta=np.ndarray>
    elevation_m  (y, x) float32 3MB dask.array<chunksize=(688, 1081), meta=np.ndarray>

Analyze disparate datasets

  • calculate mean elevation by land cover class
with rasterio.Env(aws_no_sign_request=True):
    stats = (
        merged.elevation_m.groupby(
            merged.land_cover
        ).mean().compute()
    )
land_cover elevation_m
0 10 476
1 20 513
2 30 459
3 40 353
4 50 289
5 60 403
6 80 281
7 90 487
8 100 484

Closing thoughts

  • STAC is working on a very large scale already
  • adoption by large organizations has been a huge accelerator
  • more STAC, please

Thank you!