Our ML Tooling 2022 - Part 2: Data Operations


14 min read

Fig 1. Our updated ML Tooling diagram, zoomed in on the Data Operations section.

Fig 2. The result our data pipeline that queries a STAC API. X is the longitude and y the latitude. The "V V” color bar represents the type of polarity signal for the C-band radar image.

Fig 3. An example data pipeline that converts a STAC query into tuples of Torch tensors for the image and label pairs. Torchdata allows each step in this datapipe to stream from cloud stores and supports multiprocessing.

Image by Development Seed

Fig 4. LULC labeling with JOSM. With OSM-seed we can store annotations in a database, allowing our Data Team to divide tasks and conquer large annotation challenges!

Image by Development Seed

Fig 5. Annotating wildlife in Tanzania from aerial imagery for the AI-ASSISTED AERIAL IMAGERY ANALYSIS (AIAIA) TO MAP HUMAN-WILDLIFE PROXIMITY IN TANZANIA project.

Fig 6. FiftyOne, a tool for visualizing annotated image samples

  • 1Use STAC

    Find an existing STAC API that serves your dataset. If your imagery and labels aren’t in a STAC API, you can use open source tools like stactools to create STAC items that can be ingested into a STAC API.

  • 2Use pystac_client and stackstac

    to load STAC Items into Xarray objects for visualization, analysis, and inspection.

  • 3Use TorchData and Zen3Geo

    to convert Xarray structures to Torch tensors.

What we're doing.