Open Formats and Tools#

We would be remiss not to discuss some of main building blocks for how analysis is performed in the geospatial industry. Open formats and tools play a crucial role in this regard. Firstly, they promote interoperability by ensuring that data can be seamlessly shared and utilized across diverse platforms, tools, and software applications. This interoperability fosters collaboration, facilitates data sharing, and enables integration with other systems. Moreover, adopting open formats enhances data accessibility, making geospatial data more readily available to a wider audience. This accessibility encourages innovation, research, and development within the geospatial community by lowering barriers to entry.

Additionally, open formats and tools foster community collaboration by creating a conducive environment for developers and users to contribute to the improvement and enhancement of geospatial tools and datasets. This community-driven approach often results in the development of more robust, efficient, and feature-rich solutions. Open formats also promote transparency in geospatial data processing by providing clear specifications and standards for data representation and exchange. This transparency enhances trust and reliability in the data and tools used for analysis.

Furthermore, open formats and tools offer flexibility and customization options for geospatial processing. Users can implement and/or fork these tools to their specific needs, adapt them to different workflows, and extend their functionality through plugins and extensions. Open tools are also designed to handle large-scale geospatial datasets efficiently, employing optimized algorithms, parallel processing techniques, and support for distributed computing. Many open formats and tools adhere to established geospatial standards, ensuring compatibility and interoperability with a wide range of geospatial systems and tools.

In short, open formats and tools are integral components of the geospatial ecosystem, enabling users to work with geospatial data effectively, efficiently, and collaboratively. They promote transparency, accessibility, and innovation within the community, driving advancements in geospatial technology and analysis.

A selection of current best practice tools#

To start, much of what is implemented within the geospatial industry is captured in this Geospatial Python workshop and this Cloud-Optimized Geospatial Formats Guide. It is highly recommended that anyone seeking to learn current standards in practice refer to these resources.

STAC#

STAC, which stands for SpatioTemporal Asset Catalog, is an open standard specification designed to organize and describe geospatial assets in a consistent and interoperable manner. It provides a framework for cataloging satellite imagery, aerial photography, and other geospatial data assets, along with their associated metadata, enabling easy discovery, access, and analysis of Earth observation data. You might have noticed that we have used this tool throughout the chapters thus far to access geospatial data. For example, in the python snippets like the following:

def fetch_sentinel_image(
    stac_url: str,
    collection: str,
    bbox: list,
    start_date: datetime,
    end_date: datetime,
    bands: list,
    max_cloud_cover: int,
    epsg: int
) -> xr.DataArray:
    """
    Fetches image data from a specified STAC catalog.

    Parameters:
        stac_url (str): URL of the STAC catalog.
        collection (str): Name of the collection to search within the STAC catalog.
        bbox (list): Bounding box coordinates [minx, miny, maxx, maxy].
        start_date (datetime): Start date of the time range to search for images.
        end_date (datetime): End date of the time range to search for images.
        bands (list): List of bands to fetch from Sentinel-2 data.
        max_cloud_cover (int): Maximum cloud cover percentage allowed for images.
        epsg (int): EPSG code for the desired coordinate reference system.

    Returns:
        xr.DataArray: Xarray DataArray containing the fetched image data.
    """
    catalog = pystac_client.Client.open(stac_url)

    area_of_interest = box(*bbox)

    search = catalog.search(
        collections=[collection],
        datetime=(start_date, end_date),
        intersects=area_of_interest,
        max_items=100,
        query={"eo:cloud_cover": {"lt": max_cloud_cover}},
    )
    
    ...

Now, you’re invited to learn a bit more about what it is and why it’s a crucial development in the geospatial industry.

STAC establishes a standard language and structure for describing geospatial assets, which is important because this facilitates data organization and sharing in a uniform format. This standardization, in turn, enhances interoperability and compatibility across diverse platforms, tools, and applications.

One of the most useful aspects of STAC is the extent to which metadata can be made comprehensive. The possibility to include information on geospatial assets, such as acquisition date, spatial extent, cloud cover, resolution, and data format (e.g. Cloud Optimized GeoTIFF) enable improved data discovery by allowing for rich search criteria. In effect, this detailed metadata makes it easier for users to locate relevant datasets based on specific temporal, spatial, or thematic criteria.

When working with STAC, users can query with spatiotemporal parameters to search for geospatial assets based on desired temporal and spatial properties, which is essential for identifying datasets pertinent to particular time periods, geographic regions, and domains of research. STAC also accommodates evolving data collections and catalog structures by supporting versioning and incremental updates. This flexibility enables data providers to manage changes to their datasets over time.

Generally, the STAC ecosystem seamlessly integrates with existing geospatial tools and workflows, offering APIs and client libraries for interacting with STAC-compliant catalogs (e.g. Rasterio, PySTAC Client, stackstac, Xarray and rioxarray). This integration simplifies the adoption of STAC within applications and services, fostering openness, collaboration, and innovation within the geospatial community. As an open standard developed through collaborative efforts, STAC benefits from transparent and inclusive development processes, with contributions from various stakeholders across the geospatial landscape.

Lastly, STAC is designed to scale with the increasing volume and diversity of geospatial data. It supports distributed architectures, cloud-based storage, and efficient indexing mechanisms, enabling catalog systems to handle large-scale datasets and high-volume queries with optimal performance. All this to say, STAC plays a vital role in modernizing and democratizing access to geospatial data, facilitating data discovery, interoperability, collaboration, and innovation within the geospatial community.

Xarray#

Xarray is a Python library designed for working with labeled multidimensional arrays, particularly well-suited for handling geospatial data due to its support for labeled dimensions, coordinate metadata, and advanced array operations. Xarray’s importance in the geospatial domain stems from its ability to efficiently manage and analyze large, multi-dimensional datasets commonly encountered in remote sensing and climate science applications. By representing data as labeled arrays with associated metadata, Xarray facilitates easy manipulation, slicing, and aggregation of geospatial datasets, enabling users to perform complex analysis tasks with ease. Its integration with other highly relied upon scientific Python libraries like NumPy, pandas, and Dask further enhances its functionality, allowing seamless interoperability with existing data analysis workflows.

As you’ve also seen throughout the chapters in this guide, Xarray can be used with STAC. More specifically, Xarray can be leveraged to ingest and process geospatial datasets described in STAC format, enabling users to load STAC catalogs, access individual assets, and perform analysis tasks using Xarray’s rich set of array operations. Natively and through a geospatial extension called rioxarray (which we’ll discuss in a bit), Xarray’s compatibility with STAC enhances its utility for geospatial data analysis.

Rasterio#

Rasterio is a Python library that provides functionality for reading, writing, and manipulating geospatial raster datasets. It serves as a bridge between raster data formats and Python’s scientific ecosystem, offering convenient tools for geospatial analysis and visualization. Rasterio’s importance lies in its ability to handle diverse raster formats, including popular ones like GeoTIFF and NetCDF, making it a versatile tool for working with different types of geospatial data. Rasterio provides an intuitive and efficient set of operations for common raster data manipulations such as tiling, resampling, reprojection, and computation of raster statistics. Its integration with other geospatial libraries like NumPy, matplotlib, and GDAL is at the core of its ease of adoption and wide utility, as this allows for seamless incorporation into existing geospatial workflows and applications. As a core raster processing library, Rasterio plays a crucial role in geospatial data processing and analysis.

rioxarray#

rioxarray is an extension of the Xarray library that adds geospatial capabilities for raster data processing. It serves as a bridge between the data manipulation features of Xarray and the spatial functionality provided by libraries like Rasterio and GDAL.

One of the key features of rioxarray is its ability to handle spatial metadata, such as coordinate reference systems (CRS), geotransform parameters, and spatial extent information, which are essential for geospatial analysis and visualization. In order to achieve this, rioxarray leverages the spatial metadata embedded in raster datasets such as that which can be accessed from STAC. The ability to interpret these parameters allows for georeferenced operations, such as reprojecting, resampling, and clipping, directly within the Xarray data structure.

Additionally, rioxarray simplifies reading and writing raster data in various file formats, including GeoTIFF, NetCDF, and HDF5. This flexibility enables further diversification of use in cross-domain workflows and analyses.

As you might have gathered by now after discussing Xarray, rioxarray offers interoperability with the SpatioTemporal Asset Catalog (STAC) standard. By using rioxarray with Xarray and STAC, efficient workflows that collectively discover, access, and process geospatial raster data can be built on top of a robust range of remote sensing and Earth observation datasets.

As the last tool mentioned in this non-exhaustive list of key tools, rioxarray can be summarized as being an enhancement to the geospatial capabilities of Xarray, of which its seamless integration with, and support for, spatial metadata and file formats make it an important tool for geospatial scientists, researchers, and data analysts working with raster data. Additionally, its interoperability with STAC provides a convenient way to access and analyze geospatial datasets from distributed and heterogeneous sources.