Link Search Menu Expand Document

Best Practices for Third Party Data Delivery

Authenticating with AWS

Obtain Temporary Credentials

What

When a data provider intends to connect with our AWS account, the data provider should first obtain a set of temporary credentials. These credentials will be scoped to an AWS IAM Role defined on the destination AWS account with permissions allowing writes to the destination S3 Bucket.

Why

No long-lasting credentials

Temporary credentials serve as an alternative to producing long-lasting credentials. The usage of long-lasting is generally frowned upon partly due to the increased management complexity. For example, rotation of long-lasting credentials requires a non-trivial amount of coordination between parties. Additionally, long-lasting credentials are often in opposition of managed cloud platform's security policies.

File ownership

An alternative strategy to providing credentials to the data provider could be to simply update our S3 Bucket policy to permit writes by a third party provider’s AWS account. However, with this technique the data provider would be the owner of these files. Even if the data provider applies the bucket-owner-full-control ACL to the files, problems can still arise. For example, the recipient account will run into permissions issue if it attempts to share the third-party data with another AWS account.

How

Obtaining temporary credentials is achieved via the AssumeRole operation.

Additional Reading

Delivering Data

Manifest File

What

A manifest file must accompany the delivery of datasets.

Why

Manifest files are often used as a trigger to begin the ingestion process for the delivered files. As such, it is recommended that they have a constant filename (e.g. manifest.json) and are the final file uploaded in a delivery. It is recommended that any tooling built to process these files be built with logic to wait for a reasonable amount of time in the event that any missing files are missing, however if the manifests are delivered after all other files then this functionality will likely not be utilized.

How

Data providers may have their own manifest format. These manifest files should at the very minimum contain paths to all files that are to be delivered.

If the data providers do not have an established pattern for delivery manifests, it is recommended that they implement the STAC Item specification for each item that shares a geometry & timestamp.

Verified Integrity

What

At time of upload, the integrity of the data uploaded should be verified.

How

AWS S3 provides two techniques to achieve verifying the integrity of uploaded objects.

Preferred: Tagged with trailing checksums

From the S3 Docs:

When uploading objects to Amazon S3, you can either provide a precalculated checksum for the object or use an AWS SDK to automatically create trailing checksums on your behalf. If you decide to use a trailing checksum, Amazon S3 automatically generates the checksum by using your specified algorithm and uses it to validate the integrity of the object during upload.

AWS' SDKs provide functionality to generate the desired checksum at time of upload.

Example
with open(file_path, "rb") as f:
    boto3.client("s3").put_object(
        Bucket=bucket_name,
        Key=filename,
        Body=f,
        ChecksumAlgorithm="SHA256",
    )

However, if a known checksum is available, it may be preferable to provide that manually.

Example
with open(file_path, "rb") as f:
    boto3.client("s3").put_object(
        Bucket=bucket_name,
        Key=filename,
        Body=f,
        ChecksumSHA256=checksum,
    )

With this technique, S3 objects will contain metadata reporting its checksum. This is available via the get-object-attributes API endpoint.

Example
r = boto3.client("s3").get_object_attributes(
    Bucket=bucket,
    Key=key,
    ObjectAttributes=['Checksum']
)
print(r['Checksum']['ChecksumSHA256'])
Legacy: Uploaded with Content-MD5 header

From the AWS S3 PutObject API documentation:

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error.

Additional Reading

Prefixed by unique value

What

All files within a delivery should share a unique prefix.

Why

By placing each delivery into a unique prefix, common filenames (e.g. manifest.json) won't conflict. Additionally, S3 applies rate limits per prefix in the bucket; utilizing a prefix-per-delivery greatly helps avoid hitting rate limit issues.

Additional Reading