Best Practices for Third Party Data Delivery
Authenticating with AWS
Obtain Temporary Credentials
What
When a data provider intends to connect with our AWS account, the data provider should first obtain a set of temporary credentials. These credentials will be scoped to an AWS IAM Role defined on the destination AWS account with permissions allowing writes to the destination S3 Bucket.
Why
No long-lasting credentials
Temporary credentials serve as an alternative to producing long-lasting credentials. The usage of long-lasting is generally frowned upon partly due to the increased management complexity. For example, rotation of long-lasting credentials requires a non-trivial amount of coordination between parties. Additionally, long-lasting credentials are often in opposition of managed cloud platform's security policies.
File ownership
An alternative strategy to providing credentials to the data provider could be to simply update our S3 Bucket policy to permit writes by a third party provider’s AWS account. However, with this technique the data provider would be the owner of these files. Even if the data provider applies the bucket-owner-full-control
ACL to the files, problems can still arise. For example, the recipient account will run into permissions issue if it attempts to share the third-party data with another AWS account.
How
Obtaining temporary credentials is achieved via the AssumeRole operation.
Additional Reading
- [AWS IAM Docs] Providing access to AWS accounts owned by third parties
- [AWS Knowledge Center] How can I provide cross-account access to objects that are in Amazon S3 buckets?
Delivering Data
Manifest File
What
A manifest file must accompany the delivery of datasets.
Why
Manifest files are often used as a trigger to begin the ingestion process for the delivered files. As such, it is recommended that they have a constant filename (e.g. manifest.json
) and are the final file uploaded in a delivery. It is recommended that any tooling built to process these files be built with logic to wait for a reasonable amount of time in the event that any missing files are missing, however if the manifests are delivered after all other files then this functionality will likely not be utilized.
How
Data providers may have their own manifest format. These manifest files should at the very minimum contain paths to all files that are to be delivered.
If the data providers do not have an established pattern for delivery manifests, it is recommended that they implement the STAC Item specification for each item that shares a geometry & timestamp.
Verified Integrity
What
At time of upload, the integrity of the data uploaded should be verified.
How
AWS S3 provides two techniques to achieve verifying the integrity of uploaded objects.
Preferred: Tagged with trailing checksums
From the S3 Docs:
When uploading objects to Amazon S3, you can either provide a precalculated checksum for the object or use an AWS SDK to automatically create trailing checksums on your behalf. If you decide to use a trailing checksum, Amazon S3 automatically generates the checksum by using your specified algorithm and uses it to validate the integrity of the object during upload.
AWS' SDKs provide functionality to generate the desired checksum at time of upload.
Example
with open(file_path, "rb") as f:
boto3.client("s3").put_object(
Bucket=bucket_name,
Key=filename,
Body=f,
ChecksumAlgorithm="SHA256",
)
However, if a known checksum is available, it may be preferable to provide that manually.
Example
with open(file_path, "rb") as f:
boto3.client("s3").put_object(
Bucket=bucket_name,
Key=filename,
Body=f,
ChecksumSHA256=checksum,
)
With this technique, S3 objects will contain metadata reporting its checksum. This is available via the get-object-attributes
API endpoint.
Example
r = boto3.client("s3").get_object_attributes(
Bucket=bucket,
Key=key,
ObjectAttributes=['Checksum']
)
print(r['Checksum']['ChecksumSHA256'])
Legacy: Uploaded with Content-MD5
header
From the AWS S3 PutObject API documentation:
To ensure that data is not corrupted traversing the network, use the
Content-MD5
header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error.
Additional Reading
- [S3 Docs] Checking object integrity
- [AWS Knowledge Center] How can I check the integrity of an object uploaded to Amazon S3?
Prefixed by unique value
What
All files within a delivery should share a unique prefix.
Why
By placing each delivery into a unique prefix, common filenames (e.g. manifest.json
) won't conflict. Additionally, S3 applies rate limits per prefix in the bucket; utilizing a prefix-per-delivery greatly helps avoid hitting rate limit issues.