Cookbook¶
List objects¶
Use the obstore.list
method.
import obstore as obs
# Create a Store
store = get_object_store()
# Recursively list all files below the 'data' path.
# 1. On AWS S3 this would be the 'data/' prefix
# 2. On a local filesystem, this would be the 'data' directory
prefix = "data"
# Get a stream of metadata objects:
list_stream = obs.list(store, prefix)
# Print info
for batch in list_stream:
for meta in batch:
print(f"Name: {meta.path}, size: {meta.size}")
List objects as Arrow¶
The default list
behavior creates many small Python dict
s. When listing a large bucket, generating these Python objects can add up to a lot of overhead.
Instead, you may consider passing return_arrow=True
to obstore.list
to return each chunk of list results as an Arrow RecordBatch
. This can be much faster than materializing Python objects for each row because Arrow can be shared zero-copy between Rust and Python.
This Arrow integration requires the arro3-core
dependency, a lightweight Arrow implementation. You can pass the emitted RecordBatch
to pyarrow
(zero-copy) by passing it to pyarrow.record_batch
or to polars
(also zero-copy) by passing it to polars.DataFrame
.
import obstore as obs
# Create a Store
store = get_object_store()
# Get a stream of Arrow RecordBatches of metadata
list_stream = obs.list(store, prefix="data", return_arrow=True)
for record_batch in list_stream:
print(record_batch.num_rows)
Here's a working example with the sentinel-cogs
bucket in AWS Open Data:
import obstore as obs
import pandas as pd
import pyarrow as pa
from obstore.store import S3Store
store = S3Store("sentinel-cogs", region="us-west-2", skip_signature=True)
stream = obs.list(store, chunk_size=20, return_arrow=True)
for record_batch in stream:
# Convert to pyarrow (zero-copy), then to pandas for easy export to a
# Markdown table
df = pa.record_batch(record_batch).to_pandas()
print(df.iloc[:5].to_markdown(index=False))
break
The Arrow record batch looks like the following:
path | last_modified | size | e_tag | version |
---|---|---|---|---|
sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 | 50510 | "2e24c2ee324ea478f2f272dbd3f5ce69" | |
sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 | 1455332 | "a31b78e96748ccc2b21b827bef9850c1" | |
sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" | |
sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" | |
sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" |
Fetch objects¶
Use the obstore.get
function to fetch data bytes from remote storage or files in the local filesystem.
import obstore as obs
# Create a Store
store = get_object_store()
# Retrieve a specific file
path = "data/file01.parquet"
# Fetch just the file metadata
meta = obs.head(store, path)
print(meta)
# Fetch the object including metadata
result = obs.get(store, path)
assert result.meta == meta
# Buffer the entire object in memory
buffer = result.bytes()
assert len(buffer) == meta.size
# Alternatively stream the bytes from object storage
stream = obs.get(store, path).stream()
# We can now iterate over the stream
total_buffer_len = 0
for chunk in stream:
total_buffer_len += len(chunk)
assert total_buffer_len == meta.size
Put object¶
Use the obstore.put
function to atomically write data. obstore.put
will automatically use multipart uploads for large input data.
import obstore as obs
store = get_object_store()
path = "data/file1"
content = b"hello"
obs.put(store, path, content)
You can also upload local files:
from pathlib import Path
import obstore as obs
store = get_object_store()
path = "data/file1"
content = Path("path/to/local/file")
obs.put(store, path, content)
Or file-like objects:
import obstore as obs
store = get_object_store()
path = "data/file1"
with open("path/to/local/file", "rb") as content:
obs.put(store, path, content)
Or iterables:
import obstore as obs
def bytes_iter():
for i in range(5):
yield b"foo"
store = get_object_store()
path = "data/file1"
content = bytes_iter()
obs.put(store, path, content)
Or async iterables:
import obstore as obs
async def bytes_stream():
for i in range(5):
yield b"foo"
store = get_object_store()
path = "data/file1"
content = bytes_stream()
obs.put(store, path, content)
Copy objects from one store to another¶
Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to stream a get
from one store directly to the put
of another.
Note
Using the async API is required for this.
import obstore as obs
store1 = get_object_store()
store2 = get_object_store()
path1 = "data/file1"
path2 = "data/file1"
# This only constructs the stream, it doesn't materialize the data in memory
resp = await obs.get_async(store1, path1, timeout="2min")
# A streaming upload is created to copy the file to path2
await obs.put_async(store2, path2)
Note
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
You may set the timeout
parameter in the client_options
passed to the initial get_async
call.