Pickle Implementation¶
Last edited 2025-02-04.
Pickle support is important but hard to implement. It's important to support because it's commonly used from inside Dask and similar libraries to manage state across distributed workers.
Background¶
There are two ways to implement pickle support.
- Implementing
__getstate__and__setstate__. The return value of__getstate__can be pretty much anything I think, and that gets passed back into__setstate__to be unpacked and set as the internal state. - Implementing a constructor (tagged in Rust with
#[new]) and__getnewargs_ex__.__getnewargs_ex__must return a tuple of(args: tuple, kwargs: dict), which can be passed to the#[new]constructor. This can be simpler when you already have a#[new]implemented and when the parameters into that#[new]function are easily serializable. However it might require (I wonder if there's a way to recursively pickle things?)
We can't extract the configuration out from a "finished" object_store instance like an AmazonS3. We could extract some configuration out of a builder instance like AmazonS3Builder, but then every time we use that class we'd have to call build() on the Rust side, which would probably significantly hurt performance. Therefore, the only(?) possible way to implement pickle support inside obstore is to persist the store configuration separately inside the #[pyclass].
Implementation¶
- Store configuration information a second time in the
#[pyclass]of each store. (It's already implicitly stored by the underlying rustObjectStoreinstance.) - Handle prefixes automatically within each Python store class.
- Implement
__getnewargs_ex__, returning the parameters passed to#[new] - Implement
IntoPyObjectfor each configuration object: the store-specific config, client options, and retry options
Benefits¶
- It should be relatively straightforward to implement for any of the raw stores:
S3Store,GCSStore,AzureStore,LocalStore, andMemoryStore. - We're already validating the
PyAmazonS3Config,PyClientOptions, andPyRetryConfig, so it isn't that much extra work just to store those on the Rust class. So we can serialize them to Python objects in__getnewargs_ex__and then have Python automatically pass them to#[new]. - Using
__getnewargs_ex__means we don't need to add serde support; we can useIntoPyObjectandFromPyObjectfor all (de)serialization and only have a single code path. - We can persist all of the store-specific config, the client options, and the retry config, so the pickled instances should be exactly the same as the original instances.
- Supports any of the builder classmethods, e.g.
from_env,from_url, etc, - Most of the time, storing configuration information should be just a few strings. So ideally it'll increase memory usage only slightly, and won't affect runtime performance otherwise (assuming you reuse a store instead of creating a new one each time).
- Since we don't allow the store classes to be mutated after creation from Python, there's no risk of the two copies of the configuration getting out of sync.
Drawbacks¶
- Because
urlhas deferred parsing in theobject_storebuilders, we need to special-caseurlhandling. Naturally, passingurlto a store withwith_urlmeans thatobject_storedoesn't actually parse the URL until thebuild()method, and at that point we can no longer access config information from the builtAmazonS3. Without special-casing this URL handling, pickling would fail for instances created fromfrom_url.
Therefore, we handle this by vendoring the small amount of URL parsing from upstream. So we apply the URL parsing onto our config HashMaps and then apply those to the builder. So our configs and those used by the raw stores stay in sync. See developmentseed/obstore!209.
- Unclear how to support middleware, including PrefixStore, because those have to support an arbitrary wrapped object. Is there a way to recursively pickle and unpickle the thing it's wrapping?
- (Implemented) If we can't find a way to support pickling of arbitrary middleware, we could alternatively use a PrefixStore internally and automatically inside an S3Store, GCSStore, AzureStore (NOTE: we should maybe benchmark the overhead a PrefixStore causes, in case it's something we don't want to force on everyone? Well, if the S3Store stored an arbitrary Arc<dyn ObjectStore> then we could prefix when asked for and not prefix when not asked for, but maybe that would conflict with signing, if that requires a raw object_store::AmazonS3 instance?). Alternatively, we could create the
- We don't currently implement pickle support for PrefixStore on demand, since it looks virtually free.MemoryStore, as we don't have a way to serialize the memory state across workers.