finds.dev← search

// the find

uber/petastorm

★ 1,889 · Python · Apache-2.0 · updated Jan 2026

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm is a data access library from Uber ATG that lets you train ML models directly from Parquet files, bridging the gap between data warehouse storage and PyTorch/TensorFlow training loops. It adds a schema layer (Unischema) on top of standard Parquet to handle multidimensional arrays and image codecs that plain Parquet can't express. Best fit for teams already running Spark pipelines who want to feed those datasets into deep learning without a separate ETL step.

The Spark Dataset Converter API is the most practically useful part — materializing a DataFrame to Parquet and wrapping it in a tf.data.Dataset or DataLoader in a few lines saves real plumbing work. The parallelism options (thread, process, single-threaded) in the Reader are well-thought-out and the single-threaded debug mode alone is worth knowing about. The `make_batch_reader` for reading plain non-Petastorm Parquet is a good escape hatch that makes the library usable on datasets you didn't create. InMemBatchedDataLoader for GPU-side caching is a practical optimization that most people would otherwise hand-roll.

The last real commit activity was 2026-01-02 and the project has been in maintenance mode since Uber wound down ATG in 2021 — there's no active development and open issues accumulate without response. The Unischema/custom codec layer is a proprietary abstraction on top of Parquet that locks you in: datasets written with petastorm require petastorm to read them properly, which is a real migration cost if you ever want to switch. TensorFlow support uses the v1 Session API style in the docs, which has been deprecated for years — the examples are stale. Dependency on PySpark just to write a dataset is heavy; there's no lightweight write path for teams not running Spark.

View on GitHub →

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →