finds.dev← search

// the find

CrunchyData/pg_parquet

★ 679 · Rust · NOASSERTION · updated Nov 2025

Copy to/from Parquet in S3, Azure Blob Storage, Google Cloud Storage, http(s) stores, local files or standard inout stream from within PostgreSQL

A PostgreSQL extension written in Rust that hooks into COPY TO/FROM to read and write Parquet files directly from S3, Azure Blob Storage, GCS, or local disk. Built on Apache Arrow and pgrx. If you want to move data between Postgres and a data lake without an ETL pipeline in the middle, this is the most direct path.

The COPY hook integration is genuinely clean — you get glob pattern support for reading multiple Parquet files in one statement, which most similar tools don't handle. Type coverage is thorough: composite types, arrays, jsonb, UUID, geometry via WKB all round-trip correctly. The `parquet.schema()` / `parquet.metadata()` / `parquet.column_stats()` introspection functions are immediately useful for debugging mismatches. File splitting via `file_size_bytes` means you can write large exports in S3-friendly chunks without scripting it yourself.

Installation requires building from source with a matching pgrx version pinned to Cargo.toml — not on PGXN, no prebuilt binaries for common distros, so managed Postgres (RDS, Cloud SQL, Supabase) is a non-starter. The `crunchy_map` type is only useful if you're on Crunchy Bridge, which is a product lock-in smell. Unspecified `numeric` defaults silently to precision 38 / scale 9 and throws a runtime error if your data overflows — that will surprise anyone with unconstrained numeric columns. No streaming or async path for large COPY TO operations; you're holding locks and memory for the full export duration.

View on GitHub →

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →