finds.dev← search

// the find

apache/iceberg-python

★ 1,068 · Python · Apache-2.0 · updated Jun 2026

PyIceberg

PyIceberg is the official Python client for Apache Iceberg, the open table format for large analytic datasets. It lets you read and write Iceberg tables directly from Python without going through Spark or Trino — useful for data engineering pipelines that don't want a JVM in the critical path.

Catalog support is broad: REST, Glue, Hive, DynamoDB, BigQuery Metastore, and SQL all ship as first-class options, not afterthoughts. The Avro reader has a Cython fast path (`decoder_fast.pyx`) so manifest scanning isn't purely Python-slow. Expression DSL and row filter syntax are properly implemented with pushdown visitors, not just post-filter in memory. Active development with 509 forks and a push two days ago — this isn't abandoned Apache incubator shelf-ware.

Write support is still catching up to the Java implementation; complex merge-on-read delete handling and full upsert semantics are incomplete enough that the repo ships a `upsert_util.py` as a separate utility rather than a first-class table operation. The dependency footprint is heavy — you're pulling in PyArrow, fsspec, and optionally Cython just to touch a catalog. No support for reading directly to pandas without going through Arrow. Integration test setup requires Docker Compose with Spark and Hive images, making local contribution setup a half-day project.

View on GitHub → Homepage ↗

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →