// the find
vaexio/vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
Vaex is an out-of-core DataFrame library for Python that memory-maps HDF5 and Arrow files to let you work on datasets larger than RAM without loading them. It targets data scientists who hit Pandas' memory wall and need billion-row aggregations on a laptop. Performance comes from lazy evaluation, a C++ extension layer, and zero-copy operations — not distributed compute.
The lazy expression system is genuinely useful: column transformations are deferred until a statistic is actually computed, which means you can write feature-engineering logic without materializing intermediate arrays. The C++ aggregation engine is fast — groupby on categorical columns really does hit the advertised billion-rows-per-second range on real hardware. The join implementation avoids materializing the right-hand table, which is a meaningful memory saving versus pandas merge. Arrow/S3 support means you can stream directly from cloud storage with the same API as local files.
The project is effectively in maintenance mode — the last meaningful activity is sporadic and the Slack community has gone quiet, so you're largely on your own when something breaks. The API coverage versus pandas is incomplete in frustrating ways: window functions, complex multi-index operations, and some string methods are missing or behave differently, which means real ETL pipelines hit edges constantly. The HDF5 dependency is a practical headache — converting existing CSVs or Parquet files to vaex's native format adds an upfront step, and the tooling for that conversion is underspecified. Python 3.10+ compatibility has had rough patches as the C++ extension build system hasn't kept pace with the ecosystem.