// the find

pandas-dev/pandas

★ 48,979 · Python · BSD-3-Clause · updated Jun 2026

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

pandas is the standard tabular data manipulation library for Python — DataFrames, Series, groupby, merges, time series, and I/O to basically everything. It's the layer that sits between your raw data and any analysis or ML pipeline. If you're doing data work in Python, you're already using it or you're about to.

The groupby + agg + transform pipeline is genuinely expressive once you know it — split-apply-combine covers most real-world aggregation patterns without reaching for loops. Time series support is deep: date range generation, frequency resampling, rolling windows, and tz-aware indexes all work correctly in edge cases that trip up hand-rolled code. The ASV benchmark suite is serious — regressions get caught before release, which matters for a library used at this scale. Copy-on-Write semantics (now the default) finally make mutation behavior predictable after years of SettingWithCopyWarning confusion.

Memory usage is brutal on wide DataFrames — object-dtype columns are Python objects, not bytes, and a 1M-row string column will eat RAM you don't expect. Performance falls off a cliff on large datasets compared to polars or DuckDB; groupby on 100M rows is where you start regretting not using something columnar from the start. The API surface has accumulated 15 years of decisions — there are three ways to select rows, two index alignment behaviors, and enough deprecated kwargs that Stack Overflow answers from 2019 silently do the wrong thing today. No native parallelism: everything runs single-threaded unless you explicitly bring in Dask or Modin.

View on GitHub → Homepage ↗