// the find
great-expectations/great_expectations
Always know what to expect from your data.
Great Expectations is a Python library for defining and running data quality checks ('Expectations') against dataframes, SQL tables, and file-based data sources. It's aimed at data engineers and ML practitioners who want to assert contracts on their data pipelines, similar to unit tests but for data. The generated HTML documentation of validation results is a genuinely useful feature for team communication.
- Wide datasource coverage: Pandas, Spark, and most major SQL databases (Postgres, MySQL, MSSQL, BigQuery, Snowflake, Databricks, Trino, etc.) are supported, with docker-compose fixtures for local testing of each.
- Extensible Expectation system: you can subclass and register custom Expectations with relatively little boilerplate, and the contrib/ directory shows real-world examples including statistical tests (KS test, chi-square, Benford's Law).
- Auto-generated data docs: validation results can be rendered to HTML with per-column statistics and pass/fail summaries, which actually gets used in practice for stakeholder communication.
- Active CI discipline: pre-commit hooks, Ruff linting, Azure Pipelines, per-Python-version constraint files, and SQLAlchemy compatibility matrix tests indicate the project takes backward compatibility seriously.
- The API has broken compatibility multiple times across major versions (v2 Batch Kwargs → v3 BatchRequest → current Core API), and migration is painful; production deployments regularly hit undocumented behavioral changes after upgrades.
- Initial setup friction is high: concepts like DataContext, ExpectationSuite, Checkpoint, BatchRequest, and ValidationDefinition have non-obvious relationships, and the docs try to cover all historical API shapes simultaneously, making them confusing.
- Performance against large datasets is a real problem—many Expectations materialize full column data into memory on Pandas backends, and even SQL-native Expectations can generate N+1 query patterns when validating many columns.
- The cloud/SaaS push (GX Cloud) means the OSS version occasionally feels like a loss-leader: some workflow improvements land in the paid tier first, and the open-source roadmap transparency is limited.