// the find
fivetran/great_expectations
Always know what to expect from your data.
Great Expectations is a data validation library for Python that lets you write assertions about your data — column types, value ranges, null rates, set membership — and run them against DataFrames, SQL tables, or files. It's the de facto standard for data quality testing in Python ML/data pipelines, backed by a company (GX) and a large community. Primarily for data engineers and ML engineers who need to gate pipelines on data shape.
The Expectation DSL is genuinely expressive and readable — `expect_column_values_to_be_between` is unambiguous in a way that a raw assert statement isn't. Auto-generated data docs give you HTML validation reports without writing a single template. The connector coverage is wide: Pandas, Spark, most SQL dialects, cloud warehouses — you're unlikely to hit a source that isn't supported. The custom Expectation system is well-designed; subclassing and registering a new check is straightforward and the contrib directory shows it actually gets used.
The API has broken compatibility multiple times across major versions (v2 → v3 → GX Core), and migration paths are painful — if you have existing suites from two years ago, expect to rewrite them, not port them. Cold start is heavy: the Data Context abstraction, Suites, Checkpoints, and Stores are a lot of concepts before you validate your first column. The cloud/SaaS product (GX Cloud) is clearly where the company's attention goes now; OSS features feel slower and some flagship features like the Data Assistant were deprecated. Performance on large datasets degrades badly because many Expectations fall back to loading data into memory even when you're connected to a SQL warehouse that could evaluate them natively.