// the find
Data-Centric-AI-Community/fg-data-profiling
1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.
fg-data-profiling (formerly pandas-profiling, then ydata-profiling) generates a detailed HTML/JSON EDA report from a pandas or Spark DataFrame in one line of code. It covers univariate stats, correlations, missing data patterns, time-series analysis, and data quality alerts. Aimed at data scientists who want a quick first-pass look at a new dataset without writing boilerplate.
The alert system is genuinely useful — it automatically flags high cardinality, skewed distributions, near-duplicates, and non-stationary time series without you having to know to ask for them. Spark support means it scales beyond single-machine pandas DataFrames, which is rare for this category of tool. The Great Expectations integration lets you turn a profiling report directly into a test suite, which closes the loop from exploration to validation. The HTML report is self-contained with bundled assets, so you can share it as a single file without a running server.
This repo has been renamed twice in two years (pandas-profiling → ydata-profiling → fg-data-profiling), which is a red flag for long-term stability — each rename breaks imports and causes pip dependency confusion. The Spark implementation is visibly thinner than the pandas one; several pandas-specific analyses (image, file, URL types) have no Spark equivalent, so you get a degraded experience at scale. Reports on large DataFrames can be slow and memory-hungry because the correlation and interaction computations don't sample by default — you have to know to set `minimal=True` or tune `samples.head`. The commercial upsell to 'YData Fabric' is baked into the README and docs, which makes it harder to tell what's genuinely open-source versus a loss-leader.