// the find
reata/sqllineage
SQL Lineage Analysis Tool powered by Python
SQLLineage parses SQL statements and extracts table-level and column-level data lineage, storing it as a directed graph you can query or visualize. It supports multiple SQL dialects via sqlfluff and sqlparse backends, and can connect to a real database via SQLAlchemy to resolve wildcards and unqualified column references. Aimed at data engineers who need to audit pipelines or build data catalogs without wiring up a full governance platform.
Column-level lineage that actually traces through CTEs, subqueries, and JOINs — not just table-to-table edges. Dialect awareness via sqlfluff means Hive, SparkSQL, TSQL quirks are handled rather than silently misparsed. The metadata integration (SQLAlchemy + any supported DB) to resolve `SELECT *` into actual columns is the right call — without it, wildcard expansion is just guessing. Test suite is extensive and organized by SQL construct type, which is exactly how you'd want to catch parser regressions.
Column lineage breaks down without metadata — wildcards stay unresolved and unqualified columns get no source table, which is the common case in real warehouse SQL. The dual-parser architecture (sqlparse + sqlfluff) means you're maintaining two code paths for every SQL construct; sqlparse is known to be fragile on complex queries, and the README quietly steers you toward sqlfluff without deprecating the old path. No incremental or streaming mode — you feed it a SQL string or file each time, so integrating it into a CI pipeline that processes thousands of migration files requires you to build the orchestration yourself. The visualization is a local Flask server that opens a browser tab, which is fine for demos but useless for embedding lineage graphs into any existing tool.