// the find
tobymao/sqlglot
Python SQL Parser and Transpiler
SQLGlot is a zero-dependency SQL parser, transpiler, and optimizer written in Python. It handles 31 dialects — BigQuery, Snowflake, DuckDB, Spark, TSQL, and more — and lets you parse, modify, and re-emit SQL programmatically. The target audience is tooling authors: data pipeline frameworks, query analyzers, IDE plugins, anything that needs to speak multiple SQL dialects without maintaining a parser per database.
1. No dependencies by default — drops straight into any Python project without pulling in a tree of transitive junk. The optional `[c]` extra adds mypyc-compiled extensions that cut parse time to ~25% of pure Python, competitive with Rust bindings on most workloads. 2. The AST is fully traversable and mutable via `find_all`, `transform`, and `diff` — you can write structural SQL rewrites in a few lines rather than string-mangling. 3. The optimizer is schema-aware and actually useful: predicate pushdown, CTE elimination, type annotation, and full TPC-H/TPC-DS test fixtures to verify correctness. That's a high bar most SQL libraries don't clear. 4. Adoption list is serious — Apache Superset, Dagster, Ibis, dlt. These are not toy users; they stress the edge cases.
1. Transpilation accuracy is 'best effort' and the docs say so. Type-sensitive transforms (e.g., implicit casting differences between Snowflake and BigQuery) silently produce wrong SQL unless you pass a schema and run the full optimizer, which most callers don't do. You can ship subtly broken queries without realizing it. 2. The Python executor exists 'not to be fast' — their words. It is genuinely useful for unit testing but the moment someone tries to run any real data through it in production they will have a bad time. The distinction between 'for testing' and 'production engine' is easy to miss. 3. Community dialects (Teradata, Oracle, Druid, etc.) have no SLA on bugs. If your company runs on one of those, you are effectively the maintainer for any edge cases you hit. 4. The optimizer requires explicit schema input to do its best work; without it, type inference stalls and several rewrites simply don't fire. For ad-hoc query tools where schema isn't known ahead of time, you get a noticeably weaker result.