// the find
MarquezProject/marquez
Collect, aggregate, and visualize a data ecosystem's metadata
Marquez is a metadata service for tracking data lineage — what datasets your jobs consume and produce, and how that graph changes over time. It implements the OpenLineage standard, so if your Spark/Airflow/dbt jobs already emit OpenLineage events, Marquez can ingest them with minimal integration work. Target audience is data engineering teams running multi-tool pipelines who are tired of asking 'where did this table come from.'
OpenLineage support is the real draw — it's an actual open standard with integrations across Spark, Airflow, dbt, Flink, and more, so you're not writing custom instrumentation for each tool. The lineage graph visualization in the web UI is genuinely useful for impact analysis when a source dataset changes. GraphQL endpoint (even if beta) means you can query lineage relationships programmatically without parsing REST pagination. LF AI & Data graduation and real adopters (Astronomer, Northwestern Mutual) indicate this isn't abandonware.
No auth out of the box — the README explicitly says the HTTP API requires no authentication or authorization, which means you're one misconfigured firewall rule away from exposing your entire data lineage graph to the internet. The GraphQL API has been 'beta' long enough that it should raise eyebrows about whether it'll ever stabilize. Column-level lineage is supported but the feature surface is clearly secondary to job/dataset lineage — the ColumnLineageDao exists but the documentation barely covers it. Flyway migration history shows 70+ migrations over the project's life, which means schema upgrades on a long-running instance are a gamble you'll want to test carefully before touching production.