// the find

alanchn31/Data-Engineering-Projects

★ 1,021 · Jupyter Notebook · updated Feb 2023

Personal Data Engineering Projects

A collection of seven data engineering projects from Udacity's data engineering nanodegree, covering the standard stack: Postgres, Cassandra, Redshift, S3/Spark, Airflow, and MongoDB. It's aimed at people learning data engineering fundamentals or working through the same Udacity course, not practitioners building production systems.

Each project is self-contained with its own README and schema diagrams, making it easy to understand the scope before reading code. The Airflow project is structured correctly — custom operators in a plugins directory rather than inline task logic, which is the right pattern. The progression from relational modeling to NoSQL to warehousing to pipelines mirrors how the domain actually builds on itself. The Scrapy+MongoDB project is a practical end-to-end example that goes beyond toy SQL exercises.

Last pushed February 2023 and tied to a specific Udacity curriculum — the Airflow version is old enough that the DAG patterns won't run without modification on current Airflow 2.x. Everything runs against the Sparkify fictional music dataset, so there's zero adaptation to real-world data messiness: no schema drift, no late arrivals, no deduplication logic. The data quality checks in the Airflow operators are stub-level (row count > 0), not useful as a reference for actual validation. No tests anywhere — the notebooks work as demos but you can't verify they still run.

View on GitHub →