// the find

san089/Udacity-Data-Engineering-Projects

★ 1,919 · Python · NOASSERTION · updated Aug 2022

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

A collection of six Udacity Data Engineering nanodegree projects covering Postgres/Cassandra data modeling, Redshift data warehousing, EMR/Spark data lakes, and Airflow pipelines. It's course homework made public — useful as a reference for someone learning these tools in sequence, not a production library or framework.

The Airflow project includes custom operators (stage_redshift, load_fact, load_dimension, data_quality) that are actually reusable patterns, not just notebook exercises. The Redshift IaC script automates cluster provisioning via boto3, which is a more honest way to learn infrastructure than clicking through the console. Covers a sensible progression from single-node Postgres to distributed Spark/EMR, so the stack complexity scales with the concepts. The Cassandra project correctly designs tables around query patterns rather than treating it like a relational DB.

Dead since August 2022 — the AWS services referenced (especially EMR configs and Redshift node types) have drifted and some setup steps will fail without adjustment. Everything runs against a toy music dataset bundled in the repo; there's no path to swapping in real data. No tests anywhere — the 'data quality' operator checks row counts but that's it. The capstone links out to a separate repo instead of being included, so this collection is incomplete as presented.

View on GitHub →