// the find

san089/goodreads_etl_pipeline

★ 1,513 · Python · MIT · updated Mar 2020

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

A portfolio-style data engineering project that builds a full pipeline from the GoodReads API through S3 landing zones, Spark on EMR, and into Redshift — with Airflow orchestrating the whole thing. It's aimed at data engineering learners who want to see all the AWS pieces wired together end-to-end. Not production software; treat it as a reference implementation.

The architecture covers the full classic data lake pattern (landing → working → processed → warehouse) rather than jumping straight to a destination table, which is a useful learning structure. The fake data generator for load testing is a thoughtful addition — most tutorial projects skip this entirely. Custom Airflow operators for analytics queries and data quality checks show how to extend Airflow beyond the built-in operators, which is the part most tutorials gloss over. The scenarios section (100x data, 100 concurrent users) at least makes the author think through scaling trade-offs, even if the answers are shallow.

Last commit is March 2020 — six years stale, running on Airflow 1.x patterns and Python 3.6 on EMR; none of this works out of the box with anything current. The setup instructions require manually SSHing into EC2, copying files by hand, and running pip installs on the cluster — there's no infrastructure-as-code beyond a half-linked CloudFormation script in a separate repo. Security is an afterthought: Redshift credentials appear to be passed around as plaintext in config, and the psycopg2 connection to Redshift from EMR workers is a pattern you'd never want in production. The 'testing the limits' section claims 1.6 TB/day throughput from a 10-minute DAG run on a 3-node m5.xlarge cluster, which doesn't hold up to scrutiny — those numbers are almost certainly from the faker generating sequential local writes, not real distributed ETL.

View on GitHub →