// the find

damklis/DataEngineeringProject

★ 1,411 · Python · MIT · updated Dec 2022

Example end to end data engineering project.

A complete data pipeline demo that ingests news from RSS feeds through Kafka → MongoDB → Elasticsearch, with a Django REST API for querying. It's a portfolio/learning project that wires together about a dozen open-source tools to show how they fit together. Aimed at data engineers who want to see a realistic (if simplified) streaming architecture running locally.

The component selection is realistic — Kafka Connect with Debezium for CDC, MinIO as a local S3 stand-in, and a proper CQRS split between write (MongoDB) and read (Elasticsearch) models. The proxy pool with rotating user agents is a practical touch most tutorial projects skip. Test coverage exists for the scraping and API layers, not just a single smoke test. The `manage.sh` wrapper keeps the Docker Compose lifecycle simple enough to actually run.

Abandoned since December 2022, pinned to Python 3.8, and the CI badge points to a dead Travis CI URL — you're on your own if anything breaks. Running this locally requires 8GB of Docker memory for what amounts to a news aggregator, which tells you the infrastructure is the demo, not the product. The Airflow DAG uses public proxy scraping as its data source, so it will fail silently whenever those proxy lists go stale. No schema registry for Kafka, meaning the Avro-less JSON messages will quietly drift if you extend the pipeline.

View on GitHub →