// the find

geekyouth/SZT-bigdata

★ 2,461 · Scala · NOASSERTION · updated May 2026

深圳地铁大数据客流分析系统🚇🚄🌟

A big data pipeline that ingests Shenzhen Metro tap card data (1.3M records from 2018) and routes it through Kafka → Flink → Redis/ES/HBase/ClickHouse, then builds a Hive data warehouse with ~20 analytical metrics. It's an educational project explicitly designed to touch as many big data technologies as possible with one dataset, not to solve a production problem.

The HBase rowkey design (reversed card number for balanced distribution) is a real engineering decision, not a tutorial copy-paste. The data warehouse layering (ODS→DWD→DWS→ADS) follows industry conventions correctly, which is rare for a solo learning project. The card number deobfuscation analysis (section 2.10) is a genuinely interesting forensic exercise on the dataset. Docker Compose configs for ELK and Prometheus/Grafana are included, which lowers the barrier to spinning up dependencies.

The dataset is a single day from 2018 and the government data source link is dead — you're working with a static 133.7MB snapshot forever, which limits any real-world relevance. Flink 1.10 and CDH 6.2 are multiple major versions behind; the CDH dependency is now a paid product, so the recommended setup is commercially blocked for new users. The project is explicitly incomplete: HBase real-time write, Spark batch analysis, and the DataV dashboard are all in the TODO list and haven't moved since 2020. Running this requires 40GB+ RAM across at least three machines, so the barrier to reproducing the full setup is high.

View on GitHub → Homepage ↗