finds.dev← search

// the find

MrSuiChuan/data-warehouse-learning

★ 1,154 · Java · Artistic-2.0 · updated Apr 2026

【2026最新版】 大数据 数据分析 电商系统 实时数仓 离线数仓 数据湖 建设方案及实战代码,涉及组件 #flink #paimon #doris #seatunnel #dolphinscheduler #datart #dinky #hudi #iceberg。

A Chinese-language learning project demonstrating how to build a data warehouse on an e-commerce dataset, covering both offline (Doris + SeaTunnel + DolphinScheduler) and real-time lakehouse (Flink + Paimon/Hudi/Iceberg) stacks. It's aimed at data engineers studying for jobs or learning the modern Chinese big data toolchain — not something you'd fork and run in production.

The side-by-side comparison of four table formats (Doris, Paimon, Hudi, Iceberg) against the same ODS→DWD→DIM→DWS→ADS pipeline is genuinely useful for understanding the tradeoffs between them. The SeaTunnel connector library is exhaustive — 50+ connector config examples covering CDC sources, cloud object stores, and obscure sinks that are hard to find documented elsewhere. The Flink section covers the full operational surface: watermarks, state backends, checkpoints, restart strategies, interval joins — each as a standalone runnable demo rather than one giant monolith. Version pinning in the software table is specific (Flink 1.18.1, Paimon 0.8, Hudi 0.15.0) so you can actually reproduce the environment rather than guessing.

There is essentially no actual Java source code in the warehouse layer — the data processing logic lives in SQL scripts that aren't included in the repo tree, only screenshots of results. Anyone trying to learn the DWD/DWS transformation logic has to squint at PNG files. The entire repo assumes you're on CentOS 8 with a manual cluster setup; no Docker Compose for the main warehouse stack, no CI, no way to run this without provisioning 8+ services by hand. The learn_llm section is a completely unrelated LLM fine-tuning tangent shoved into the same Maven project, with Python files living under src/main/java — a sign this grew by accretion rather than design. Documentation is Chinese-only, which limits its audience for a repo that covers tools (Flink, Iceberg, SeaTunnel) with substantial English-speaking userbases.

View on GitHub →

// want more like this?

We dig through GitHub every week and send a few repos picked for what you actually care about — each with an honest take like this one.

Get finds in your inbox → Search again →