// the find

gangly/datafaker

★ 644 · Python · updated Apr 2026

Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具

Datafaker is a CLI tool for generating fake test data and inserting it directly into databases, Kafka, Hive, HBase, and MongoDB. You describe your schema via a metadata file and it generates rows at whatever volume and interval you need. It's aimed at backend developers and QA engineers who need realistic data in test databases without hand-crafting INSERT statements.

Broad sink support is the real selling point — MySQL, PostgreSQL, Oracle, Kafka, HBase, Hive, MongoDB, Elasticsearch in one tool is unusual. The streaming mode with configurable intervals is useful for Kafka pipeline testing where you need sustained data flow, not just a one-shot batch. Multi-table FK coherence via enumerated types (shared fixed pools across tables) is a practical approach that sidesteps the 'orphaned FK' problem in generated data. The test suite is genuinely comprehensive — unit, integration, and functional layers with good coverage of the data type modules.

Python 2.7 compatibility is still listed as a feature in 2026, which signals the project hasn't moved with the times; setup.py instead of pyproject.toml confirms this. The metadata format is a custom text-based DSL that you have to learn and hand-maintain — no introspection of live database schemas, so you'll be duplicating your DDL. Last meaningful activity appears to be a couple of years behind the last push date, and the 644-star count suggests limited adoption outside the original Chinese developer community. No support for modern data warehouses (BigQuery, Snowflake, Databricks) that have largely replaced Hive in most orgs.

View on GitHub →