// the find
gangly/datafaker
Datafaker is a large-scale test data and flow test data generation tool. Datafaker fakes data and inserts to varied data sources. 测试数据生成工具
Datafaker is a CLI tool for generating fake test data and inserting it directly into databases, Kafka, Hive, HBase, and MongoDB. You describe your schema via a metadata file and it generates rows at whatever volume and interval you need. It's aimed at backend developers and QA engineers who need realistic data in test databases without hand-crafting INSERT statements.
Broad sink support is the real selling point — MySQL, PostgreSQL, Oracle, Kafka, HBase, Hive, MongoDB, Elasticsearch in one tool is unusual. The streaming mode with configurable intervals is useful for Kafka pipeline testing where you need sustained data flow, not just a one-shot batch. Multi-table FK coherence via enumerated types (shared fixed pools across tables) is a practical approach that sidesteps the 'orphaned FK' problem in generated data. The test suite is genuinely comprehensive — unit, integration, and functional layers with good coverage of the data type modules.
Python 2.7 compatibility is still listed as a feature in 2026, which signals the project hasn't moved with the times; setup.py instead of pyproject.toml confirms this. The metadata format is a custom text-based DSL that you have to learn and hand-maintain — no introspection of live database schemas, so you'll be duplicating your DDL. Last meaningful activity appears to be a couple of years behind the last push date, and the 644-star count suggests limited adoption outside the original Chinese developer community. No support for modern data warehouses (BigQuery, Snowflake, Databricks) that have largely replaced Hive in most orgs.