// the find

georgian-io/LLM-Finetuning-Toolkit

★ 871 · Python · Apache-2.0 · updated May 2026

Toolkit for fine-tuning, ablating and unit-testing open-source LLMs.

A config-driven CLI for running LLM fine-tuning experiments, ablation studies, and basic QA testing against open-source models via HuggingFace + PEFT. You define one YAML file, point it at a dataset and one or more models, and it fans out training runs across all combinations. Aimed at ML engineers who want systematic experimentation without wiring up the boilerplate every time.

The ablation matrix in a single YAML is genuinely useful — specifying multiple models, LoRA ranks, and prompt templates and getting all combinations run automatically saves real setup time. The hash-based artifact directory means interrupted runs resume without re-downloading or re-generating datasets, which is a practical detail most similar tools miss. The QA test suite is a real differentiator: shipping CSV-driven tests for things like JSON validity and dot-product similarity against model output is more rigorous than 'eyeball the generations'. Code structure is clean — data, finetune, inference, and qa are properly separated modules, not one giant script.

The model roster in the docs (Llama 2, Falcon, Mistral 7B) is frozen in time — the repo topics and examples haven't caught up to anything post-2023, and there's no evidence Llama 3, Gemma, or Phi-3/4 work without manual config surgery. The QA metrics are shallow: word overlap and length tests are table stakes, but there's no integration for evals like LLM-as-judge or standard benchmarks (MMLU, MT-Bench), so you can confirm output length but not whether it's actually better. The llama2/ and mistral/ directories at the repo root are dead weight — standalone scripts that predate the CLI and aren't wired into anything, just sitting there creating confusion about what the canonical entry point is. No distributed training support; if your model doesn't fit in one GPU with QLoRA, you're on your own.

View on GitHub →