Portfolio
Project

Teaching a small local LLM to write my git commits

Fine-tuning tiny open-weight models to turn a git diff into a clean Conventional Commit message — privately, offline, and fast enough to live in a git hook.

1. The problem

Writing git commits has always been my favourite part of programming… no, of course I’m joking — that’s exactly why I was about to spend a few dozen hours automating something that takes seconds.

Every developer has written lazy commit messages. But commit history is documentation: it’s the record of why a codebase looks the way it does. Automating a decent first draft is a reasonable thing to want.

So let’s scrape some commit data, fine-tune a few models on GPUs, evaluate them honestly, and find the best way to write our commit messages.

(Everything here — code, dataset, and models — is open source; links at the end if you want to dig in.)

2. The approach: why local, why fine-tuned

So how do we do this? We have a few different options:

So let’s fine-tune our first model!

3. The data

Let’s start with the most important thing: our training data. Its quality is probably the single biggest factor in output quality, we should be careful. Gathering and filtering good data can be hard — that’s literally a job — but fortunately it’s fairly simple in our case.

We need git diffs and their corresponding commit messages, and there are plenty on GitHub. The plan: find open-source repos that follow Conventional Commits, clone them, then run a Python script to extract (diff, message) pairs into a JSONL file.

One detail: I fine-tune specifically for TypeScript, because that’s what I mostly use in my current internship at Botpress. Here are some of the repos I chose: Angular, Tanstack Query, Vite, Nuxt… and of course the Botpress OSS repo, to match my actual workflow.

The script clones each repo, walks the commits, and for every commit grabs the diff and the message. But I don’t keep every commit — many are badly formatted (developers are sometimes lazy) and others just aren’t interesting. The main filters:

Lesson learned: my first dataset iteration used only 9 repos and was less strict one the rules (e.g too many repetitions) — not enough diversity. The current dataset has 13.

Running the script gives ~12k (diff, message) pairs — enough for fine-tuning, evaluation, and testing.

How many examples do we actually need?

Honestly, nobody knows precisely, and the literature gives a wide range depending on assumptions. What the empirical evidence suggests, for LoRA on a narrow, well-defined task like this (consistent format, short output, constrained vocabulary):

So our ~12k is more than enough. Why is this hard to pin down? It depends on:

Diversity also matters for generalization. I deliberately mixed:

Here’s is the shape of my dataset:

0k0.5k1k1.5k2k2.5k3k3.5k4k4.5k↑ examplesfixchoredocsfeattestrefactorcibuildperfstyle
Commit-type distribution across the 12,433 examples (10 Conventional Commit types)
nuxttypescript-eslintbotpressprismatypeormvitesttanstack-queryvuejs-coretrpcviteangularcommitlinttwenty0k0.2k0.4k0.6k0.8k1kexamples →
Examples per source repository (13 repos)
050100150200250300350400450↑ examples20406080100120140commit message length (chars) →p50 = 54p75 = 68p95 = 91p99 = 116
Commit message length distribution
0k0.2k0.4k0.6k0.8k1k1.2k↑ examples0k2k4k6k8k10k12kdiff length (chars) →p50 = 2,024p75 = 3,882p95 = 7,827p99 = 10,318
Diff length distribution

In addition, I made different analysis to compare the repos:

If you’re curious, here is the notebook to check the dataset quality: [insert link]

This analysis allowed me to tell the second version of the dataset was better than the first one.

Even with all these precautions, there’s a real data limitation: the training data came from curated open-source TypeScript repos, and may not generalize to codebases with different conventions or commit hygiene.

After the extra filtering we have 12,433 higher-quality examples. In practice I trim further (to keep examples under 4096 tokens) and hold out 5% for evaluation and 5% for testing, leaving 11,178 examples for fine-tuning.

4. Training

Originally I wanted to fine-tune on my own machine. My first tests used Hugging Face Transformers and then MLX (I’m on a MacBook M4 Pro, 48 GB — my work laptop; otherwise I use Linux btw). It worked, but it was long and the laptop ran hot. I’ll keep the Mac for inference only.

So I looked for free GPU options. Kaggle has a generous offer (~30h GPU/week), but fine-tuning the smallest model with Transformers was still ~9h — doable, but painful when you want to run lots of experiments. I switched to Google Colab Pro ($10/month), which in theory gives you an H100 (80 GB VRAM, 3.35 TB/s)… except they’re never available. The A100 (40 GB, 1.5 TB/s) is available and more than enough here. With Unsloth for performance, fine-tuning dropped to ~20 minutes. Quite impressive. So let’s invest $10 conscientiously and fine-tune our models.

Here are the models I want to fine-tune and compare:

Why these? They’re all open-weight, so I can run them locally for everyday use. I deliberately chose very small models: the task is narrow, I’ll fine-tune them, and I want them fast and light on memory. I favoured the Qwen-Coder models because they were pretrained on lots of code, so they should be better at mapping a diff to a message than general-purpose models (I’ll verify that later). I included a few general-purpose and newer models to compare.

To fine-tune them I use LoRA (Low-Rank Adaptation): instead of updating all the parameters (there are a lot), which is compute-intensive and slow, we freeze the base model and train a small pair of “adapter” matrices added to a few weight layers. I use r=16 with alpha = 2·r = 32 , which is the default value for my model size.

On Qwen2.5-Coder-3B that’s only ~30M trainable params out of 3B — we train roughly 1% of the parameters. Why does that work? The base model already knows code syntax, language, and summarization. It doesn’t need to relearn any of that — it just needs a small, low-dimensional nudge to consistently map “diff → commit message.” That nudge is exactly what LoRA’s compact update can capture, giving most of the benefit of full fine-tuning at a fraction of the cost and memory.

I set max input length to 4096 tokens, since git diffs can be long, then filter the dataset to that limit (which means tokenizing examples beforehand to filter).

Do we need a system prompt? Not strictly — a task-specific fine-tune learns from data patterns alone. But a system prompt makes inference more robust and explicit, so I’ll use“ one:

SYSTEM_PROMPT = (
    "You are an expert software engineer. "
    "Given a git diff, write a single concise conventional"
    "commit message in the imperative mood. "
    "Output only the commit message, nothing else."
)

At each step I pass only the diff, and the model returns the message — nothing else. The key rule: be consistent between training and inference. Use the exact same system prompt wording at inference as during training, or omit it in both.

Some of the training arguments I used for fine-tuning Qwen2.5-Coder-0.5B-Instruct:

training_args = SFTConfig(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
    num_train_epochs=3,
    learning_rate=1e-4,
    warmup_steps=50,
    bf16=True,
    max_length=4096,
    packing=True,
)
(go to the notebook to see every parameter I used)

A few choices worth explaining:

I don’t use LoRA dropout to enable Unsloth’s optimizations. It’s not a problem here because the dataset (11k examples) is large enough relative to the ~1.75% of parameters being trained that overfitting risk is low.

Each model fine-tuning took between 20 and 30 minutes depending on the model size.

0.800.850.900.951.001.051.101.151.201.251.30↑ loss50100150200250300350400training step →training losseval loss
Training and eval loss during fine-tuning of Qwen2.5-Coder-1.5B-Instruct

5. Evaluation: how do you know it works?

This is probably the most important part of the whole project. It’s easy to fine-tune a model and convince yourself it’s better because the outputs look nicer — but proving it is harder, and it requires actually measuring, and being honest about what the measurements can and can’t tell you.

I evaluated on a test set of 621 examples the models had never seen during training (the 5% held out earlier). First step was generating the 621 predictions for each variant of each model. For this task, the T4 GPU was more than enough, I could even have generated then on my computer (remember, that’s the goal of this project).

The two variants for each model are:

Here is the system prompt given to each baseline model. This one is more detailed than the fine-tuned prompt to be a fair comparison.

BASELINE_SYSTEM_PROMPT = (
    "You are an expert software developer. "
    "Given a git diff, generate a single conventional concise commit message.\n\n"
    "Format: <type>(<optional-scope>): <description>\n"
    "Allowed types: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert.\n\n"
    "Rules:\n"
    "- The description must be in imperative mood (e.g. 'add', 'fix', 'remove') — never past tense (e.g. 'added') or gerund (e.g. 'adding').\n"
    "- Total length must be between 10 and 72 characters.\n"
    "- Do not end with a period, exclamation mark, or ellipsis.\n"
    "Output only the commit message, nothing else."
)

I used three layers of evaluation, each catching something the others miss.

ROUGE-L. Measures the longest common subsequence between the generated message and the human reference. It’s a metric with a real weakness: it rewards overlapping words and word order, not correctness. A model could score well while describing the wrong change, just by using similar phrasing. I keep it anyway because it’s standard, cheap, and useful as a sanity check rather than a verdict.

Structural checks. Simple, deterministic, and arguably more useful here than ROUGE-L: is the message between 10 and 72 characters? Does it follow the type(scope): description format? These don’t measure quality, but they measure whether the model learned the format — which, for a tool that has to slot into a real workflow, matters as much as semantic accuracy.

LLM-as-judge. For 200 test examples, I had GPT-4o-mini score each generated message against the diff and the human reference on accuracy, conciseness, and a relative judgment (better/equal/worse than the human reference), at temperature 0 for reproducibility. The limitation: using an LLM to judge an LLM has known biases — it can favour certain phrasing styles, and it’s not ground truth, just another signal. I treat it as one data point among three, not the deciding one. I started with GPT-5.4-nano but switched to GPT-4o-mini because the smaller judge wasn’t precise enough and often penalised a good commit message for the wrong reason. All the LLM calls for this analysis costed me less than a dollar.

Before the aggregate numbers, here are a few examples to judge for yourself — the diff, both model outputs, and the human reference:

Enough speaking, let’s see the results!

6. Results

Model ROUGE-L valid_length% conventional% judge_accuracy judge_conciseness vs_ref_better% vs_ref_equal% vs_ref_worse% total_score
Qwen2.5-Coder-3B-Instruct_finetuned 0.403 86.2% 99.8% 3.895 4.965 4.5% 31.0% 64.5% 0.789
Qwen2.5-Coder-1.5B-Instruct_finetuned 0.371 88.9% 100.0% 3.755 4.960 2.5% 30.5% 67.0% 0.778
Qwen2.5-Coder-0.5B-Instruct_finetuned 0.390 88.2% 100.0% 3.600 5.000 1.5% 23.5% 75.0% 0.773
Llama-3.2-1B-Instruct_finetuned 0.398 89.5% 100.0% 3.480 4.995 3.0% 20.0% 77.0% 0.769
Qwen2.5-Coder-0.5B-Instruct_finetuned_old-dataset 0.370 90.8% 100.0% 3.505 4.970 4.0% 16.0% 80.0% 0.765
Qwen2.5-Coder-3B-Instruct_baseline 0.320 68.8% 99.8% 4.110 4.935 16.0% 25.5% 58.5% 0.758
Qwen2.5-1.5B-Instruct_finetuned 0.322 90.8% 100.0% 3.415 4.965 1.5% 16.5% 82.0% 0.750
Qwen2.5-0.5B-Instruct_finetuned 0.363 84.1% 100.0% 3.400 4.960 3.0% 17.5% 79.5% 0.747
Qwen2.5-Coder-1.5B-Instruct_baseline 0.281 72.3% 98.1% 3.600 4.915 11.0% 15.5% 73.5% 0.719
Qwen2.5-1.5B-Instruct_baseline 0.243 77.3% 93.4% 3.520 4.890 6.5% 10.5% 83.0% 0.703
Qwen2.5-0.5B-Instruct_baseline 0.208 55.9% 95.2% 2.920 4.795 6.0% 5.5% 88.5% 0.626
Qwen2.5-Coder-0.5B-Instruct_baseline 0.230 53.1% 91.8% 3.035 4.695 8.0% 6.5% 85.5% 0.621
Llama-3.2-1B-Instruct_baseline 0.151 86.5% 36.2% 3.110 4.855 1.0% 19.0% 80.0% 0.557

Here is the total score formula:

total_score = 0.25 × accuracy_norm + 0.20 × ROUGE-L + 0.20 × conventional + 0.20 × concise_norm + 0.15 × valid_length

The headline number first: across every model I tested, fine-tuning improved every structural and ROUGE metric, with no exceptions. ROUGE-L went up between +26% and +164% relative depending on the model — the biggest jumps coming from the weakest baselines (Llama-3.2-1B, lowest at the start, gained the most); Conventional Commits compliance hit 100% for every fine-tuned model except the 3B (99.8%); valid-length compliance jumped by 3 to 35 percentage points. If I stopped here, the conclusion would be simple: fine-tuning seems to work. But the more interesting story is coming.

Smaller fine-tuned beats bigger un-tuned. The fine-tuned 0.5B model scores higher in total score (0.773) than the un-tuned 3B baseline (0.758). A model six times smaller, specialized on this one task, outperforms a much larger general-purpose model that’s never seen an example of how I want my commits written. This is the main argument for fine-tuning over “just prompt a bigger model” made concrete in the numbers — model size doesn’t substitute for task-specific training, at least not for something this narrow.

baseline fine-tuned
Qwen2.5-Coder-3BQwen2.5-Coder-1.5BQwen2.5-Coder-0.5BLlama-3.2-1BQwen2.5-1.5BQwen2.5-0.5B0.00.20.40.60.81.0total score →0.7580.7190.6210.5570.7030.6260.7890.7780.7730.7690.7500.747+4%+8%+24%+38%+7%+19%
Total score, baseline vs fine-tuned, per model (higher is better)

Diminishing returns as the base model grows. Looking at the relative improvement from fine-tuning at each size, the gain shrinks fast: +24% total score for 0.5B, +8% for 1.5B, +4% for 3B (on Qwen-2.5-coder models). The bigger the base model, the less there is for the adapter to teach it — Qwen2.5-Coder-3B already does a decent job at this task zero-shot, so fine-tuning is polishing rather than transforming. This is a useful number when deciding whether fine-tuning is worth the effort on a given model size: the smaller the model, the bigger the payoff.

Qwen2.5-Coder other models
0%5%10%15%20%25%30%35%40%45%↑ total-score gain (%)0.5B1B1.5B3Bbase model size →Qwen2.5-0.5BQwen2.5-1.5BLlama-3.2-1B+24%+8%+4%
Relative total-score gain from fine-tuning, by base-model size

Code-pretraining still matters, even with fine-tuning. Comparing Qwen2.5-Coder against the plain Qwen2.5 at matched sizes, the Coder variant wins both before fine-tuning (+2.3% at 1.5B) and after (+3.7% at 1.5B). The effect is modest, but consistent in direction. The model’s coding-specific pretraining and the LoRA adapter aren’t redundant, they compound.

One of the finding I find most interesting: fine-tuning trades rare brilliance for reliable competence. This shows up clearly in the LLM-as-judge’s relative verdicts. At every model size, the chance that fine-tuning makes the model actually beat the human-written reference goes down: 8.0%→1.5% at 0.5B, 11.0%→2.5% at 1.5B, 16.0%→4.5% at 3B. At the same time, the chance the model matches the human reference goes up substantially: 6.5%→23.5%, 15.5%→30.5%, 25.5%→31.0%. Said another way: the base model, prompted with explicit instructions, occasionally produces a surprisingly good commit message by getting creative — and just as often produces something off the rails. Fine-tuning sands off both ends of that distribution. The result is a model that almost never writes a great commit message, but reliably writes a good-enough one. For a tool meant to run unattended in a git hook, I’d take consistency over occasional brilliance but it’s worth showing that this is a tradeoff, not a strict improvement.

better than reference equal worse
0.5B1.5B3Bbaselinefinetunedbaselinefinetunedbaselinefinetuned0%20%40%60%80%100%share of test set →
LLM-judge verdict vs. the human reference, baseline vs. fine-tuned, per model size (Qwen2.5-Coder).

One result I can’t fully explain: judge_accuracy drops for the 3B model after fine-tuning (4.110 → 3.895), the only metric, on the only model, where fine-tuning makes things worse by GPT-4o’s judgment. Every other model improves on this metric after fine-tuning. I don’t have a confirmed explanation — my best guesses are that the 3B model already had enough competence that the LoRA adapter nudged it toward shorter, more “trained” phrasing at a small cost to factual completeness, or that this is noise from judging a 200-example subsample rather than the full test set. I’d want a larger judge sample before treating this as a real effect rather than variance.

Dataset quality: a real improvement, but smaller than I expected compared to the work of improving the dataset. Comparing the 0.5B model fine-tuned on the old, less diverse dataset against the new one, ROUGE-L moved from 0.371 to 0.391 — a real gain, but a modest one. More diverse data with stricter rules helped, but the single biggest lever in this project was still doing any task-specific fine-tuning at all. Maybe better quality commits, filtered with an LLM-as-judge would have given better results.

baseline old dataset new dataset
ROUGE-Ltotal score0.00.10.20.30.40.50.60.70.80.90.2300.3700.3900.6210.7650.773
Dataset quality on Qwen2.5-Coder-0.5B

One result shaped how I read everything else: all models lose to the human-written reference the vast majority of the time — 92.7% for the baseline, 96.7% for the fine-tuned model. At first glance that makes the fine-tuned model sound worse. But the human authors had something neither model had: full context on the codebase, the ticket, and the conversation that led to the change. Beating that is a different, much harder problem than “write a plausible commit message from a diff alone.”

To conclude, the fine-tune reliably fixes format (Conventional Commits, length) — which is exactly what makes it usable as a drop-in tool — and modestly improves semantic quality. It does not, and is not expected to, beat a human author who knows the codebase.

base model
 docs/troubleshooting/typed-linting/Performance.mdx | 1 +
 docs/troubleshooting/typed-linting/index.mdx       | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/troubleshooting/typed-linting/Performance.mdx b/docs/troubleshooting/typed-linting/Performance.mdx
index b17cd9937..4f2737128 100644
--- a/docs/troubleshooting/typed-linting/Performance.mdx
+++ b/docs/troubleshooting/typed-linting/Performance.mdx
@@ -180,6 +180,7 @@ Instead of globs that use `**` to recursively check all folders, prefer paths th
 // @ts-check
 
 import js from '@eslint/js';
+import { defineConfig } from 'eslint/config';
 import tseslint from 'typescript-eslint';
 
 export default defineConfig({
diff --git a/docs/troubleshooting/typed-linting/index.mdx b/docs/troubleshooting/typed-linting/index.mdx
index 163c2558b..3622c7b23 100644
--- a/docs/troubleshooting/typed-linting/index.mdx
+++ b/docs/troubleshooting/typed-linting/index.mdx
@@ -30,7 +30,7 @@ For example, to disable type-checked linting on all `.js` files:
   <TabItem value="Flat Config">
 
 ```js title="eslint.config.mjs"
-import defineConfig from 'eslint/config';
+import { defineConfig } from 'eslint/config';
 import tseslint from 'typescript-eslint';
 
 export default defineConfig(
baseline
docs(troubleshooting/typed-linting): Add TypeScript linting configuration for performance issues.
fine-tuned
docs: update typed-linting performance section
reference
docs: fix `defineConfig` typos

A practical note before moving to deployment: every number above comes from full-precision (BF16) models running on Colab. The version that will actually ships to my laptop is quantized to GGUF (Q4_K_M), and I haven’t yet established whether that introduces a meaningful drop.

7. Deployment: from notebook to git hook

This is where the project becomes a real tool: merge the LoRA adapter → export to GGUF → run with Ollama → wire it into a prepare-commit-msg hook.

There’s a memory/latency tradeoff in how you run the model:

8. Limitations and what’s next

All things considered, this works quite well: the fine-tuned model is genuinely better and consistent than the baseline on every metric I tracked. But “better than baseline” isn’t the same as perfect. So here are the limitations I ran into, and what could be improved.

9. Wrapping up

That was another story about automating a few seconds of typing into a complete ML project: scraping and filtering ~12k diff/message pairs, LoRA fine-tuning a handful of small models on a Colab A100 with Unsloth, a three-layer evaluation (ROUGE-L, structural checks, LLM-as-judge), and shipping the result as a local prepare-commit-msg hook. The takeaway I keep coming back to: a tiny model you can run offline, for free and privately, writes clean Conventional Commits — and for a task this narrow, task-specific fine-tuning beats prompting a larger model.

If you want to dig into the code, run the notebooks, or fine-tune your own:

Thanks for reading — and may your commit history finally read like documentation :)