Teaching Small Models to Self-Debug Code
A training technique that teaches small language models to debug their own code by learning from test failures and creating a feedback loop of error detection
Teaching a 0.8B Model to Debug Its Own Code
What It Is
A training technique that teaches small language models to fix their own programming errors by learning from specific failure patterns. Instead of trying to make a tiny model memorize correct solutions, this approach trains it to interpret test failures and adjust code accordingly.
The process creates a feedback loop: the model generates code, automated tests reveal what broke, the model attempts repairs, and successful fixes become training data. Using LoRA (Low-Rank Adaptation), the model learns on pairs of broken code and working fixes - specifically focusing on how to respond when shown exact failure messages with inputs, expected outputs, and actual outputs.
Recent experiments with a 0.8B parameter model demonstrated this on HumanEval, a standard coding benchmark. After training on just 13 repair pairs for 3 minutes on a MacBook Air M4, single-pass performance jumped from 16/50 to 28/50 correct solutions - a 75% improvement. The training required roughly 10GB RAM, while inference ran on 6GB.
Why It Matters
This challenges assumptions about what small models can learn. Rather than treating them as compressed knowledge stores that need to memorize solutions, this technique teaches a meta-skill: interpreting failure information and making targeted corrections.
The implications extend beyond coding. Any task with automatic verification - SQL queries, mathematical proofs, data transformations - becomes a candidate for this training approach. Teams working with resource-constrained environments can potentially deploy smaller models that adapt to feedback rather than requiring massive parameter counts to store domain knowledge.
The efficiency stands out. Three minutes of training on consumer hardware produced measurable improvements. Organizations experimenting with on-device AI or edge deployment gain a practical method for enhancing model capabilities without cloud infrastructure or expensive GPU clusters.
What the model actually learned reveals something unexpected: it didn’t memorize better solutions. Cold performance without failure feedback remained mediocre. But when provided failure information after training, improvement exceeded pre-training levels significantly. The model learned to leverage the debugging context itself.
Getting Started
Implementation requires three components: a code-generating model, automated test execution, and LoRA training infrastructure.
Start by setting up test harnesses that capture precise failure information:
try:
result = execute_code(code, test_input)
if result != expected:
return f"Input: {test_input}\nExpected: {expected}\nGot: {result}"
except Exception as e:
return f"Error: {str(e)}"
return None
Collect broken/fixed pairs by running the model’s initial attempts, showing it failure messages, and saving successful repairs. Each training example should include the original code, the specific failure message, and the corrected version.
For LoRA training, frameworks like Hugging Face PEFT work on standard hardware. The key is training specifically on the repair pattern - not just correct solutions, but the transformation from broken code plus failure context to working code.
Models around 0.8B parameters from sources like https://huggingface.co/models provide starting points. Qwen2.5-Coder and DeepSeek-Coder variants in this size range have shown responsiveness to this training approach.
Context
Traditional fine-tuning for coding models emphasizes exposure to correct solutions at scale. This technique inverts that priority, focusing on the debugging process itself with minimal examples.
The approach has clear boundaries. It requires automated verification - domains without objective correctness checks won’t benefit. The model still needs baseline coding ability; this enhances debugging skills rather than teaching programming from scratch.
Interestingly, typical scaling strategies showed diminishing returns. Larger training populations and lower sampling temperatures didn’t improve results. Sometimes constrained compute produces better outcomes when the task involves learning a specific pattern rather than accumulating knowledge.
Compared to retrieval-augmented generation or tool-using agents, this method bakes the debugging capability directly into model weights. No external systems needed during inference, though the tradeoff is task-specific training rather than general-purpose enhancement.
The RAM requirements remain practical - 10GB for training, 6GB for inference - putting this within reach of developers working on modern laptops rather than requiring dedicated ML infrastructure.
Related Tips
AgentHandover: AI Skill Builder from Screen Activity
AgentHandover is an AI skill builder that learns from screen activity to automate repetitive tasks, enabling users to train intelligent agents by demonstrating
Codesight: AI-Ready Codebase Structure Generator
Codesight is an AI-ready codebase structure generator that creates organized, well-documented project architectures optimized for AI code assistants and
Real-time Multimodal AI on M3 Pro with Gemma 2B
A technical guide exploring how to run real-time multimodal AI applications using the Gemma 2B model on Apple's M3 Pro chip, demonstrating local inference