Students Train SOTA Code Models on Single GPUs

A developer debugging a complex Python codebase at 2 AM needs intelligent autocomplete that understands context across multiple files. Until recently, running state-of-the-art code models required expensive cloud infrastructure or high-end workstations. A team of graduate students from the University of Edinburgh has changed that equation with SmolLM-Code, a family of compact models that match or exceed larger alternatives while running on consumer hardware.

The SmolLM-Code series includes three variants: 135M, 360M, and 1.7B parameters. Despite their modest size, these models achieve performance comparable to models 10-20 times larger. The 1.7B version scores 56.7% on HumanEval, outperforming CodeGemma-2B and approaching the capabilities of models like StarCoder2-3B. The researchers accomplished this through aggressive data curation and a training recipe optimized for efficiency.

Benchmarks

SmolLM-Code demonstrates competitive results across multiple evaluation frameworks. On HumanEval, the standard benchmark for Python code generation, the 1.7B model reaches 56.7% pass@1 accuracy. For comparison, CodeGemma-2B scores 51.8% and Qwen2.5-Coder-1.5B achieves 61.6%. The gap narrows further on MultiPL-E, which tests code generation across 18 programming languages.

The models show particular strength in instruction-following tasks. On the LiveCodeBench benchmark, which uses recent programming problems to prevent data contamination, SmolLM-Code-1.7B-Instruct scores 18.2% on code generation tasks. This trails Qwen2.5-Coder-1.5B-Instruct (24.6%) but substantially beats CodeGemma-2B-Instruct (9.4%).

Cross-language performance reveals interesting patterns. The models handle JavaScript, TypeScript, and Java nearly as well as Python, with only modest degradation for languages like Rust or Swift that appear less frequently in training data. On MBPP, another Python benchmark emphasizing practical programming tasks, the instruct variant achieves 70.1% accuracy.

How to Run It

The models are available on Hugging Face at https://huggingface.co/HuggingFaceTB/SmolLM-Code-1.7B and can run locally with minimal setup. Installation requires the transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM-Code-1.7B",
    device_map="auto",
    torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-Code-1.7B")

prompt = "def calculate_fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

The 1.7B model requires approximately 4GB of VRAM in float16 precision, making it compatible with most modern GPUs including the RTX 3060 or M1 MacBook Pro. The smaller 360M variant runs comfortably on integrated graphics or older hardware with 2GB VRAM.

For instruction-following applications, the instruct-tuned variants accept chat-formatted inputs. These models respond to natural language requests like “Write a function to parse JSON and extract all email addresses” with complete, documented code.

Quantized versions reduce memory requirements further. A 4-bit quantized 1.7B model occupies roughly 1GB, enabling deployment on edge devices or integration into IDEs without significant overhead.

Limitations

The models inherit biases from their training data, primarily sourced from GitHub and Stack Overflow. Code samples may reflect outdated practices or security vulnerabilities common in public repositories. The researchers note that the models occasionally generate plausible-looking but incorrect code, particularly for edge cases or less common library functions.

Performance degrades noticeably for languages with limited training representation. While Python, JavaScript, and Java receive strong support, specialized languages like Haskell or Elixir show higher error rates. The models also struggle with very long context windows, as the 2048-token limit restricts their ability to understand large codebases.

Mathematical reasoning remains a weak point. Problems requiring complex algorithmic thinking or mathematical proofs often produce incorrect solutions, even when the syntax appears valid. The models excel at pattern matching and common programming idioms but lack deeper semantic understanding.

Verdict

SmolLM-Code demonstrates that careful data curation and training optimization can produce surprisingly capable small models. The ability to run state-of-the-art code generation on a single consumer GPU opens new possibilities for privacy-conscious development environments, offline coding assistants, and educational tools.

The models work best as autocomplete engines or boilerplate generators rather than autonomous coding agents. Developers gain productivity improvements without surrendering code to cloud services or requiring expensive infrastructure. For teams building coding tools or researchers exploring model efficiency, SmolLM-Code provides a compelling foundation at https://github.com/huggingface/smollm.

SmolLM-Code: SOTA Models for Single-GPU Training

Students Train SOTA Code Models on Single GPUs

Benchmarks

How to Run It

Limitations

Verdict

Related Tips

Caveman: Slashing AI Development Time on Benchmarks

Abliteration: Surgical Removal of AI Safety Filters

AI Coding Tools Now Age Faster Than Milk