Fine-Tuning LLMs
debt(d9/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9). The detection_hints explicitly state 'automated: no' and the code_pattern describes a strategic decision (choosing fine-tuning when prompt engineering would suffice). There is no linter, compiler, or SAST tool that can catch this misapplication — it only becomes apparent after spending significant compute budget and observing underwhelming real-world results.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix notes that fine-tuning requires hundreds of high-quality examples, costs money, and produces a model that must be re-fine-tuned as the base model improves. Undoing a wrong fine-tuning decision means abandoning the fine-tuned model, constructing a RAG pipeline or redesigning prompts, potentially refactoring all calling code around the model interface, and re-evaluating — a significant cross-cutting effort across data pipelines, inference code, and evaluation infrastructure.
Closest to 'strong gravitational pull' (b7). Once fine-tuning is committed to, every future model update requires repeating the fine-tuning process. The fine-tuned model becomes load-bearing: deployment infrastructure, versioning, evaluation pipelines, and data curation workflows are all shaped by this choice. Applies across both web and cli contexts per applies_to, meaning it touches multiple workstreams and teams.
Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The canonical misconception is that 'fine-tuning adds new knowledge to a model' — a very intuitive but incorrect belief, since fine-tuning improves style and task performance, not factual grounding. This directly contradicts the mental model most developers carry from analogous training concepts, and the common_mistakes reinforce that developers repeatedly fall into this trap, spending significant resources before discovering the error.
Also Known As
TL;DR
Explanation
Fine-tuning updates model weights on a curated dataset, specialising the model for a domain or task. Full fine-tuning updates all weights (expensive, requires GPUs). PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA update a small fraction of weights — tractable on consumer hardware. When to fine-tune: consistent tone/style (not achievable with prompting), domain terminology, or format adherence. When NOT to fine-tune: adding new knowledge (use RAG), one-off tasks (use prompt engineering), or small datasets (risk overfitting).
Common Misconception
Why It Matters
Common Mistakes
- Fine-tuning on small datasets — under ~1000 examples risks overfitting; the model loses generality.
- Using fine-tuning to add current knowledge — training data has a cutoff; use RAG for dynamic knowledge.
- Not evaluating the fine-tuned model against a held-out test set — training accuracy does not predict real performance.
- Fine-tuning before trying prompt engineering — prompt engineering is cheaper and often sufficient.
Code Examples
// Fine-tuning for a task that prompt engineering handles:
// Expensive approach: collect 5000 examples, train for 3 hours, $200 GPU cost
// Task: 'Summarise PHP documentation in 2 sentences'
// This is trivially solved with a system prompt:
const systemPrompt = 'You are a PHP documentation summariser.
Respond with exactly 2 concise sentences.';
// Fine-tuning appropriate use case:
// Task: generate code comments in our internal style guide
// 1000+ examples of before/after pairs
// Consistent style not achievable with prompting
// LoRA fine-tune (efficient):
// Training data format:
// {"prompt": "Add PHPDoc for: public function process(Order $order): bool",
// "completion": "/**\n * Processes order payment and updates inventory.\n * @param Order $order The order to process\n * @return bool True if successful\n */"}
// Cheaper alternative first: few-shot prompting with 10 examples in context