← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Fine-Tuning LLMs

ai_ml Advanced
debt(d9/e7/b7/t7)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints explicitly state 'automated: no' and the code_pattern describes a strategic decision (choosing fine-tuning when prompt engineering would suffice). There is no linter, compiler, or SAST tool that can catch this misapplication — it only becomes apparent after spending significant compute budget and observing underwhelming real-world results.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix notes that fine-tuning requires hundreds of high-quality examples, costs money, and produces a model that must be re-fine-tuned as the base model improves. Undoing a wrong fine-tuning decision means abandoning the fine-tuned model, constructing a RAG pipeline or redesigning prompts, potentially refactoring all calling code around the model interface, and re-evaluating — a significant cross-cutting effort across data pipelines, inference code, and evaluation infrastructure.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). Once fine-tuning is committed to, every future model update requires repeating the fine-tuning process. The fine-tuned model becomes load-bearing: deployment infrastructure, versioning, evaluation pipelines, and data curation workflows are all shaped by this choice. Applies across both web and cli contexts per applies_to, meaning it touches multiple workstreams and teams.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The canonical misconception is that 'fine-tuning adds new knowledge to a model' — a very intuitive but incorrect belief, since fine-tuning improves style and task performance, not factual grounding. This directly contradicts the mental model most developers carry from analogous training concepts, and the common_mistakes reinforce that developers repeatedly fall into this trap, spending significant resources before discovering the error.

About DEBT scoring →

Also Known As

fine-tuning LoRA PEFT model fine-tuning

TL;DR

Training a pre-trained LLM on domain-specific data to improve performance on a specific task — more expensive and complex than prompt engineering but produces more consistent results.

Explanation

Fine-tuning updates model weights on a curated dataset, specialising the model for a domain or task. Full fine-tuning updates all weights (expensive, requires GPUs). PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA update a small fraction of weights — tractable on consumer hardware. When to fine-tune: consistent tone/style (not achievable with prompting), domain terminology, or format adherence. When NOT to fine-tune: adding new knowledge (use RAG), one-off tasks (use prompt engineering), or small datasets (risk overfitting).

Common Misconception

Fine-tuning adds new knowledge to a model — fine-tuning improves task performance and style; for adding specific facts or current data, RAG is more appropriate.

Why It Matters

Fine-tuning vs RAG vs prompt engineering is a fundamental architectural decision — choosing fine-tuning when RAG is appropriate wastes significant compute budget without better results.

Common Mistakes

  • Fine-tuning on small datasets — under ~1000 examples risks overfitting; the model loses generality.
  • Using fine-tuning to add current knowledge — training data has a cutoff; use RAG for dynamic knowledge.
  • Not evaluating the fine-tuned model against a held-out test set — training accuracy does not predict real performance.
  • Fine-tuning before trying prompt engineering — prompt engineering is cheaper and often sufficient.

Code Examples

✗ Vulnerable
// Fine-tuning for a task that prompt engineering handles:
// Expensive approach: collect 5000 examples, train for 3 hours, $200 GPU cost
// Task: 'Summarise PHP documentation in 2 sentences'

// This is trivially solved with a system prompt:
const systemPrompt = 'You are a PHP documentation summariser.
Respond with exactly 2 concise sentences.';
✓ Fixed
// Fine-tuning appropriate use case:
// Task: generate code comments in our internal style guide
// 1000+ examples of before/after pairs
// Consistent style not achievable with prompting

// LoRA fine-tune (efficient):
// Training data format:
// {"prompt": "Add PHPDoc for: public function process(Order $order): bool",
//  "completion": "/**\n * Processes order payment and updates inventory.\n * @param Order $order The order to process\n * @return bool True if successful\n */"}

// Cheaper alternative first: few-shot prompting with 10 examples in context

Added 15 Mar 2026
Edited 22 Mar 2026
Views 26
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 2 pings F 0 pings S 0 pings S 2 pings M 0 pings T 1 ping W 2 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 1 ping F 0 pings S
No pings yet today
Amazonbot 8 Perplexity 6 Unknown AI 2 Ahrefs 2 Google 2 ChatGPT 2
crawler 20 crawler_json 2
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Try prompt engineering and RAG before fine-tuning — fine-tuning requires 100s of high-quality examples, costs money, and produces a model you must re-fine-tune as the base model improves
📦 Applies To
any web cli
🔗 Prerequisites
🔍 Detection Hints
Considering fine-tuning for tone/format tasks that system prompt would handle; no eval set to measure improvement from fine-tuning
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File Tests: Update

✓ schema.org compliant