← Back to glossary

Fine-Tuning LLMs

ai_ml Advanced

debt(d9/e7/b7/t7)

d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). The detection_hints explicitly state 'automated: no' and the code_pattern describes a strategic decision (choosing fine-tuning when prompt engineering would suffice). There is no linter, compiler, or SAST tool that can catch this misapplication — it only becomes apparent after spending significant compute budget and observing underwhelming real-world results.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix notes that fine-tuning requires hundreds of high-quality examples, costs money, and produces a model that must be re-fine-tuned as the base model improves. Undoing a wrong fine-tuning decision means abandoning the fine-tuned model, constructing a RAG pipeline or redesigning prompts, potentially refactoring all calling code around the model interface, and re-evaluating — a significant cross-cutting effort across data pipelines, inference code, and evaluation infrastructure.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). Once fine-tuning is committed to, every future model update requires repeating the fine-tuning process. The fine-tuned model becomes load-bearing: deployment infrastructure, versioning, evaluation pipelines, and data curation workflows are all shaped by this choice. Applies across both web and cli contexts per applies_to, meaning it touches multiple workstreams and teams.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The canonical misconception is that 'fine-tuning adds new knowledge to a model' — a very intuitive but incorrect belief, since fine-tuning improves style and task performance, not factual grounding. This directly contradicts the mental model most developers carry from analogous training concepts, and the common_mistakes reinforce that developers repeatedly fall into this trap, spending significant resources before discovering the error.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-07 · reviewed by human

Also Known As

fine-tuning LoRA PEFT model fine-tuning

TL;DR

Training a pre-trained LLM on domain-specific data to improve performance on a specific task — more expensive and complex than prompt engineering but produces more consistent results.

Explanation

Fine-tuning updates model weights on a curated dataset, specialising the model for a domain or task. Full fine-tuning updates all weights (expensive, requires GPUs). PEFT (Parameter-Efficient Fine-Tuning) methods like LoRA update a small fraction of weights — tractable on consumer hardware. When to fine-tune: consistent tone/style (not achievable with prompting), domain terminology, or format adherence. When NOT to fine-tune: adding new knowledge (use RAG), one-off tasks (use prompt engineering), or small datasets (risk overfitting).

Common Misconception

✗ Fine-tuning adds new knowledge to a model — fine-tuning improves task performance and style; for adding specific facts or current data, RAG is more appropriate.

Why It Matters

Fine-tuning vs RAG vs prompt engineering is a fundamental architectural decision — choosing fine-tuning when RAG is appropriate wastes significant compute budget without better results.

Common Mistakes

Fine-tuning on small datasets — under ~1000 examples risks overfitting; the model loses generality.
Using fine-tuning to add current knowledge — training data has a cutoff; use RAG for dynamic knowledge.
Not evaluating the fine-tuned model against a held-out test set — training accuracy does not predict real performance.
Fine-tuning before trying prompt engineering — prompt engineering is cheaper and often sufficient.

Code Examples

✗ Vulnerable

// Fine-tuning for a task that prompt engineering handles:
// Expensive approach: collect 5000 examples, train for 3 hours, $200 GPU cost
// Task: 'Summarise PHP documentation in 2 sentences'

// This is trivially solved with a system prompt:
const systemPrompt = 'You are a PHP documentation summariser.
Respond with exactly 2 concise sentences.';

✓ Fixed

// Fine-tuning appropriate use case:
// Task: generate code comments in our internal style guide
// 1000+ examples of before/after pairs
// Consistent style not achievable with prompting

// LoRA fine-tune (efficient):
// Training data format:
// {"prompt": "Add PHPDoc for: public function process(Order $order): bool",
//  "completion": "/**\n * Processes order payment and updates inventory.\n * @param Order $order The order to process\n * @return bool True if successful\n */"}

// Cheaper alternative first: few-shot prompting with 10 examples in context

References

↗ https://huggingface.co/docs/peft/index

Tags

ai llm machine-learning

Added 15 Mar 2026

Edited 22 Mar 2026

Curated in Warsaw under one editorial standard. 1,445 terms, single voice. About this reference →

Rate this term

No ratings yet

🤖 AI Guestbook educational data only

| |

Last 30 days

Agents 0

No pings yet today

Amazonbot 8 Perplexity 6 Unknown AI 2 Ahrefs 2 Google 2 ChatGPT 2

Also referenced

Large Language Models (LLMs) 26 Retrieval-Augmented Generation (RAG) 26 Prompt Engineering 19

How they use it

crawler 20 crawler_json 2

Related categories

ai_ml 1k

⚡ DEV INTEL Tools & Severity

🔵 Info ⚙ Fix effort: High

⚡ Quick Fix

Try prompt engineering and RAG before fine-tuning — fine-tuning requires 100s of high-quality examples, costs money, and produces a model you must re-fine-tune as the base model improves

📦 Applies To

any web cli

🔗 Prerequisites

Large Language Models (LLMs) Embeddings AI Evaluation Metrics

🔍 Detection Hints

Considering fine-tuning for tone/format tasks that system prompt would handle; no eval set to measure improvement from fine-tuning

Auto-detectable: ✗ No

⚠ Related Problems

Large Language Models (LLMs) AI Evaluation Metrics AI API Cost Management

🤖 AI Agent

Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File Tests: Update