ML Types
debt(d9/e7/b5/t7)
Closest to 'silent in production until users hit it' (d9). Detection_hints confirm 'automated: no' — there is no tool that catches paradigm misselection. Choosing clustering over supervised classification, or building a custom model when an API suffices, produces no error, warning, or lint signal; the consequence is only visible when model results underperform in production.
Closest to 'cross-cutting refactor across the codebase' (e7). Switching ML paradigms (e.g., from unsupervised clustering to supervised classification) means re-collecting or labelling data, retraining or swapping models, re-evaluating pipelines, and potentially re-architecting integration code. The quick_fix note acknowledges that for consumers of ML APIs the fix is simpler, but the common_mistakes list shows the real cost when the wrong paradigm is deeply embedded in a data pipeline.
Closest to 'persistent productivity tax' (b5). The choice of ML paradigm shapes data collection strategy, labelling effort, API or model selection, and evaluation metrics across multiple work streams. It applies to both web and cli contexts per applies_to. While it doesn't define the entire system shape, it imposes ongoing costs on data, model, and integration decisions throughout the project lifecycle.
Closest to 'serious trap' (t7). The misconception field directly states a high-confidence wrong belief: developers assume LLMs use purely supervised learning, when modern LLMs use self-supervised pre-training plus RLHF. This contradicts common mental models imported from classical supervised ML education, paralleling how a similar concept (supervised learning) works — making this a paradigm-level misconception that can drive incorrect architectural and data-labelling decisions.
Also Known As
TL;DR
Explanation
Supervised: labelled input→output pairs. Classification (spam/not), regression (predict price). Unsupervised: clustering (K-means), dimensionality reduction, anomaly detection. Self-supervised: model generates its own labels from data — GPT predicts next token. Reinforcement learning: agent+rewards+policy — game playing, RLHF for fine-tuning LLMs. Choosing the right paradigm depends on: whether labels are available, whether you need groups or predictions, and latency requirements.
Common Misconception
Why It Matters
Common Mistakes
- Supervised learning without sufficient labelled data — model learns noise
- Using clustering when supervised labels are available — worse results
- Ignoring class imbalance in supervised classification
- Not considering self-supervised approaches when labelling is expensive
Code Examples
// Goal: detect fraud (labelled historical data exists)
// Wrong choice: unsupervised clustering (ignores labels)
// Should use: supervised binary classification with fraud labels
// Matching ML type to problem:
// Churn prediction (labelled) → supervised: logistic regression
// Customer segments (no predefined groups) → unsupervised: K-means
// Optimise recommendations (engagement signal) → reinforcement: bandit
// PHP code completion (large PHP codebase) → self-supervised: next-token