← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

AI Synthetic Data Generation

ai_ml Intermediate
debt(d9/e7/b7/t9)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). detection_hints.automated is no; the failure mode (distribution narrowing, memorisation leakage, model collapse) is silent until real-world deployment exposes tail cases. No linter or SAST catches naive synthetic data pipelines.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix requires adding holdout validation sets, deduplication, distributional fidelity measurement (KS/Wasserstein), and provenance tagging — these span the data pipeline, training, and evaluation stages, not a single-call swap.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). applies_to spans queue-worker, cli, and library contexts; once synthetic data is mixed into training corpora without provenance, every future evaluation, audit, and debugging session is shaped by that choice, and untangling real-vs-synthetic retroactively is painful.

t9 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'catastrophic trap' (t9). The misconception is explicit: developers assume synthetic data is privacy-safe because no real records are copied, when generators routinely memorise and leak PII verbatim absent differential privacy. The 'obvious' privacy intuition is precisely wrong.

About DEBT scoring →

Also Known As

synthetic data data augmentation with AI generative data artificial training data

TL;DR

Using generative models to produce artificial training, testing, or augmentation data that mimics the statistical properties of real datasets without exposing originals.

Explanation

Synthetic data generation uses LLMs, diffusion models, GANs, or statistical samplers to create artificial records that resemble real data in structure and distribution. Common uses include: augmenting small training sets, generating edge cases for testing, producing privacy-preserving substitutes for sensitive datasets (medical, financial), bootstrapping fine-tuning datasets for new tasks, and stress-testing pipelines with realistic-looking but non-identifying records. Approaches range from prompt-based LLM generation ('Write 100 customer support tickets about billing issues'), to model-based methods (tabular GANs like CTGAN, diffusion for images), to programmatic templating with controlled randomness. The technique is powerful but trap-laden. Synthetic data inherits and often amplifies biases in the generator; an LLM trained on biased corpora will produce biased synthetics. Mode collapse is common - generators produce a narrow slice of plausible outputs rather than the full distribution, so models trained on synthetics fail on real-world tail cases. Recursive training (training new models on synthetic data from previous models) causes model collapse, where output quality degrades across generations. Privacy is not automatic: synthetic records can leak memorised training examples, especially with low-temperature LLM generation or insufficient differential privacy noise. Quality validation requires comparing distributions (KS tests, Wasserstein distance), checking for memorisation, and measuring downstream task performance against real-data baselines. Best used to augment, not replace, real data, with explicit provenance tracking so synthetic and real records can be separated during evaluation.

Common Misconception

Synthetic data is automatically privacy-safe because no real records are copied. In reality, generators frequently memorise and reproduce training examples verbatim, and naive synthetic datasets can leak personally identifiable information without formal differential privacy guarantees.

Why It Matters

Synthetic data can unblock projects with data scarcity or privacy constraints, but a poorly generated dataset silently injects bias, narrows distribution coverage, and produces models that pass internal benchmarks while failing in production on real-world tail cases.

Common Mistakes

  • Training a model exclusively on synthetic data without holding out a real validation set - benchmark scores become meaningless.
  • Assuming synthetic equals private without applying differential privacy or memorisation audits.
  • Recursively training generators on their own outputs, accelerating model collapse and distribution narrowing.
  • Not measuring distributional fidelity (KS test, Wasserstein, downstream accuracy) before shipping synthetic data into a pipeline.
  • Mixing synthetic and real records without provenance flags, making it impossible to diagnose failures or audit datasets later.

Avoid When

  • Training models intended for high-stakes domains (medical diagnosis, credit decisions) on synthetic-only data without rigorous real-world validation.
  • Recursively training generators on outputs from prior generations - model collapse degrades distribution coverage rapidly.
  • Using synthetic data as a privacy guarantee without applying differential privacy or memorisation audits.
  • Replacing real evaluation sets with synthetic ones - synthetic-on-synthetic benchmarks are not predictive of production behaviour.

When To Use

  • Augmenting small real datasets where collection is expensive or slow.
  • Generating edge cases and adversarial examples that real data does not cover.
  • Producing privacy-substitute datasets for development and staging environments, paired with formal differential privacy guarantees.
  • Bootstrapping fine-tuning corpora for new tasks where no labelled data yet exists, then iterating with real feedback.

Code Examples

✗ Vulnerable
// Generate fine-tuning data with an LLM, train, ship
$examples = [];
for ($i = 0; $i < 10000; $i++) {
    $response = $llm->complete([
        'model' => 'gpt-4o',
        'temperature' => 0.2,  // low - generator collapses to narrow outputs
        'messages' => [['role' => 'user', 'content' => 'Generate a customer complaint and resolution.']]
    ]);
    $examples[] = json_decode($response, true);
}

// No deduplication, no distribution check, no real-data validation
file_put_contents('train.jsonl', implode("\n", array_map('json_encode', $examples)));
$model = $finetuner->train('train.jsonl');  // model overfits to synthetic mode
$accuracy = evaluateOn($model, 'train.jsonl');  // self-eval, meaningless
✓ Fixed
// Generate with diversity controls, validate, mix with real data
$examples = [];
$seen = [];
foreach ($seedPrompts as $seed) {
    $response = $llm->complete([
        'model' => 'gpt-4o',
        'temperature' => 0.9,
        'top_p' => 0.95,
        'messages' => [['role' => 'user', 'content' => $seed]]
    ]);
    $record = json_decode($response, true);
    $hash = md5(json_encode($record));
    if (isset($seen[$hash])) continue;  // dedupe
    $seen[$hash] = true;
    $record['_provenance'] = 'synthetic';
    $examples[] = $record;
}

// Memorisation audit against generator training corpus
$leaks = $auditor->checkMemorisation($examples, $sensitiveCorpus);
if (count($leaks) > 0) throw new RuntimeException('PII leak detected');

// Distribution check vs real holdout
$drift = $stats->wasserstein($examples, $realHoldout);
if ($drift > 0.3) throw new RuntimeException('Distribution drift too high');

// Train on mixed corpus; evaluate ONLY on real holdout
$mixed = array_merge($realTrain, $examples);
$model = $finetuner->train($mixed);
$accuracy = evaluateOn($model, $realHoldout);

Added 12 May 2026
Views 34
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 1 ping F 0 pings S 0 pings S 1 ping M 1 ping T 1 ping W 1 ping T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 2 pings T 1 ping F 0 pings S 3 pings S 0 pings M 0 pings T 0 pings W 0 pings T
No pings yet today
No pings yesterday
Perplexity 5 Google 3 SEMrush 3 Scrapy 3 Meta AI 2 Ahrefs 2 Bing 1
crawler 17 crawler_json 2
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Medium
⚡ Quick Fix
Always hold out a real validation set, deduplicate generated records, measure distributional fidelity against real data, and tag synthetic rows with provenance so they can be filtered during evaluation.
📦 Applies To
any queue-worker cli library
🔗 Prerequisites
🔍 Detection Hints
LLM-driven loops producing training examples written directly to a dataset file with no deduplication, distribution check, or real-data holdout validation
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Medium ✗ Manual fix Fix: Medium Context: File Tests: Regenerate

✓ schema.org compliant