← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

Search Indexing Pipeline

Search PHP 7.0+ Intermediate
debt(d7/e5/b5/t5)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7), because sync indexing in request paths shows up via Horizon/Datadog latency traces but isn't flagged by static tools; detection_hints.automated is 'no'.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5), since the quick_fix requires moving indexing into queue jobs wired to model events — a refactor across model observers, job classes, and indexer service, not a one-liner.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5), as the pipeline applies across web and queue-worker contexts and every searchable model must integrate with it; analyser/synonym choices shape ongoing search work.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap (a documented gotcha most devs eventually learn)' (t5), matching the misconception that the index is just a DB copy — devs eventually learn about stemming, analysers, and async indexing after hitting the gotchas.

About DEBT scoring →

Also Known As

inverted index document indexing search pipeline tokenisation

TL;DR

The process of transforming raw content into a searchable index — extraction, normalisation, tokenisation, stemming, and index writing with incremental update strategies.

Explanation

A search indexing pipeline: (1) Extract — fetch content from DB, files, or API; (2) Normalise — lowercase, strip HTML, remove special chars; (3) Tokenise — split into terms; (4) Analyse — apply stemmer (run→run, running→run), synonyms, stop word removal; (5) Index — write inverted index (term → list of document IDs with positions). Incremental indexing handles updates: updated_at timestamp queries, change data capture (CDC) from DB, or event-driven indexing via domain events. Full re-index handles schema changes.

Common Misconception

The search index is just a copy of the database — the index is a transformed, analysed, denormalised structure optimised for retrieval; building it well requires deliberate choices about what to include and how to analyse it.

Why It Matters

A search index built without stemming treats 'authenticate' and 'authentication' as different terms — users searching for 'authentication' miss results containing 'authenticate' only.

Common Mistakes

  • Synchronous re-indexing on every write — index updates should be async via queue.
  • No incremental indexing — full re-index on every change is O(n) regardless of change size.
  • Indexing HTML without stripping tags — '<strong>PHP</strong>' won't match 'PHP'.
  • No synonym configuration — 'oauth' and 'open authorisation' treated as unrelated terms.

Code Examples

✗ Vulnerable
// Synchronous full re-index on every term save:
public function save(Term $term): void {
    $this->db->save($term);
    // Blocks request — re-indexes ALL 800 terms on every save:
    foreach ($this->db->findAll() as $t) {
        $this->searchEngine->index($t); // O(n) on every save!
    }
}
✓ Fixed
// Async incremental indexing via queue:
public function save(Term $term): void {
    $this->db->save($term);
    // Queue just this term for indexing:
    $this->queue->dispatch(new IndexTermJob($term->slug));
    // Response returns immediately — indexing happens in background
}

// Queue worker:
class IndexTermJob {
    public function handle(SearchIndex $index): void {
        $term = $this->db->find($this->slug);
        $index->upsert($term->slug, [
            'term'     => $term->term,
            'body'     => strip_tags($term->long),
            'category' => $term->category,
        ]);
    }
}

Added 16 Mar 2026
Edited 22 Mar 2026
Views 47
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S 0 pings S 3 pings M 1 ping T 1 ping W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 0 pings F 0 pings S 2 pings S 0 pings M 0 pings T 0 pings W
No pings yet today
No pings yesterday
Amazonbot 8 Perplexity 7 Ahrefs 4 Scrapy 4 Google 3 SEMrush 3 ChatGPT 2 PetalBot 2 Claude 1 Bing 1
crawler 33 crawler_json 2
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Build the indexing pipeline as a queue job triggered by model events (saved, deleted) — sync indexing in HTTP requests blocks responses and causes timeouts on large reindexes
📦 Applies To
PHP 7.0+ web queue-worker laravel
🔗 Prerequisites
🔍 Detection Hints
Elasticsearch/Meilisearch update called synchronously in HTTP request; full reindex blocking web request; no queue for search index updates
Auto-detectable: ✗ No laravel-horizon datadog
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Medium ✗ Manual fix Fix: Medium Context: File Tests: Update


✓ schema.org compliant