← Back to glossary

Search Indexing Pipeline

Search PHP 7.0+ Intermediate

debt(d7/e5/b5/t5)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7), because sync indexing in request paths shows up via Horizon/Datadog latency traces but isn't flagged by static tools; detection_hints.automated is 'no'.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5), since the quick_fix requires moving indexing into queue jobs wired to model events — a refactor across model observers, job classes, and indexer service, not a one-liner.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5), as the pipeline applies across web and queue-worker contexts and every searchable model must integrate with it; analyser/synonym choices shape ongoing search work.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap (a documented gotcha most devs eventually learn)' (t5), matching the misconception that the index is just a DB copy — devs eventually learn about stemming, analysers, and async indexing after hitting the gotchas.

About DEBT scoring → scored by claude-opus-4-7 · 2026-05-11 · reviewed by human

Also Known As

inverted index document indexing search pipeline tokenisation

TL;DR

The process of transforming raw content into a searchable index — extraction, normalisation, tokenisation, stemming, and index writing with incremental update strategies.

Explanation

A search indexing pipeline: (1) Extract — fetch content from DB, files, or API; (2) Normalise — lowercase, strip HTML, remove special chars; (3) Tokenise — split into terms; (4) Analyse — apply stemmer (run→run, running→run), synonyms, stop word removal; (5) Index — write inverted index (term → list of document IDs with positions). Incremental indexing handles updates: updated_at timestamp queries, change data capture (CDC) from DB, or event-driven indexing via domain events. Full re-index handles schema changes.

Common Misconception

✗ The search index is just a copy of the database — the index is a transformed, analysed, denormalised structure optimised for retrieval; building it well requires deliberate choices about what to include and how to analyse it.

Why It Matters

A search index built without stemming treats 'authenticate' and 'authentication' as different terms — users searching for 'authentication' miss results containing 'authenticate' only.

Common Mistakes

Synchronous re-indexing on every write — index updates should be async via queue.
No incremental indexing — full re-index on every change is O(n) regardless of change size.
Indexing HTML without stripping tags — '<strong>PHP</strong>' won't match 'PHP'.
No synonym configuration — 'oauth' and 'open authorisation' treated as unrelated terms.

Code Examples

✗ Vulnerable

// Synchronous full re-index on every term save:
public function save(Term $term): void {
    $this->db->save($term);
    // Blocks request — re-indexes ALL 800 terms on every save:
    foreach ($this->db->findAll() as $t) {
        $this->searchEngine->index($t); // O(n) on every save!
    }
}

✓ Fixed

// Async incremental indexing via queue:
public function save(Term $term): void {
    $this->db->save($term);
    // Queue just this term for indexing:
    $this->queue->dispatch(new IndexTermJob($term->slug));
    // Response returns immediately — indexing happens in background
}

// Queue worker:
class IndexTermJob {
    public function handle(SearchIndex $index): void {
        $term = $this->db->find($this->slug);
        $index->upsert($term->slug, [
            'term'     => $term->term,
            'body'     => strip_tags($term->long),
            'category' => $term->category,
        ]);
    }
}

References

https://en.wikipedia.org/wiki/Inverted_index