When should you NOT use Suffix Array?

The text changes frequently, since rebuilding the array on every mutation negates the preprocessing payoff. You only need a single substring search, where a direct scan or strpos is simpler and fast enough. Memory for the index plus LCP array exceeds what the environment allows for the input size.

When is Suffix Array the right choice?

A large, mostly static text receives many substring or pattern queries. You need full-text search, longest repeated substring, or distinct substring counts. You want suffix-tree query power with lower memory and better cache locality. Building a Burrows-Wheeler transform or FM-index for compression or genomic search.

← Back to glossary

Suffix Array

Algorithms PHP 7.0+ Advanced

debt(d8/e6/b4/t6)

d8 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9), reduced to d8 because detection_hints.automated is 'no' and only a regex code_pattern exists. A naive O(n^2 log n) build works correctly on small inputs but silently times out on large texts in production; no linter catches the performance choice.

e6 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5), nudged to e6 because the quick_fix requires swapping the naive sort for prefix-doubling or SA-IS, adding LCP array computation, and introducing a caching layer for the static text — a meaningful multi-part change within the string-search component rather than a one-line patch.

b4 Burden Structural debt — long-term weight of choosing wrong

Closest to 'localised tax' (b3), raised to b4 because applies_to spans cli/library/web and the suffix array plus its LCP array become a shared index that downstream query code depends on, imposing a modest persistent tax beyond a single isolated component.

t6 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7), softened to t6 grounded in the misconception that the array stores actual suffix strings or all rotations when it only stores sorted integer positions, plus common_mistakes around confusing suffix sorting with Burrows-Wheeler rotation ordering and off-by-one binary search boundaries — the obvious mental model is wrong but learnable.

About DEBT scoring → scored by claude-opus-4-8 · 2026-06-11 · reviewed by human

Also Known As

suffix sorting SA-IS sorted suffix index

TL;DR

A sorted array of all suffix starting positions of a string, enabling fast substring search in O(m log n) with far less memory than a suffix tree.

Explanation

A suffix array is the sorted list of starting indices of every suffix of a string. For a string S of length n, it holds a permutation of 0..n-1 such that the suffixes beginning at those positions are in lexicographic order. Once built, you can locate any pattern P of length m by binary searching the array in O(m log n) time, because all occurrences of P appear as a contiguous block of suffixes that share P as a prefix.

The naive construction sorts all n suffixes directly, which is O(n^2 log n) because each comparison can scan up to n characters. Better algorithms - the prefix-doubling approach (O(n log n) or O(n log^2 n) depending on the sort), DC3/skew (O(n)), or SA-IS (O(n)) - make the structure practical for large texts and genomic data.

Suffix arrays are usually paired with an LCP (longest common prefix) array, which records the length of the shared prefix between each pair of adjacent suffixes. The LCP array turns the suffix array into a tool for counting distinct substrings, finding the longest repeated substring, and speeding up pattern matching to O(m + log n).

Compared with a suffix tree, a suffix array stores the same searching power in a flat integer array. It uses roughly 4n bytes plus the text instead of the much larger node-and-pointer layout of a tree, has better cache behaviour, and is simpler to serialise. The trade-off is that some operations that are constant time on a suffix tree need the auxiliary LCP array and a little extra logic on a suffix array.

Note the naming overlap: the user described storing rotations in sorted order, which is the Burrows-Wheeler / suffix-automaton family. A classic suffix array sorts suffixes, not full rotations, though the two ideas are closely related and the BWT is derived directly from the suffix array. Reach for a suffix array when you must run many substring queries against a large, mostly static text such as a document index, log corpus, or DNA sequence.

Diagram

flowchart TD
    TEXT[Text: banana$] --> SUF[Enumerate suffixes]
    SUF --> SORT[Sort suffixes lexicographically]
    SORT --> SA[Suffix array of start indices]
    SA --> LCP[Optional LCP array]
    SA --> SEARCH[Binary search pattern in O m log n]
    LCP --> FAST[Speed up to O m plus log n]
    style SA fill:#238636,color:#fff
    style SEARCH fill:#1f6feb,color:#fff

Common Misconception

✗ A suffix array stores the actual suffix strings or all rotations of the text. In practice it stores only integer starting positions of suffixes, sorted lexicographically - the characters are read from the original string on demand.

Why It Matters

Suffix arrays let you preprocess a large static text once and then answer many substring queries fast, powering full-text search, log analysis, and bioinformatics with far less memory than a suffix tree.

Common Mistakes

Building with the naive O(n^2 log n) sort on large inputs, causing timeouts where prefix-doubling or SA-IS would finish quickly.
Forgetting to compute the LCP array, then re-scanning characters and losing the O(m + log n) search speed.
Confusing suffix sorting with full-rotation (Burrows-Wheeler) sorting, which gives a different ordering for the wraparound suffixes.
Rebuilding the array on every query instead of caching it for a static text - the whole point is to amortise construction over many lookups.
Off-by-one errors at the binary search boundaries that miss the first or last occurrence of a pattern.

Avoid When

The text changes frequently, since rebuilding the array on every mutation negates the preprocessing payoff.
You only need a single substring search, where a direct scan or strpos is simpler and fast enough.
Memory for the index plus LCP array exceeds what the environment allows for the input size.

When To Use

A large, mostly static text receives many substring or pattern queries.
You need full-text search, longest repeated substring, or distinct substring counts.
You want suffix-tree query power with lower memory and better cache locality.
Building a Burrows-Wheeler transform or FM-index for compression or genomic search.

Code Examples

✗ Vulnerable

// Naive O(n^2 log n): sorts whole suffix strings each comparison.
function buildSuffixArray(string $s): array {
    $n = strlen($s);
    $suffixes = [];
    for ($i = 0; $i < $n; $i++) {
        $suffixes[$i] = substr($s, $i); // copies up to n chars per suffix
    }
    // sort compares full strings: each compare is O(n)
    asort($suffixes);
    return array_keys($suffixes);
}
// For a 1M-char log this allocates O(n^2) bytes and crawls.

✓ Fixed

// Prefix-doubling: O(n log^2 n) time, integer ranks, no string copies.
function buildSuffixArray(string $s): array {
    $n = strlen($s);
    $sa = range(0, $n - 1);
    // initial rank = char code, then compacted to dense ranks each round
    $rank = array_map('ord', str_split($s));
    $tmp = array_fill(0, $n, 0);
    for ($k = 1; ; $k <<= 1) {
        $cmp = function (int $a, int $b) use (&$rank, $k, $n): int {
            if ($rank[$a] !== $rank[$b]) return $rank[$a] <=> $rank[$b];
            $ra = ($a + $k < $n) ? $rank[$a + $k] : -1;
            $rb = ($b + $k < $n) ? $rank[$b + $k] : -1;
            return $ra <=> $rb;
        };
        usort($sa, $cmp);
        $tmp[$sa[0]] = 0;
        for ($i = 1; $i < $n; $i++) {
            $tmp[$sa[$i]] = $tmp[$sa[$i - 1]] + ($cmp($sa[$i - 1], $sa[$i]) ? 1 : 0);
        }
        $rank = $tmp;
        if ($rank[$sa[$n - 1]] === $n - 1 || $k >= $n) break; // all ranks distinct
    }
    return $sa;
}

Suffix Array

Also Known As

TL;DR

Explanation

Diagram

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

Tags

References

Suffix Array

Also Known As

TL;DR

Explanation

Diagram

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

Tags

Related Terms

References