{
    "slug": "suffix_array",
    "term": "Suffix Array",
    "category": "algorithms",
    "difficulty": "advanced",
    "short": "A sorted array of all suffix starting positions of a string, enabling fast substring search in O(m log n) with far less memory than a suffix tree.",
    "long": "A suffix array is the sorted list of starting indices of every suffix of a string. For a string S of length n, it holds a permutation of 0..n-1 such that the suffixes beginning at those positions are in lexicographic order. Once built, you can locate any pattern P of length m by binary searching the array in O(m log n) time, because all occurrences of P appear as a contiguous block of suffixes that share P as a prefix.\n\nThe naive construction sorts all n suffixes directly, which is O(n^2 log n) because each comparison can scan up to n characters. Better algorithms - the prefix-doubling approach (O(n log n) or O(n log^2 n) depending on the sort), DC3/skew (O(n)), or SA-IS (O(n)) - make the structure practical for large texts and genomic data.\n\nSuffix arrays are usually paired with an LCP (longest common prefix) array, which records the length of the shared prefix between each pair of adjacent suffixes. The LCP array turns the suffix array into a tool for counting distinct substrings, finding the longest repeated substring, and speeding up pattern matching to O(m + log n).\n\nCompared with a suffix tree, a suffix array stores the same searching power in a flat integer array. It uses roughly 4n bytes plus the text instead of the much larger node-and-pointer layout of a tree, has better cache behaviour, and is simpler to serialise. The trade-off is that some operations that are constant time on a suffix tree need the auxiliary LCP array and a little extra logic on a suffix array.\n\nNote the naming overlap: the user described storing rotations in sorted order, which is the Burrows-Wheeler / suffix-automaton family. A classic suffix array sorts suffixes, not full rotations, though the two ideas are closely related and the BWT is derived directly from the suffix array. Reach for a suffix array when you must run many substring queries against a large, mostly static text such as a document index, log corpus, or DNA sequence.",
    "aliases": [
        "suffix sorting",
        "SA-IS",
        "sorted suffix index"
    ],
    "tags": [
        "algorithms",
        "string-matching",
        "data-structures",
        "indexing",
        "substring-search"
    ],
    "misconception": "A suffix array stores the actual suffix strings or all rotations of the text. In practice it stores only integer starting positions of suffixes, sorted lexicographically - the characters are read from the original string on demand.",
    "why_it_matters": "Suffix arrays let you preprocess a large static text once and then answer many substring queries fast, powering full-text search, log analysis, and bioinformatics with far less memory than a suffix tree.",
    "common_mistakes": [
        "Building with the naive O(n^2 log n) sort on large inputs, causing timeouts where prefix-doubling or SA-IS would finish quickly.",
        "Forgetting to compute the LCP array, then re-scanning characters and losing the O(m + log n) search speed.",
        "Confusing suffix sorting with full-rotation (Burrows-Wheeler) sorting, which gives a different ordering for the wraparound suffixes.",
        "Rebuilding the array on every query instead of caching it for a static text - the whole point is to amortise construction over many lookups.",
        "Off-by-one errors at the binary search boundaries that miss the first or last occurrence of a pattern."
    ],
    "when_to_use": [
        "A large, mostly static text receives many substring or pattern queries.",
        "You need full-text search, longest repeated substring, or distinct substring counts.",
        "You want suffix-tree query power with lower memory and better cache locality.",
        "Building a Burrows-Wheeler transform or FM-index for compression or genomic search."
    ],
    "avoid_when": [
        "The text changes frequently, since rebuilding the array on every mutation negates the preprocessing payoff.",
        "You only need a single substring search, where a direct scan or strpos is simpler and fast enough.",
        "Memory for the index plus LCP array exceeds what the environment allows for the input size."
    ],
    "related": [
        "string_algorithms",
        "searching_algorithms",
        "sorting_algorithms",
        "two_pointer_technique",
        "divide_and_conquer"
    ],
    "prerequisites": [
        "sorting_algorithms",
        "string_algorithms",
        "big_o_notation",
        "searching_algorithms"
    ],
    "refs": [
        "https://en.wikipedia.org/wiki/Suffix_array",
        "https://web.stanford.edu/class/cs97si/suffix-array.pdf",
        "https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform"
    ],
    "bad_code": "// Naive O(n^2 log n): sorts whole suffix strings each comparison.\nfunction buildSuffixArray(string $s): array {\n    $n = strlen($s);\n    $suffixes = [];\n    for ($i = 0; $i < $n; $i++) {\n        $suffixes[$i] = substr($s, $i); // copies up to n chars per suffix\n    }\n    // sort compares full strings: each compare is O(n)\n    asort($suffixes);\n    return array_keys($suffixes);\n}\n// For a 1M-char log this allocates O(n^2) bytes and crawls.",
    "good_code": "// Prefix-doubling: O(n log^2 n) time, integer ranks, no string copies.\nfunction buildSuffixArray(string $s): array {\n    $n = strlen($s);\n    $sa = range(0, $n - 1);\n    // initial rank = char code, then compacted to dense ranks each round\n    $rank = array_map('ord', str_split($s));\n    $tmp = array_fill(0, $n, 0);\n    for ($k = 1; ; $k <<= 1) {\n        $cmp = function (int $a, int $b) use (&$rank, $k, $n): int {\n            if ($rank[$a] !== $rank[$b]) return $rank[$a] <=> $rank[$b];\n            $ra = ($a + $k < $n) ? $rank[$a + $k] : -1;\n            $rb = ($b + $k < $n) ? $rank[$b + $k] : -1;\n            return $ra <=> $rb;\n        };\n        usort($sa, $cmp);\n        $tmp[$sa[0]] = 0;\n        for ($i = 1; $i < $n; $i++) {\n            $tmp[$sa[$i]] = $tmp[$sa[$i - 1]] + ($cmp($sa[$i - 1], $sa[$i]) ? 1 : 0);\n        }\n        $rank = $tmp;\n        if ($rank[$sa[$n - 1]] === $n - 1 || $k >= $n) break; // all ranks distinct\n    }\n    return $sa;\n}",
    "quick_fix": "Build the suffix array once with prefix-doubling or SA-IS, cache it for the static text, and binary search patterns instead of re-scanning the whole string per query.",
    "severity": "medium",
    "effort": "high",
    "created": "2026-06-11",
    "updated": "2026-06-11",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/suffix_array",
        "html_url": "https://codeclaritylab.com/glossary/suffix_array",
        "json_url": "https://codeclaritylab.com/glossary/suffix_array.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Suffix Array](https://codeclaritylab.com/glossary/suffix_array) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/suffix_array"
            }
        }
    }
}