Cardinality in Observability
debt(d9/e7/b7/t9)
Closest to 'silent in production until users hit it' (d9). The detection_hints.tools field is not specified, and by nature cardinality explosions are invisible during development — dev environments have only a handful of users/sessions so label cardinality appears low. The explosion only manifests in production under real traffic, typically when the monitoring system starts degrading or crashing, which is exactly the d9 pattern.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes auditing every metric label and moving high-cardinality values to logs or traces. This is not a single-line swap — it requires identifying all affected metrics across the codebase, removing or replacing label values, ensuring the same information is captured in structured logs or trace attributes, and potentially restructuring dashboards and alerts that relied on those labels. This is a cross-cutting change touching instrumentation code, logging setup, and monitoring configuration.
Closest to 'strong gravitational pull' (e7, mapped to b7). The choice of metric labels, once made and deployed to production, shapes every future observability decision. Dashboards, alerts, and SLOs are built on top of those metrics. Rolling back or changing label cardinality requires coordinated changes to metrics emission, Prometheus configuration, dashboards, and alert rules. The applies_to covers both web and cli contexts, giving this wide reach across the PHP application. Every future instrumentation decision is shaped by whatever cardinality choices were made early on.
Closest to 'catastrophic trap — the obvious way is always wrong' (t9). The misconception field states it explicitly: 'Adding more labels to metrics makes them more useful.' This is the instinctive, intuitive action every developer takes — they want user-level or request-level detail, so they add user_id or request_id as a label. This obvious approach is precisely what causes cardinality explosions and can bring down the entire monitoring infrastructure. The math in the misconception (10^5 = 100,000 or even 1 billion time series) shows that the natural developer instinct directly causes catastrophic failure, matching t9.
Also Known As
TL;DR
Explanation
In metrics systems like Prometheus, every unique combination of label values creates a separate time series stored in memory. A metric with labels {method, status_code} and 10 methods × 5 status codes = 50 time series — manageable. Adding a user_id label with 1 million users creates 1 million × 10 × 5 = 50 million time series — this causes Prometheus to run out of memory and crash. This is the cardinality explosion problem. High-cardinality values (user IDs, request IDs, email addresses, IP addresses, session tokens) must never be used as metric labels. They belong in traces and logs, not metrics. The correct pattern is to use low-cardinality labels (endpoint path, status class like 2xx/4xx/5xx, service name) in metrics and include high-cardinality identifiers in structured logs or trace attributes where they are stored per-event rather than as index dimensions.
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Adding user IDs, session IDs, or request IDs as Prometheus label values — each unique value creates a new time series.
- Using unbounded string values from user input as metric labels — even seemingly low-cardinality values like product names can grow unbounded.
- Not auditing label cardinality before adding metrics to production — test with realistic data, not dev data with five users.
- Conflating high-cardinality identification (belongs in traces) with low-cardinality aggregation (belongs in metrics) — both are valuable; they just go in different places.
Avoid When
- When discussing the number of metrics collected (cardinality refers to label combinations, not metric count).
- When explaining trace sampling or log retention policies (cardinality is metrics-specific; traces and logs handle high-cardinality values natively).
- When troubleshooting latency or query performance in dashboards (cardinality causes memory exhaustion, not query slowness per se).
- When designing alerting thresholds or SLO targets (cardinality is an infrastructure scaling problem, not a business or service-level concern).
When To Use
- Choosing what to expose as metric labels when instrumenting a service — reject user_id, request_id, IP address, and other unbounded identifiers; keep only dimensions with fixed, small counts like environment, service, method, status.
- Debugging why your Prometheus instance is consuming unexpectedly high memory or scraping is slow — cardinality explosion from a recently added label is the most common cause and should be your first suspect.
- Designing alerting rules and dashboards — if you're tempted to alert or visualize on a high-cardinality label, move that identifier to logs or traces instead and use low-cardinality aggregations in metrics.
- Setting up retention and storage capacity planning for a metrics backend — estimate time series count by multiplying cardinality of each label dimension; anything exceeding millions of series signals you need to remove or bucket high-cardinality labels before ingestion.
Code Examples
// High cardinality — crashes Prometheus with many users
$counter->labels([
'user_id' => $userId, // millions of unique values
'request_id' => $reqId, // billions of unique values
'endpoint' => $path, // fine — limited set
])->inc();
// Low cardinality labels only — scalable metrics
$counter->labels([
'endpoint' => $normalizedPath, // /users/{id} not /users/12345
'status_class' => $statusClass, // '2xx', '4xx', '5xx'
'method' => $httpMethod, // GET, POST, etc.
])->inc();
// High-cardinality context goes in the trace span
$span->setAttribute('user.id', $userId)
->setAttribute('request.id', $requestId);