General latency when navigating application

Postmortem

December 01, 2025 at 10:28 PM

Postmortem

December 01, 2025 at 10:28 PM

Root Cause Analysis – Redis Cache Storage Exhaustion
Date of incident: 2025-12-01
Incident window: ~1:10 PM–1:45 PM MST
Impact: Elevated latency and intermittent API timeouts on the 1.0 platform

1. Summary

At 1:10 PM MST, the Redis cache supporting the 1.0 platform reached 100% storage utilization, causing cache write failures and forcing the platform to fall back to slower database queries. This resulted in significant API latency increases, ultimately leading to timeouts for some users. An alert was triggered at 1:17 PM, the issue was identified as Redis storage exhaustion, and the Redis instance was resized. Normal operation resumed at 1:45 PM.

2. Impact
Affected system: 1.0 Platform APIs

User impact:
Some API requests experienced elevated latency
A subset of users experienced full timeouts

Business impact:
Temporary performance degradation
Increased system load during fallback operations
Duration: Approximately 35 minutes

3. Root Cause
The Redis cache used to optimize 1.0 platform queries ran out of available storage. Once storage was exhausted:

Redis evictions and failures occurred
Systems reliant on cached query results began performing full database queries
Resulting load contributed to latency increases and some request timeouts

Contributing factor:
A faulty Redis storage monitoring alert failed to notify the operations team before the cache reached saturation.

4. Timeline
Time (MST) Event

1:10 PM Redis cache reaches 100% storage capacity; platform begins falling back to database queries, increasing latency.
1:17 PM Operations alert triggered due to API timeouts.
1:20–1:40 PM Investigation identifies Redis storage exhaustion; Redis instance is resized.
1:45 PM Resized Redis instance becomes fully operational; traffic and latency return to normal ranges.

5. Resolution
Redis instance was resized, increasing available storage capacity. Platform traffic and API latency returned to baseline levels after the upgrade was completed.

6. Preventative Measures
Completed
Fixed the faulty Redis storage monitor, ensuring future alerts will correctly trigger before storage is fully consumed.

Resolved

December 01, 2025 at 9:01 PM

Resolved

December 01, 2025 at 9:01 PM

This incident has been resolved. Thank you for your patience.

Monitoring

December 01, 2025 at 8:44 PM

Monitoring

December 01, 2025 at 8:44 PM

We implemented a fix and are currently monitoring the result.

Investigating

December 01, 2025 at 8:30 PM

Investigating

December 01, 2025 at 8:30 PM

We are currently investigating this incident.

GUIDEcx - General latency when navigating application – Incident details

All systems operational