Root Cause Analysis – Redis Cache Storage Exhaustion
Date of incident: 2025-12-01
Incident window: ~1:10 PM–1:45 PM MST
Impact: Elevated latency and intermittent API timeouts on the 1.0 platform
1. Summary
At 1:10 PM MST, the Redis cache supporting the 1.0 platform reached 100% storage utilization, causing cache write failures and forcing the platform to fall back to slower database queries. This resulted in significant API latency increases, ultimately leading to timeouts for some users. An alert was triggered at 1:17 PM, the issue was identified as Redis storage exhaustion, and the Redis instance was resized. Normal operation resumed at 1:45 PM.
2. Impact
Affected system: 1.0 Platform APIs
User impact:
Some API requests experienced elevated latency
A subset of users experienced full timeouts
Business impact:
Temporary performance degradation
Increased system load during fallback operations
Duration: Approximately 35 minutes
3. Root Cause
The Redis cache used to optimize 1.0 platform queries ran out of available storage. Once storage was exhausted:
Redis evictions and failures occurred
Systems reliant on cached query results began performing full database queries
Resulting load contributed to latency increases and some request timeouts
Contributing factor:
A faulty Redis storage monitoring alert failed to notify the operations team before the cache reached saturation.
4. Timeline
Time (MST) Event
1:10 PM Redis cache reaches 100% storage capacity; platform begins falling back to database queries, increasing latency.
1:17 PM Operations alert triggered due to API timeouts.
1:20–1:40 PM Investigation identifies Redis storage exhaustion; Redis instance is resized.
1:45 PM Resized Redis instance becomes fully operational; traffic and latency return to normal ranges.
5. Resolution
Redis instance was resized, increasing available storage capacity. Platform traffic and API latency returned to baseline levels after the upgrade was completed.
6. Preventative Measures
Completed
Fixed the faulty Redis storage monitor, ensuring future alerts will correctly trigger before storage is fully consumed.