Post-Mortem: Messaging Service Incident (May 28, 2024)
Summary
Following the release of a new messaging feature early on May 28, 2024, users experienced general slowness in messaging related requests. The new messaging feature increased the number of requests hitting our gateway, triggering rate limit issues that hadn't been encountered before. Additionally, messaging failed to load in the task drawer due to invalid feature flag configurations related to customers, and channel loading was slow due to an expensive database query.
Resolution
The issues were resolved by taking three key actions:
Increasing resources on affected services
Temporarily scaling gateway services to handle the increased request volume while implementing a permanent fix that adjusted rate limits to more reasonable values
Optimizing the expensive database query to retrieve only necessary data, reducing contention on the database
Incident Timeline
Time | Event |
---|
2:30 AM MDT, May 28 | New messaging feature released |
8:12 AM MDT, May 28 | Customer reports of issues received |
8:30 AM MDT, May 28 | Resources adjusted on affected services |
9:00 AM MDT, May 28 | War Room initiated for coordinated response |
11:33 AM MDT, May 28 | Optimized query deployed |
11:35 AM MDT, May 28 | Adjusted the feature flag configuration |
12:34 PM MDT, May 28 | Adjusted rate limit configuration deployed |
12:50 PM MDT, May 28 | All issues resolved, system returned to normal operation |
Root Causes
Resource Constraints: Insufficient resources allocated to select services to handle the increased load from the new messaging feature
Gateway Rate Limiting: Rate limits on gateway services were too restrictive, causing legitimate requests to be denied when traffic increased
Inefficient Database Queries: Certain queries were retrieving excessive data, causing database contention and slowing down channel loading
Additional Notes
To prevent similar issues in the future, we will be looking into the following:
Proactive monitoring for rate limit thresholds, especially during new feature releases
Load testing with realistic traffic patterns prior to major feature releases
Database query optimization reviews as part of the deployment checklist
Automated scaling policies for critical gateway servicesEnhanced existing
monitoring to include coverage for areas primarily affected by the incident, such as messaging request latency and error rates.