Resolution

The issues were resolved by taking two key actions:

Monitoring existing services - Monitored existing services to make sure that they were running as performant as possible while investigating the underlying image pull failures.

Image registry migration - Converted affected services to pull images from a more reliable image registry location that doesn't have the same rate limiting constraints.

Time	Event
1:15 PM	First error detected
1:17 PM	Engineering team notified of the incident
1:34 PM	Root cause identified as image registry rate limiting
2:50 PM	Fix deployed to production (services migrated to new image registry)
3:46 PM	All issues resolved, system returned to normal operation

Time

Event

1:15 PM

First error detected

1:17 PM

Engineering team notified of the incident

1:34 PM

Root cause identified as image registry rate limiting

2:50 PM

Fix deployed to production (services migrated to new image registry)

3:46 PM

All issues resolved, system returned to normal operation

Root Causes

Image Registry Rate Limiting: The primary image registry had restrictive rate limits that prevented services from pulling server images during restart events, blocking the successful restart of services that were failing health checks.

Service Health Check Failures: services were restarting due to health check failures, and the subsequent server image pull failures during restart created a cascading effect that impacted message and email processing across the platform.

Additional Notes

To prevent similar issues in the future, the following measures will be implemented:

Image pre-caching - Implement image pre-caching on nodes to reduce dependency on external registry pulls during scaling events.

Enhanced monitoring - Add specific monitoring for image pull failures and registry rate limit warnings.

Root Cause Analysis: Examination into the persistent health check failures experienced by the service.

Post-Mortem: Messaging Service Incident (May 28, 2024)

Summary

Following the release of a new messaging feature early on May 28, 2024, users experienced general slowness in messaging related requests. The new messaging feature increased the number of requests hitting our gateway, triggering rate limit issues that hadn't been encountered before. Additionally, messaging failed to load in the task drawer due to invalid feature flag configurations related to customers, and channel loading was slow due to an expensive database query.

Resolution

The issues were resolved by taking three key actions:

Increasing resources on affected services
Temporarily scaling gateway services to handle the increased request volume while implementing a permanent fix that adjusted rate limits to more reasonable values
Optimizing the expensive database query to retrieve only necessary data, reducing contention on the database

Incident Timeline

Time	Event
2:30 AM MDT, May 28	New messaging feature released
8:12 AM MDT, May 28	Customer reports of issues received
8:30 AM MDT, May 28	Resources adjusted on affected services
9:00 AM MDT, May 28	War Room initiated for coordinated response
11:33 AM MDT, May 28	Optimized query deployed
11:35 AM MDT, May 28	Adjusted the feature flag configuration
12:34 PM MDT, May 28	Adjusted rate limit configuration deployed
12:50 PM MDT, May 28	All issues resolved, system returned to normal operation

Root Causes

Resource Constraints: Insufficient resources allocated to select services to handle the increased load from the new messaging feature
Gateway Rate Limiting: Rate limits on gateway services were too restrictive, causing legitimate requests to be denied when traffic increased
Inefficient Database Queries: Certain queries were retrieving excessive data, causing database contention and slowing down channel loading

Additional Notes

To prevent similar issues in the future, we will be looking into the following:

Proactive monitoring for rate limit thresholds, especially during new feature releases
Load testing with realistic traffic patterns prior to major feature releases
Database query optimization reviews as part of the deployment checklist
Automated scaling policies for critical gateway servicesEnhanced existing
monitoring to include coverage for areas primarily affected by the incident, such as messaging request latency and error rates.

All systems operational

Notice history

Jun 2025

Post-Mortem: image limit (June 10th 2024)

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes

May 2025

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes

Apr 2025

GUIDEcx - Notice history

All systems operational

Notice history

Jun 2025

Post-Mortem: image limit (June 10th 2024)

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes

May 2025

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes

Apr 2025