Post-Mortem: image limit (June 10th 2024)
Summary
Following increased system load today, users experienced delays in messaging and email processing. Image registry rate limiting prevented services from restarting properly. This caused cascading dependent systems to fall behind in processing, resulting in delayed message and email delivery.
Resolution
The issues were resolved by taking two key actions:
Monitoring existing services - Monitored existing services to make sure that they were running as performant as possible while investigating the underlying image pull failures.
Image registry migration - Converted affected services to pull images from a more reliable image registry location that doesn't have the same rate limiting constraints.
Incident Timeline
Time | Event |
---|
1:15 PM | First error detected |
1:17 PM | Engineering team notified of the incident |
1:34 PM | Root cause identified as image registry rate limiting |
2:50 PM | Fix deployed to production (services migrated to new image registry) |
3:46 PM | All issues resolved, system returned to normal operation |
Root Causes
Image Registry Rate Limiting: The primary image registry had restrictive rate limits that prevented services from pulling server images during restart events, blocking the successful restart of services that were failing health checks.
Service Health Check Failures: services were restarting due to health check failures, and the subsequent server image pull failures during restart created a cascading effect that impacted message and email processing across the platform.
Additional Notes
To prevent similar issues in the future, the following measures will be implemented:
Image pre-caching - Implement image pre-caching on nodes to reduce dependency on external registry pulls during scaling events.
Enhanced monitoring - Add specific monitoring for image pull failures and registry rate limit warnings.
Root Cause Analysis: Examination into the persistent health check failures experienced by the service.