Post-Mortem: image limit (June 10th 2024)

Summary

Following increased system load today, users experienced delays in messaging and email processing. Image registry rate limiting prevented services from restarting properly. This caused cascading dependent systems to fall behind in processing, resulting in delayed message and email delivery.

Resolution

The issues were resolved by taking two key actions:

Monitoring existing services - Monitored existing services to make sure that they were running as performant as possible while investigating the underlying image pull failures.
Image registry migration - Converted affected services to pull images from a more reliable image registry location that doesn't have the same rate limiting constraints.

Incident Timeline

Time	Event
1:15 PM	First error detected
1:17 PM	Engineering team notified of the incident
1:34 PM	Root cause identified as image registry rate limiting
2:50 PM	Fix deployed to production (services migrated to new image registry)
3:46 PM	All issues resolved, system returned to normal operation

Root Causes

Image Registry Rate Limiting: The primary image registry had restrictive rate limits that prevented services from pulling server images during restart events, blocking the successful restart of services that were failing health checks.
Service Health Check Failures: services were restarting due to health check failures, and the subsequent server image pull failures during restart created a cascading effect that impacted message and email processing across the platform.

Additional Notes

To prevent similar issues in the future, the following measures will be implemented:

Image pre-caching - Implement image pre-caching on nodes to reduce dependency on external registry pulls during scaling events.
Enhanced monitoring - Add specific monitoring for image pull failures and registry rate limit warnings.
Root Cause Analysis: Examination into the persistent health check failures experienced by the service.

Resolved

June 10, 2025 at 9:48 PM

Resolved

June 10, 2025 at 9:48 PM

The issue has been resolved and users should be able to navigate without significant lag or latency.

Identified

June 10, 2025 at 9:13 PM

Identified

June 10, 2025 at 9:13 PM

We are continuing to work on a fix for this incident. We believe we have identified the issue and are deploying a potential fix. If that works we will update to Monitoring status otherwise we will revert back to Investigating status.

Investigating

June 10, 2025 at 8:59 PM

Investigating

June 10, 2025 at 8:59 PM

We are currently investigating this incident. Users are seeing slowness or errors when it comes to Custom Fields, Messages, and general navigation. Thank you for your patience and we will continue to update as more information becomes available.

All systems operational

Intermittent latency throughout the application

Post-Mortem: image limit (June 10th 2024)

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes

GUIDEcx - Intermittent latency throughout the application – Incident details

All systems operational

Intermittent latency throughout the application

Post-Mortem: image limit (June 10th 2024)

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes