GUIDEcx - Intermittent latency throughout the application – Incident details

All systems operational

Intermittent latency throughout the application

Resolved
Degraded performance
Started 12 days agoLasted about 1 hour

Affected

Web Application

Degraded performance from 8:59 PM to 9:48 PM

Project Management

Degraded performance from 8:59 PM to 9:48 PM

Compass Customer Portal

Degraded performance from 8:59 PM to 9:48 PM

Resource Management

Degraded performance from 8:59 PM to 9:48 PM

Advanced Time Tracking

Degraded performance from 8:59 PM to 9:48 PM

Report Navigator and Report Builder

Degraded performance from 8:59 PM to 9:48 PM

Updates
  • Postmortem
    Postmortem

    Post-Mortem: image limit (June 10th 2024)

    Summary

    Following increased system load today, users experienced delays in messaging and email processing. Image registry rate limiting  prevented services from restarting properly. This caused cascading dependent systems to fall behind in processing, resulting in delayed message and email delivery.

    Resolution

    The issues were resolved by taking two key actions:

    • Monitoring existing services - Monitored existing services to make sure that they were running as performant as possible while investigating the underlying image pull failures.

    • Image registry migration - Converted affected services to pull images from a more reliable image registry location that doesn't have the same rate limiting constraints.

    Incident Timeline

    Time

    Event

    1:15 PM

    First error detected

    1:17 PM

    Engineering team notified of the incident

    1:34 PM

    Root cause identified as image registry rate limiting

    2:50 PM

    Fix deployed to production (services migrated to new image registry)

    3:46 PM

    All issues resolved, system returned to normal operation

    Root Causes

    • Image Registry Rate Limiting: The primary image registry had restrictive rate limits that prevented services from pulling server images during restart events, blocking the successful restart of services that were failing health checks.

    • Service Health Check Failures: services were restarting due to health check failures, and the subsequent server image pull failures during restart created a cascading effect that impacted message and email processing across the platform.

    Additional Notes

    To prevent similar issues in the future, the following measures will be implemented:

    • Image pre-caching - Implement image pre-caching on nodes to reduce dependency on external registry pulls during scaling events.

    • Enhanced monitoring - Add specific monitoring for image pull failures and registry rate limit warnings.

    • Root Cause Analysis: Examination into the persistent health check failures experienced by the service.

  • Resolved
    Resolved

    The issue has been resolved and users should be able to navigate without significant lag or latency.

  • Identified
    Identified

    We are continuing to work on a fix for this incident. We believe we have identified the issue and are deploying a potential fix. If that works we will update to Monitoring status otherwise we will revert back to Investigating status.

  • Investigating
    Investigating

    We are currently investigating this incident. Users are seeing slowness or errors when it comes to Custom Fields, Messages, and general navigation. Thank you for your patience and we will continue to update as more information becomes available.