GUIDEcx - Platform Stability Issues – Incident details

All systems operational

Platform Stability Issues

Resolved
Partial outage
Started over 3 years agoLasted about 4 hours
Updates
  • Update
    Update

    New containers are autoscaling successfully now. An outdated docker agent on an autoscaling group caused duplicate host mappings, leading one target group to fail health checks, which caused the minority of the tasks in the autoscaling cluster to continually restart. If requests were routed to the healthy tasks, experience was normal. But if traffic was routed to one of these “unhealthy” tasks it only had 5 minutes that it would work – hence the sporadic success or failure of requests.

    A breaking API change to Docker required updating the ECS agents from 1.55.1 to 1.55.3. Once upgrade was completed, autoscaling and host port mapping worked correctly.

  • Update
    Update

    Fix for stability issues continues to work. Working with IaaS support on autoscaling issues.

  • Resolved
    Resolved

    Instability of some requests succeeding and others failing has been resolved. Team is working on root cause analysis and to reenable autoscaling.

  • Identified
    Identified

    Instability is caused by new containers failing health checks 5 to 10 minutes after starting. Load balancers updated to ensure requests are routed to old containers that are healthy.

  • Investigating
    Investigating

    We are currently investigating this issues of partial failure of different API requests, leading to impartially loaded, or broken pages