Platform Stability Issues

Updates

Update
November 02, 2021 at 2:30 AMUTC
Update
November 02, 2021 at 2:30 AMUTC
New containers are autoscaling successfully now. An outdated docker agent on an autoscaling group caused duplicate host mappings, leading one target group to fail health checks, which caused the minority of the tasks in the autoscaling cluster to continually restart. If requests were routed to the healthy tasks, experience was normal. But if traffic was routed to one of these “unhealthy” tasks it only had 5 minutes that it would work – hence the sporadic success or failure of requests.

A breaking API change to Docker required updating the ECS agents from 1.55.1 to 1.55.3. Once upgrade was completed, autoscaling and host port mapping worked correctly.
Update
November 01, 2021 at 9:00 PMUTC
Update
November 01, 2021 at 9:00 PMUTC
Fix for stability issues continues to work. Working with IaaS support on autoscaling issues.
Resolved
November 01, 2021 at 5:50 PMUTC
Resolved
November 01, 2021 at 5:50 PMUTC
Instability of some requests succeeding and others failing has been resolved. Team is working on root cause analysis and to reenable autoscaling.
Identified
November 01, 2021 at 5:30 PMUTC
Identified
November 01, 2021 at 5:30 PMUTC
Instability is caused by new containers failing health checks 5 to 10 minutes after starting. Load balancers updated to ensure requests are routed to old containers that are healthy.
Investigating
November 01, 2021 at 5:10 PMUTC
Investigating
November 01, 2021 at 5:10 PMUTC
We are currently investigating this issues of partial failure of different API requests, leading to impartially loaded, or broken pages

GUIDEcx - Platform Stability Issues – Incident details

All systems operational