- UpdateUpdate
New containers are autoscaling successfully now. An outdated docker agent on an autoscaling group caused duplicate host mappings, leading one target group to fail health checks, which caused the minority of the tasks in the autoscaling cluster to continually restart. If requests were routed to the healthy tasks, experience was normal. But if traffic was routed to one of these “unhealthy” tasks it only had 5 minutes that it would work – hence the sporadic success or failure of requests.
A breaking API change to Docker required updating the ECS agents from 1.55.1 to 1.55.3. Once upgrade was completed, autoscaling and host port mapping worked correctly.
- UpdateUpdate
Fix for stability issues continues to work. Working with IaaS support on autoscaling issues.
- ResolvedResolved
Instability of some requests succeeding and others failing has been resolved. Team is working on root cause analysis and to reenable autoscaling.
- IdentifiedIdentified
Instability is caused by new containers failing health checks 5 to 10 minutes after starting. Load balancers updated to ensure requests are routed to old containers that are healthy.
- InvestigatingInvestigating
We are currently investigating this issues of partial failure of different API requests, leading to impartially loaded, or broken pages