RCA

Summary:

The GUIDEcx API services run on AWS ECS Clusters, supported by Auto Scaling Group (ASG) configurations for managing EC2 instances and ECS task placement. On October 8, 2024, the engineering team noticed that the ECS Cluster was not provisioning or de-provisioning EC2 instances correctly, which could potentially impact the ability of API services to scale according to demand. After refreshing the EC2 instances for the primary cluster, services appeared healthy and were scaling as expected.

However, around 12:00 AM EDT, the clusters returned to an unhealthy state, causing ECS to remove EC2 instances that had active services running on them. This issue compounded as the cluster's unhealthy state prevented GUIDEcx services from automatically scaling back up, leaving all services stuck in a "Pending" status and resulting in a system-wide outage of GUIDEcx API services.

Resolution:

The issue was resolved by refreshing the EC2 instances in the ASG and ensuring that “Scale In Protection” was being applied correctly to new instances. This allowed the "Pending" ECS Cluster apps to start on the newly restarted instances.

Incident Timeline (in EDT):

11:45 PM, October 8: Configuration changes were made to the ASG to address down-scaling issues.
12:00 AM, October 8: Automated monitoring first detected instability.
3:00 AM, October 9: Incident was first reported by customers, but most systems were still functioning.
8:00 AM, October 9: A scaling event caused ECS to remove EC2 instances running primary services, leading to a complete outage as ECS services had zero running tasks.
9:45 AM, October 9: The cause was identified, and the resolution was implemented.
10:00 AM, October 9: Incident fully resolved.

Additional Notes:

Our automated monitoring and alerting system detected the initial signs of instability at 12:00 AM EDT. However, the team's response to these alerts did follow our standard incident response program, leading to a delayed resolution. As a result, we have reinforced training for on-call engineers and improved our escalation policies to ensure timely responses to these automated alerts in the future.

Additionally, since the auto-scaling configurations for these ECS Clusters have not been modified in over 18 months, we believe the root cause of this unhealthy behavior is vendor-related, and we are currently engaging with the vendor to ensure this issue does not recur.

Resolved

October 09, 2024 at 2:04 PM

Resolved

October 09, 2024 at 2:04 PM

Issue is fully resolved

Monitoring

October 09, 2024 at 2:00 PM

Monitoring

October 09, 2024 at 2:00 PM

Fix has been applied. We are monitoring to ensure it stays stable. Access is up for all services.

Identified

October 09, 2024 at 1:52 PM

Identified

October 09, 2024 at 1:52 PM

AWS ECS Cluster is not auto-provisioning new instances, preventing API services from autoscaling.

Investigating

October 09, 2024 at 7:00 AM

Investigating

October 09, 2024 at 7:00 AM

We are investigating this incident where core API services are unavailable in all regions. This is resulting in outages for the web application and integrations.

All systems operational

Core API Services Unavailable

RCA

Summary:

Resolution:

Incident Timeline (in EDT):

Additional Notes:

GUIDEcx - Core API Services Unavailable – Incident details

All systems operational

Core API Services Unavailable

RCA

Summary:

Resolution:

Incident Timeline (in EDT):

Additional Notes: