GUIDEcx - Core API Services Unavailable – Incident details

All systems operational

Core API Services Unavailable

Resolved
Major outage
Started about 1 month agoLasted about 7 hours

Affected

Web Application

Operational from 7:00 AM to 1:52 PM, Major outage from 1:52 PM to 2:04 PM

Project Management

Operational from 7:00 AM to 1:52 PM, Major outage from 1:52 PM to 2:04 PM

Compass Customer Portal

Operational from 7:00 AM to 1:52 PM, Major outage from 1:52 PM to 2:04 PM

Resource Management

Operational from 7:00 AM to 1:52 PM, Major outage from 1:52 PM to 2:04 PM

Advanced Time Tracking

Operational from 7:00 AM to 1:52 PM, Major outage from 1:52 PM to 2:04 PM

Report Navigator and Report Builder

Operational from 7:00 AM to 1:52 PM, Major outage from 1:52 PM to 2:04 PM

Updates
  • Update
    Update

    RCA

    Summary:

    The GUIDEcx API services run on AWS ECS Clusters, supported by Auto Scaling Group (ASG) configurations for managing EC2 instances and ECS task placement. On October 8, 2024, the engineering team noticed that the ECS Cluster was not provisioning or de-provisioning EC2 instances correctly, which could potentially impact the ability of API services to scale according to demand. After refreshing the EC2 instances for the primary cluster, services appeared healthy and were scaling as expected.

    However, around 12:00 AM EDT, the clusters returned to an unhealthy state, causing ECS to remove EC2 instances that had active services running on them. This issue compounded as the cluster's unhealthy state prevented GUIDEcx services from automatically scaling back up, leaving all services stuck in a "Pending" status and resulting in a system-wide outage of GUIDEcx API services.

    Resolution:

    The issue was resolved by refreshing the EC2 instances in the ASG and ensuring that “Scale In Protection” was being applied correctly to new instances. This allowed the "Pending" ECS Cluster apps to start on the newly restarted instances.

    Incident Timeline (in EDT):

    • 11:45 PM, October 8: Configuration changes were made to the ASG to address down-scaling issues.

    • 12:00 AM, October 8: Automated monitoring first detected instability.

    • 3:00 AM, October 9: Incident was first reported by customers, but most systems were still functioning.

    • 8:00 AM, October 9: A scaling event caused ECS to remove EC2 instances running primary services, leading to a complete outage as ECS services had zero running tasks.

    • 9:45 AM, October 9: The cause was identified, and the resolution was implemented.

    • 10:00 AM, October 9: Incident fully resolved.

    Additional Notes:

    Our automated monitoring and alerting system detected the initial signs of instability at 12:00 AM EDT. However, the team's response to these alerts did follow our standard incident response program, leading to a delayed resolution. As a result, we have reinforced training for on-call engineers and improved our escalation policies to ensure timely responses to these automated alerts in the future.

    Additionally, since the auto-scaling configurations for these ECS Clusters have not been modified in over 18 months, we believe the root cause of this unhealthy behavior is vendor-related, and we are currently engaging with the vendor to ensure this issue does not recur.

  • Resolved
    Resolved

    Issue is fully resolved

  • Monitoring
    Monitoring

    Fix has been applied. We are monitoring to ensure it stays stable. Access is up for all services.

  • Identified
    Identified

    AWS ECS Cluster is not auto-provisioning new instances, preventing API services from autoscaling.

  • Investigating
    Investigating

    We are investigating this incident where core API services are unavailable in all regions. This is resulting in outages for the web application and integrations.