GUIDEcx - Notice history

Experiencing partially degraded performance

100% - uptime

Workato Website - Operational

Workato Email notifications - Operational

Workbot for Teams - Operational

Workbot for Slack - Operational

Recipe runtime for job execution - Operational

Recipe Webhook ingestion - Operational

Recipe API gateway - Operational

Notice history

Oct 2024

Core API Services Unavailable
  • Update
    Update

    RCA

    Summary:

    The GUIDEcx API services run on AWS ECS Clusters, supported by Auto Scaling Group (ASG) configurations for managing EC2 instances and ECS task placement. On October 8, 2024, the engineering team noticed that the ECS Cluster was not provisioning or de-provisioning EC2 instances correctly, which could potentially impact the ability of API services to scale according to demand. After refreshing the EC2 instances for the primary cluster, services appeared healthy and were scaling as expected.

    However, around 12:00 AM EDT, the clusters returned to an unhealthy state, causing ECS to remove EC2 instances that had active services running on them. This issue compounded as the cluster's unhealthy state prevented GUIDEcx services from automatically scaling back up, leaving all services stuck in a "Pending" status and resulting in a system-wide outage of GUIDEcx API services.

    Resolution:

    The issue was resolved by refreshing the EC2 instances in the ASG and ensuring that “Scale In Protection” was being applied correctly to new instances. This allowed the "Pending" ECS Cluster apps to start on the newly restarted instances.

    Incident Timeline (in EDT):

    • 11:45 PM, October 8: Configuration changes were made to the ASG to address down-scaling issues.

    • 12:00 AM, October 8: Automated monitoring first detected instability.

    • 3:00 AM, October 9: Incident was first reported by customers, but most systems were still functioning.

    • 8:00 AM, October 9: A scaling event caused ECS to remove EC2 instances running primary services, leading to a complete outage as ECS services had zero running tasks.

    • 9:45 AM, October 9: The cause was identified, and the resolution was implemented.

    • 10:00 AM, October 9: Incident fully resolved.

    Additional Notes:

    Our automated monitoring and alerting system detected the initial signs of instability at 12:00 AM EDT. However, the team's response to these alerts did follow our standard incident response program, leading to a delayed resolution. As a result, we have reinforced training for on-call engineers and improved our escalation policies to ensure timely responses to these automated alerts in the future.

    Additionally, since the auto-scaling configurations for these ECS Clusters have not been modified in over 18 months, we believe the root cause of this unhealthy behavior is vendor-related, and we are currently engaging with the vendor to ensure this issue does not recur.

  • Resolved
    Resolved

    Issue is fully resolved

  • Monitoring
    Monitoring

    Fix has been applied. We are monitoring to ensure it stays stable. Access is up for all services.

  • Identified
    Identified

    AWS ECS Cluster is not auto-provisioning new instances, preventing API services from autoscaling.

  • Investigating
    Investigating

    We are investigating this incident where core API services are unavailable in all regions. This is resulting in outages for the web application and integrations.

Users unable to access projects
  • Update
    Update

    Root Cause Analysis

    Issue Summary:

    On the morning of October 4, 2024, at 10:00 AM MST, an error spike occurred on project plan loading after the release of an improved database view designed to enhance project statistics load time and overall database performance. The issue was resolved by 10:40 AM MST.

    Root Cause:

    The new database view was incompatible with old pods, which caused errors when they accessed the updated view. Our automated rollback process was triggered in response to the error spike, but it only rolled back the application deployment, not the database schema. As a result, all pods continued to access the incompatible view, extending the period of disruption.

    Resolution:

    At 10:30 AM MST, we redeployed the latest version of the application, ensuring that all pods were compatible with the new database view, resolving the issue.

    Preventive Measures:

    • Improve the canary release cycle to isolate better and test database changes before full production rollouts.

    • Enhance the automated rollback process to include database schema rollbacks when necessary.

  • Resolved
    Resolved

    This issue has been resolved.

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. The initial results are positive and it appears that access to projects has been restored.

  • Update
    Update

    We are currently investigating this incident. The behavior we are seeing is that users access the Project page, click on a project and rather than being directed to the Plan view are redirected back to the Project page.

  • Investigating
    Investigating
    We are currently investigating this incident.

Sep 2024

No notices reported this month

Sep 2024 to Nov 2024

Next