GUIDEcx - Notice history

All systems operational

100% - uptime

Workato Website - Operational

Workato Email notifications - Operational

Workbot for Teams - Operational

Workbot for Slack - Operational

Recipe runtime for job execution - Operational

Recipe Webhook ingestion - Operational

Recipe API gateway - Operational

Notice history

Dec 2024

No notices reported this month

Nov 2024

Some API requests experiencing slower responses than usual
  • Update
    Update

    RCA

    Summary:

    The latest GUIDEcx API services run on AWS EKS Clusters. On the evening of November 19, 2024, a scheduled deployment introduced changes to a particular service related to new project and task-related features. This service also manages custom fields, certain application pages, and interfaces with Open API requests concerning custom fields.

    A specific database query, in certain instances, returned more data than necessary. Although these queries executed within acceptable time limits, the excessive amount of data returned created a bottleneck for other queries requesting similar data. While it can sometimes be acceptable for queries to return significant amounts of data, this query was intended to return smaller amounts, as it was being run frequently across multiple distributed services. This led to a backlog of requests, which in turn created pressure on messages, events, and other services reliant on the same data. This pressure caused the slowness and inconsistencies experienced during the incident.


    Resolution:

    The issue was resolved by identifying the cases where more data was being requested than intended and modifying those queries to only return the necessary data. Once this adjustment was made, queries executed efficiently, and the backlog of messages, events, and services was processed, allowing the system to return to normal operations.


    Incident Timeline (in EST):

    • 6:30 PM, November 19: Code released for feature updates. Services appeared healthy and scaled appropriately.

    • 6:48 PM, November 19: Automated monitoring detects minor instability, but the incident resolves itself before a support representative can investigate further.

    • 8:00 AM, November 20: Engineering support discovers errors related to requests failing to connect to services. A backlog of requests causes some services to restart, resulting in intermittent failures.

    • 10:52 AM, November 20: Support receives reports of API errors (later identified as timeouts), incomplete page loads, and missing custom fields on projects. Open API traffic spikes during this time (an expected occurrence, but it increases the frequency of the problematic query execution).

    • 12:24 PM, November 20: The root cause is identified, and a resolution is implemented.

    • 1:54 PM, November 20: The incident is fully resolved. System back pressure is processed successfully, and operations return to normal.


    Additional Notes:

    Three specific actions are being implemented to mitigate similar issues in the future:

    1. Automated alerts will flag queries that return larger-than-expected payloads to clients. Use cases will define reasonable maximum thresholds for data loads.

    2. Automated alerts will detect system backlogs related to messaging. Use cases will define reasonable maximum thresholds for such backlogs.

    3. APIs associated with AWS EKS services will require filterable criteria for database queries unless explicitly defined otherwise.

  • Resolved
    Resolved

    This incident has been resolved. Our team will continue to monitor the situation but users are no longer receiving errors within the API and the Team page and Custom Fields are loading properly.

  • Identified
    Identified

    The team has identified a fix and has started working on implementing a fix. We will continue to update as progress is made.

  • Update
    Update

    We are still investigating this incident. We have found some insights from our monitoring tools and are chasing down a few theories. We also understand that some issues are leaking into the UI experience and that it isn't just related to work done via the API, including issues with custom fields, new Team page, etc.

  • Investigating
    Investigating
    We are currently investigating this incident.

Oct 2024

Core API Services Unavailable
  • Update
    Update

    RCA

    Summary:

    The GUIDEcx API services run on AWS ECS Clusters, supported by Auto Scaling Group (ASG) configurations for managing EC2 instances and ECS task placement. On October 8, 2024, the engineering team noticed that the ECS Cluster was not provisioning or de-provisioning EC2 instances correctly, which could potentially impact the ability of API services to scale according to demand. After refreshing the EC2 instances for the primary cluster, services appeared healthy and were scaling as expected.

    However, around 12:00 AM EDT, the clusters returned to an unhealthy state, causing ECS to remove EC2 instances that had active services running on them. This issue compounded as the cluster's unhealthy state prevented GUIDEcx services from automatically scaling back up, leaving all services stuck in a "Pending" status and resulting in a system-wide outage of GUIDEcx API services.

    Resolution:

    The issue was resolved by refreshing the EC2 instances in the ASG and ensuring that “Scale In Protection” was being applied correctly to new instances. This allowed the "Pending" ECS Cluster apps to start on the newly restarted instances.

    Incident Timeline (in EDT):

    • 11:45 PM, October 8: Configuration changes were made to the ASG to address down-scaling issues.

    • 12:00 AM, October 8: Automated monitoring first detected instability.

    • 3:00 AM, October 9: Incident was first reported by customers, but most systems were still functioning.

    • 8:00 AM, October 9: A scaling event caused ECS to remove EC2 instances running primary services, leading to a complete outage as ECS services had zero running tasks.

    • 9:45 AM, October 9: The cause was identified, and the resolution was implemented.

    • 10:00 AM, October 9: Incident fully resolved.

    Additional Notes:

    Our automated monitoring and alerting system detected the initial signs of instability at 12:00 AM EDT. However, the team's response to these alerts did follow our standard incident response program, leading to a delayed resolution. As a result, we have reinforced training for on-call engineers and improved our escalation policies to ensure timely responses to these automated alerts in the future.

    Additionally, since the auto-scaling configurations for these ECS Clusters have not been modified in over 18 months, we believe the root cause of this unhealthy behavior is vendor-related, and we are currently engaging with the vendor to ensure this issue does not recur.

  • Resolved
    Resolved

    Issue is fully resolved

  • Monitoring
    Monitoring

    Fix has been applied. We are monitoring to ensure it stays stable. Access is up for all services.

  • Identified
    Identified

    AWS ECS Cluster is not auto-provisioning new instances, preventing API services from autoscaling.

  • Investigating
    Investigating

    We are investigating this incident where core API services are unavailable in all regions. This is resulting in outages for the web application and integrations.

Users unable to access projects
  • Update
    Update

    Root Cause Analysis

    Issue Summary:

    On the morning of October 4, 2024, at 10:00 AM MST, an error spike occurred on project plan loading after the release of an improved database view designed to enhance project statistics load time and overall database performance. The issue was resolved by 10:40 AM MST.

    Root Cause:

    The new database view was incompatible with old pods, which caused errors when they accessed the updated view. Our automated rollback process was triggered in response to the error spike, but it only rolled back the application deployment, not the database schema. As a result, all pods continued to access the incompatible view, extending the period of disruption.

    Resolution:

    At 10:30 AM MST, we redeployed the latest version of the application, ensuring that all pods were compatible with the new database view, resolving the issue.

    Preventive Measures:

    • Improve the canary release cycle to isolate better and test database changes before full production rollouts.

    • Enhance the automated rollback process to include database schema rollbacks when necessary.

  • Resolved
    Resolved

    This issue has been resolved.

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. The initial results are positive and it appears that access to projects has been restored.

  • Update
    Update

    We are currently investigating this incident. The behavior we are seeing is that users access the Project page, click on a project and rather than being directed to the Plan view are redirected back to the Project page.

  • Investigating
    Investigating
    We are currently investigating this incident.

Oct 2024 to Dec 2024

Next