GUIDEcx - Notice history

All systems operational

100% - uptime

Workato Website - Operational

Workato Email notifications - Operational

Workbot for Teams - Operational

Workbot for Slack - Operational

Recipe runtime for job execution - Operational

Recipe Webhook ingestion - Operational

Recipe API gateway - Operational

Notice history

Oct 2024

Core API Services Unavailable
  • Resolved
    Update

    RCA

    Summary:

    The GUIDEcx API services run on AWS ECS Clusters, supported by Auto Scaling Group (ASG) configurations for managing EC2 instances and ECS task placement. On October 8, 2024, the engineering team noticed that the ECS Cluster was not provisioning or de-provisioning EC2 instances correctly, which could potentially impact the ability of API services to scale according to demand. After refreshing the EC2 instances for the primary cluster, services appeared healthy and were scaling as expected.

    However, around 12:00 AM EDT, the clusters returned to an unhealthy state, causing ECS to remove EC2 instances that had active services running on them. This issue compounded as the cluster's unhealthy state prevented GUIDEcx services from automatically scaling back up, leaving all services stuck in a "Pending" status and resulting in a system-wide outage of GUIDEcx API services.

    Resolution:

    The issue was resolved by refreshing the EC2 instances in the ASG and ensuring that “Scale In Protection” was being applied correctly to new instances. This allowed the "Pending" ECS Cluster apps to start on the newly restarted instances.

    Incident Timeline (in EDT):

    • 11:45 PM, October 8: Configuration changes were made to the ASG to address down-scaling issues.

    • 12:00 AM, October 8: Automated monitoring first detected instability.

    • 3:00 AM, October 9: Incident was first reported by customers, but most systems were still functioning.

    • 8:00 AM, October 9: A scaling event caused ECS to remove EC2 instances running primary services, leading to a complete outage as ECS services had zero running tasks.

    • 9:45 AM, October 9: The cause was identified, and the resolution was implemented.

    • 10:00 AM, October 9: Incident fully resolved.

    Additional Notes:

    Our automated monitoring and alerting system detected the initial signs of instability at 12:00 AM EDT. However, the team's response to these alerts did follow our standard incident response program, leading to a delayed resolution. As a result, we have reinforced training for on-call engineers and improved our escalation policies to ensure timely responses to these automated alerts in the future.

    Additionally, since the auto-scaling configurations for these ECS Clusters have not been modified in over 18 months, we believe the root cause of this unhealthy behavior is vendor-related, and we are currently engaging with the vendor to ensure this issue does not recur.

  • Resolved
    Resolved

    Issue is fully resolved

  • Monitoring
    Monitoring

    Fix has been applied. We are monitoring to ensure it stays stable. Access is up for all services.

  • Identified
    Identified

    AWS ECS Cluster is not auto-provisioning new instances, preventing API services from autoscaling.

  • Investigating
    Investigating

    We are investigating this incident where core API services are unavailable in all regions. This is resulting in outages for the web application and integrations.

Users unable to access projects
  • Resolved
    Update

    Root Cause Analysis

    Issue Summary:

    On the morning of October 4, 2024, at 10:00 AM MST, an error spike occurred on project plan loading after the release of an improved database view designed to enhance project statistics load time and overall database performance. The issue was resolved by 10:40 AM MST.

    Root Cause:

    The new database view was incompatible with old pods, which caused errors when they accessed the updated view. Our automated rollback process was triggered in response to the error spike, but it only rolled back the application deployment, not the database schema. As a result, all pods continued to access the incompatible view, extending the period of disruption.

    Resolution:

    At 10:30 AM MST, we redeployed the latest version of the application, ensuring that all pods were compatible with the new database view, resolving the issue.

    Preventive Measures:

    • Improve the canary release cycle to isolate better and test database changes before full production rollouts.

    • Enhance the automated rollback process to include database schema rollbacks when necessary.

  • Resolved
    Resolved

    This issue has been resolved.

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. The initial results are positive and it appears that access to projects has been restored.

  • Investigating
    Update

    We are currently investigating this incident. The behavior we are seeing is that users access the Project page, click on a project and rather than being directed to the Plan view are redirected back to the Project page.

  • Investigating
    Investigating
    We are currently investigating this incident.

Sep 2024

No notices reported this month

Aug 2024

Login Instability Issues
  • Resolved
    Update

    Summary

    During a routine upgrade of our system infrastructure, we encountered an issue related to the rate-limiting of image downloads from an external service. This rate limit disrupted the startup of essential services, leading to a temporary outage that affected the availability of certain features.

    What Happened

    The issue occurred during the upgrade process when the rapid and simultaneous restarting of multiple system components led to a higher-than-usual number of download requests within a short time frame. This exceeded the limits set by the external service provider, disrupting the startup of critical services.

    How We Fixed It

    • Enhanced Access: At 3:10 PM ET, we upgraded our access to the external service, allowing for higher download limits. A new access credential was created and applied, which allowed the impacted services to restart successfully.

    • Configuration Update: We updated our system configurations to ensure more reliable access to required components in the future, reducing the likelihood of similar issues.

    What We've Done to Prevent This in the Future

    To prevent this from happening again, we took several steps:

    • Image Caching & Version Control: We implemented changes to cache frequently used components and pin them to specific versions, reducing the need to download them from external sources repeatedly and avoiding future rate limits.

    • Upgraded Service Plan: We upgraded our plan with the external service provider to a higher tier, increasing our allowed download capacity and providing more robust support for future operations.

  • Resolved
    Resolved

    The login fix was successful. Monitoring has proven stable. All access is restored.

  • Monitoring
    Monitoring
    We implemented a fix and are currently monitoring the result.
  • Identified
    Identified

    Diagnosis is complete. We are working on solving the main login issue now.

  • Investigating
    Investigating

    We are currently investigating this incident that's causing intermittent login issues.

System unavailable
  • Resolved
    Resolved

    Fix has been applied by Vercel. Access is consistently restored. All systems operational.

  • Monitoring
    Update

    Access is back again. We will monitor to ensure it remains stable.

  • Monitoring
    Update

    Vercel has identified the deeper root cause on their end and they are working on resolving the issue. Thousands of sites and systems around the world are having the same issue so the urgency drives the expectation this should be resolved within the next few minutes.

  • Monitoring
    Update

    Vercel status has been integrated into our status page for the top section called "Web Application". Real time updates on their progress fixing the hosting of the front end interface can be monitored in the "Edge Functions" and "Edge Middleware" components.

  • Monitoring
    Update

    Vercel is volatile right now and access is temporarily lost again. We are continuing to escalate with them.

  • Monitoring
    Monitoring

    Vercel is back up again. The web interface is available. We will monitor the status for a few more minutes.

  • Identified
    Identified

    Vercel is having an outage. GUIDEcx uses that system for hosting of the front end web interace. We are escalating with them now.

  • Investigating
    Investigating

    We are investigating unavailability of the web interface. The API and recipes are still working. No data has been lost.

Aug 2024 to Oct 2024

Next