GUIDEcx - Notice history

All systems operational

100% - uptime

Workato Website - Operational

Workato Email notifications - Operational

Workbot for Teams - Operational

Workbot for Slack - Operational

Recipe runtime for job execution - Operational

Recipe Webhook ingestion - Operational

Recipe API gateway - Operational

Notice history

Jun 2025

Intermittent latency throughout the application
  • Postmortem
    Postmortem

    Post-Mortem: image limit (June 10th 2024)

    Summary

    Following increased system load today, users experienced delays in messaging and email processing. Image registry rate limiting  prevented services from restarting properly. This caused cascading dependent systems to fall behind in processing, resulting in delayed message and email delivery.

    Resolution

    The issues were resolved by taking two key actions:

    • Monitoring existing services - Monitored existing services to make sure that they were running as performant as possible while investigating the underlying image pull failures.

    • Image registry migration - Converted affected services to pull images from a more reliable image registry location that doesn't have the same rate limiting constraints.

    Incident Timeline

    Time

    Event

    1:15 PM

    First error detected

    1:17 PM

    Engineering team notified of the incident

    1:34 PM

    Root cause identified as image registry rate limiting

    2:50 PM

    Fix deployed to production (services migrated to new image registry)

    3:46 PM

    All issues resolved, system returned to normal operation

    Root Causes

    • Image Registry Rate Limiting: The primary image registry had restrictive rate limits that prevented services from pulling server images during restart events, blocking the successful restart of services that were failing health checks.

    • Service Health Check Failures: services were restarting due to health check failures, and the subsequent server image pull failures during restart created a cascading effect that impacted message and email processing across the platform.

    Additional Notes

    To prevent similar issues in the future, the following measures will be implemented:

    • Image pre-caching - Implement image pre-caching on nodes to reduce dependency on external registry pulls during scaling events.

    • Enhanced monitoring - Add specific monitoring for image pull failures and registry rate limit warnings.

    • Root Cause Analysis: Examination into the persistent health check failures experienced by the service.

  • Resolved
    Resolved

    The issue has been resolved and users should be able to navigate without significant lag or latency.

  • Identified
    Identified

    We are continuing to work on a fix for this incident. We believe we have identified the issue and are deploying a potential fix. If that works we will update to Monitoring status otherwise we will revert back to Investigating status.

  • Investigating
    Investigating

    We are currently investigating this incident. Users are seeing slowness or errors when it comes to Custom Fields, Messages, and general navigation. Thank you for your patience and we will continue to update as more information becomes available.

May 2025

Intermittent latency throughout the application
  • Postmortem
    Postmortem

    Post-Mortem: Messaging Service Incident (May 28, 2024)

    Summary

    Following the release of a new messaging feature early on May 28, 2024, users experienced general slowness in messaging related requests. The new messaging feature increased the number of requests hitting our gateway, triggering rate limit issues that hadn't been encountered before. Additionally, messaging failed to load in the task drawer due to invalid feature flag configurations related to customers, and channel loading was slow due to an expensive database query.

    Resolution

    The issues were resolved by taking three key actions:

    1. Increasing resources on affected services

    2. Temporarily scaling gateway services to handle the increased request volume while implementing a permanent fix that adjusted rate limits to more reasonable values

    3. Optimizing the expensive database query to retrieve only necessary data, reducing contention on the database

    Incident Timeline

    Time

    Event

    2:30 AM MDT, May 28

    New messaging feature released

    8:12 AM MDT, May 28

    Customer reports of issues received

    8:30 AM MDT, May 28

    Resources adjusted on affected services

    9:00 AM MDT, May 28

    War Room initiated for coordinated response

    11:33 AM MDT, May 28

    Optimized query deployed

    11:35 AM MDT, May 28 

    Adjusted the feature flag configuration

    12:34 PM MDT, May 28

    Adjusted rate limit configuration deployed

    12:50 PM MDT, May 28

    All issues resolved, system returned to normal operation

    Root Causes

    • Resource Constraints: Insufficient resources allocated to select services to handle the increased load from the new messaging feature

    • Gateway Rate Limiting: Rate limits on gateway services were too restrictive, causing legitimate requests to be denied when traffic increased

    • Inefficient Database Queries: Certain queries were retrieving excessive data, causing database contention and slowing down channel loading

    Additional Notes

    To prevent similar issues in the future, we will be looking into the following:

    • Proactive monitoring for rate limit thresholds, especially during new feature releases

    • Load testing with realistic traffic patterns prior to major feature releases

    • Database query optimization reviews as part of the deployment checklist

    • Automated scaling policies for critical gateway servicesEnhanced existing

    • monitoring to include coverage for areas primarily affected by the incident, such as messaging request latency and error rates.

  • Resolved
    Resolved

    This incident has been resolved. Thank you for your patience. We will be reviewing what caused the issue and share with you our learnings and the steps we have taken to help prevent future disruptions. Please reach out to Support if you experience any future latency.

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. Typical navigation speed, project creation, task creation, etc. are all seeing improved speed.

  • Update
    Update

    Our team is actively investigating the issue impacting performance. We've identified a likely root cause and are preparing a fix that will be deployed shortly. We’re closely monitoring the situation and will provide updates as soon as more information becomes available. Thank you for your patience as we work to resolve this quickly and thoroughly.

  • Update
    Update

    We are currently investigating this incident. Again, we have a few ideas about a potential root cause, but haven't identified the culprit yet.

  • Investigating
    Investigating

    We are currently investigating this incident. We have already boosted resources to reduce lag, but still haven't found the root issue. We have a couple of leads we are still investigating.

Apr 2025 to Jun 2025

Next