GUIDEcx - Notice history

All systems operational

100% - uptime

Workato Website - Operational

Workato Email notifications - Operational

Workbot for Teams - Operational

Workbot for Slack - Operational

Recipe runtime for job execution - Operational

Recipe Webhook ingestion - Operational

Recipe API gateway - Operational

Notice history

Sep 2025

Issues with logging in affecting several users.
  • Postmortem
    Postmortem

    Post-Mortem: Access-Audit Service MongoDB Connection Incident (September 16, 2025)


    Summary

    On September 16, 2025, the access-audit service experienced widespread connection failures to MongoDB Atlas, causing major latency in the login flow and service disruptions. The MongoDB was  taking too long to respond, resulting in timeouts for requests sent there. MongoDB was making  some infrastructure changes that affected us and other clients. The issue was resolved through implementation of a workaround to handle MongoDB timeouts more gracefully.


    Resolution

    The issue was resolved by:

    Deploying a change to the access-audit service to Gracefully handle the timeouts.



    Incident Timeline

    Time (MDT)

    Date

    Status

    4:34 PM

    Sep 16

    Engineering was alerted through our automated alerts that we were seeing unexpected latency. Posts were also made to the engineering channel to alert other engineers. 

    4:47 PM

    Sep 16

    Support raises alarm that they are seeing customer impact in the Product Support channel.

    4:53 PM

    Sep 16

    Engineering assembles in a war room and communicates to the org they are investigating.

    5:01 PM

    Sep 16

    The situation is deemed an incident and the status page is updated to indicate investigation is underway.

    5:18 PM

    Sep 16

    Engineering team actively investigating, root cause not yet identified

    5:34 PM

    Sep 16

    Login access restored with ~1 minute delay, users can navigate normally once logged in

    6:08 PM

    Sep 16

    Fix implemented and deployed, users should be able to log in without delay

    6:14 PM

    Sep 16

    Incident resolved, login flow restored to normal operation


    Root Causes

    • MongoDB experienced an issue  when implementing a feature flag for serverless databases, causing latency for us and others of their customers.


    Observed Evidence:

    Contributing Factors:

    • Service did not gracefully handle MongoDB connection timeouts, causing complete service failures instead of degraded operation.

    • Authentication endpoints were dependent on the response from audit calls, though successful completion or error did not impede user login. A fix was implemented to cease awaiting that response, thereby allowing the continued processing of login, logout, and other requests (e.g., SendEmail) irrespective of the audit's response.


    Additional Notes

    To prevent similar issues in the future, we will be implementing the following:

    • Implementing better timeout handling and retry logic for MongoDB connections

    • Adding in better logging to indicate connection issues.

    • Code adjustments to Mongo follow our existing database connection processes.

  • Resolved
    Resolved

    This incident has been resolved. Thank you for your patience as we navigated restoring the log in flow.

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. The fix has been deployed and users should be able to log in without any delay. Any users that were currently logged in during this degraded performance didn't experience any additional slowness while logged in.

  • Update
    Update

    We are currently investigating this incident and login access has been restored, although it can take up to a minute or so to login in. Once a user is logged in they can navigate the app as expected and are experiencing normal navigation speeds. Users should be able to log in now, but again will experience a minor delay after entering your password.

  • Update
    Update

    We are currently investigating this incident. This is our top priority right now and we have our engineering team actively investigate potential solutions. A root cause has yet to be identified.

  • Investigating
    Investigating

    We are currently investigating this incident.

Aug 2025

No notices reported this month

Jul 2025

No notices reported this month

Jul 2025 to Sep 2025

Next