Affected
Major outage from 11:01 PM to 11:34 PM, Degraded performance from 11:34 PM to 12:14 AM
Major outage from 11:01 PM to 11:34 PM, Degraded performance from 11:34 PM to 12:14 AM
- PostmortemPostmortem
Post-Mortem: Service Provider (Atlas) MongoDB Connection Incident (September 16, 2025)
Summary
On September 16, 2025, the access-audit service experienced widespread connection failures to MongoDB Atlas, causing major latency in the login flow and service disruptions. The MongoDB was taking too long to respond, resulting in timeouts for requests sent there. MongoDB was making some infrastructure changes that affected us and other clients. The issue was resolved through implementation of a workaround to handle MongoDB timeouts more gracefully.
Resolution
The issue was resolved by:
Deploying a change to the access-audit service to Gracefully handle the timeouts.
Incident Timeline
Time (MDT)
Date
Status
4:34 PM
Sep 16
Engineering was alerted through our automated alerts that we were seeing unexpected latency. Posts were also made to the engineering channel to alert other engineers.
4:47 PM
Sep 16
Support raises alarm that they are seeing customer impact in the Product Support channel.
4:53 PM
Sep 16
Engineering assembles in a war room and communicates to the org they are investigating.
5:01 PM
Sep 16
The situation is deemed an incident and the status page is updated to indicate investigation is underway.
5:18 PM
Sep 16
Engineering team actively investigating, root cause not yet identified
5:34 PM
Sep 16
Login access restored with ~1 minute delay, users can navigate normally once logged in
6:08 PM
Sep 16
Fix implemented and deployed, users should be able to log in without delay
6:14 PM
Sep 16
Incident resolved, login flow restored to normal operation
Root Causes
Atlas experienced an issue when implementing a feature flag for serverless MongoDB databases, causing latency for us and others of their customers.
Observed Evidence:
Contributing Factors:
Service did not gracefully handle MongoDB connection timeouts, causing complete service failures instead of degraded operation.
Authentication endpoints were dependent on the response from audit calls, though successful completion or error did not impede user login. A fix was implemented to cease awaiting that response, thereby allowing the continued processing of login, logout, and other requests (e.g., SendEmail) irrespective of the audit's response.
Additional Notes
To prevent similar issues in the future, we will be implementing the following:
Implementing better timeout handling and retry logic for MongoDB connections
Adding in better logging to indicate connection issues.
Code adjustments to Mongo follow our existing database connection processes.
- ResolvedResolved
This incident has been resolved. Thank you for your patience as we navigated restoring the log in flow.
- MonitoringMonitoring
We implemented a fix and are currently monitoring the result. The fix has been deployed and users should be able to log in without any delay. Any users that were currently logged in during this degraded performance didn't experience any additional slowness while logged in.
- UpdateUpdate
We are currently investigating this incident and login access has been restored, although it can take up to a minute or so to login in. Once a user is logged in they can navigate the app as expected and are experiencing normal navigation speeds. Users should be able to log in now, but again will experience a minor delay after entering your password.
- UpdateUpdate
We are currently investigating this incident. This is our top priority right now and we have our engineering team actively investigate potential solutions. A root cause has yet to be identified.
- InvestigatingInvestigating
We are currently investigating this incident.
![[object Object]](/_next/image?url=https%3A%2F%2Finstatus.com%2Fuser-content%2Fv1662851820%2Fhjhznketzlbdtqziimib.png&w=3840&q=75)