Post-Mortem: Access-Audit Service MongoDB Connection Incident (September 16, 2025)
Summary
On September 16, 2025, the access-audit service experienced widespread connection failures to MongoDB Atlas, causing major latency in the login flow and service disruptions. The MongoDB was taking too long to respond, resulting in timeouts for requests sent there. MongoDB was making some infrastructure changes that affected us and other clients. The issue was resolved through implementation of a workaround to handle MongoDB timeouts more gracefully.
Resolution
The issue was resolved by:
Deploying a change to the access-audit service to Gracefully handle the timeouts.
Incident Timeline
Time (MDT) | Date | Status |
4:34 PM | Sep 16 | Engineering was alerted through our automated alerts that we were seeing unexpected latency. Posts were also made to the engineering channel to alert other engineers. |
4:47 PM | Sep 16 | Support raises alarm that they are seeing customer impact in the Product Support channel. |
4:53 PM | Sep 16 | Engineering assembles in a war room and communicates to the org they are investigating. |
5:01 PM | Sep 16 | The situation is deemed an incident and the status page is updated to indicate investigation is underway. |
5:18 PM | Sep 16 | Engineering team actively investigating, root cause not yet identified |
5:34 PM | Sep 16 | Login access restored with ~1 minute delay, users can navigate normally once logged in |
6:08 PM | Sep 16 | Fix implemented and deployed, users should be able to log in without delay |
6:14 PM | Sep 16 | Incident resolved, login flow restored to normal operation |
Root Causes
Observed Evidence:
Contributing Factors:
Service did not gracefully handle MongoDB connection timeouts, causing complete service failures instead of degraded operation.
Authentication endpoints were dependent on the response from audit calls, though successful completion or error did not impede user login. A fix was implemented to cease awaiting that response, thereby allowing the continued processing of login, logout, and other requests (e.g., SendEmail) irrespective of the audit's response.
Additional Notes
To prevent similar issues in the future, we will be implementing the following:
Implementing better timeout handling and retry logic for MongoDB connections
Adding in better logging to indicate connection issues.
Code adjustments to Mongo follow our existing database connection processes.