Notice history

Postmortem

September 17, 2025 at 11:25 PM

Postmortem

September 17, 2025 at 11:25 PM

Post-Mortem: Service Provider (Atlas) MongoDB Connection Incident (September 16, 2025)

Summary

On September 16, 2025, the access-audit service experienced widespread connection failures to MongoDB Atlas, causing major latency in the login flow and service disruptions. The MongoDB was taking too long to respond, resulting in timeouts for requests sent there. MongoDB was making some infrastructure changes that affected us and other clients. The issue was resolved through implementation of a workaround to handle MongoDB timeouts more gracefully.

Resolution

The issue was resolved by:

Deploying a change to the access-audit service to Gracefully handle the timeouts.

Incident Timeline

Time (MDT)	Date	Status
4:34 PM	Sep 16	Engineering was alerted through our automated alerts that we were seeing unexpected latency. Posts were also made to the engineering channel to alert other engineers.
4:47 PM	Sep 16	Support raises alarm that they are seeing customer impact in the Product Support channel.
4:53 PM	Sep 16	Engineering assembles in a war room and communicates to the org they are investigating.
5:01 PM	Sep 16	The situation is deemed an incident and the status page is updated to indicate investigation is underway.
5:18 PM	Sep 16	Engineering team actively investigating, root cause not yet identified
5:34 PM	Sep 16	Login access restored with ~1 minute delay, users can navigate normally once logged in
6:08 PM	Sep 16	Fix implemented and deployed, users should be able to log in without delay
6:14 PM	Sep 16	Incident resolved, login flow restored to normal operation

Root Causes

Atlas experienced an issue when implementing a feature flag for serverless MongoDB databases, causing latency for us and others of their customers.

Observed Evidence:

Contributing Factors:

Service did not gracefully handle MongoDB connection timeouts, causing complete service failures instead of degraded operation.
Authentication endpoints were dependent on the response from audit calls, though successful completion or error did not impede user login. A fix was implemented to cease awaiting that response, thereby allowing the continued processing of login, logout, and other requests (e.g., SendEmail) irrespective of the audit's response.

Additional Notes

To prevent similar issues in the future, we will be implementing the following:

Implementing better timeout handling and retry logic for MongoDB connections
Adding in better logging to indicate connection issues.
Code adjustments to Mongo follow our existing database connection processes.

Resolved

September 17, 2025 at 12:14 AM

Resolved

September 17, 2025 at 12:14 AM

This incident has been resolved. Thank you for your patience as we navigated restoring the log in flow.

Monitoring

September 17, 2025 at 12:08 AM

Monitoring

September 17, 2025 at 12:08 AM

We implemented a fix and are currently monitoring the result. The fix has been deployed and users should be able to log in without any delay. Any users that were currently logged in during this degraded performance didn't experience any additional slowness while logged in.

Update

September 16, 2025 at 11:34 PM

Update

September 16, 2025 at 11:34 PM

We are currently investigating this incident and login access has been restored, although it can take up to a minute or so to login in. Once a user is logged in they can navigate the app as expected and are experiencing normal navigation speeds. Users should be able to log in now, but again will experience a minor delay after entering your password.

Update

September 16, 2025 at 11:18 PM

Update

September 16, 2025 at 11:18 PM

We are currently investigating this incident. This is our top priority right now and we have our engineering team actively investigate potential solutions. A root cause has yet to be identified.

Investigating

September 16, 2025 at 11:01 PM

Investigating

September 16, 2025 at 11:01 PM

We are currently investigating this incident.

All systems operational

Oct 2025

Sep 2025

Post-Mortem: Service Provider (Atlas) MongoDB Connection Incident (September 16, 2025)

Aug 2025

GUIDEcx - Notice history

All systems operational

Notice history

Oct 2025

Sep 2025

Post-Mortem: Service Provider (Atlas) MongoDB Connection Incident (September 16, 2025)

Aug 2025