Summary

Following the release of a new messaging feature early on May 28, 2024, users experienced general slowness in messaging related requests. The new messaging feature increased the number of requests hitting our gateway, triggering rate limit issues that hadn't been encountered before. Additionally, messaging failed to load in the task drawer due to invalid feature flag configurations related to customers, and channel loading was slow due to an expensive database query.

Resolution

The issues were resolved by taking three key actions:

Increasing resources on affected services
Temporarily scaling gateway services to handle the increased request volume while implementing a permanent fix that adjusted rate limits to more reasonable values
Optimizing the expensive database query to retrieve only necessary data, reducing contention on the database

Incident Timeline

Time	Event
2:30 AM MDT, May 28	New messaging feature released
8:12 AM MDT, May 28	Customer reports of issues received
8:30 AM MDT, May 28	Resources adjusted on affected services
9:00 AM MDT, May 28	War Room initiated for coordinated response
11:33 AM MDT, May 28	Optimized query deployed
11:35 AM MDT, May 28	Adjusted the feature flag configuration
12:34 PM MDT, May 28	Adjusted rate limit configuration deployed
12:50 PM MDT, May 28	All issues resolved, system returned to normal operation

Root Causes

Resource Constraints: Insufficient resources allocated to select services to handle the increased load from the new messaging feature
Gateway Rate Limiting: Rate limits on gateway services were too restrictive, causing legitimate requests to be denied when traffic increased
Inefficient Database Queries: Certain queries were retrieving excessive data, causing database contention and slowing down channel loading

Additional Notes

To prevent similar issues in the future, we will be looking into the following:

Proactive monitoring for rate limit thresholds, especially during new feature releases
Load testing with realistic traffic patterns prior to major feature releases
Database query optimization reviews as part of the deployment checklist
Automated scaling policies for critical gateway servicesEnhanced existing
monitoring to include coverage for areas primarily affected by the incident, such as messaging request latency and error rates.

Resolved

May 28, 2025 at 6:50 PM

Resolved

May 28, 2025 at 6:50 PM

This incident has been resolved. Thank you for your patience. We will be reviewing what caused the issue and share with you our learnings and the steps we have taken to help prevent future disruptions. Please reach out to Support if you experience any future latency.

Monitoring

May 28, 2025 at 6:04 PM

Monitoring

May 28, 2025 at 6:04 PM

We implemented a fix and are currently monitoring the result. Typical navigation speed, project creation, task creation, etc. are all seeing improved speed.

Update

May 28, 2025 at 5:16 PM

Update

May 28, 2025 at 5:16 PM

Our team is actively investigating the issue impacting performance. We've identified a likely root cause and are preparing a fix that will be deployed shortly. We’re closely monitoring the situation and will provide updates as soon as more information becomes available. Thank you for your patience as we work to resolve this quickly and thoroughly.

Update

May 28, 2025 at 4:24 PM

Update

May 28, 2025 at 4:24 PM

We are currently investigating this incident. Again, we have a few ideas about a potential root cause, but haven't identified the culprit yet.

Investigating

May 28, 2025 at 3:54 PM

Investigating

May 28, 2025 at 3:54 PM

We are currently investigating this incident. We have already boosted resources to reduce lag, but still haven't found the root issue. We have a couple of leads we are still investigating.

All systems operational

Intermittent latency throughout the application

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes

GUIDEcx - Intermittent latency throughout the application – Incident details

All systems operational

Intermittent latency throughout the application

Summary

Resolution

Incident Timeline

Root Causes

Additional Notes