GUIDEcx - Intermittent latency throughout the application – Incident details

All systems operational

Intermittent latency throughout the application

Resolved
Degraded performance
Started about 1 month agoLasted about 3 hours

Affected

Web Application

Degraded performance from 3:54 PM to 6:04 PM, Operational from 6:04 PM to 6:50 PM

Project Management

Degraded performance from 3:54 PM to 6:04 PM, Operational from 6:04 PM to 6:50 PM

Compass Customer Portal

Degraded performance from 3:54 PM to 6:04 PM, Operational from 6:04 PM to 6:50 PM

Resource Management

Degraded performance from 3:54 PM to 6:04 PM, Operational from 6:04 PM to 6:50 PM

Advanced Time Tracking

Degraded performance from 3:54 PM to 6:04 PM, Operational from 6:04 PM to 6:50 PM

Report Navigator and Report Builder

Degraded performance from 3:54 PM to 6:04 PM, Operational from 6:04 PM to 6:50 PM

Updates
  • Postmortem
    Postmortem

    Post-Mortem: Messaging Service Incident (May 28, 2024)

    Summary

    Following the release of a new messaging feature early on May 28, 2024, users experienced general slowness in messaging related requests. The new messaging feature increased the number of requests hitting our gateway, triggering rate limit issues that hadn't been encountered before. Additionally, messaging failed to load in the task drawer due to invalid feature flag configurations related to customers, and channel loading was slow due to an expensive database query.

    Resolution

    The issues were resolved by taking three key actions:

    1. Increasing resources on affected services

    2. Temporarily scaling gateway services to handle the increased request volume while implementing a permanent fix that adjusted rate limits to more reasonable values

    3. Optimizing the expensive database query to retrieve only necessary data, reducing contention on the database

    Incident Timeline

    Time

    Event

    2:30 AM MDT, May 28

    New messaging feature released

    8:12 AM MDT, May 28

    Customer reports of issues received

    8:30 AM MDT, May 28

    Resources adjusted on affected services

    9:00 AM MDT, May 28

    War Room initiated for coordinated response

    11:33 AM MDT, May 28

    Optimized query deployed

    11:35 AM MDT, May 28 

    Adjusted the feature flag configuration

    12:34 PM MDT, May 28

    Adjusted rate limit configuration deployed

    12:50 PM MDT, May 28

    All issues resolved, system returned to normal operation

    Root Causes

    • Resource Constraints: Insufficient resources allocated to select services to handle the increased load from the new messaging feature

    • Gateway Rate Limiting: Rate limits on gateway services were too restrictive, causing legitimate requests to be denied when traffic increased

    • Inefficient Database Queries: Certain queries were retrieving excessive data, causing database contention and slowing down channel loading

    Additional Notes

    To prevent similar issues in the future, we will be looking into the following:

    • Proactive monitoring for rate limit thresholds, especially during new feature releases

    • Load testing with realistic traffic patterns prior to major feature releases

    • Database query optimization reviews as part of the deployment checklist

    • Automated scaling policies for critical gateway servicesEnhanced existing

    • monitoring to include coverage for areas primarily affected by the incident, such as messaging request latency and error rates.

  • Resolved
    Resolved

    This incident has been resolved. Thank you for your patience. We will be reviewing what caused the issue and share with you our learnings and the steps we have taken to help prevent future disruptions. Please reach out to Support if you experience any future latency.

  • Monitoring
    Monitoring

    We implemented a fix and are currently monitoring the result. Typical navigation speed, project creation, task creation, etc. are all seeing improved speed.

  • Update
    Update

    Our team is actively investigating the issue impacting performance. We've identified a likely root cause and are preparing a fix that will be deployed shortly. We’re closely monitoring the situation and will provide updates as soon as more information becomes available. Thank you for your patience as we work to resolve this quickly and thoroughly.

  • Update
    Update

    We are currently investigating this incident. Again, we have a few ideas about a potential root cause, but haven't identified the culprit yet.

  • Investigating
    Investigating

    We are currently investigating this incident. We have already boosted resources to reduce lag, but still haven't found the root issue. We have a couple of leads we are still investigating.