GUIDEcx - Some API requests experiencing slower responses than usual – Incident details

All systems operational

Some API requests experiencing slower responses than usual

Resolved
Degraded performance
Started about 1 month agoLasted about 3 hours

Affected

Web Application

Operational from 4:13 PM to 6:54 PM

Project Management

Operational from 4:13 PM to 6:54 PM

OpenAPI

Degraded performance from 4:13 PM to 6:54 PM, Operational from 6:54 PM to 12:26 AM

API v1

Degraded performance from 4:13 PM to 6:54 PM, Operational from 6:54 PM to 12:26 AM

API v2

Degraded performance from 4:13 PM to 6:54 PM, Operational from 6:54 PM to 12:26 AM

Updates
  • Update
    Update

    RCA

    Summary:

    The latest GUIDEcx API services run on AWS EKS Clusters. On the evening of November 19, 2024, a scheduled deployment introduced changes to a particular service related to new project and task-related features. This service also manages custom fields, certain application pages, and interfaces with Open API requests concerning custom fields.

    A specific database query, in certain instances, returned more data than necessary. Although these queries executed within acceptable time limits, the excessive amount of data returned created a bottleneck for other queries requesting similar data. While it can sometimes be acceptable for queries to return significant amounts of data, this query was intended to return smaller amounts, as it was being run frequently across multiple distributed services. This led to a backlog of requests, which in turn created pressure on messages, events, and other services reliant on the same data. This pressure caused the slowness and inconsistencies experienced during the incident.


    Resolution:

    The issue was resolved by identifying the cases where more data was being requested than intended and modifying those queries to only return the necessary data. Once this adjustment was made, queries executed efficiently, and the backlog of messages, events, and services was processed, allowing the system to return to normal operations.


    Incident Timeline (in EST):

    • 6:30 PM, November 19: Code released for feature updates. Services appeared healthy and scaled appropriately.

    • 6:48 PM, November 19: Automated monitoring detects minor instability, but the incident resolves itself before a support representative can investigate further.

    • 8:00 AM, November 20: Engineering support discovers errors related to requests failing to connect to services. A backlog of requests causes some services to restart, resulting in intermittent failures.

    • 10:52 AM, November 20: Support receives reports of API errors (later identified as timeouts), incomplete page loads, and missing custom fields on projects. Open API traffic spikes during this time (an expected occurrence, but it increases the frequency of the problematic query execution).

    • 12:24 PM, November 20: The root cause is identified, and a resolution is implemented.

    • 1:54 PM, November 20: The incident is fully resolved. System back pressure is processed successfully, and operations return to normal.


    Additional Notes:

    Three specific actions are being implemented to mitigate similar issues in the future:

    1. Automated alerts will flag queries that return larger-than-expected payloads to clients. Use cases will define reasonable maximum thresholds for data loads.

    2. Automated alerts will detect system backlogs related to messaging. Use cases will define reasonable maximum thresholds for such backlogs.

    3. APIs associated with AWS EKS services will require filterable criteria for database queries unless explicitly defined otherwise.

  • Resolved
    Resolved

    This incident has been resolved. Our team will continue to monitor the situation but users are no longer receiving errors within the API and the Team page and Custom Fields are loading properly.

  • Identified
    Identified

    The team has identified a fix and has started working on implementing a fix. We will continue to update as progress is made.

  • Update
    Update

    We are still investigating this incident. We have found some insights from our monitoring tools and are chasing down a few theories. We also understand that some issues are leaking into the UI experience and that it isn't just related to work done via the API, including issues with custom fields, new Team page, etc.

  • Investigating
    Investigating
    We are currently investigating this incident.