Summary
We recently encountered an issue related to a new feature release, which temporarily affected project creation and led to slower performance in some areas of our platform. Specifically, customers saw projects stuck in the "Creating" state with a loading screen that wouldn't go away.
What Happened
During the release of a new feature, a part of our project creation process encountered an unexpected issue. While projects were being created successfully, a background process responsible for completing the setup encountered an error, causing it to retry the process multiple times. This led to the creation of duplicate templates within projects and prevented some projects from completing their setup.
Additionally, the repeated attempts to finalize project creation increased the workload on our system, leading to slower response times for certain actions, such as syncing data with external tools like Jira and Salesforce.
How We Fixed It
Rapid Response: Within 15 minutes of the issue being reported, our team implemented a fix to ensure that new projects could complete their setup process without any issues.
Clean-Up: The cleanup of impacted jobs followed this schedule:
2:30 PM ET: We completed tests of a script to remove duplicate templates. A total of 735 duplicates were removed, along with duplicate milestones, tasks, and attachments.
5:46 PM ET: All projects stuck in a "Creating" state were updated, and we confirmed that known customer projects were accessible.
5:56 PM ET: The system processed the remaining background tasks, and overall performance returned to normal as the load on our database decreased throughout the day.
What We've Done to Prevent This in the Future
To ensure this doesn't happen again, we took several steps:
Enhanced Quality Control: We reviewed our release processes and introduced additional quality checks and safeguards, ensuring that similar issues would be caught before they reach production.
Improved Monitoring: We reinstated and enhanced our monitoring systems to quickly detect and alert us to any issues affecting critical processes like project creation.
System Improvements: We also made improvements to how our background processes handle spikes in activity, ensuring that our system remains responsive even during high-demand periods.
We understand the impact this may have had on your experience, and we are committed to learning from this event to provide you with a more reliable service moving forward.