Summary
During a routine upgrade of our system infrastructure, we encountered an issue related to the rate-limiting of image downloads from an external service. This rate limit disrupted the startup of essential services, leading to a temporary outage that affected the availability of certain features.
What Happened
The issue occurred during the upgrade process when the rapid and simultaneous restarting of multiple system components led to a higher-than-usual number of download requests within a short time frame. This exceeded the limits set by the external service provider, disrupting the startup of critical services.
How We Fixed It
Enhanced Access: At 3:10 PM ET, we upgraded our access to the external service, allowing for higher download limits. A new access credential was created and applied, which allowed the impacted services to restart successfully.
Configuration Update: We updated our system configurations to ensure more reliable access to required components in the future, reducing the likelihood of similar issues.
What We've Done to Prevent This in the Future
To prevent this from happening again, we took several steps:
Image Caching & Version Control: We implemented changes to cache frequently used components and pin them to specific versions, reducing the need to download them from external sources repeatedly and avoiding future rate limits.
Upgraded Service Plan: We upgraded our plan with the external service provider to a higher tier, increasing our allowed download capacity and providing more robust support for future operations.