Service is operating normally: [RESOLVED] Increase in error rates
At 2:20PM PDT today we began the final phase of a staged deployment of a new software component to our US-Standard region. This software component had already been successfully rolled-out to our other regions and to multiple data centers within the US-Standard region earlier in the week. At 2:27PM, as the final phase of deployment began completing, error rates began increasing, and triggered our alarms. At 2:28PM the automated deployment completed and by 2:38 PM rollback of the new software had been initiated in all data centers within the US-Standard region. By 2:54PM the rollback was complete and the system was fully recovered. The root cause was a reduction in throughput under heavy load for the newly deployed software. We did not see this change in behavior in our pre-deployment testing or in our other regions running this version of the software. To prevent recurrence, we are increasing the load in our pre-deployment load testing. We are also adding additional alarming to better detect changes in key scaling characteristics for this component so that we can identify issues earlier in the phased deployment process.