Archive for the ‘Amazon Simple Storage Service (US)’ Category.
August 20, 2010, 3:50 pm
Starting at 2:12PM PDT a small subset of requests routed to our East Coast facilities experienced connection timeouts. The service was fully recovered by 2:33PM PDT. The service is now operating normally.
August 20, 2010, 3:32 pm
We are investigating timeouts for a subset of requests routed to our East Coast facilities for our US Standard endpoint.
August 18, 2010, 11:50 am
Starting 10:04AM PDT a small subset of requests routed to our East Coast facilities experienced connection timeouts. The service began recovering at 10:21AM PDT and was fully recovered by 10:29AM PDT. The service is now operating normally.
August 18, 2010, 11:24 am
We are investigating timeouts for a subset of requests routed to our East Coast facilities for our US Standard endpoint.
August 5, 2010, 12:04 am
At 2:20PM PDT today we began the final phase of a staged deployment of a new software component to our US-Standard region. This software component had already been successfully rolled-out to our other regions and to multiple data centers within the US-Standard region earlier in the week. At 2:27PM, as the final phase of deployment began completing, error rates began increasing, and triggered our alarms. At 2:28PM the automated deployment completed and by 2:38 PM rollback of the new software had been initiated in all data centers within the US-Standard region. By 2:54PM the rollback was complete and the system was fully recovered. The root cause was a reduction in throughput under heavy load for the newly deployed software. We did not see this change in behavior in our pre-deployment testing or in our other regions running this version of the software. To prevent recurrence, we are increasing the load in our pre-deployment load testing. We are also adding additional alarming to better detect changes in key scaling characteristics for this component so that we can identify issues earlier in the phased deployment process.
August 4, 2010, 4:17 pm
The increased error rates lasted from 2:27 to 2:54 PM PDT. The service has fully recovered and is operating normally.
August 4, 2010, 3:40 pm
We are currently investigating an increase in error rates affecting S3.
June 3, 2010, 9:00 am
From 1:48PM to 2:19PM PDT on 6/1/2010, the US-Standard region experienced elevated error rates. The errors were the result of a new algorithm we deployed a few weeks ago that was designed to increase PUT request throughput. This code introduced an error in Amazon S3′s indexing layer. The resulting problem was only triggered if there was a very high PUT request rate to a bucket, and we had manually enabled the bucket to leverage the new algorithm, and the bucket had Versioning enabled. The confluence of these conditions is rare, and this is why we hadn’t seen this problem in the several weeks since we’d deployed this change.
When Amazon S3 encountered this set of conditions yesterday, it caused an unhandled exception. When the unhandled exception was encountered, the index server processing the request would fail and restart. Because of the high PUT request rate for the bucket, requests were spread across many different servers in the system for processing. Each of these servers encountered the set of conditions, failing and restarting. While an index server is restarting, it cannot process requests. Once a significant portion of Amazon S3′s index servers were restarting, Amazon S3 didn’t have enough index server capacity to handle the system wide request load, causing the elevated error rates in this Region.
In this situation, we’d typically throttle requests to the bucket encountering the problem in order to prevent it from impacting a broad set of buckets. However, because of the way the index server failed, our logs didn’t contain the bucket name for the request that caused the server to fail. We weren’t able to quickly identify that requests to this bucket were causing the problem.
We’re changing the way that logging is done for index servers to ensure we have the bucket information in situations like this. We’ve also identified the issue in the algorithm and fixed it. We are also reviewing our test coverage to ensure that we have adequate coverage for this type of scenario.
June 1, 2010, 3:30 pm
Error rates are back down to normal, but we are continuing to monitor closely.
June 1, 2010, 3:01 pm
We are investigating elevated error rates.