Service disruption: [RESOLVED] Elevated error rates

From 1:48PM to 2:19PM PDT on 6/1/2010, the US-Standard region experienced elevated error rates. The errors were the result of a new algorithm we deployed a few weeks ago that was designed to increase PUT request throughput. This code introduced an error in Amazon S3′s indexing layer. The resulting problem was only triggered if there was a very high PUT request rate to a bucket, and we had manually enabled the bucket to leverage the new algorithm, and the bucket had Versioning enabled. The confluence of these conditions is rare, and this is why we hadn’t seen this problem in the several weeks since we’d deployed this change.

When Amazon S3 encountered this set of conditions yesterday, it caused an unhandled exception. When the unhandled exception was encountered, the index server processing the request would fail and restart. Because of the high PUT request rate for the bucket, requests were spread across many different servers in the system for processing. Each of these servers encountered the set of conditions, failing and restarting. While an index server is restarting, it cannot process requests. Once a significant portion of Amazon S3′s index servers were restarting, Amazon S3 didn’t have enough index server capacity to handle the system wide request load, causing the elevated error rates in this Region.

In this situation, we’d typically throttle requests to the bucket encountering the problem in order to prevent it from impacting a broad set of buckets. However, because of the way the index server failed, our logs didn’t contain the bucket name for the request that caused the server to fail. We weren’t able to quickly identify that requests to this bucket were causing the problem.

We’re changing the way that logging is done for index servers to ensure we have the bucket information in situations like this. We’ve also identified the issue in the algorithm and fixed it. We are also reviewing our test coverage to ensure that we have adequate coverage for this type of scenario.


Comments are closed.

Get Adobe Flash playerPlugin by wpburn.com wordpress themes