Archive for the ‘Amazon Simple Storage Service (US)’ Category.
July 13, 2011, 12:40 pm
We’d like to give some additional information on the 18 minute disruption in our US-Standard Region on July 12, 2011. At 4:13PM PDT memory errors on one of our servers introduced malformed system-health information into the system. Using Error Correction Code (ECC) memory and Extended Error Correction (EEC) process, memory errors on our servers are extremely rare. This malformed system-health information was then distributed across the storage servers in the region. There was, unfortunately, a mistake in the storage server error handling code causing it to protect the storage servers too aggressively. When the storage servers attempted to process the malformed system-health information they detected it was invalid. The error resulted in these servers shutting themselves down and not serving requests. This caused elevated errors for all APIs that save or retrieve objects (GET, PUT, POST, etc). At 4:31PM PDT we corrected the system-health information and our servers automatically started serving requests again. We have made changes to our system-health handling routines so that when they recognize malformed system-health information they continue to operate using the previous value instead of shutting down. In addition they will log where the malformed system-health information originated and raise alarms so that we can take the server with memory errors down and fix it. We coded, tested, and deployed these changes last night (July 12), finishing at 7:01PM PDT.
July 12, 2011, 5:45 pm
Between 4:13PM and 4:31PM PDT we experienced high error rates in our East Coast and West Coast facilities for GET, PUT and POST. The issue has been resolved and the service is operating normally.
July 12, 2011, 5:31 pm
We are experiencing high error rates in US-Standard in both our East Coast and West Coast facilities. The Amazon S3 US-West-1 region is operating normally.
July 12, 2011, 5:23 pm
We are currently investigating increased error rates in the East Coast facilities.
June 28, 2011, 3:40 pm
Between 1:45 PST and 2:28 PST, some customers experienced problems loading the Amazon S3 Console. This issue has been resolved and the console is operating normally.
June 28, 2011, 3:31 pm
Some customers are currently unable to load the Amazon S3 Console. We are actively working to resolve the issue.
April 12, 2011, 5:37 am
We just wanted to provide a little more information on the List-Buckets API issue yesterday morning. From 1:39AM PDT to 5:39 AM PDT yesterday morning, calls to the List-Buckets API returned incomplete results. The List-Buckets API calls completed successfully, but in many cases, the results only included a subset of the customers buckets. This caused the AWS Management Console to show an incomplete list of buckets or indicate there were no buckets at all. Note that during this event all Amazon S3 buckets were still online, calls to GET, PUT, and DELETE were operating normally, and no customer data was lost.
The cause of this event was a software update. Last week wed deployed a software update to a subset of data centers in each region as part of a staged deployment. Over the last week we encountered no issues. Monday night, we completed the world wide deployment. This deployment included a change to the internal mechanism S3 uses to list bucket metadata. This internal mechanism is used in many places including the process that synchronizes metadata between our primary bucket index and the secondary index that supports the List-Buckets API. Last night we encountered an unexpected error condition that caused the internal mechanism for listing bucket metadata to return a partial set of results to the synchronization process. The synchronization process updated the secondary index based on the partial results and incorrectly removed the buckets that were not included. Subsequent calls to the List-Buckets API then returned incomplete results. We fixed the incomplete results issue by repopulating the secondary index, which was completed at 5:39 AM PDT on Tuesday. At this point, the List-Buckets API was returning the complete results and we began rolling back the defective software update worldwide. This roll back was completed at 7:00AM PDT.
We tested this change before we began the deployment, but clearly missed this case. We wont deploy another change until weve extended our testing and alarming to ensure weve covered this and related cases. We will also change our synchronization process to halt and alarm if it appears that theres been a large, sudden decrease in the number of buckets, as this is a highly unusual case. We apologize for any inconvenience or confusion that resulted from this event.
Sincerely,
The Amazon S3 team
March 23, 2011, 4:29 pm
Between 10:29 AM PDT and 3:11PM PDT we experienced increased connection timeouts to our West coast facilities in the US-Standard region. The issue has been resolved and the service is operating normally.
March 23, 2011, 3:52 pm
We have confirmed that a subset of requests routed to our West coast facilities in the US-Standard region are experiencing connection timeouts. We are working on resolving the issue.
March 23, 2011, 3:03 pm
We are currently investigating reports of increased connection timeouts for requests that are routed to our West coast facilities.