rrs – Shlomo Swidler

AWS recently added support for receiving Simple Notification Service notifications when S3 loses a Reduced Redundancy Storage S3 object. This raises a number of questions:

What the heck does that even mean?
Why would I want to do that?
Under what conditions does it make financial sense to do that?

Let’s take a look at these questions, and we’ll also do a bit of brainstorming (please participate!) to design a service that puts it all together.

What is S3 Reduced Redundancy Storage?

Standard objects stored in S3 have “eleven nines” of durability annually. This means 99.999999999% of your objects stored in S3 will still be there after one year. On average, you will need to store 100,000,000,000 – that’s one hundred billion – objects in standard S3 storage before you will, on average, have one of them disappear over a year’s time. Pretty great.

Reduced Redundancy Storage (RRS) is a different class of S3 storage that, in effect, has a lower durability: 99.99% annually. On average, you will need to store only 10,000 objects in RRS S3 before you should expect one of them to disappear over a year’s time. Not quite as great, but still more than 400 times better than a traditional hard drive.

When an RRS object is lost S3 will return an HTTP 405 response code, and your application is supposed to be built to understand that and take the appropriate action: most likely regenerate the object from its source objects, which have been stored elsewhere more reliably – probably in standard eleven-nines S3. It’s less expensive for AWS to provide a lower durability class of service, and therefore RRS storage is priced accordingly: it’s about 2/3 the cost of standard S3 storage.

RRS is great for derived objects – for example, image thumbnails. The source object – the full-quality image or video – can be used to recreate the derived object – the thumbnail – without losing any information. All it costs to create the derived object is time and CPU power. And that’s most likely why you’re creating the derived objects and storing them in S3: to act as a cache so the app server does not need to spend time and CPU power recreating them for every request. Using S3 RRS as a cache will save you 1/3 of your storage costs for the derived objects, but you’ll need to occasionally recreate a derived object in your application.

How Do You Handle Objects Stored in RRS?

If you serve the derived objects to clients directly from S3 – as many web apps do with their images – your clients will occasionally get a HTTP 405 response code (about once a year for every 10,000 RRS objects stored). The more objects you store the higher the likelihood of a client’s browser encountering a HTTP 405 error – and most browsers show ugly messages when they get a 405 error. So your application should do some checking.

To get your application to check for a lost object you can do the following: Send S3 an HTTP HEAD request for the object before giving the client its URL. If the object exists then the HEAD request will succeed. If the object is lost the HEAD request will return a 405 error. Once you’re sure the object is in S3 (either the HEAD request succeeded, or you recreated the derived object and stored it again in S3), give the object’s URL to the client.

All that HEAD checking is a lot of overhead: each S3 RRS URL needs to be checked every time it’s served. You can add a cache of the URL of objects you’ve checked recently and skip those. This will cut down on the overhead and reduce your S3 bill – remember that each HEAD request costs 1/10,000 of a cent – but it’s still a bunch of unnecessary work because most of the time you check its HEAD the object will still be there.

Using Simple Notification Service with RRS

Wouldn’t it be great if you could be notified when S3 RRS loses an object?

You can. AWS’s announcement introduces a way to receive notification – via Simple Notification Service, SNS – when S3 RRS detects that an object has been lost. This means you no longer need your application to check for 405s before serving objects. Instead you can have your application listen for SNS notifications (either via HTTP or via email or via SQS) and proactively process them to restore lost objects.

Okay, it’s not really true that your application no longer needs to check for lost objects. The latency between the actual loss of an object and the time you recreate and replace it is still nonzero, and during that time you probably want your application to behave nicely.

[An aside: I do wonder what the expected latency is between the object’s loss and the SNS notification. I’ve asked on the Forums and in a comment to Jeff Barr’s blog post – I’ll update this article when I have an answer.]

When Does it Make Financial Sense to Use S3 RRS?

While you save on storage costs for using S3 RRS you still need to devote resources to recreating any lost objects. How can you decide when it makes sense to go with RRS despite the need to recreate lost objects?

There are a number of factors that influence the cost of recreating lost derived objects:

Bandwidth to get the source object from S3 and return the derived object to S3. If you perform the processing inside the same EC2 region as the S3 region you’re using then this cost is zero.
CPU to perform the transformation of the source object into the derived object.
S3 requests for GETting the source object and PUTting the derived object.

I’ve prepared a spreadsheet analyzing these costs for various different numbers of objects, sizes of objects, and CPU-hours required for each derived object.

For 100,000 source objects of average 5MB size stored in Standard S3, each of which creates 5 derived objects of average 500KB size stored in RRS and requiring 1 second of CPU time to recreate, the savings in choosing RRS is $12.50 per month. Accounting for the cost of recreating lost derived objects reduces that savings to $12.37.

For the same types of objects but requiring 15 minutes of CPU time to recreate each derived object the net savings overall is $12.28. Still very close to the entire savings generated by using RRS.

For up to about 500,000 source objects it doesn’t pay to launch a dedicated m1.small instance just for the sake of recreating lost RRS objects. An m1.small costs $61.20 per month, which is approximately the same as the net savings from 500,000 source objects of average 5MB size with 5 derived objects each of average size 500KB. At this level of usage, if you have spare capacity on an existing instance then it would make financial sense to run the recreating process there.

For larger objects the savings is also almost the entire amount saved by using RRS, and the amounts saved are larger than the cost of a single m1.small so it already pays to launch your own instance for the processing.

For larger numbers of objects the savings is also almost the entire amount saved by using RRS.

As far down as you go in the spreadsheet, and as much as you may play with the numbers, it makes financial sense to use RRS and have a mechanism to recreate derived objects.

Which leads us to the the brainstorming.

Why Should I Worry About Lost Objects?

Let’s face it, nobody wants to operate a service that is not core to their business. Most likely, creating the derived objects from the source object is not your business core competency. Creating thumbnails and still frame video captures is commodity stuff.

So let’s imagine a service that does the transformation, storage in S3, and maintenance of RRS derived objects for you so you don’t have to.

You’d drop off your source object in your bucket in S3. Then you’d send an SQS message to the service containing the new source object’s key and a list of the transformations you want applied. As Jeff Bar suggests in his blog, the service would process the message and create derived objects (stored in RRS) whose keys (the name) would be composed of the source object’s name and the name of the transformation applied. You’d know how to construct the name of every derived object, so you would know how to access them. The service would subscribe to the RRS SNS notifications and recreate the derived objects when they are lost.

This service would need a way for clients to discover the supported file types and the supported transformations for each file type.

As we pointed out above, there is a lot of potential financial savings in using RRS, so such a service has plenty of margin to price itself profitably, below the cost of standard S3 storage.

What else would such a service need? Please comment.

If you build such a service, please cut me in for 30% for giving you the idea. Or, at least acknowledge me in your blog.