Categories
Cloud Developer Tips

Poking Holes in CloudFront-Based Sites for Dynamic Content

As of Februrary 2011 AWS S3 has been able to serve static websites, giving you superior availability for unchanging (or seldom-changing) content. But most websites today are not static; dynamic elements drive essential features such as personalized pages, targeted advertisements, and shopping carts. Today’s release from AWS CloudFront: Support for Dynamic Content alleviates some of the challenge of running dynamic websites. You can now configure a custom set of URL patterns to always be passed through to the origin server. This allows you to “poke holes” in the CDN cache for providing dynamic content.

Some web sites, such as this one, appear to be static but are driven by dynamic code. WordPress renders each page on every request. Though excellent tools exist to provide caching for WordPress, these tools still require your web server to process WordPress’s PHP scripts. Heavy traffic or poor hosting choices can still overwhelm your web server.

Poking Holes

It’s relatively easy to configure your entire domain to be served from CloudFront. What do you need to think about when you poke holes in a CloudFront distribution? Here are two important items: admin pages and form actions.

Admin pages

The last thing you want is for your site’s control panel to be statically served. You need an accurate picture of the current situation in order to manage your site. In WordPress, this includes everything in the /wp-admin/* path as well as the /wp-login.php page.

Form actions

Your site most likely does something with the information people submit in forms – search with it, store it, or otherwise process it. If not, why collect it? In order to process the submitted information you need to handle it dynamically in your web application, and that means the submit action can’t lead to a static page. Make sure your form submission actions – such as search and feedback links – pass through to the webserver directly.

A great technique for feedback forms is to use WuFoo, where you can visually construct forms and integrate them into your website by simple Javascipt. This means that your page can remain static – the Javascript code dynamically inserts the form, and WuFoo handles the processing, stops the spam, and sends you the results via email.

When Content Isn’t So Dynamic

Sometimes content changes infrequently – for example, your favicon probably changes rarely. Blog posts, once written, seldom change. Serving these items from a CDN is still an effective way to reduce load on your webserver and reduce latency for your users. But when things do change – such as updated images, additional comments, or new posts, how can you use CloudFront to serve the new content? How can you make sure CloudFront works well with your updated content?

Object versioning

A common technique used to enable updating static objects is called object versioning. This means adding a version number to the file name, and updating the link to the file when a new version is available. This technique also allows an entire set of resources to be versioned at once, when you create a versioned directory name to hold the resources.

Object versioning works well with CloudFront. In fact, it is the recommended way to update static resources that change infrequently. The alternative method, invalidating objects, is more expensive and difficult to control.

Combining the Above Techniques

You can use a combination of the above techniques to create a low-latency service that caches sometimes-dynamic content. For example, a WordPress blog could be optimized by integrating these techniques into the WordPress engine, perhaps via a plugin. Here’s what you’d do:

  • Create a CloudFront distribution for the site, setting its custom origin to point to the webserver.
  • Poke holes in the distribution necessary for the admin, login, and forms pages.
  • Create new versions of pages, images, etc. when they change, and new versions of the pages that refer to them.

Even though WordPress generates each page via PHP, this collection of techniques allows the pages to be served via CloudFront and also be updated when changes occur. I don’t know of a plugin that combines all these techniques, but I suspect the good folks at W3-EDGE, producers of the W3 Total Cache performance optimization framework I mentioned above, are already working on it.

Categories
Cloud Developer Tips

S3 Reduced Redundancy Storage with Simple Notification Service: What, Why, and When

AWS recently added support for receiving Simple Notification Service notifications when S3 loses a Reduced Redundancy Storage S3 object. This raises a number of questions:

  • What the heck does that even mean?
  • Why would I want to do that?
  • Under what conditions does it make financial sense to do that?

Let’s take a look at these questions, and we’ll also do a bit of brainstorming (please participate!) to design a service that puts it all together.

What is S3 Reduced Redundancy Storage?

Standard objects stored in S3 have “eleven nines” of durability annually. This means 99.999999999% of your objects stored in S3 will still be there after one year. On average, you will need to store 100,000,000,000 – that’s one hundred billion – objects in standard S3 storage before you will, on average, have one of them disappear over a year’s time. Pretty great.

Reduced Redundancy Storage (RRS) is a different class of S3 storage that, in effect, has a lower durability: 99.99% annually. On average, you will need to store only 10,000 objects in RRS S3 before you should expect one of them to disappear over a year’s time. Not quite as great, but still more than 400 times better than a traditional hard drive.

When an RRS object is lost S3 will return an HTTP 405 response code, and your application is supposed to be built to understand that and take the appropriate action: most likely regenerate the object from its source objects, which have been stored elsewhere more reliably – probably in standard eleven-nines S3. It’s less expensive for AWS to provide a lower durability class of service, and therefore RRS storage is priced accordingly: it’s about 2/3 the cost of standard S3 storage.

RRS is great for derived objects – for example, image thumbnails. The source object – the full-quality image or video – can be used to recreate the derived object – the thumbnail – without losing any information. All it costs to create the derived object is time and CPU power. And that’s most likely why you’re creating the derived objects and storing them in S3: to act as a cache so the app server does not need to spend time and CPU power recreating them for every request. Using S3 RRS as a cache will save you 1/3 of your storage costs for the derived objects, but  you’ll need to occasionally recreate a derived object in your application.

How Do You Handle Objects Stored in RRS?

If you serve the derived objects to clients directly from S3 – as many web apps do with their images – your clients will occasionally get a HTTP 405 response code (about once a year for every 10,000 RRS objects stored). The more objects you store the higher the likelihood of a client’s browser encountering a HTTP 405 error – and most browsers show ugly messages when they get a 405 error. So your application should do some checking.

To get your application to check for a lost object you can do the following: Send S3 an HTTP HEAD request for the object before giving the client its URL. If the object exists then the HEAD request will succeed. If the object is lost the HEAD request will return a 405 error. Once you’re sure the object is in S3 (either the HEAD request succeeded, or you recreated the derived object and stored it again in S3), give the object’s URL to the client.

All that HEAD checking is a lot of overhead: each S3 RRS URL needs to be checked every time it’s served. You can add a cache of the URL of objects you’ve checked recently and skip those. This will cut down on the overhead and reduce your S3 bill – remember that each HEAD request costs 1/10,000 of a cent – but it’s still a bunch of unnecessary work because most of the time you check its HEAD the object will still be there.

Using Simple Notification Service with RRS

Wouldn’t it be great if you could be notified when S3 RRS loses an object?

You can. AWS’s announcement introduces a way to receive notification – via Simple Notification Service, SNS – when S3 RRS detects that an object has been lost. This means you no longer need your application to check for 405s before serving objects. Instead you can have your application listen for SNS notifications (either via HTTP or via email or via SQS) and proactively process them to restore lost objects.

Okay, it’s not really true that your application no longer needs to check for lost objects. The latency between the actual loss of an object and the time you recreate and replace it is still nonzero, and during that time you probably want your application to behave nicely.

[An aside: I do wonder what the expected latency is between the object’s loss and the SNS notification. I’ve asked on the Forums and in a comment to Jeff Barr’s blog post – I’ll update this article when I have an answer.]

When Does it Make Financial Sense to Use S3 RRS?

While you save on storage costs for using S3 RRS you still need to devote resources to recreating any lost objects. How can you decide when it makes sense to go with RRS despite the need to recreate lost objects?

There are a number of factors that influence the cost of recreating lost derived objects:

  • Bandwidth to get the source object from S3 and return the derived object to S3. If you perform the processing inside the same EC2 region as the S3 region you’re using then this cost is zero.
  • CPU to perform the transformation of the source object into the derived object.
  • S3 requests for GETting the source object and PUTting the derived object.

I’ve prepared a spreadsheet analyzing these costs for various different numbers of objects, sizes of objects, and CPU-hours required for each derived object.

For 100,000 source objects of average 5MB size stored in Standard S3, each of which creates 5 derived objects of average 500KB size stored in RRS and requiring 1 second of CPU time to recreate, the savings in choosing RRS is $12.50 per month. Accounting for the cost of recreating lost derived objects reduces that savings to $12.37.

For the same types of objects but requiring 15 minutes of CPU time to recreate each derived object the net savings overall is $12.28. Still very close to the entire savings generated by using RRS.

For up to about 500,000 source objects it doesn’t pay to launch a dedicated m1.small instance just for the sake of recreating lost RRS objects. An m1.small costs $61.20 per month, which is approximately the same as the net savings from 500,000 source objects of average 5MB size with 5 derived objects each of average size 500KB. At this level of usage, if you have spare capacity on an existing instance then it would make financial sense to run the recreating process there.

For larger objects the savings is also almost the entire amount saved by using RRS, and the amounts saved are larger than the cost of a single m1.small so it already pays to launch your own instance for the processing.

For larger numbers of objects the savings is also almost the entire amount saved by using RRS.

As far down as you go in the spreadsheet, and as much as you may play with the numbers, it makes financial sense to use RRS and have a mechanism to recreate derived objects.

Which leads us to the the brainstorming.

Why Should I Worry About Lost Objects?

Let’s face it, nobody wants to operate a service that is not core to their business. Most likely, creating the derived objects from the source object is not your business core competency. Creating thumbnails and still frame video captures is commodity stuff.

So let’s imagine a service that does the transformation, storage in S3, and maintenance of RRS derived objects for you so you don’t have to.

You’d drop off your source object in your bucket in S3. Then you’d send an SQS message to the service containing the new source object’s key and a list of the transformations you want applied. As Jeff Bar suggests in his blog, the service would process the message and create derived objects (stored in RRS) whose keys (the name) would be composed of the source object’s name and the name of the transformation applied. You’d know how to construct the name of every derived object, so you would know how to access them. The service would subscribe to the RRS SNS notifications and recreate the derived objects when they are lost.

This service would need a way for clients to discover the supported file types and the supported transformations for each file type.

As we pointed out above, there is a lot of potential financial savings in using RRS, so such a service has plenty of margin to price itself profitably, below the cost of standard S3 storage.

What else would such a service need? Please comment.

If you build such a service, please cut me in for 30% for giving you the idea. Or, at least acknowledge me in your blog.

Categories
Cloud Developer Tips

How I Moved 5% of All Objects in S3 with Jets3t

This is a true story about a lot of data. The cast of characters is as follows:

The Protagonist: Me.

The Hero: Jets3t, a Java library for using Amazon S3.

The Villain: Decisions made long ago, for forgotten reasons.

Innocent Bystanders: My client.

Once Upon a Time…

Amazon S3 is a great place to store media files and allows these files to be served directly from S3, instead of from your web server, thereby saving your server’s network and CPU for more important tasks. There are some gotchas with serving files directly from S3, and it is these gotchas that had my client locked in to paying for bandwidth and CPU to serve media files directly from his web server.

You see, a few years ago when my client first created their S3 bucket, they named it media.example.com. Public objects in that bucket could be accessed via the URL http://s3.amazonaws.com/media.example.com/objectKey or via the Virtual Host style URL http://media.example.com.s3.amazonaws.com/objectKey. If you’re just serving images via HTTP then this can work for you. But you might have a good reason to convince the browser that all the media is being served from your domain media.example.com (for example, when using Flash, which requires an appropriately configured crossdomain.xml). Using a URL that lives at s3.amazonaws.com or a subdomain of that host will not suffice for these situations.

Luckily, S3 lets you set up your DNS in a special manner, convincing the world that the same object lives at the URL http://media.example.com/objectKey. All you need to do is to set up a DNS CNAME alias pointing media.example.com to media.example.com.s3.amazonaws.com. The request will be routed to S3, which will look at the HTTP Host header and discover the bucket name media.example.com.

So what’s the problem? That’s all great for bucket with a name that works in DNS. But it won’t work for a bucket whose name is Bucket.example.com, because DNS is case insensitive. There are limitations on the name of a bucket if you want to use the DNS alias. This is where we reveal a secret: the bucket was not really named media.example.com. For some reason nobody remembers, the bucket was named Media.example.com – with a capital letter, which is invalid in DNS entries. This makes all the difference in the world, because S3 cannot serve this bucket via the Virtual Host method – you get a NoSuchBucket error when you try to access http://Media.example.com.s3.amazonaws.com/objectKey (equivalent to http://Media.example.com/objectKey with the appropriate DNS CNAME in place).

As a workaround my client developed an application that dynamically loaded the media onto the server and served it directly from there. This server serviced media.example.com, and it would essentially do the following for each requested file:

  1. Do we already have this objectKey on our local filesystem? If yes, go to step 3.
  2. Fetch the object from S3 via http://s3.amazonaws.com/Media.example.com/objectKey and save it to the local filesystem.
  3. Serve the file from the local filesystem.

This workaround allowed the client to release URLs that looked correct, but required using a separate server for the job. It costs extra time (when there is a cache miss) and money (to operate the server).

The challenge? To remove the need for this caching server and allow the URLs to be served directly from S3 via media.example.com.

Just Move the Objects, Right?

It might seem obvious: Why not simply move the objects to a correctly-named bucket? Turns out that’s not quite so simple to do in this case.

Obviously, if I was dealing with a few hundred, thousand, or even tens of thousands of objects, I could use a GUI tool such as CloudBerry Explorer or the S3Fox Organizer Firefox Extension. But this client is a popular web site, and has been storing media in the bucket for a few years already. They had 5 billion objects in the bucket (which is 5% of the total number of objects in S3). These tools crashed upon viewing the bucket. So, no GUI for you.

S3 is a hosted object store system. Why not just use its MOVE command (via the API) to move the objects from the wrong bucket to the correctly-named bucket? Well, it turns out that S3 has no MOVE command.

Thankfully, S3 has a COPY command which allows you to copy an object on the server-side, without downloading the object’s contents and uploading them again to the new location. Using some creative programming you can put together a COPY and a DELETE (only if the COPY succeeded!) to simulate a MOVE. I tried using the boto Python library but it choked on manipulating any object in the bucket name Media.example.com – even though it’s a legal name, it’s just not recommended – so I couldn’t use this tool. The Java-based Jets3t library was able to handle this unfortunate bucket name just fine, and it also provides a convenience method to move objects via COPY and DELETE. The objects in this bucket are immutable, so we don’t need to worry about consistency.

So I’m all set with Jets3t.

Or so I thought.

First Attempt: Make a List

My first attempt was to:

  1. List all the objects in the bucket and put them in a database.
  2. Run many client programs that requested the “next” object key from the database and deleted the entry from the database when it was successfully moved to the correctly-named bucket.

This approach would provide a way to make sure all the objects were moved successfully.

Unfortunately, listing so many objects took too long – I allowed a process to list the bucket’s contents for a full 24 hours before killing it. I don’t know how far it got, but I didn’t feel like waiting around for it to finish dumping its output to a file, then waiting some more to import the list into a database.

Second Attempt: Make a Smaller List

I thought about the metadata I had: The objects in the bucket all had object keys with a particular structure:

/binNumber/oneObjectKey

binNumber was a number from 0 to 4.5 million, and each binNumber served as the prefix for approximately 1200 objects (which works out to 5.4 billion objects total in the bucket). The names of these objects were essentially random letters and numbers after the binNumber/ component. S3 has  a list objects with this prefix method. Using this method you can get a list of object keys that begin with a specific prefix – which is perfect for my needs, since it will return a list of very manageable size.

So I coded up something quick in Java using Jets3t. Here’s the initial code snippet:

public class MoveObjects {
private static final String AWS_ACCESS_KEY_ID = .... ; private static final String AWS_SECRET_ACCESS_KEY = .... ; private static final String SOURCE_BUCKET_NAME = "Media.example.com"; private static final String DEST_BUCKET_NAME = "media.example.com"; public static void main(String[] args) {
AWSCredentials awsCredentials = new AWSCredentials(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY); S3Service restService = new RestS3Service(awsCredentials); S3Bucket sourceBucket = restService.getBucket(SOURCE_BUCKET_NAME); final String delimiter = "/"; String[] prefixes = new String[...]; for (int i = 0; i < prefixes.length; ++i) {
// fill the list of binNumbers from the command-line args (not shown)
prefixes[i] = String.valueOf(...);
} ExecutorService tPool = Executors.newFixedThreadPool(32); long delay = 50; for (String prefix : prefixes) {
S3Object[] sourceObjects = restService.listObjects(sourceBucket, prefix + delimiter, delimiter); if (sourceObjects != null && sourceObjects.length > 0) {
System.out.println(" At key " + sourceObjects[0].getKey() + ", this many: " + sourceObjects.length); for (int i = 0; i < sourceObjects.length; ++i) {
final S3Object sourceObject = sourceObjects[i]; final String sourceObjectKey = sourceObject.getKey(); sourceObject.setAcl(AccessControlList.REST_CANNED_PUBLIC_READ); Mover mover = new Mover(restService, sourceObject, sourceObjectKey); while (true) {
try {
tPool.execute(mover); delay = 50; break;
} catch (RejectedExecutionException r) {
System.out.println("Queue full: waiting " + delay + " ms"); Thread.sleep(delay); // backoff and retry delay += 50;
}
}
}
}
} tPool.shutdown(); tPool.awaitTermination(360000, TimeUnit.SECONDS); System.out.println(" Completed!");
} private static class Mover implements Runnable {
final S3Service restService; final S3Object sourceObject; final String sourceObjectKey; Mover(final S3Service restService, final S3Object sourceObject, final String sourceObjectKey) {
this.restService = restService; this.sourceObject = sourceObject; this.sourceObjectKey = sourceObjectKey;
} public void run() {
Map moveResult = null; try {
moveResult = restService.moveObject(SOURCE_BUCKET_NAME, sourceObjectKey, DEST_BUCKET_NAME, sourceObject, false); if (moveResult.containsKey("DeleteException")) {
System.out.println("Error: " + sourceObjectKey);
}
} catch (S3ServiceException e) {
System.out.println("Error: " + sourceObjectKey + " EXCEPTION: " + e.getMessage());
}
}
}
}

The code uses an Executor to control a pool of threads, each of which is given a single object to move which is encapsulated in a Mover. All objects with a given prefix (binNumber) are listed and then added to the Executor’s pool to be moved. The initial setup of Jets3t with the credentials and building the array of prefixes is not shown.

We need to be concerned that the thread pool will fill up faster than we can handle the operations we’re enqueueing, so we have backoff-and-retry logic in that code. But, notice we don’t care if a particular object’s move operation fails. This is because we will run the same program again a second time, after covering all the binNumber prefixes, to catch any objects that have been left behind (and a third time, too – until no more objects are left in the source bucket).

I ran this code on an EC2 m1.xlarge instance in two simultaneous processes, each of which was given half of the binNumber prefixes to work with. I settled on 32 threads in the thread pool after a few experiments showed this number ran the fastest. I made sure to set the proper number of underlying HTTP connections for Jets3t to use, with these arguments: -Ds3service.max-thread-count=32 -Dhttpclient.max-connections=60 . Things were going well for a few hours.

Third Attempt: Make it More Robust

After a few hours I noticed that the rate of progress was slowing. I didn’t have exact numbers, but I saw that things were just taking longer in minute 350 than they had taken in minute 10. I could have taken on the challenge of debugging long-running, multithreaded code. Or I could hack in a workaround.

The workaround I chose is to force the program to terminate every hour, and to restart itself. I added the following code to the main method:

    // exit every hour
    Timer t = new Timer(true);
    TimerTask tt = new TimerTask() {
    	public void run() {
    		System.out.println("Killing myself!");
    		System.exit(42);
    	}
    };
    final long dieMillis = 3600 * 1000;
    t.schedule(tt, dieMillis);

And I wrapped the program in a “forever” wrapper script:

#! /bin/bash

while true; do
	DATE=`date`
	echo $DATE: $0: launching $*
	$* 2>&1
done

This script is invoked as follows:

ARGS=... ./forever.sh nohup java -Ds3service.max-thread-count=32 -Dhttpclient.max-connections=60 -classpath bin/:lib/jets3t-0.7.2.jar:lib/commons-logging-1.1.1.jar:lib/commons-httpclient-3.1.jar:lib/commons-codec-1.3.jar com.orchestratus.s3.MoveObjects $ARGS >> nohup.out 2>&1 &

Whenever the Java program terminates, the forever wrapper script re-launches it with the same arguments. This works properly because the only objects that will be left in the bucket will be those that haven’t been deleted yet. Eventually, this ran to completion and the program would start, check all its binNumber prefixes, find nothing, exit, restart, find nothing, exit, restart, etc.

The whole process took 5 days to completely move all objects to the new bucket. Then I gave my client the privilege of deleting the Media.example.com bucket.

Lessons Learned

Here are some important lessons I learned and reinforced through this project.

Use the metadata to your benefit

Sometimes the only thing you know about a problem is its shape, not its actual contents. In this case I knew the general structure of the object keys, and this was enough to go on even if I couldn’t discover every object key a priori. This is a key principle when working with large amounts of data: the metadata is your friend.

Robustness is a feature

It took a few iterations until I got to a point where things were running consistently fast. And it took some advanced planning to design a scheme that would gracefully tolerate failure to move some objects. But without these features I would have had to manually intervene when problems arose. Don’t let intermittent failure delay a long-running process.

Sometimes it doesn’t pay to debug

I used an ugly hack workaround to force the process to restart every hour instead of debugging the actual underlying problem causing it to gradually slow down. For this code, which was one-off code that I wrote for my specific circumstances, I decided this was a more effective approach than getting bogged down making it correct. It works fast when brute-forced, so it didn’t need to be truly corrected.

Repeatability

I’ve been thinking about how someone would repeat my experiments and discover improvements to the techniques I employed. We could probably get by without actually copying and deleting the objects, rather we could perform two successive calls – perhaps to get different metadata headers. We’d need some public S3 bucket with many millions of objects in it to make a comparable test case. And we’d need an S3 account willing to let users play in it.

Any takers?

Categories
Cloud Developer Tips

Read-After-Write Consistency in Amazon S3

S3 has an “eventual consistency” model, which presents certain limitations on how S3 can be used. Today, Amazon released an improvement called “read-after-write-consistency” in the EU and US-west regions (it’s there, hidden at the bottom of the blog post). Here’s an explanation of what this is, and why it’s cool.

What is Eventual Consistency?

Consistency is a key concept in data storage: it describes when changes committed to a system are visible to all participants. Classic transactional databases employ various levels of consistency, but the golden standard is that after a transaction commits the changes are guaranteed to be visible to all participants. A change committed at millisecond 1 is guaranteed to be available to all views of the system – all queries – immediately thereafter.

Eventual consistency relaxes the rules a bit, allowing a time lag between the point the data is committed to storage and the point where it is visible to all others. A change committed at millisecond 1 might be visible to all immediately. It might not be visible to all until millisecond 500. It might not even be visible to all until millisecond 1000. But, eventually it will be visible to all clients. Eventual consistency is a key engineering tradeoff employed in building distributed systems.

One issue with eventual consistency is that there’s no theoretical limit to how long you need to wait until all clients see the committed data. A delay must be employed (either explicitly or implicitly) to ensure the changes will be visible to all clients.

Practically speaking, I’ve observed that changes committed to S3 become visible to all within less than 2 seconds. If your distributed system reads data shortly after it was written to eventually consistent storage (such as S3) you’ll experience higher latency as a result of the compensating delays.

What is Read-After-Write Consistency?

Read-after-write consistency tightens things up a bit, guaranteeing immediate visibility of new data to all clients. With read-after-write consistency, a newly created object or file or table row will immediately be visible, without any delays.

Note that read-after-write is not complete consistency: there’s also read-after-update and read-after-delete. Read-after-update consistency would allow edits to an existing file or changes to an already-existing object or updates of an existing table row to be immediately visible to all clients. That’s not the same thing as read-after-write, which is only for new data. Read-after-delete would guarantee that reading a deleted object or file or table row will fail for all clients, immediately. That, too, is different from read-after-write, which only relates to the creation of data.

Why is Read-After-Write Consistency Useful?

Read-after-write consistency allows you to build distributed systems with less latency. As touched on above, without read-after-write consistency you’ll need to incorporate some kind of delay to ensure that the data you just wrote will be visible to the other parts of your system.

But no longer. If you use S3 in the US-west or EU regions (or other regions supporting read-after-write consistency), your systems need not wait for the data to become available.

Update March 2011: As more S3 regions come online they seem to be getting the same features as US-West. So far the AP-Singapore and AP-Tokyo regions also support Read-After-Write consistency. US Standard does not.

Update June 2012: As pointed out in the comments below, more S3 regions now support read-after-write consistency: US-West Oregon, SA-Sao Paolo, and AP-Tokyo. It’s not easy keeping up with the pace of AWS’s updates!

Why Only in the AWS US-west and EU Regions not in the US Standard region?

Read-after-write consistency for AWS S3 is was only available in the US-west and EU regions, not the US-Standard region. I asked Jeff Barr of AWS blogging fame why, and his answer makes a lot of sense:

This is a feature for EU and US-West. US Standard is bi-coastal and doesn’t have read-after-write consistency.

Aha! I had forgotten about the way Amazon defines its S3 regions. US-Standard has servers on both the east and west coasts (remember, this is S3 not EC2) in the same logical “region”. The engineering challenges in providing read-after-write consistency in a smaller geographical area are greatly magnified when that area is expanded. The fundamental physical limitation is the speed of light, which takes at least 16 milliseconds to cross the US coast-to-coast (that’s in a vacuum – it takes at least four times as long over the internet due to the latency introduced by routers and switches along the way).

If you use S3 and want to take advantage of the read-after-write consistency, make sure you understand the cost implications: some other regions have higher storage and bandwidth costs than the US-Standard region.

Next Up: SQS Improvements?

Some vague theorizing:

It’s been suggested that AWS Simple Queue Service leverages S3 under the hood. The improved S3 consistency model can be used to provide better consistency for SQS as well. Is this in the works? Jeff Barr, any comment? 🙂

Categories
Cloud Developer Tips

Amazon S3 Gotcha: Using Virtual Host URLs with HTTPS

Amazon S3 is a great place to store static content for your web site. If the content is sensitive you’ll want to prevent the content from being visible while in transit from the S3 servers to the client. The standard way to secure the content during transfer is by https – simply request the content via an https URL. However, this approach has a problem: it does not work for content in S3 buckets that are accessed via a virtual host URL. Here is an examination of the problem and a workaround.

Accessing S3 Buckets via Virtual Host URLs

S3 provides two ways to access your content. One way uses s3.amazonaws.com host name URLs, such as this:

http://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey

The other way to access your S3 content uses a virtual host name in the URL:

http://mybucket.mydomain.com.s3.amazonaws.com/myObjectKey

Both of these URLs map to the same object in S3.

You can make the virtual host name URL shorter by setting up a DNS CNAME that maps mybucket.mydomain.com to mybucket.mydomain.com.s3.amazonaws.com. With this DNS CNAME alias in place, the above URL can also be written as follows:

http://mybucket.mydomain.com/myObjectKey

This shorter virtual host name URL works only if you setup the DNS CNAME alias for the bucket.

Virtual host names in S3 is a convenient feature because it allows you to hide the actual location of the content from the end-user: you can provide the URL http://mybucket.mydomain.com/myObjectKey and then freely change the DNS entry for mybucket.mydomain.com (to point to an actual server, perhaps) without changing the application. With the CNAME alias pointing to mybucket.mydomain.com.s3.amazonaws.com, end-users do not know that the content is actually being served from S3. Without the DNS CNAME alias you’ll need to explicitly use one of the URLs that contain s3.amazonaws.com in the host name.

The Problem with Accessing S3 via https URLs

https encrypts the transferred data and prevents it from being recovered by anyone other than the client and the server. Thus, it is the natural choice for applications where protecting the content in transit is important. However, https relies on internet host names for verifying the identity certificate of the server, and so it is very sensitive to the host name specified in the URL.

To illustrate this more clearly, consider the servers at s3.amazonaws.com. They all have a certificate issued to *.s3.amazonaws.com. [“Huh?” you say. Yes, the SSL certificate for a site specifies the host name that the certificate represents. Part of the handshaking that sets up the secure connection ensures that the host name of the certificate matches the host name in the request. The * indicates a wildcard certificate, and means that the certificate is valid for any subdomain also.] If you request the https URL https://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey, then the certificate’s host name matches the requested URL’s host name component, and the secure connection can be established.

If you request an object in a bucket without any periods in its name via a virtual host https URL, things also work fine. The requested URL can be https://aSimpleBucketName.s3.amazonaws.com/myObjectKey. This request will arrive at an S3 server (whose certificate was issued to *.s3.amazonaws.com), which will notice that the URL’s host name is indeed a subdomain of s3.amazonaws.com, and the secure connection will succeed.

However, if you request the virtual host URL https://mybucket.mydomain.com.s3.amazonaws.com/myObjectKey, what happens? The host name component of the URL is mybucket.mydomain.com.s3.amazonaws.com, but the actual server that gets the request is an S3 server whose certificate was issued to *.s3.amazonaws.com. Is mybucket.mydomain.com.s3.amazonaws.com a subdomain of s3.amazonaws.com? It depends who you ask, but most up-to-date browsers and SSL implementations will say “no.” A multi-level subdomain – that is, a subdomain that has more than one period in it – is not considered to be a proper subdomain by recent Firefox, Internet Explorer, Java, and wget clients. So the client will report that the server’s SSL certificate, issued to *.s3.amazonaws.com, does not match the host name of the request, mybucket.mydomain.com.s3.amazonaws.com, and refuse to establish a secure connection.

The same problem occurs when you request the virtual host https URL https://mybucket.mydomain.com/myObjectKey. The request arrives – after the client discovers that mybucket.mydomain.com is a DNS CNAME alias for mybucket.mydomain.com.s3.amazonaws.com – at an S3 server with an SSL certificate issued to *.s3.amazonaws.com. In this case the host name mybucket.mydomain.com clearly does not match the host name on the certificate, so the secure connection again fails.

Here is what a failed certificate check looks like in Firefox 3.5, when requesting https://images.mydrifts.com.s3.amazonaws.com/someContent.txt:

Here is what happens in Java:

javax.net.ssl.SSLHandshakeException: java.security.cert.CertificateException: No subject alternative DNS name matching media.mydrifts.com.s3.amazonaws.com found.
at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Alerts.java:174)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1591)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:187)
at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Handshaker.java:181)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:975)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:123)
at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Handshaker.java:516)
at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Handshaker.java:454)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:884)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1096)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1123)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1107)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:405)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:166)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:977)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:318)
Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching media.mydrifts.com.s3.amazonaws.com found.
at sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:193)
at sun.security.util.HostnameChecker.match(HostnameChecker.java:77)
at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkIden
tity(X509TrustManagerImpl.java:264)
at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:250)
at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:954)

And here is what happens in wget:

$ wget -nv https://media.mydrifts.com.s3.amazonaws.com/someContent.txt
ERROR: Certificate verification error for media.mydrifts.com.s3.amazonaws.com: unable to get local issuer certificate
ERROR: certificate common name `*.s3.amazonaws.com' doesn't match requested host name `media.mydrifts.com.s3.amazonaws.com'.
To connect to media.mydrifts.com.s3.amazonaws.com insecurely, use `--no-check-certificate'.
Unable to establish SSL connection.

Requesting the https URL using the DNS CNAME images.mydrifts.com results in the same errors, with the messages saying that the certificate *.s3.amazonaws.com does not match the requested host name images.mydrifts.com.

Notice that the browsers and wget clients offer a way to circumvent the mis-matched SSL certificate. You could, theoretically, ask your users to add an exception to the browser’s security settings. However, most web users are scared off by a “This Connection is Untrusted” message, and will turn away when confronted with that screen.

How to Access S3 via https URLs

As pointed out above, there are two forms of S3 URLs that work with https:

https://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey

and this:

https://simplebucketname.s3.amazonaws.com/myObjectKey

So, in order to get https to work seamlessly with your S3 buckets, you need to either:

  • choose a bucket whose name contains no periods and use the virtual host URL, such as https://simplebucketname.s3.amazonaws.com/myObjectKey or
  • use the URL form that specifies the bucket name separately, after the host name, like this: https://s3.amazonaws.com/mybucket.mydomain.com/myObjectKey.

Update 25 Aug 2009: For buckets created via the CreateBucketConfiguration API call, the only option is to use the virtual host URL. This is documented in the S3 docs here.