Cloud Developer Tips

Recapture Unused EC2 Minutes

How much time is “wasted” in the paid-for but unused portion of the hour when you terminate an instance? How can you recapture this time – which represents compute power – and put it to good use? After all, you’ve paid for it already. This article presents a technique for repurposing an instance after you’re “done” with it, until the current billing hour is up. It’s inspired by a tweet from DEVOPS_BORAT:

We have new startup CloudJanitor. We recycle old or unuse cloud instance. Need only your cloud account login!

To clarify, we’re talking about per-hour pricing in public cloud IaaS services, where partial hours consumed are billed as whole hours. AWS EC2 is the most prominent example of a cloud sporting this pricing policy (search for “partial”). In this pricing policy, terminating (or stopping) an instance after it’s been running for 121 minutes results in a usage charge for three hours, “wasting” an extra 59 minutes that you have paid for but not used.

What’s Involved

You might think it’s easy to repurpose an instance: just Stop it (if it’s EBS-backed), change its root volume to a new one, and Start the instance again. Not so fast: Stopping an EC2 instance immediately ends the current billing hour before you can use it all, and when you Start the instance again a new billing hour begins – so we can’t Stop the instance. We also can’t Terminate the instance – that would also immediately curtail the billing hour and prevent us from utilizing it. Instead, we’re going to reboot the instance, which does not affect the billing.

We’ll need an EBS volume that has a bootable distro on it – let’s call this the “beneficiary” volume, because it’s going to benefit from the extra time on the clock. The beneficiary volume should have the same distro as the “normal” root volume has. [Actually, to be more precise, it need only have a distro that works with the same kernel that the instance is currently running.] I’ve tested this technique with Ubuntu 10.04 Lucid and 10.10 Maverick.

One of the great things about the Ubuntu images is how easy it is to play this root volume switcheroo: these distros boot from any volume that has the label uec-rootfs. To change the root volume we’ll change the volume labels, so a different volume is used as the root filesystem upon reboot.

It’s very important to disassociate the instance from all external hooks, such as Auto-Scaling Triggers and Elastic Load Balancers before you repurpose it. Otherwise the beneficiary workload will influence those no-longer-relevant systems. However, this may not be possible if you use hooks that cannot be de-coupled from the instance, such as a CloudWatch Dimension of ImageIdInstanceId, or InstanceType.

The network I/O incurred during the recaptured time may be subject to additional charges. In EC2, only communications between instances in the same availability zone, or between EC2 and S3 in the same region, are free of charge.

You’ll need to make sure the beneficiary workload only accepts communications on ports that are open in the normal instance’s security groups. It’s not possible to add or remove security groups while an instance is running. You also wouldn’t want to be modifying the security groups dynamically because that will influence all instances in those security groups – and you may have other instances that are still performing their normal workload.

The really cool thing about this technique is that it can be used on both EBS-backed and instance-store instances. However, you’ll need to prepare separate beneficiary volumes (or snapshots) for 32-bit and 64-bit instances.

How to Do it

There are three stages in repurposing an instance:

  1. Preparing the beneficiary volume (or snapshot).
  2. Preparing the normal workload image.
  3. Actually repurposing the instance.

Stages 1 and 2 are only performed once. Stage 3 is performed for every instance you want to repurpose.

Preparing the beneficiary snapshot

First we’re going to prepare the beneficiary snapshot. Beginning with a pristine Ubuntu 10.10 Maverick EBS-based instance (at the time of publishing this article that’s ami-ccf405a5 for 32-bit instances), let’s create a clone of the root filesystem:

ec2-run-instances ami-ccf405a5 -k my-keypair -t m1.small -g default

ec2-describe-instances $instanceId #use the instanceId outputted from the previous command

Wait for the instance to be “running”. Once it is, identify the volumeId of the root volume – it will be indicated in the ec2-describe-instances output, the one attached to device /dev/sda1.

At this point you have a running Ubuntu 10.10 instance. For real-world usage you’ll want to customize this instance by installing the beneficiary workload and arranging for it to automatically start up on boot. (I recommend Folding@home as a worthy beneficiary project.)

Now we create the beneficiary snapshot:

ec2-create-snapshot $volumeId #use the volumeId from the previous command

And now we have the beneficiary snapshot.

Preparing the normal workload image

Begin with the same base AMI that you used for the beneficiary snapshot. Launch it and customize it to contain your normal workload stuff. You’ll also need to put in a custom script that will perform the repurposing. Here’s what that script will do:

  1. Determine how much time is left on the clock in the current billing hour. If it’s not enough time to prepare and to reboot into the beneficiary volume, just force ourselves to shut down.
  2. Disassociate any external hooks the instance might participate in: remove it from ELBs, force it to fail any Auto-Scaling health checks, and make sure it’s not handling “normal” workloads anymore.
  3. Attach the beneficiary volume to the instance.
  4. Change the volume labels so the beneficiary volume will become the root filesystem at the next reboot.
  5. Edit the startup scripts on the beneficiary volume to start a self-destruct timer.
  6. Reboot.

The following script performs steps 1, 4, 5, and 6, and clearly indicates where you should perform steps 2 and 3.

#! /bin/bash
# reboot into the attached EBS volume on the specified device, but terminate
# before this billing hour is complete.
# requires the first argument to be the device on which the EBS volume is attached

safetyMarginMinutes=1 # set this to how long it takes to attach and reboot

# make sure we have at least "safetyMargin" minutes left this hour
if wget -q -O $t ; then
	# add 60 seconds artificially as a safety margin
	let runningSecs=$(( `date +%s` - `date -r $t +%s` ))+60
	rm -f $t
	let runningSecsThisHour=$runningSecs%3600
	let runningMinsThisHour=$runningSecsThisHour/60
	let leftMins=60-$runningMinsThisHour-$safetyMarginMinutes
	# start shutdown one minute earlier than actually required
	let shutdownDelayMins=$leftMins-1
	if [[ $shutdownDelayMins < 2 || $shutdownDelayMins > 59 ]]; then
		echo "Shutting down now."
		shutdown -h now
		exit 0

## here is where you would disassociate this instance from ELBs,
# force it to fail AutoScaling health checks, and otherwise make sure
# it does not participate in "normal" activities.

## here is where you would attach the beneficiary volume to $device
# ec2-create-volume --snapshot snap-00000000 -z this_availability_zone
# dont forget to wait until the volume is "available"

# ec2-attach-volume . . . and don't forget to wait until the volume is "attached"

## (optionally) force the beneficiary volume to be deleted when this instance terminates:
# ec2-modify-instance-attribute --block-device-mapping '$device=::true' this_instance_id

## get the beneficiary volume ready to be rebooted into
# change the filesystem labels
e2label /dev/sda1 old-uec-rootfs
e2label $device uec-rootfs
# mount the beneficiary volume
mkdir -m 000 $mountPoint
mount $device $mountPoint
# install the self-destruct timer
sed -i -e "s/^exit 0$/shutdown -h +$shutdownDelayMins\nexit 0/" \
# neutralize the self-destruct for subsequent boots
sed -i -e "s#^exit 0#chmod -x /etc/rc.local\nexit 0#" $mountPoint/etc/rc.local
# get out
umount $mountPoint
rmdir $mountPoint

# do the deed
shutdown -r now
exit 0

Save this script into the instance you’re preparing for the normal workload (perhaps, as the root user, into /root/ and chmod it to 744.

Now, make your application detect when its normal workload is completed – this exact method will be very specific to your application. Add in a hook there to invoke this script as the root user, passing it the device on which the beneficiary volume will be attached. For example, the following command will cause the instance to repurpose itself to a volume attached on /dev/sdp:

sudo /root/ /dev/sdp

Once all this is set up, use the usual EC2 AMI creation methods to create your normal workload image (either as an instance-store AMI or as an EBS-backed AMI).

Actually repurposing the instance

Now that everything is prepared, this is the easy part. Your normal workload image can be launched. When it is finished, the repurposing script will be invoked and the instance will be rebooted into the beneficiary volume. The repurposed instance will self-destruct before the billing hour is complete.

You can force this repurposing to happen by explicitly invoking the command at an SSH prompt on the instance:

sudo /root/ /dev/sdp

Notice that you will be immediately kicked out of your SSH session – either the instance will reboot or the instance will terminate itself because there isn’t enough time left in the current billable hour. If it’s just a reboot (which happens when there is significant time left in the current billing hour) then be aware: the SSH host key will most likely be different on the repurposed instance than it was originally, and you may need to clean up your local ~/.ssh/known_hosts file, removing the entry for the instance, before you can SSH in again.

Cloud Developer Tips

Play “Chicken” with Spot Instances

AWS Spot Instances have an interesting economic characteristic that make it possible to game the system a little. Like all EC2 instances, when you initiate termination of a Spot Instance then you incur a charge for the entire hour, even if you’ve used less than a full hour. But, when AWS terminates the instance due to the spot price exceeding the bid price, you do not pay for the current hour.

What if your Spot Instance could wait, after finishing its work, to see if AWS will terminate it involuntarily in this hour and avoid the hour’s cost? In the worst case, your instance can kill itself in the last few minutes of the hour and you will not have incurred any extra unplanned cost. In the best case, the spot price will rise above the instance’s bid price before the hour is up, AWS will terminate the instance involuntarily, and you will not be charged for that entire hour. Wouldn’t this technique reduce costs, especially when performed at large scale?

I call this technique Playing Chicken, based on the game of that name, because it shares similar characteristics to the game:

  • Whoever “swerves” (terminates) first, loses (pays for the hour)
  • If nobody “swerves” (terminates), then an undesirable situation occurs (the instance remains running)

How to Play Chicken

Playing Chicken is really as simple as running a script on the instance when you’re done with the work. Here’s such a script:

#! /bin/bash
if wget -q -O $t ; then
	# add 60 seconds artificially as a safety margin
	let runningSecs=$(( `date +%s` - `date -r $t +%s` ))+60
	rm -f $t
	let runningSecsThisHour=$runningSecs%3600
	let runningMinsThisHour=$runningSecsThisHour/60
	let leftMins=60-$runningMinsThisHour
	# start shutdown one minute earlier than actually required
	let shutdownDelayMins=$leftMins-1
	if [[ $shutdownDelayMins > 1 && $shutdownDelayMins < 60 ]]; then
		echo "Shutting down in $shutdownDelayMins mins."
		# TODO: Notify off-instance listener that the game of chicken has begun
		sudo shutdown -h +$shutdownDelayMins
		echo "Shutting down now."
		sudo shutdown -h now
	exit 0
echo "Failed to determine remaining minutes in this billable hour. Terminating now."
sudo shutdown -h now
exit 1

This script uses the technique published by Dmitriy Samovskiy to determine the launch time of the current instance without using the EC2 API, using the instance meta-data instead. We include a safety margin of two minutes: accounting for the remaining time conservatively adding 1 minute, and beginning the shutdown sequence one minute earlier.

You would run this script on the instance when the Spot Instance is done with its work instead of terminating the instance immediately. You also can add a hook at the indicated place to notify an off-instance listener that the game of chicken has begun, to allow you to track the savings delivered by this technique.

Warning: Make sure you really understand what this script does before you use it. If you mistakenly schedule an instance to be shut down you can cancel it with this command, run on the instance: sudo shutdown -c

How Much is Saved by Playing Chicken?

The extent to which you can benefit from playing chicken depends on a number of factors:

  • The difference between the spot price and your instance’s bid price. The further away the spot price is from your bid, the less likely it is that the spot price will hit the bid and save you money.
  • The volatility of the spot price. The more volatile the spot price, the more likely it will hit the bid and save you money.
  • The number of Spot Instances you terminate in a given period of time. If you normally don’t terminate any Spot Instances then you won’t save anything; if you terminate many then you can potentially save an hour’s worth of cost for each of them.
  • The EC2 Region and instance type. The actual spot price varies by region and instance type, so the potential savings depends on these factors as well.

I’m looking for help to work out a model that can describe the potential savings. If you are interested and able to help with the financial math, please get in touch.

Update: Hat tip to Simon Wardley who pointed out the site CloudExchange that shows great visualizations of the spot prices by region and instance type. This may help you formulate a bidding strategy.

Cloud Developer Tips The Business of IT

CloudConnect 2011 Platforms and Ecosystems BOF

This March 7-10 2011 I will be in Santa Clara for the CloudConnect conference. There are many reasons you should go too, and I’d like to highlight one session that you should not miss: the Platforms and Ecosystems BOF, on Tuesday March 8 at 6:00 – 7:30 PM. Read on for a detailed description of this BOF session and why it promises to be worthwhile.

[Full disclosure: I’m the track chair for the Design Patterns track and I’m running the Platforms and Ecosystems BOF. The event organizers are sponsoring my hotel for the conference, and like all conference speakers my admission to the event is covered.]

CloudConnect 2011 promises to be a high-quality conference, as last year’s was. This year you will be able to learn all about design patterns for cloud applications in the Design Patterns track I’m leading. You’ll also be able to hear from an all-star lineup about many aspects of using cloud: cloud economics, cloud security, culture, risks, and governance, data and storage, devops and automation, performance and monitoring, and private clouds.

But I’m most looking forward to the Platforms and Ecosystems BOF because the format of the event promises to foster great discussions.

The BOF Format

The BOF will be conducted as a… well, it’s hard to describe in words only, so here is a picture:

BOF Overview

At three fixed points around the outside of the room will be three topics: Public IaaS, Public PaaS, and Private Cloud Stacks. There will also be three themes which rotate around the room at each interval; these themes are: Workload Portability, Monitoring & Control, and Avoiding Vendor Lock-in. At any one time there will be three discussions taking place, one in each corner, focusing on the particular combination of that theme and topic.

Here is an example of the first set of discussions:

BOF Session 1

And here is the second set of discussions, which will take place after one “turn” of the inner “wheel”:

BOF Session 2

And here is the final set of discussions, after the second turn of the inner wheel:

BOF Session 3

In all, nine discussions are conducted. Here is a single matrix to summarize:

BOF Summary Matrix

Anatomy of a Discussion

What makes a discussion worthwhile? Interesting questions, focus, and varied opinions. The discussions in the BOF will have all of these elements.

Interesting Questions

The questions above are just “seeder” questions. They are representative questions related to the intersection of each topic and theme, but they are by no means the only questions. Do you have questions you’d like to see discussed? Please, leave a comment below. Better yet, attend the BOF and raise them there, in the appropriate discussion.


Nothing sucks more than a pointless tangent or a a single person monopolizing the floor. Each discussion will be shepherded by a capable moderator, who will keep things focused on the subject area and encourage everyone’s participation.

Varied Opinions

Topic experts and theme experts will participate in every discussion.

Vendors will be present as well – but the moderators will make sure they do not abuse the forum.

And interested parties – such as you! – will be there too. In unconference style, the audience will drive the discussion.

These three elements all together provide the necessary ingredients to make sure things stay interesting.

At the Center

What, you may ask, is at the center of the room? I’m glad you asked. There’ll be beer and refreshments at the center. This is also where you can conduct spin-off discussions if you like.

The Platforms and Ecosystems BOF at CloudConnect is the perfect place to bring your cloud insights, case studies, anecdotes, and questions. It’s a great forum to discuss the ideas you’ve picked up during the day at the conference, in a less formal atmosphere where deep discussion is on the agenda.

Here’s a discount code for 25% off registrationCNXFCC03. I hope to see you there.

Cloud Developer Tips

Lessons Learned from Using Multiple Cloud APIs

Adrian Cole, author of the jClouds library, has an excellent writeup of the trajectory the library’s development followed as it added support for more cloud providers’ APIs.

[Update August 2011: has removed Adrian’s blog so that link no longer works.]

Some important takeaways for application developers:

  • Your unit tests are as valuable as your code. The tests ensure the code works to spec, and they should be used as frequently as possible during development.
  • Make your code easy to test, with sensible defaults that require no external dependencies: e.g. don’t require internet connectivity.
  • Cloud  limitations, both general (such as eventual consistency) and cloud-specific (such as a limit on the number of buckets per S3 account), will require careful consideration in your code (and tests).
  • Some things can only really be tested against a live cloud service. As Adrian points out, the only way to test that an instance launched with the desired customizations is to ssh in to that instance and explore it from the inside. This is not testable using an offline stub emulator.

But the key lesson developers can learn is: Whenever possible, use an existing library to interface with your clouds. As Adrian’s post makes patently clear, a lot of effort goes into ensuring the library works properly with the various supported APIs, and you can only benefit by leveraging those accomplishments.

On the other side of the fence, API developers can also learn from Adrian’s article. As William Vambenepe recently commented:

Rather than spending hours obsessing about the finer points of your API, spend the time writing love letters to [boto author] Mitch and Adrian so they support you in their libraries.

In fact, Adrian’s blog can be viewed as a TODO list for API creators who want to encourage adoption.

API authors should also refer to Steve Loughran’s Cloud Tools Manifesto for more great ideas on how to make life easy for developers.

Cloud Developer Tips

AWS Auto-Scaling and ELB with Reliable Root Domain Handling

Update May 2011: Now that AWS Route 53 can be used to allow an ELB to host a domain zone apex, the technique described here is no longer necessary. Cool, but not necessary.

Someone really has to implement this. I’ve had this draft sitting around ever since AWS announced support for improved CloudWatch alerts and AutoScaling policies (August 2010), but I haven’t yet turned it into a clear set of commands to follow. If you do, please comment.


You want an auto-scaled, load-balanced pool of web servers to host your site at Unfortunately it’s not so simple, because AWS Elastic Load Balancer can’t be used to host a domain apex (AKA a root domain). One of the longest threads on the AWS Developer Forum discusses this limitation: because ELB utilizes DNS CNAMEs, which are not legal for root domain entries, ELB does not support root domains.

An often-suggested workaround is to use an instance with an Elastic IP address to host the root domain, via standard static DNS, with the web server redirecting all root domain requests to the subdomain (www) served by the ELB. There are four drawbacks to this approach:

  1. The instance with the Elastic IP address is liable to be terminated by auto-scaling, leaving requests to the root domain unanswered.
  2. The instance with the Elastic IP address might fail unnaturally, again leaving requests to the root domain unanswered.
  3. Even when traffic is very low, we need at least two instances running: the one handling the root domain outside the auto-scaled ELB group (due to issue #1) and the one inside the auto-scaled ELB group (to handle the actual traffic hitting the ELB-managed subdomain).
  4. The redirect adds additional latency to requests hitting the root domain.

While we can’t do anything about the fourth issue, what follows is a technique to handle the first three issues.

The Idea

The idea is built on these principles:

  • The instance with the Elastic IP is outside the auto-scaled group so it will not be terminated by auto-scaling.
  • The instance with the Elastic IP is managed using AWS tools to ensure the root domain service is automatically recovered if the instance dies unexpectedly.
  • The auto-scaling group can scale back to zero size, so only a single instance is required to serve low traffic volumes.

How do we put these together?

Here’s how:

  1. Create an AMI for your web server. The AMI will need some special boot-time hooks, which are described below in italics. The web server should be set up to redirect root domain traffic to the subdomain that you’ll want to associate with the ELB, and to serve the subdomain normally.
  2. Create an ELB for the site’s subdomain with a meaningful Health Check (e.g. a URL that exercises representative areas of the application).
  3. Create an AutoScaling group with min=1 and max=1 instances of that AMI. This AutoScaling group will benefit from the default health checks that such groups have, and if EC2 reports the instance is degraded it will be replaced. The LaunchConfiguration for this AutoScaling group should specify user-data that indicates this instance is the “root domain” instance. Upon booting, the instance will notice this flag in the user data, associate the Elastic IP address with itself, an add itself to the ELB.
    Note: At this point, we have a reliably-hosted single instance hosting the root domain and the subdomain.
  4. Create a second AutoScaling group (the “ELB AutoScaling group”) that uses the same AMI, with min=0 instances – the max can be anything you want it to – and set it up to use the ELB’s Health Check. The LaunchConfiguration for this group should not contain the abovementioned special flag – these are not root domain instances.
  5. Create an Alarm that looks at the CPUUtilization across all instances of the AMI, and connect it to the “scale up” and “scale down” Policies for the ELB AutoScaling group.

That is the basic idea. The result will be:

  • The root domain is hosted on an instance that redirects to the ELB subdomain. This instance is managed by a standalone Auto Scaling group that will replace the instance if it becomes degraded. This instance is also a member of the ELB, so it serves the subdomain traffic as well.
  • A second AutoScaling group manages the “overflow” traffic, measured by the CPUUtilization of all the running instances of the AMI.


Here are the missing pieces:

  1. A script that can be run as a boot-time hook that checks the user-data for a special flag. When this flag is detected, the script associates the root domain’s Elastic IP address (which should be specified in the user-data) and adds the instance to the ELB (whose name is also specified in the user-data). This will likely require AWS Credentials to be placed on the instance – perhaps in the user-data itself (be sure you understand the security implications of this) as well as a library such as boto or the AWS SDK to perform the AWS API calls.
  2. The explicit step-by-step instructions for carrying out steps 1 through 5 above using the relevant AWS command-line tools.

Do you have these missing pieces? If so, please comment.

Cloud Developer Tips

Using Elastic Beanstalk via command-line on a Mac? Keep that OS X Install DVD handy

The title pretty much says it all.

Elastic Beanstalk is the new service from Amazon Web Services offering you easier deployment of Java WAR files. More languages and platforms are expected to be supported in the future.

Most people will use the service via the convenient web console, but if you want to automate things you’ll either end up using the command-line tools (CLI tools) or the API in the Java SDK (until their other SDKs add Beanstalk support).

But, if you’re running on a Mac, you’ll have a problem running the command-line tools:

Shlomos-MacBook-Pro:ec2 shlomo$ elastic-beanstalk-describe-applications/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- json (LoadError)
from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rubygems/custom_require.rb:31:in `require'
from /Users/shlomo/ec2/elasticbeanstalk/bin/../lib/aws/client/awsqueryhandler.rb:2
from /Users/shlomo/ec2/elasticbeanstalk/bin/../lib/aws/client/awsquery.rb:4:in `require'
from /Users/shlomo/ec2/elasticbeanstalk/bin/../lib/aws/client/awsquery.rb:4
from /Users/shlomo/ec2/elasticbeanstalk/bin/../lib/aws/elasticbeanstalk.rb:19:in `require'
from /Users/shlomo/ec2/elasticbeanstalk/bin/../lib/aws/elasticbeanstalk.rb:19
from /Users/shlomo/ec2/elasticbeanstalk/bin/setup.rb:18:in `require'
from /Users/shlomo/ec2/elasticbeanstalk/bin/setup.rb:18
from /Users/shlomo/ec2/elasticbeanstalk/bin/elastic-beanstalk-describe-applications:18:in `require'
from /Users/shlomo/ec2/elasticbeanstalk/bin/elastic-beanstalk-describe-applications:18

If you remember from the README (which you read, of course 😉 there was some vague mention of this:

If you're using Ruby 1.8, you will have to install the JSON gem:
gem install json

OK, let’s try that:

Shlomos-MacBook-Pro:ec2 shlomo$ sudo gem install json
Building native extensions. This could take a while...
ERROR: Error installing json:
ERROR: Failed to build gem native extension.

/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby extconf.rb
mkmf.rb can't find header files for ruby at /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/ruby.h

Gem files will remain installed in /Library/Ruby/Gems/1.8/gems/json-1.4.6 for inspection.
Results logged to /Library/Ruby/Gems/1.8/gems/json-1.4.6/ext/json/ext/generator/gem_make.out

The first complaint is about the ruby tools not being in the path. Let’s fix that and try again:

Shlomos-MacBook-Pro:ec2 shlomo$ export PATH=$PATH:/Users/shlomo/.gem/ruby/1.8/bin
Shlomos-MacBook-Pro:ec2 shlomo$ sudo gem install json
Building native extensions. This could take a while...
ERROR: Error installing json:
ERROR: Failed to build gem native extension.

/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/bin/ruby extconf.rb
mkmf.rb can't find header files for ruby at /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/ruby.h

Gem files will remain installed in /Library/Ruby/Gems/1.8/gems/json-1.4.6 for inspection.
Results logged to /Library/Ruby/Gems/1.8/gems/json-1.4.6/ext/json/ext/generator/gem_make.out

Uh oh, no dice. What now? StackOverflow to the rescue:

The ruby headers don’t come installed with the base ruby install with Mac OS X. These can been found on Mac OS X Install Disc 2 by installing the XCode Tools.

If you’re like me and you don’t carry around the OS X Install DVD wherever you go, you’re stuck.
Any readers in Seoul with an OS X 10.6 Install DVD?

Update 20 Jan 2011: I’ve gotten some comments that made me realize the title really didn’t say it all. Some clarifications are in order.

Some people pointed out that I should just download the XCode .dmg DVD image – all 3.4 GB of it. Unfortunately that wasn’t applicable for me at the time: I was connected via 3G, tethered to my Android phone. I’ve never tried to download 3.4 GB on that connection, and I don’t plan to try today: it would be expensive.

See the great comment below by Beltran (who works for Bitnami) about a great solution.

Cloud Developer Tips

Using AWS Route 53 to Keep Track of EC2 Instances

This article is a guest post by Guy Rosen, CEO of Onavo and author of the Jack of All Clouds blog. Guy was one of the first people to produce hard numbers on cloud adoption for site hosting, and he continues to publish regular updates to this research in his State of the Cloud series. These days he runs his startup Onavo which uses the cloud to offer smartphone users a way to slash overpriced data roaming costs.

In this article, Guy provides another technique to track changes to your dynamic cloud services automatically, possible now that AWS has released Route 53, DNS services. Take it away, Guy.

While one of the greatest things about EC2 is the way you can spin up, stop and start instances to your heart’s desire, things get sticky when it comes to actually connecting to an instance. When an instance boots (or comes up after being in the Stopped state), Amazon assigns a pair of unique IPs (and DNS names) that you can use to connect: a private IP used when connecting from another machine in EC2, and a public IP is used to connect from the outside. The thing is, when you start and stop dozens of machines daily you lose track of these constantly changing IPs. How many of you have found, like me, that each time you want to connect to a machine (or hook up a pair of machines that need to communicate with each other, such as a web and database server) you find yourself going back to your EC2 console to copy and paste the IP?

This morning I got fed up with this, and since Amazon launched their new Route 53 service I figured the time was ripe to make things right. Here’s what I came up with: a (really) small script that takes your EC2 instance list and plugs it into DNS. You can then refer to your machines not by their IP but by their instance ID (which is preserved across stops and starts of EBS-backed instances) or by a user-readable tag you assign to a machine (such as “webserver”).

Here’s what you do:

  1. Sign up to Amazon Route 53.
  2. Download and install cli53 from (follow the instructions to download the latest Boto and dnspython)
  3. Set up a domain/subdomain you want to use for the mapping (e.g.,
    1. Set it up on Route53 using cli53:
      ./ create
    2. Use your domain provider’s interface to set Amazon’s DNS servers (reported in the response to the create command)
    3. Run the following script (replace any details and paths, emphasized in bold, with your own):

      #!/bin/tcsh -f
      set root=`dirname $0`
      setenv EC2_HOME /usr/local/ec2-api-tools
      setenv EC2_CERT $root/ec2_x509_cert.pem
      setenv EC2_PRIVATE_KEY $root/ec2_x509_private.pem
      setenv AWS_ACCESS_KEY_ID myawsaccesskeyid
      setenv AWS_SECRET_ACCESS_KEY mysecretaccesskey

      $EC2_HOME/bin/ec2-describe-instances | \
      perl -ne '/^INSTANCE\s+(i-\S+).*?(\S+\.amazonaws\.com)/ \
      and do { $dns = $2; print "$1 $dns\n" }; /^TAG.+\sShortName\s+(\S+)/ \
      and print "$1 $dns\n"' | \
      perl -ane 'print "$F[0] CNAME $F[1] --replace\n"' | \
      xargs -n 4 $root/cli53/ \
      rrcreate -x 60

Voila! You now have DNS names such as that point to your instances. To make things more helpful, if you add a tag called ShortName to your instances it will be picked up, letting you create names such as The script creates CNAME records, which means that you will automatically get internal EC2 IPs when querying inside EC2 and public IPs from the outside.

Put this script somewhere, run it in a cron – and you’ll have an auto-updating DNS zone for your EC2 servers.

Short disclaimer: the script above is a horrendous one-liner that roughly works and uses many assumptions, it works for me but no guarantees.

Cloud Developer Tips

S3 Reduced Redundancy Storage with Simple Notification Service: What, Why, and When

AWS recently added support for receiving Simple Notification Service notifications when S3 loses a Reduced Redundancy Storage S3 object. This raises a number of questions:

  • What the heck does that even mean?
  • Why would I want to do that?
  • Under what conditions does it make financial sense to do that?

Let’s take a look at these questions, and we’ll also do a bit of brainstorming (please participate!) to design a service that puts it all together.

What is S3 Reduced Redundancy Storage?

Standard objects stored in S3 have “eleven nines” of durability annually. This means 99.999999999% of your objects stored in S3 will still be there after one year. On average, you will need to store 100,000,000,000 – that’s one hundred billion – objects in standard S3 storage before you will, on average, have one of them disappear over a year’s time. Pretty great.

Reduced Redundancy Storage (RRS) is a different class of S3 storage that, in effect, has a lower durability: 99.99% annually. On average, you will need to store only 10,000 objects in RRS S3 before you should expect one of them to disappear over a year’s time. Not quite as great, but still more than 400 times better than a traditional hard drive.

When an RRS object is lost S3 will return an HTTP 405 response code, and your application is supposed to be built to understand that and take the appropriate action: most likely regenerate the object from its source objects, which have been stored elsewhere more reliably – probably in standard eleven-nines S3. It’s less expensive for AWS to provide a lower durability class of service, and therefore RRS storage is priced accordingly: it’s about 2/3 the cost of standard S3 storage.

RRS is great for derived objects – for example, image thumbnails. The source object – the full-quality image or video – can be used to recreate the derived object – the thumbnail – without losing any information. All it costs to create the derived object is time and CPU power. And that’s most likely why you’re creating the derived objects and storing them in S3: to act as a cache so the app server does not need to spend time and CPU power recreating them for every request. Using S3 RRS as a cache will save you 1/3 of your storage costs for the derived objects, but  you’ll need to occasionally recreate a derived object in your application.

How Do You Handle Objects Stored in RRS?

If you serve the derived objects to clients directly from S3 – as many web apps do with their images – your clients will occasionally get a HTTP 405 response code (about once a year for every 10,000 RRS objects stored). The more objects you store the higher the likelihood of a client’s browser encountering a HTTP 405 error – and most browsers show ugly messages when they get a 405 error. So your application should do some checking.

To get your application to check for a lost object you can do the following: Send S3 an HTTP HEAD request for the object before giving the client its URL. If the object exists then the HEAD request will succeed. If the object is lost the HEAD request will return a 405 error. Once you’re sure the object is in S3 (either the HEAD request succeeded, or you recreated the derived object and stored it again in S3), give the object’s URL to the client.

All that HEAD checking is a lot of overhead: each S3 RRS URL needs to be checked every time it’s served. You can add a cache of the URL of objects you’ve checked recently and skip those. This will cut down on the overhead and reduce your S3 bill – remember that each HEAD request costs 1/10,000 of a cent – but it’s still a bunch of unnecessary work because most of the time you check its HEAD the object will still be there.

Using Simple Notification Service with RRS

Wouldn’t it be great if you could be notified when S3 RRS loses an object?

You can. AWS’s announcement introduces a way to receive notification – via Simple Notification Service, SNS – when S3 RRS detects that an object has been lost. This means you no longer need your application to check for 405s before serving objects. Instead you can have your application listen for SNS notifications (either via HTTP or via email or via SQS) and proactively process them to restore lost objects.

Okay, it’s not really true that your application no longer needs to check for lost objects. The latency between the actual loss of an object and the time you recreate and replace it is still nonzero, and during that time you probably want your application to behave nicely.

[An aside: I do wonder what the expected latency is between the object’s loss and the SNS notification. I’ve asked on the Forums and in a comment to Jeff Barr’s blog post – I’ll update this article when I have an answer.]

When Does it Make Financial Sense to Use S3 RRS?

While you save on storage costs for using S3 RRS you still need to devote resources to recreating any lost objects. How can you decide when it makes sense to go with RRS despite the need to recreate lost objects?

There are a number of factors that influence the cost of recreating lost derived objects:

  • Bandwidth to get the source object from S3 and return the derived object to S3. If you perform the processing inside the same EC2 region as the S3 region you’re using then this cost is zero.
  • CPU to perform the transformation of the source object into the derived object.
  • S3 requests for GETting the source object and PUTting the derived object.

I’ve prepared a spreadsheet analyzing these costs for various different numbers of objects, sizes of objects, and CPU-hours required for each derived object.

For 100,000 source objects of average 5MB size stored in Standard S3, each of which creates 5 derived objects of average 500KB size stored in RRS and requiring 1 second of CPU time to recreate, the savings in choosing RRS is $12.50 per month. Accounting for the cost of recreating lost derived objects reduces that savings to $12.37.

For the same types of objects but requiring 15 minutes of CPU time to recreate each derived object the net savings overall is $12.28. Still very close to the entire savings generated by using RRS.

For up to about 500,000 source objects it doesn’t pay to launch a dedicated m1.small instance just for the sake of recreating lost RRS objects. An m1.small costs $61.20 per month, which is approximately the same as the net savings from 500,000 source objects of average 5MB size with 5 derived objects each of average size 500KB. At this level of usage, if you have spare capacity on an existing instance then it would make financial sense to run the recreating process there.

For larger objects the savings is also almost the entire amount saved by using RRS, and the amounts saved are larger than the cost of a single m1.small so it already pays to launch your own instance for the processing.

For larger numbers of objects the savings is also almost the entire amount saved by using RRS.

As far down as you go in the spreadsheet, and as much as you may play with the numbers, it makes financial sense to use RRS and have a mechanism to recreate derived objects.

Which leads us to the the brainstorming.

Why Should I Worry About Lost Objects?

Let’s face it, nobody wants to operate a service that is not core to their business. Most likely, creating the derived objects from the source object is not your business core competency. Creating thumbnails and still frame video captures is commodity stuff.

So let’s imagine a service that does the transformation, storage in S3, and maintenance of RRS derived objects for you so you don’t have to.

You’d drop off your source object in your bucket in S3. Then you’d send an SQS message to the service containing the new source object’s key and a list of the transformations you want applied. As Jeff Bar suggests in his blog, the service would process the message and create derived objects (stored in RRS) whose keys (the name) would be composed of the source object’s name and the name of the transformation applied. You’d know how to construct the name of every derived object, so you would know how to access them. The service would subscribe to the RRS SNS notifications and recreate the derived objects when they are lost.

This service would need a way for clients to discover the supported file types and the supported transformations for each file type.

As we pointed out above, there is a lot of potential financial savings in using RRS, so such a service has plenty of margin to price itself profitably, below the cost of standard S3 storage.

What else would such a service need? Please comment.

If you build such a service, please cut me in for 30% for giving you the idea. Or, at least acknowledge me in your blog.

Cloud Developer Tips

Storing AWS Credentials on an EBS Snapshot Securely

Thanks to reader Ewout and his comment on my article How to Keep Your AWS Credentials on an EC2 Instance Securely for suggesting an additional method of transferring credentials: via a snapshot. It’s similar to burning credentials into an AMI, but easier to do and less prone to accidental inclusion in the application’s AMI.

Read on for a discussion of how to implement this technique.

How to Store AWS Credentials on an EBS Snapshot

This is how to store a secret on an EBS snapshot. You do this only once, or whenever you need to change the secret.

We’re going to automate as much as possible to make it easy to do. Here’s the command that launches an instance with a newly created 1GB EBS volume, formats it, mounts it, and sets up the root user to be accessible via ssh and scp. The new EBS volume created will not be deleted when the instance is terminated.

$ ec2-run-instances -b /dev/sdf=:1:false -t m1.small -k \
my-keypair -g default ami-6743ae0e -d '#! /bin/bash
yes | mkfs.ext3 /dev/sdf
mkdir -m 000 /secretVol
mount -t ext3 -o noatime /dev/sdf /secretVol
cp /home/ubuntu/.ssh/authorized_keys /root/.ssh/'

We have set up the root user to be accessible via ssh and scp so we can store the secrets on the EBS volume as the root user by directly copying them to the volume as root. Here’s how we do that:

$ ls -l
total 24
-r--r--r-- 1 shlomo  shlomo  916 Jun 20  2010 cert-NT63JNE4VSDEMH6VHLHBGHWV3DRFDECP.pem
-r--------  1 shlomo  shlomo   90 Jun  1  2010 creds
-r-------- 1 shlomo  shlomo  926 Jun 20  2010 pk-NT63JNE4VSDEMH6VHLHBGHWV3DRFDECP.pem
$ scp -i /path/to/id_rsa-my-keypair * root@

Our secret is now on the EBS volume, visible only to the root user.

We’re almost done. Of course you want to test that your application can access the secret appropriately. Once you’ve done that you can terminate the instance – don’t worry, the volume will not be deleted due to the “:false” specification in our launch command.

$ ec2-terminate-instance $instance
$ ec2-describe-volumes
VOLUME	vol-7ce48a15	1		us-east-1b	available	2010-07-18T17:34:01+0000
VOLUME	vol-7ee48a17	15	snap-5e4bec36	us-east-1b	deleting	2010-07-18T17:34:02+0000

Note that the root EBS volume is being deleted but the new 1GB volume we created and stored the secret on is intact.

Now we’re ready for the final two steps:
Snapshot the volume with the secret:

$ ec2-create-snapshot $secretVolume
SNAPSHOT	snap-2ec73045	vol-7ce48a15	pending	2010-07-18T18:05:39+0000		540528830757	1

And, once the snapshot completes, delete the volume:

$ ec2-describe-snapshots -o self
SNAPSHOT	snap-2ec73045	vol-7ce48a15	completed	2010-07-18T18:05:40+0000	100%	540528830757	1
$ ec2-delete-volume $secretVolume
VOLUME	vol-7ce48a15
# save the snapshot ID
$ secretSnapshot=snap-2ec73045

Now you have a snapshot $secretSnapshot with your credentials stored on it.

How to Use Credentials Stored on an EBS Snapshot

Of course you can create a new volume from the snapshot, attach the volume to your instance, mount the volume to the filesystem, and access the secrets via the root user. But here’s a way to do all that at instance launch time:

$ ec2-run-instances un-instances -b /dev/sdf=$secretSnapshot -t m1.small -k \
my-keypair -g default ami-6743ae0e -d '#! /bin/bash
mkdir -m 000 /secretVol
mount -t ext3 -o noatime /dev/sdf /secretVol
# make sure it gets remounted if we reboot
echo "/dev/sdf /secretVol ext3 noatime 0 0" > /etc/fstab'

This one-liner uses the -b option of ec2-run-instances to specify a new volume be created from $secretSnapshot, attached to /dev/sdf, and this volume will be automatically deleted when the instance terminates. The user-data script sets up the filesystem mount point and mounts the volume there, also ensuring that the volume will be remounted if the instance reboots.
Check it out, a new volume was created for /dev/sdf:

$ ec2-describe-instances
RESERVATION	r-e4f2608f	540528830757	default
INSTANCE	i-155b857f	ami-6743ae0e			pending	my-keypair	0		m1.small	2010-07-19T15:51:13+0000	us-east-1b	aki-5f15f636	ari-d5709dbc	monitoring-disabled					ebs
BLOCKDEVICE	/dev/sda1	vol-8a721be3	2010-07-19T15:51:22.000Z
BLOCKDEVICE	/dev/sdf	vol-88721be1	2010-07-19T15:51:22.000Z

Let’s make sure the files are there. SSHing into the instance (as the ubuntu user) we then see:

$ ls -la /secretVol
ls: cannot open directory /secretVol: Permission denied
$ sudo ls -l /secretVol
total 28
-r--------  1 root root   916 2010-07-18 17:52 cert-NT63JNE4VSDEMH6VHLHBGHWV3DRFDECP.pem
-r--------  1 root root    90 2010-07-18 17:52 creds
dr--------  2 root   root   16384 2010-07-18 17:42 lost+found
-r--------  1 root root   926 2010-07-18 17:52 pk-NT63JNE4VSDEMH6VHLHBGHWV3DRFDECP.pem

Your application running the instance (you’ll install it by adding to the user-data script, right?) will need root privileges to access those secrets.

Cloud Developer Tips

Track Changes to your Dynamic Cloud Services Automatically

Dynamic infrastructure can be a pain to accommodate in applications. How do you keep track of the set of web servers in your dynamically scaling web farm? How do your apps keep up with which server is currently running what service? How can applications be written so they don’t need to care if a service gets moved to a different machine? There are a number of techniques available, and I’m happy to share implementation code for one that I’ve found useful.

One thing common to all these techniques: they all allow the application code to refer to services by name instead of IP address. This makes sense because the whole point is not to care about the IP address running the service. Every one of these techniques offers a way to translate the name of the service into an IP address behind the scenes, without your application knowing about it. Where the techniques differ is in how they provide this indirection.

Note that there are four usage scenarios that we might want to support:

  1. Service inside the cloud, client inside the cloud
  2. Service inside the cloud, client outside the cloud
  3. Service outside the cloud, client inside the cloud
  4. Service outside the cloud, client outside the cloud

Let’s take a look at a few techniques to provide loose coupling between dynamically movable services and their IP addresses, and see how they can support these usage scenarios.

Dynamic DNS

Dynamic DNS is the classic way of handling dynamically assigned roles: DNS entries on a DNS server are updated via an API (usually HTTP/S) when a server claims a given role. The DNS entry is updated to point to the IP address of the server claiming that role. For example, your DNS may have a record. When the production deployment’s master database starts up it can register itself with the DNS provider to claim the dns record, pointing that DNS entry to its own IP address. Any client of the database can use the host name to refer to the master database, and as long as the machine that last claimed that DNS entry is still alive, it will work.

When running your service within EC2, Dynamic DNS servers running outside EC2 will see the source IP address for the Dynamic DNS registration request as the public IP address of the instance. So if your Dynamic DNS is hosted outside EC2 you can’t easily register the internal IP addresses. Often you want to register the internal IP address because from within the same EC2 region it costs less to use the private IP address than the public IP addresses. One way to use Dynamic DNS with private IPs is to build your own Dynamic DNS service within EC2 and set up all your application instances to use that DNS server for your domain’s DNS lookups. When instances register with that EC2-based DNS server, the Dynamic DNS service will detect the source of the registration request as being the internal IP address for the instance, and it will assign that internal IP address to the DNS record.

Another way to use Dynamic DNS with internal IP addresses is to use DNS services such as DNSMadeEasy whose API allows you to specify the IP address of the server in the registration request. You can use the EC2 instance metadata to discover your instance’s internal IP address via the URL .

Here’s how Dynamic DNS fares in each of the above usage scenarios:

Scenario 1: Service in the cloud, client inside the cloud: Only if you run your own DNS inside EC2 or use a special DNS service that supports specifying the internal IP address.
Scenario 2: Service in the cloud, client outside the cloud: Can use public Dynamic DNS providers.
Scenario 3: Service outside the cloud, client inside the cloud: Can use public Dynamic DNS providers.
Scenario 4: Service outside the cloud, client outside the cloud: Can use public Dynamic DNS providers.

Update window: Changes are available immediately to all DNS servers that respect the zero TTL on the Dynamic DNS server (guaranteed only for Scenario 1). DNS propagation delay penalty may still apply because not all DNS servers between the client and your Dynamic DNS service necessarily respect TTLs properly.

Pros: For public IP addresses only, easy to integrate into existing scripts.

Cons: Running your own DNS (to support private IP addresses) is not trivial, and introduces a single point of failure.

Bottom line: Dynamic DNS is useful when both the service and the clients are in the cloud; and for other usage scenarios if a DNS propagation delay is acceptable.

Elastic IP Addresses

In AWS you can have an Elastic IP address: an IP address that can be associated with any instance within a given region. It’s very useful when you want to move your service to a different instance (perhaps because the old one died?) without changing DNS and waiting for those changes to propagate across the internet to your clients. You can put code into the startup sequence of your instances that associates the desired Elastic IP address, making this approach very scriptable. For added flexibility you can write those scripts to accept configurable input (via settings in the user-data or some data stored in S3 or SimpleDB) that specifies which Elastic IP address to associate with the instance.

A cool feature of Elastic IP addresses: if clients use the DNS name of the IP address (“”) instead of the numeric IP address you can have extra flexibility: clients within EC2 will get routed via the internal IP address to the service while clients outside EC2 will get routed via the public IP address. This seamlessly minimizes your bandwidth cost. To take advantage of this you can put a CNAME entry in your domain’s DNS records.

Summary of Elastic IP addresses:

Scenario 1: Service in the cloud, client inside the cloud: Trivial, client should use Elastic IP’s DNS name (or set up a CNAME).
Scenario 2: Service in the cloud, client outside the cloud: Trivial, client should use Elastic IP’s DNS name (or set up a CNAME).
Scenario 3: Service outside the cloud, client inside the cloud: Elastic IPs do not help here.
Scenario 4: Service outside the cloud, client outside the cloud: Elastic IPs do not help here.

Update window: Changes are available in under a minute.

Pros: Requires minimal setup, easy to script.

Cons: No support for running the service outside the cloud.

Bottom line: Elastic IPs are useful when the service is inside the cloud and an approximately one minute update window is acceptable.

Generating Hosts Files

Before the OS queries DNS for the IP address of a hostname it checks in the hosts file. If you control the OS of the client you can generate the hosts file with the entries you need. If you don’t control the OS of the client then this technique won’t help.

There are three important ingredients to get this to work:

  1. A central repository that stores the current name-to-IP address mappings.
  2. A method to update the repository when mappings are updated.
  3. A method to regenerate the hosts file on each client, running on a regular schedule.

The central repository can be S3 or SimpleDB, or a database, or security group tags . If you’re concerned about storing your AWS access credentials on each client (and if these clients are web servers then they may not need your AWS credentials at all) then the database is a natural fit (and web servers probably already talk to the database anyway).

If your service is inside the cloud and you want to support clients both inside and outside the cloud you’ll need to maintain two separate repository tables – one containing the internal IP addresses of the services (for use generating the hosts file of clients inside the cloud) and the other containing the public IP addresses of the services (for use generating the hosts file of clients outside the cloud).

Summary of Generating Hosts Files:

Scenario 1: Service in the cloud, client inside the cloud: Only if you control the client’s OS, and register the service’s internal IP address.
Scenario 2: Service in the cloud, client outside the cloud: Only if you control the client’s OS, and register the service’s public IP address.
Scenario 3: Service outside the cloud, client inside the cloud: Only if you control the client’s OS.
Scenario 4: Service outside the cloud, client outside the cloud: Only if you control the client’s OS.

Update Window: Controllable via the frequency with which you regenerate the hosts file. Can be as short as a few seconds.

Pros: Works on any client whose OS you control, whether inside or outside the cloud, and with services either inside or outside the cloud. And, assuming your application already uses a database, this technique adds no additional single points of failure.

Cons: Requires you to control the client’s OS.

Bottom line: Good for all scenarios where the client’s OS is under your control and you need refresh times of a few seconds.

A Closer Look at Generating Hosts Files

Here is an implementation of this technique using a database as the repository, using Java wrapped in a shell script to regenerate the hosts file, and using Java code to perform the updates. This implementation was inspired by the work of Edward M. Goldberg of myCloudWatcher.

Creating the Repository

Here is the command to create the necessary database (“Hosts”) and table (“hosts”):

mysql -h dbHostname -u dbUsername -pDBPassword -e \
USE Hosts; \
CREATE TABLE \`hosts\` ( \
\`record\` TEXT \
INSERT INTO \`hosts\` VALUES ("   localhost   localhost.localdomain");'

Notice that we pre-populate the repository with an entry for “localhost”. This is necessary because the process that updates the hosts file will completely overwrite the old one, and that’s where the localhost entry is supposed to live. Removing the localhost entry could wreak havoc on networking services – so we preserve it by ensuring a localhost entry is in the repository.

Updating the Repository

To claim a certain role (identified by a hostname – in this example “webserver1” – with an IP address it is registered in the repository. Here’s the one-liner:

mysql -h dbHostname -u dbUsername -pDBPassword -e \
'DELETE FROM Hosts.\`hosts\` WHERE record LIKE "% webserver1"; \
INSERT INTO Hosts.\`hosts\` (\`record\`) VALUES ("   webserver1");'

The registration process can be performed on the client itself or by an outside agent. Make sure you substitute the real host name and the correct IP address.

On an EC2 instance you can get the private and public IP addresses of the instance via the instance metadata URLs. For example:

$ privateIp=$(curl --silent
$ echo $privateIp
$ publicIp=$(curl --silent
$ echo $publicIp

Regenerating the Hosts File

The final piece is recreating the hosts file based on the contents of the database table. Notice how the table records are already in the correct format for a hosts file. It would be simple to dump the output of the entire table to the hosts file:

mysql -h dbHostname -u dbUsername -pDBPassword --silent --column-names=0 -e \
'SELECT \`record\` FROM Hosts.\`hosts\`' | uniq > /etc/hosts  # This is simple and wrong

But it would also be wrong to do that! Every so often the database connection might fail and you’d be left with a hosts file that was completely borked – and that would prevent the client from properly resolving the hostnames of your services. It’s safer to only overwrite the hosts file if the SQL query actually returns results. Here’s some Java code that does that:

Connection conn = DriverManager.getConnection("jdbc:mysql://" + dbHostname + "/?user=" +
	dbUsername + "&password=" + dbPassword);
String outputFileName = "/etc/hosts";
Statement stmt = conn.createStatement();
ResultSet res = stmt.executeQuery("SELECT record FROM");
HashSet<String> uniqueMe = new HashSet<String>();
PrintStream out = System.out;
if (res.isBeforeFirst()) {
	out = new PrintStream(outputFileName);
while ( {
	String record = res.getString(1);
	if (uniqueMe.add(record)) {

This code uses the MySQL Connector/J JDBC driver. It makes sure only to overwrite the hosts file if there were actual records returned from the database query.

Scheduling the Regeneration

Now that you have a script that regenerates that hosts file (you did wrap that Java program into a script, right?) you need to place that script on each client and schedule a cron job to run it regularly. Via cron you can run it as often as every minute if you want – it adds a negligible amount of load to the database server so feel free – but if you need more frequent updates you’ll need to write your own driver to call the regeneration script more frequently.

If you find this technique helpful – or have any questions about it – I’d be happy to hear from you in the comments.

Update December 2010: Guy Rosen guest-authored this article on using AWS’s DNS service Route 53 to track instances.