Categories
Cloud Developer Tips

Elastic Load Balancer: An Elasticity Gotcha

If you use AWS’s Elastic Load Balancer to allow your EC2 application to scale, like I do, then you’ll want to know about this gotcha recently reported in the AWS forums. By all appearances, it looks like something that should be fixed by Amazon. Until it is, you can reduce (but not eliminate) your exposure to this problem by keeping a small TTL for your ELB’s DNS CNAME entry. Read on for details.

The Gotcha

As your ELB-balanced application experiences an increasing load, some of the traffic received by your back-end instances may be traffic that does not belong to your application. And, after your application experiences a sustained heavy load and then traffic subsides, some of your application’s traffic may be lost or misdirected to other EC2 instances that are not yours.

Update March 2010: It appears AWS has changed the behavior of ELB so this is no longer a likely issue. See below for more details.

Why it Happens

In my article about how ELB works, I describe how ELB resolves its DNS name to a pool of IP addresses, and that this pool increases and decreases in size according to the load placed on the service. Each of the IP addresses in the pool is a “virtual appliance”, acting as a load balancer to distribute the connections among your back-end instances. This gives ELB two levels of elasticity: the pool of virtual appliance IP addresses, and your pool of back-end instances.

Before they are assigned to a specific ELB, the virtual appliance IP addresses are available for use by any ELB, waiting in a global pool. When an ELB needs to increase its pool of virtual appliances due to load, it gets a new IP address from the global pool and begins resolving the ELB DNS name to that IP address in addition to the ones it already uses. And when an ELB notices decreasing load, it releases one of its virtual appliance IP addresses back to the global pool, and no longer returns that IP address when resolving the ELB DNS name. According to testing performed by AWS forum user wizardofcrowds, ELB scales up under sustained load by increasing its pool of IP addresses at the rate of one additional address every 5 minutes. And, ELB scales down by relinquishing an IP address at the rate of one every 2 hours. Thus it is possible that a single ELB virtual appliance IP address can be in service to a number of different ELBs over the course of a few hours.

The problem is that DNS resolution is cached at many layers across the internet. When the ELB scales up and gets a new virtual appliance IP address from the global pool, some client somewhere might still be using that IP address as the resolution of a different ELB’s DNS name. This other ELB might not even belong to you. A few hours ago, another ELB with a different DNS name returned that IP address from a DNS lookup. Now, that IP address is serving your ELB. But some client somewhere may still be using that IP address to attempt to reach an application that is not yours.

The flip side occurs when the ELB scales down and releases a virtual appliance IP address back to the global pool. Some client somewhere might continue resolving your ELB’s DNS name to the now-relinquished IP address. When the address is returned to the pool, that client’s attempts to connect to your service will fail. If that same virtual appliance IP is then put into service for another ELB, then the client working with the cached but no-longer-current DNS resolution for your ELB DNS name will be directed to the other ELB virtual appliance, and then onward to back-end instances that are not yours.

So your application served by ELB may receive traffic destined for other ELBs during increasing load, and may experience lost traffic during decreasing load.

What is the Solution?

Fundamentally, this issue is caused by badly-configured DNS implementations. Some DNS servers (including those of some major ISPs) ignore the TTL (“time to live”) setting of the original DNS record, and thus end up resolving DNS names to an expired IP address. Some DNS clients (browsers such as IE7, and Java programs by default) also ignore DNS TTLs, causing the same problem. Short of fixing all the misconfigured DNS servers and patching all the IE and Java VMs, however, the issue cannot be solved. So a workaround is the best we can hope for.

You, the EC2 user, can minimize the risk that a well-behaved client will experience this issue. Set up your DNS CNAME entries for the ELB to have a small TTL – 120 seconds is good. This will help for clients whose DNS honors the TTL, but not for clients that ignore TTLs or for clients using DNS servers that ignore TTLs.

Amazon can work around the problem on their end. When an ELB needs to scale up and use a new virtual appliance IP address, that address could remain “reserved” for its use for a longer time. When the ELB scales down and releases the virtual appliance IP address, that address would not be reused by another ELB until the reservation period has expired. This would prevent “recent” ELB virtual appliance IP address from being reused by other ELBs, and reduce the risk of misdirecting traffic.

Update March 2010: SanD@AWS has shared that ELB IP addresses will continue to direct traffic to the ELB for one hour after being withdrawn from that ELB’s DNS pool. Hooray!

It should be noted that DNS caching and TTLs influence all load balancing solutions that rely on DNS (such as round-robin DNS), so this issue is not unique to ELB. Caching DNS entries is a good thing for the internet in general, but not all implementations honor the TTL of the cached DNS records. Services relying on DNS for scalability need to be designed with this in mind.

22 replies on “Elastic Load Balancer: An Elasticity Gotcha”

Using DNS as a load balancer is a bad idea. Thats not what it was designed for. Not only do you have to deal with stale caches out there, DNS is also inherently unaware of whether an instance it's routing traffic to still has the server software up & running. Not does it know anything about the load metrics, blindly splitting the traffic equally amongst all servers.

@Drew,

I am experiencing this very issue with my ELB ever since March 5th. It always worked before then, but since then, we’ve been getting reports of stale caches every day.

I know you say this is supposed to be fixed, but is it really fixed? Is there anything that might cause this still?

@Drew,

I do consider the issue with the ELB IP address being repurposed resolved, since the 1-hour delay before it can be reused by another ELB should be sufficient for all but the most egregiously broken DNS environments.

It seems to me that the issue you’re experiencing is not related to the ELB IP address, but rather to a different issue: the instance you originally registered in the ELB was not removed from the ELB when it shut down. The IP addresses of that instance remained in the load balancer pool, and when another user’s instance was assigned those IP addresses, your ELB routed traffic there. As mentioned in the thread, fixing this problem is on the AWS roadmap, but – as usual – no timeframe is given.

Thanks, @Shlomo. I’ve finally gotten to the bottom of this, and as you suspect, it is *not* an ELB issue (other than a desire to have fixed ELB IP addresses). Amazon’s 1 hour delay is working as expected.

We are proxying to the ELB from an Apache web server, and it turns out that Apache is secretly caching DNS lookups despite a short TTL on the underlying server and no obvious setting to enable, disable, or tweak this functionality in Apache. Any Apache process that happens to be running when an ELB IP is removes and continues to live longer than 1 hour may result in this situation.

While I now understand the issue, I’m currently looking for a solution… shortening Apache process lifespan, running Apache in worker mode, restarting Apache every hour… all would suffice. I’m still hoping to find a magical Apache setting that will resolve this for me.

Thanks for your input!

@Drew,

I’m not familiar with any way to control the TTL of Apache’s DNS lookups.

Have you considered using squid or varnish as a reverse proxy instead of apache?

Yes. We’re using Apache because 1) legacy reasons, and 2) it performs server side includes in the response which are handled by Apache mod_include, mod_python, etc.

Varnish is certainly a much better proxy solution, we’re just not decoupled enough from our legacy (non-cloud) environment yet. 🙂

We’re testing out a few options right now.

I’m seeing this “gotcha” as a really HUGE issue for us right now. One of these domains alone was more than doubling the HTTP requests being sent to one of my servers, so it adversely affects my capacity.

The corollary question that I see no way of answering is, how much of MY traffic destined to MY server is being incorrectly redirected by an ELB to OTHER Amazon servers sitting behind ELBs?

We set the TTL to 60 for our CNAME DNS entry, and I opened up a case with Amazon, but we don’t seem to be getting a resolution to this — I’ll update this thread with any progress on that front.

I am seeing a TON of traffic destined to many other websites in my web server logs, and coming from the same IP addresses of ELBs that send me good traffic. So this is not hackers trying to crack into my systems, it’s the ELB forwarding traffic to my server that doesn’t belong to me. I clearly see domain names of where the traffic should go, and DNS lookups of those domain names (unsurprisingly) resolve to CNAME records of an ELB.

I looked into TTLs of the other folks, and saw numbers (in seconds) all over the map: 1800, 3600, 68424, 220, 14400. So what could I possibly do to NOT get that traffic, if these other folks have such long TTLs?

What could fix this in the EBS architecture? Perhaps another field(s) than an ELB should have in the API is the FQDN(s) that I CNAME’d in my DNS, and compare those to incoming traffic before forwarding it. However, if it finds a mismatch, that ELB should figure out what the correct EC2 destination instance is and forward it there, such that it doesn’t simply get lost.

I think what Daniel is experiencing is slightly different than what this thread started out as. And from what I can tell, it’s not necessarily Amazon’s problem… This might be a case of what you earlier labeled “egregiously broken DNS environments”. That is… a DNS entry that improperly points to an IP address rather than an ELB’s CNAME.

When this happens, there is NOTHING you can do to prevent traffic from hitting your servers should you be the recipient of a recycled IP (ELB, EC2 or otherwise).

So even though Amazon is innocent in this example, they are likely the only one’s who can provide any protection from it. What if, like in CloudFront, you could specify alternate hosts (CNAMEs) that an ELB should respond to? Request made via other Hosts could be discarded. I believe this is what Daniel’s idea for FQDN is getting at.

Wow… I responded to this post on Shlomo’s site over 4 years ago and I still get regular inquiries into how or whether I resolved the Apache DNS cache issue. I think that means that Apache needs to clarify some docs or something… It’s been a long time since I though about this issue, but let me post an update here for the sake of everyone else.

I don’t believe I ever really “solved” the problem but did make some changes that rendered the issue moot.

1) First, Amazon actually made some changes to the allocation of their IPs which greatly reduced the potential for a bad IP to be cached. They hang onto their IPs for some time before reallocating them to new hosts.

2) I changed our Apache MPM from Worker to Prefork (or vice versa). This reduced the length of time that a given thread ran. New threads would each do a new DNS lookup.

3) Started using Nginx or HAProxy instead of Apache where possible.

Answer not satisfying? Tell me about it!

I have AWS setup with external ELB with two registered Apache instances, the request from Apache goes to internal ELB which has two tomcat nodes registered. I have enabled sticky session on both ELB. How do i set up my Apache httpd.conf to go the correct appserver determined by internal ELB

@Madan Nadgauda,

The problem is that the AWSELB cookie is used by both ELBs for stickiness. One of the ELBs is masking the cookie of the other.

The solution is not trivial. It involves a proxy layer that renames the AWSELB cookie in the requests and responses.
I would use mitmproxy in Upstream Proxy Mode to do this, configuring it to receive all traffic from the ELB and direct it to Apache after editing the cookie name. It would also need to edit the cookie name in the response.

If you get this working, please share the configuration.

Does your outernmost ELB actually require sticky sessions at all? If the external ELB can be configured as non-sticky then it would leave the AWSELB cookie untouched when delivering traffic so the AWSELB cookie set by the internal ELB would not get jumbled up. Furthermore load from the external ELB would be distributed more evenly between the apaches attached to the external ELB which should be a good thing.

Some cookie renaming can also be done with apache header edit directives but apache directive execution order is a bit hazy and complex configurations bring you world of pain sooner than you could imagine. Inspect nginx or other alternatives if you need complicated setup.

Leave a Reply to Drew Cancel reply

Your email address will not be published.