---
"
mattj 6 hours ago
| link |
So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)
The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
reply
aristus 6 hours ago
| link |
I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.
reply
jhspaybar 5 hours ago
| link |
I can chime in here that I have had similar experiences at another large scale place :). Some requests would take a second or more to complete with the vast majority finishing in under 100MS. A solution was put in place that added about 5 MS to the average request, but also crushed the long tail(it just doesn't even exist anymore) and everything is hugely more stable and responsive.
reply
genwin 22 minutes ago
| link |
How was 5 ms added? Multiple sleep states per request?
I imagine the long tail disappears in a similar way that a traffic jam is prevented by lowering the speed limit.
reply
harshreality 5 hours ago
| link |
I seem to recall Google mentioning on some blog several years ago that high variance in response latency degrades user experience much more than slightly higher average request times. I can't find the link though; if anyone has it, I'd be grateful.
reply
nostrademons 5 hours ago
| link |
Jeff Dean wrote a paper on it for CACM:
http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...
There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
reply
badgar 4 hours ago
| link |
> There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
Your solution doesn't work if requests aren't idempotent.
reply
gojomo 4 hours ago
| link |
A relatively easy fix, for read-only or idempotent requests. Also, if long-tail latency requests wind up being run twice, this technique might accelerate tip-over saturation. Still, this 'hedged request' idea is good to keep in mind, thanks for the pointer.
The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"
reply
fizx 1 hour ago
| link |
Right now, heroku has one inbound load-balancer that's out of their control (probably ELB(s)). This load balancer hits another layer of mesh routers that heroku does control, and that perform all of herokus magic. In order for "intelligent routing," which is more commonly known as "least-conn" routing to work amongst the mesh layer, all of the mesh routers would have to share state with each other in real-time, which makes this a hard problem.
Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.
reply
gleb 4 hours ago
| link |
From experience, this is an incredibly effective way to