Difference between revision 1 and current revision
No diff available.---
"
mattj 6 hours ago
| link |
So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)
The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
reply
aristus 6 hours ago
| link |
I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.
reply
jhspaybar 5 hours ago
| link |
I can chime in here that I have had similar experiences at another large scale place :). Some requests would take a second or more to complete with the vast majority finishing in under 100MS. A solution was put in place that added about 5 MS to the average request, but also crushed the long tail(it just doesn't even exist anymore) and everything is hugely more stable and responsive.
reply
genwin 22 minutes ago
| link |
How was 5 ms added? Multiple sleep states per request?
I imagine the long tail disappears in a similar way that a traffic jam is prevented by lowering the speed limit.
reply
harshreality 5 hours ago
| link |
I seem to recall Google mentioning on some blog several years ago that high variance in response latency degrades user experience much more than slightly higher average request times. I can't find the link though; if anyone has it, I'd be grateful.
reply
nostrademons 5 hours ago
| link |
Jeff Dean wrote a paper on it for CACM:
http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...
There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
reply
badgar 4 hours ago
| link |
> There's a relatively easy fix for Heroku. They should do random routing with a backup second request sent if the first request times fails to respond after a relatively short period of time (say, 95th percentile latency), killing any outstanding requests when the first response comes back in. The amount of bookkeeping required for this is a lot less than full-on intelligent routing, but it can reduce tail latency dramatically since it's very unlikely that the second request will hit the same overloaded server.
Your solution doesn't work if requests aren't idempotent.
reply
gojomo 4 hours ago
| link |
A relatively easy fix, for read-only or idempotent requests. Also, if long-tail latency requests wind up being run twice, this technique might accelerate tip-over saturation. Still, this 'hedged request' idea is good to keep in mind, thanks for the pointer.
The 'tied request' idea from the Dean paper is neat, too, and Heroku could possibly implement that, and give dyno request-handlers the ability to check, "did I win the race to handle this, or can this request be dropped?"
reply
fizx 1 hour ago
| link |
Right now, heroku has one inbound load-balancer that's out of their control (probably ELB(s)). This load balancer hits another layer of mesh routers that heroku does control, and that perform all of herokus magic. In order for "intelligent routing," which is more commonly known as "least-conn" routing to work amongst the mesh layer, all of the mesh routers would have to share state with each other in real-time, which makes this a hard problem.
Alternately, heroku can introduce a third layer between the mesh routers and the inbound random load balancer. This layer consistently hashes (http://en.wikipedia.org/wiki/Consistent_hashing) the api-key/primary key of your app, and sends you to a single mesh router for all of your requests. Mesh routers are/should be blazing fast relative to rails dynos, so that this isn't really a bottleneck for your app. Since the one mesh router can maintain connection state for your app, heroku can implement a least-conn strategy. If the mesh router dies, another router can be automatically chosen.
reply
gleb 4 hours ago
| link |
From experience, this is an incredibly effective way to DoS? yourself. It was the default behaviour of nginx LB ages ago. Maybe only on EngineYard?. Doesn't really matter as nobody uses nginx LB anymore.
Even ignoring the POST requests problem (yup, it tried to replay those) properly cancelling a request on all levels of a multi-level rails stack is very hard/not possible in practice. So you end up DOSing the hard to scale lower levels of the stack (e.g. database) at the expense of the easy to scale LB.
reply
javajosh 4 hours ago
| link |
This is something I've read in networked game literature: players react far better to consistent and high latency than to inconsistent and low latency, even if the averages are lower in the latter case. (It might even have been a John Carmack article).
reply
mvgoogler 5 hours ago
| link |
Sounds like Jeff Dean :-)
http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scal...
reply
bmohlenhoff 5 hours ago
| link |
Correct, this matches my observations as well. I'd trade an increase in mean latency for a decrease in worst-case latency anytime. It makes it so much easier to reason about how many resources are needed for a given workload when your latency is bounded.
reply
cmccabe 3 hours ago
| link |
> Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny.
Yeah, it's a lot more practical than implementing QoS?, isn't it?
reply
dbpatterson 5 hours ago
| link |
That's probably true, but the value that Heroku is selling (and they charge a lot for it!) is that you _don't_ need to deal with this - that they will balance that out for you.
reply
cmccabe 3 hours ago
| link |
> The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).
The Golang runtime uses non-blocking I/O to get around this problem. "
---
" (incidentally, javascript's security model is so rubbish that it is effectively impossible to prevent code-poisoning from stealing keys, even with a signed extension).
A signed extension doesn't really protect you with JavaScript?. The open security model of the DOM means that you don't have to change the code in question; you can simply run some additional code that installs an event handler in an appropriate place to grab keys as they pass. "
---
some example stuff that ppl complained about in another languag:
--- ideas from 'release it!'
SOA integration points should always have load balancing, circuit breakers (remember that this server is down/overloaded and stop trying to call it), and timeouts, and maybe handshakes (asking 'can you handle another connection?' before making the connection)
beware of blocking; if you have blocking, always have a timeout!
a subclass should not be able to declare a method synchronized (a blocking critical section) if it is implementing an interface or a superclass, and that method is not synchronized in the interface -- this violates Liskov substitution
other properties to think about for Liskov: side-effect-free-ness; only-throws-these-exceptions-ness
watch out for blocking calls within critical sections
---
timeouts as security capability / a channel like stdin
---
from my Release It! notes;
131 asynchronous connections between systems are more stable (but more substantially complicated) than synchronous ones; e.g. pub/sub message passing is more stable than synchronous request/response protocols like HTTP. a continuum: in-process method calls (a function call into a library; same time, same host, same process), IPC (e.g. shared memory, pipes, semaphores, events; same host, same time, different process), remote procedure calls (XML-RPC, HTTP; same time, different host, different process -- my note -- this continuum neglects the RPC/REST distinction (one part of which is statelessness of HTTP), messaging middleware (MQ, pub-sub, smtp, sms; different time, different host, different process), tuple spaces
---
finally a clear explanation of tuple spaces:
http://software-carpentry.org/blog/2011/03/tuple-spaces-or-good-ideas-dont-always-win.html
they sound great. let's do them.
here's something i wrote about them for the c2.com wiki:
The structure has six primitive operations: put, copy, take, try_copy, try_take, and eval. put places a tuple into the bag. copy finds and reads a tuple. take is like copy but also removes a tuple after reading it. Copy and find block if they cannot find a tuple matching the query; try_copy and try_find are the non-blocking versions. Eval forks a new process.
---
on the note of tuplespaces vs. message passing:
if we have tuplespaces, may as well add transactions to get STM (is this what STM is or is it different?). if we have channels, they should be buffered (async) or buffered like in Go, and have timeouts, and a multicast option -- maybe they should just be something like zeromq.
how do go channels compare to zeromq? https://www.google.com/search?client=ubuntu&channel=fs&q=go+channels+zeromq&ie=utf-8&oe=utf-8
"
Fast asynchronous message passing.
This is what ZeroMQ? gives you. But it gives you it in a form different to that of Erlang: in Erlang, processes are values and message-passing channels are anonymous; in ZeroMQ?, channels are values and processes are anonymous. ZeroMQ? is more like Go than Erlang. If you want the actor model (that Erlang is based on), you have to encode it in your language of choice, yourself. "
http://www.rabbitmq.com/blog/2011/06/30/zeromq-erlang/
hmm.. seems like we need to support both named processes and named channels, and convenience omissions of either channel or process... hmm..
how does STM compare to tuplespaces?
--- the ingredients of Erlang: " Fast process creation/destruction Ability to support » 10 000 concurrent processes with largely unchanged characteristics.
A programming model where processes are lightweight values -- and a good scheduler -- make concurrent programming much easier, in a similar way to garbage collection. It frees you from resource micro-management so you can spend more time reasoning about other things.
Fast asynchronous message passing.
This is what ZeroMQ? gives you. But it gives you it in a form different to that of Erlang: in Erlang, processes are values and message-passing channels are anonymous; in ZeroMQ?, channels are values and processes are anonymous. ZeroMQ? is more like Go than Erlang. If you want the actor model (that Erlang is based on), you have to encode it in your language of choice, yourself.
Copying message-passing semantics (share-nothing concurrency).
Notably, Erlang enforces this. In other languages, shared memory and the trap of using it (usually unwittingly) doesn't go away.
Process monitoring.
Erlang comes with a substantial library, battle-tested over decades, for building highly concurrent, distributed, and fault-tolerant systems. Crucial to this is process monitoring -- notification of process termination. This allows sophisticated process management strategies; in particular, using supervisor hierarchies to firewall core parts of the system from more failure-prone parts of the system.
Selective message reception.
" http://www.rabbitmq.com/blog/2011/06/30/zeromq-erlang/ (from http://www.zeromq.org/whitepapers:multithreading-magic )
" As far as selective message reception goes, the dual of (Erlang) "from a pool of messages, specify the types to receive" is (Go/ZMQ) "from a pool of channels, specify the ones to select". One message type per channel. "
--
---
mb or mb not look into http://dl.acm.org/citation.cfm?id=1774536
Ruby refinements:
http://www.rubyinside.com/ruby-refinements-an-overview-of-a-new-proposed-ruby-feature-3978.html
" In a nutshell, refinements clear up the messiness of Ruby's monkey patching abilities by letting you override methods within a specific context only. Magnus Holm has done a great write up of the concepts involved, and presents a good example:
module TimeExtensions? refine Fixnum do def minutes; self * 60; end end end
class MyApp? using TimeExtensions?
def initialize
p 2.minutes
endendMyApp?.new # => 120 p 2.minutes # => NoMethodError? "
refinements are not a property of the value, they are lexically scoped. So you can't easily mimic this with prototype inheritance.
---
if lst is a list value, how to distinguish between things like lst[3] and lst.extend if lst[3] is just lst.__get__(3) and lst extend is just lst.__get__(:extend)? e.g. don't want 'keys lst' to include ':extend'. Obviously, perspectives.
But this means that we can syntactically distinguish indexing into collections from attribute reference, by allowing a syntax like lst[3] to be syntactic sugar for a choice of perspective combined with a __get__.
in addition, this may allow us to do refinements.
---
hmm, actually i think prototype inheritance can help -- the key is that instead of the instance containing a reference to a specific prototype, it contains the name of the type, which is resolved to a prototype based on the lexically scoped type:prototype mapping. hmm maybe this isnt different from classes/instances though, when you put it that way.
anyhow what does this type do? it pre-populates the attributes of the value in an overridable fashion (you might call it a template, although that's different from the C++ usage of that word).
however, i would prefer dynamic scoping to lexical scoping for refinements. i think lexically scoped refinements are merely cosmetic, in that if you refine String::capitalize, then instead of doing that you could just replace every instance of
x.capitalize
in your code with
if x isa String: x.mycapitalize else: x.capitalize
dynamically scoped refinements would be like semiglobals; their impact disappears when you ascend up the call stack past the guy who put the refinement in place
" Simple Generators v. Lazy Evaluation
Oleg Kiselyov, Simon Peyton-Jones and Amr Sabry: Simple Generators:
Incremental stream processing, pervasive in practice, makes the best case for lazy evaluation. Lazy evaluation promotes modularity, letting us glue together separately developed stream producers, consumers and transformers. Lazy list processing has become a cardinal feature of Haskell. It also brings the worst in lazy evaluation: its incompatibility with effects and unpredictable and often extraordinary use of memory. Much of the Haskell programming lore are the ways to get around lazy evaluation.
We propose a programming style for incremental stream processing based on typed simple generators. It promotes modularity and decoupling of producers and consumers just like lazy evaluation. Simple generators, however, expose the implicit suspension and resumption inherent in lazy evaluation as computational effects, and hence are robust in the presence of other effects. Simple generators let us accurately reason about memory consumption and latency. The remarkable implementation simplicity and efficiency of simple generators strongly motivates investigating and pushing the limits of their expressiveness.
To substantiate our claims we give a new solution to the notorious pretty-printing problem. Like earlier solutions, it is linear, backtracking-free and with bounded latency. It is also modular, structured as a cascade of separately developed stream transducers, which makes it simpler to write, test and to precisely analyze latency, time and space consumption. It is compatible with effects including IO, letting us read the source document from a file, and format it as we read.
This is fascinating work that shows how to gain the benefits of lazy evaluation - decoupling of producers, transformers, and consumers of data, and producing only as much data as needed - in a strict, effectful setting that works well with resources that need to be disposed of once computation is done, e.g. file handles.
The basic idea is that of Common Lisp signal handling: use a hierarchical, dynamically-scoped chain of handler procedures, which get called - on the stack, without unwinding it - to parameterize code. In this case, the producer code (which e.g. reads a file character by character) is the parameterized code: every time data (a character) is produced, it calls the dynamically innermost handler procedure with the data (it yields the data to the handler). This handler is the data consumer (it could e.g. print the received character to the console). Through dynamic scoping, each handler may also have a super-handler, to which it may yield data. In this way, data flows containing multiple transformers can be composed.
I especially like the OCaml version of the code, which is just a page of code, implementing a dynamically-scoped chain of handlers. After that we can already write map and fold in this framework (fold using a loop and a state cell, notably.) There's more sample code.
This also ties in with mainstream yield. By Manuel J. Simoni at 2013-02-21 13:30
| Fun | Paradigms | Software Engineering | other blogs | 4584 reads |
Nice, I was going to post this too.
One thing I wondered about in this paper was the obsession with weight vs expressivity. The restrictions adopted by the "simple generator" strategy make generators less powerful. Is a copying single-shot generator really that much slower that these tradeoffs become attractive? By Andy Wingo at Thu, 2013-02-21 15:19
| login or register to post comments |
I also liked the repmin example, and how the solution was to turn the tree into a stream -- the opposite of the SSAX work Kiselyov did 10 years ago. By Andy Wingo at Thu, 2013-02-21 15:21
| login or register to post comments |
The repmin solution presented in the paper still does 2 traversals over the stream, which somewhat defeats the purpose IMHO. The attractiveness of the original repmin is that it does two things in one traversal. By ninegua at Sun, 2013-02-24 21:29
| login or register to post comments |
I'm not sure I see the point. If I remember correctly, the lazy repmin builds a thunk structure that exactly mirror the tree. So yes, you need only one traversal to build it, but then forcing a result out of it is isomorphic to a second traversal.
(It's a bit like when you use continuation-passing style to make any program tail-recursive: yes, it's tail-recursive, but the memory consumption corresponding to stack frames for non-tail calls is now needed to build the chain of continuation closures. It's an interesting and transform, but you still have to take into account the memory usage corresponding to continuation closures, just as here you should consider the control pattern corresponding to chained thunks forcing.) By gasche at Sun, 2013-02-24 22:27
| login or register to post comments |
No, in a lazy language repmin can link nodes to a new box for the min value, which isn't known until the whole tree is traversed. By Bruno Martinez at Sun, 2013-02-24 23:46
| login or register to post comments |
I didn't read it in detail, but am I correct in thinking that this cannot express zip? By Jules Jacobs at Thu, 2013-02-21 19:04
| login or register to post comments |
I assume by zip, in this case, you mean:
(Producer m x, Producer m y) -> Producer m (x,y)
Inability to zip (or split) streams seems to be a common weakness of many of Haskell's pipe/conduit/iteratee/etc. models. You can partially make up for it by clever use of `Either`. By dmbarbour at Thu, 2013-02-21 19:36
| login or register to post comments |
Simple generators indeed cannot do zip: they can express nested but not parallel loops. Iterators of CLU (which was the first real language with yield) made this limitation clear. I believe the balance of simplicity of implementation vs expressivity is often an illuminating question. For example, we gained a lot of insight investigating a deliberately restricted version of Turing Machines: read-only Turing machines, or Finite State Automata. Primitive recursive functions are another example.
Iteratees and its many variations are in a different league. They do capture and reify a part of their continuation. Therefore, they can do zip. Iteratees can do much more: the ordinary ZipWith? means reading two sources in lockstep. Iteratees can read two sources in arbitrary, statically unknown amounts -- for example, merge two sorted sources:
http://okmij.org/ftp/Streams.html#2enum1iter
One can also distribute one source to several sinks, which is simpler. By Oleg at Sat, 2013-02-23 09:13
| login or register to post comments |
http://okmij.org/ftp/Streams.html
---
http://coffeescript.org/#literate
http://aptiverse.com/blog/css_features/
---
---
Consider bash:
if [ `which foo` == "/usr/bin/foo" ]; then
do_stuff
ficompared to Ruby:
if `which foo`.strip == "/usr/bin/foo"
do_stuff
endcompared to Python:
import os
if os.popen("which foo").read().strip() == "/usr/bin/foo":
do_stuff()---
" Having a class model instead of a prototype model also helps—much as I like working with prototypes—because you can easily eliminate late binding (“hash lookups”) and allocations, two of the performance killers mentioned in the slides. The operative word there is easily. You can analyse and JIT all you want, but you cannot feasibly solve at runtime the fundamental constraints imposed by the language. "
" I really like the basic convention in C that you should pass needed allocations in as arguments (and it is virtually always possible) allowing these allocations to be members of of members of other allocations, aggregating allocations so there are much fewer. "
"
pcwalton 1 day ago
| link |
Related to this is the importance of deforestation. Some good links:
Deforestation is basically eliminating intermediate data structures, which is similar to what the "int(s.split("-", 1)[1])" versus "atoi(strchr(s, '-') + 1)" slides are about. If you consider strings as just lists of characters, then it's basically a deforestation problem: the goal is to eliminate all the intermediate lists of lists that are constructed. (It's something of a peculiar case though, because in order to transform into the C code you need to not only observe that indexing an rvalue via [1] and throwing the rest away means that the list doesn't have to be constructed at all, but you also need to allow strings to share underlying buffer space—the latter optimization isn't deforestation per se.)
I don't know if there's been much effort into deforestation optimizations for dynamic languages, but perhaps this is an area that compilers and research should be focusing on more.
On another minor note, I do think that the deck is a little too quick to dismiss garbage collection as an irrelevant problem. For most server apps I'm totally willing to believe that GC doesn't matter, but for interactive apps on the client (think touch-sensitive mobile apps and games) where you have to render each frame in under 16 ms, unpredictable latency starts to matter a lot.
reply
cwzwarich 1 day ago
| link |
Automatic deforestation can remove intermediate results in a pipeline of computation, but it can not rewrite a program that is based around the querying / updating of a fixed data structure to use an efficient imperative equivalent throughout.
reply
lucian1900 1 day ago
| link |
Deforestation is easily done in lazy languages like Haskell.
As for GC, it would be nice to have good real time GCs in runtimes.
reply
dons 1 day ago
| link |
It has nothing to do with laziness. It has everything to do with the guaranteed absence of side effects.
Deforestation is /more useful/ in strict languages, because allocation of temporary structures costs more. So fusion on strict arrays is better than on lazy streams.
You just can't do it unless you can freely reorder multiple loops, and to do that you need a proof there are no side effects. Haskell just makes that trivial.
reply
klodolph 1 day ago
| link |
> As for GC, it would be nice to have good real time GCs in runtimes.
After decades of GC research, I think the conclusion is, "Yeah, that would be nice." Current state of the art gives us some very nice GCs that penalize either throughput or predictability. One of my favorite stories about GC is here:
http://samsaffron.com/archive/2011/10/28/in-managed-code-we-...
reply
gngeal 11 hours ago
| link |
"Deforestation is easily done in lazy languages like Haskell."
You can also do it in stream-based or data-flow-based languages. Or in pretty much any DSL you decide to implement, if the semantics of the language itself is reasonable.
reply "
---
" irahul 15 hours ago
| link |
Mike Pall of luajit fame has an interesting take on it.
http://www.reddit.com/r/programming/comments/19gv4c/why_pyth...
<quote>
While I agree with the first part ("excuses"), the "hard" things mentioned in the second part are a) not that hard and b) solved issues (just not in PyPy?).
Hash tables: Both v8 and LuaJIT? manage to specialize hash table lookups and bring them to similar performance as C structs (1). Interestingly, with very different approaches. So there's little reason NOT to use objects, dictionaries, tables, maps or whatever it's called in your favorite language.
(1) If you really, really care about the last 10% or direct interoperability with C, LuaJIT? offers native C structs via its FFI. And PyPy? has inherited the FFI design, so they should be able to get the same performance someday. I'm sure v8 has something to offer for that, too.
Allocations: LuaJIT? has allocation sinking, which is able to eliminate the mentioned temporary allocations. Incidentally, the link shows how that's done for a x,y,z point class! And it works the same for ALL cases: arrays {1,2,3} (on top of a generic table), hash tables {x=1,y=2,z=3} or FFI C structs.
String handling: Same as above -- a buffer is just a temporary allocation and can be sunk, too. Provided the stores (copies) are eliminated first. The extracted parts can be forwarded to the integer conversion from the original string. Then all copies and references are dead and the allocation itself can be eliminated. LuaJIT? will get all of that string handling extravaganza with the v2.1 branch -- parts of the new buffer handling are already in the git repo. I'm sure the v8 guys have something up their sleeves, too.
I/O read buffer: Same reasoning. The read creates a temporary buffer which is lazily interned to a string, ditto for the lstrip. The interning is sunk, the copies are sunk, the buffer is sunk (the innermost buffer is reused). This turns it into something very similar to the C code.
Pre-sizing aggregates: The size info can be backpropagated to the aggreagate creation from scalar evolution analysis. SCEV is already in LuaJIT?