notes-books-releaseIt

started writing this on page 63; todo: go back and add old stuff. the #s at the beginning of lines are page #s.

section 4 Stability Anti-Patterns

section 4.1 integration points

54 and others) SOA integration points should always have load balancing, circuit breakers (remember that this server is down/overloaded and stop trying to call it), and timeouts, and maybe handshakes (asking 'can you handle another connection?' before making the connection)

beware of blocking; if you have blocking, always have a timeout!

section 4.2 chain reactions

50 watching out for chain reaction failure among a server pool (bulkhead pattern, splitting the pool into multiple pools that dont load-balance into each other, can help)

section 4.3 cascading failures

section 4.4 users

62 sessions are the Achillies heel of web applications -- they take up lots of memory -- try to put as little data in the session as possible. you want to be able to delete old sessions whenever memory is tight so dont put anything in there you need to persist before deleting.

66 CDNs like Akamai can help block large-scale screen scrapers

section 4.5 blocked threads

71 need to have an external bot service (mock client, not in the same data centre) check the availability of your site; internal measures like e.g. monitoring your own logs isnt enough

71 dont try to roll your own connection pool class (concurrency is hard)

72 dont use synchronized methods on domain objects (my note: i guess this is like saying 'dont use shared memory concurrency'?)

74 watch out for blocking calls within critical sections; watch out for calls to things like caches which can block on external SOA calls but then you dont think about

76 blocked threads are the proximate cause of most failures

76 especially common are blocked threads in resource pools, especiall database connection pools

76 use timeouts

76 beware of proprietary third-party libraries that you dont have the source code for

76 dont roll your own concurrent component 'primitives' such as connection pools and pub/sub queues

section 4.6 attacks of self-denial

77 if you have a special offer or discount code emailed to a 'select group of customers' that goes fast, so that people want to get it as soon as it becomes available, a zillion people will find out about it and will bring down your site. at the least, if you do this, have a special server cluster that handles just that page, so that these people dont bring down the rest of the site too.

79 keep lines of communication between marketing and ops open so that ops knows about special offers that may generate extraordinary traffic. no one in marketing should send out deep links, or links with session IDs embedded in URLs. the first click for special offers should have special static landing pages.

section 4.7 scaling effects

80 watch out for O(n^2) scaling if your system requires each server to talk to each other server individually. if it's saying the same thing to many peers, then consider using broadcast, multicast, or pub/sub (in ascending order of efficiency and complexity) instead

82 watch out for resource contention, that is, any shared resource. especially ones that cannot themselves be easily horizontally scaled. make sure the system does fine when each shared resource is under heavy load.

83 shared-nothing is the most scalable. of course, at some point you usually want to persist something, so there will be something shared. sometimes you even want to persist transient data over a failover, e.g. you want to replacement to still have every user's session so that users dont notice anything during failover.

section 4.8 unbalanced capacities

87 capacity mismatch within your system: if you have more frontend servers than backend services for some function, then when something changes and that one function is suddenly needed a lot, you'll fail. so use Circuit Breaker, Handshaking, and Bulkheads to contain the failure. Stress both sides of the interface: make sure backend performance gracefully degrades if the frontend hammers it, and make sure the frontend can handle a slow, crashed, or hanging backend.

section 4.9 slow responses

89 a backend service that generates a very slow response can be worse than refusing a connection or returning an error.

90 Fail Fast: "consider sending an immediate error response" to new requests "when the average response time exceeds" a time limit

section 4.10 SLA inversion

92 if you plan to offer an SLA, realize that the best you can do is the product of the SLA levels (e.g. 99.95%) of your service providers (and that a single service provider with no SLA may as well be a zero, so you get a zero) -- unless your SLA explicitly talks only about certain features being available, and they can be available even if your service providers fail. An "SLA inversion" is when you try to offer ae SLA level above this number. (note: EC2's SLA level is 99.95% as of this writing; this refers to the 'Region Unavailable' error state)

section 4.11 unbounded result sets

97 always have LIMIT clauses (postgresql and mysql term) in database queries -- otherwise someday you'll get a 10 million row result set where previously you had been getting 100 rows. beware ORM tools which often dont limit results arising from an association. "beware any relationship that can accumulate unlimited children, such as orders to order lines or user profiles to site visits...audit trail of changes are also suspect". this applies not just to database calls, but to any query to a service (e.g. RESTful services)

section 5 Stability Patterns

section 5.1 Timeouts

102 a ballpark/default/first guess setting for timeouts (in web SOAs, i'm guessing) is 250 ms

section 5.2 Circuit Breaker

107 a Circuit Breaker is something that stops trying to use a service when it fails a bunch of times in a row (and then tries again after a timeout). When a Circuit Breaker trips, it should send an email to Ops or something. "Likewise, opserations needs some way to directly trip or reset the circuit breaker"

section 5.3 bulkheads

108 example for bulkheads: of Foo and Bar both use service Baz, then a bug in Foo that causes it too suddenly use too much Baz will also take out Bar. So use a bulkhead -- dedicate some Baz capacity to each of Foo and Bar.

section 5.4 steady state

113 if people have to log into production systems periodically, e.g. to clean up log files or to purge data, this will invariably lead to fiddling. Fiddling leads to mistakes that take down the production system. Therefore, the system should be able to run indefinitely without human intervention.

113 in order to run indefinitely without human intervention, the system needs to automatically, periodically purge data (e.g. log files, database records, caches)

114 will the application still work if items are missing from the middle of collections (Hibernate won't)

118 watch out for caches whose sizes grow over time, too. Limit the amount of memory a cache can consume.

section 5.5 fail fast

120 when a request comes in, the application should first check the state of the connections to other services that it will need and the state of the circuit breakers for those connections. if there will be a transaction, open the transaction. perform basic parameter checking (but not to the extent to violate encapsulation). this way, if any of the prerequisites for handling the request aren't met, you fail fast and return an error. Return a different error for system unavailable and user error, lest an upstream client trip a circuit breaker on an OK system over a bunch of user errors.

section 5.6 handshaking

section 5.7 test harness

126 list of some failure modes of socket connections:

" • It can be refused. • It can sit in a listen queue until the caller times out. • The remote end can reply with a SYN/ACK and then never send any data. • The remote end can send nothing but RESET packets. • The remote end can report a full receive window and never drain the data. • The connection can be established, but the remote end never sends a byte of data. • The connection can be established, but packets could be lost caus- ing retransmit delays. • The connection can be established, but the remote end never acknowledges receiving a packet, causing endless retransmits. • The service can accept a request, send response headers (suppos- ing HTTP), and never send the response body. • The service can send one byte of the response every thirty seconds. • The service can send a response of HTML instead of the expected XML. • The service can send megabytes when kilobytes are expected. • The service can refuse all authentication credentials. "

127 mock objects, integration tests, test harnesses: mock objects are different from integration tests in that in integration tests, you actually run all the services at once, whereas in mock objects, you replace some services with local objects that mimic them (or mimic their failures). Test harnesses differ from mock objects in that a mock object is testing code mimicing a service locally, whereas a test harness is testing code that is over the network, like the real service (this distinction can be abstracted to other layer boundaries besides 'network'). Test harnesses differ from integration tests in that the thing acting as the service in the test harness is just test code, whereas in the integration test it is the actual service. Test harnesses can be used to explore failure modes than mock objects and integration tests cannot; e.g. network level failures such as reply with a SYN/ACK and then never send any data. A mock object cannot mimic a network failure (it can try to, by throwing the exception that is supposed to be thrown in case of that failure, but that doesn't test if your stack would actually throw that exception if that failure actually occurred). An integration test can only mimic your failure if you insert a special 'test mode' into it, essentially turning it into a combination of production system and test harness (dangerous, what if you accidentally turn on the test mode in production someday), or if you find a way to cause that kind of failure in the real system. The lesson is: do at least both of integration testing, and test harnesses. Use the test harnesses to explore failures that arise from a lower level of abstraction than the mock objects.

131 asynchronous connections between systems are more stable (but more substantially complicated) than synchronous ones; e.g. pub/sub message passing is more stable than synchronous request/response protocols like HTTP. a continuum: in-process method calls (a function call into a library; same time, same host, same process), IPC (e.g. shared memory, pipes, semaphores, events; same host, same time, different process), remote procedure calls (XML-RPC, HTTP; same time, different host, different process -- my note -- this continuum neglects the RPC/REST distinction (one part of which is statelessness of HTTP), messaging middleware (MQ, pub-sub, smtp, sms; different time, different host, different process), tuple spaces

section 7.3 load testing

143 some people say "users" when they mean "users plus bots" and some people say "concurrent users" when they mean "concurrent sessions". A session stays active for awhile until a timeout has elasped; by that time, the user has left long ago. So concurrent sessions overestimated concurrent users.

section 8 capacity

section 8.1 defining capacity

151 "Performance measures how fast the system processes a single trans- action. This can be measured in isolation or under load. The system’s performance has a major impact on its throughput. Even if using the word performance, the customer does not really care about perfor- mance. Customers are interested in either throughput or capacity. End users, on the other hand, don’t care about overall capacity; they care only about the performance of their own transactions. They can’t log in to the servers to see whether the applications are running. As far as they know, when the response time exceeds their expectation, the system is down. Throughput describes the number of transactions the system can pro- cess in a given time span. The system’s performance clearly affects its throughput but not necessarily in a linear way. Throughput is always limited by a constraint in the system—a bottleneck. Optimizing per- formance of any nonbottleneck part of the system will not increase throughput. Scalability is commonly used two different ways. First, it can describe how throughput changes under varying loads. A graph of requests per second versus response time measures scalability. In the second sense, it refers to the modes of scaling supported by a system. I will use the word scalability in the sense of adding capacity to the system. Finally, the maximum throughput a system can sustain, for a given workload, while maintaining an acceptable response time for each indi- vidual transaction is its capacity. Notice that the definition of capacity includes several important vari- ables. There is no single fixed number that you can regard as your capacity. If the workload changes—perhaps because users are inter- ested in different services around the holidays—then your capacity might be dramatically different. This definition also requires a judgment. What constitutes an “accept- able response time?” For an ecommerce retailer, any response time longer than two seconds will cause customers to walk away. For a financial exchange, it could be shorter—on the order of milliseconds. A travel reservation system, on the other hand, might be allowed five hundred milliseconds for any availability search but thirty seconds to confirm a reservation. "

sec 8.2 constraints

154 identify the bottlenecks in the system in the following way. First, create a casual model of the variables as load increases when the system is happy. Do this by collecting data, identifying exogenous variables (like "number of incoming requests per second" and "time of day"), and finding other endogenous variables (like "database queries per second") with a high correlation to one or more exogenous variables. Then iterate, applying your domain knowledge to determine direction of causation from those variables to others (e.g. now find variables highly correlated to "database queries per second", maybe 'application server response time', which you know primarily depends on database queries per second and not the other way around). This allows you to construct a graphical model of causation when the system is happy. Now, as the system comes under critical load, one of those correlations will break down first as one variable hits a capacity limit and can't continue to increase/decrease even though its driving variables are increasing. E.g. maybe your database hits a concurrency limit and can't serve any more simultaneous queries even when incoming requests increase further. There you have your bottleneck.

9.1 resource pool contention

" Eliminate contention under normal loads

During “regular peak” operation, there should be no contention for resources. Regular peak load would occur on a typical day outside your company’s peak season.

If possible, size resource pools to the request thread pool

If there’s always a resource ready when a request-handling thread needs it, then you have no efficiency loss to overhead. For database connections, the added connections mainly consume RAM on the database server, which, while expensive, is less costly than lost revenue. Be careful, however, that a single database server can handle the maximum number of connections. During a failover situation, one node of a database cluster must serve all the queries—and all the connections.

Prevent vicious cycles

Resource contention causes transactions to take longer. Slower transactions cause more resource contention. If you have more than one resource pool, this cycle can cause throughput to drop exponentially as response time goes up.

Watch for the Blocked Threads pattern "

9.2 excessive JSP fragments

Many Java application servers come configured by default to keep all Java classes in the permanent generation of the heap. So if you have a million different JSPs they will all be cached in the heap forever. That's no good. So either don't have so many different JSPs (don't put content in your JSPs), or don't use the -noclassgc JVM argument.

9.3 AJAX

"The best places to apply AJAX are those interactions that represent a single task in the user’s mind. In Gmail’s case, that task is “send an email.” It could be any multistep interaction that would ordinarily take a few pages to complete. If you have “wireframes” for your site, look for a linear chain of pages that eventually returns to the home page or some other nexus. "

Use JSON, not XML.

Use session affinity.

Don't use js 'eval' to parse JSON (security hazard).

" Make sure your AJAX requests include a session ID cookie or query parameter. (This is much easier with session cookies than query parameters!) If you don’t, your application server will create a new, wasted session for every AJAX request. "

9.4 Overstaying Sessions

" The common default timeout of thirty minutes is overkill. For a live site, examine your traffic patterns to find the average and standard deviation of the time between page requests that you would still call a session. In other words, two visits in the same day but hours apart are not the same workflow. Likewise, two visits from the same user, both starting at the home page one hour apart are probably not the same session either. On the other hand, two requests twenty-nine minutes apart but from one deep page to another deep page, probably are the same activity in the user’s eyes. A good bet is to set the session timeout to one standard deviation past the average of that delay. In practice, this will be about ten minutes for a retail site, five for a media gateway, and up to twenty for travel-industry sites. "

Even better, however, is to use the session only as an in-memory cache. Users should not notice them -- persist shopping carts and search results in the DB, not in the session (the session can hold keys to them).

case studies of failure

16

Recovery-oriented computing

"damage containment, automatic fault detection, and component-level restartability"

17 Transparency

17.1 Perspectives

Historical

"

Predictive

"* How many customers per day can we handle? • When do I have to buy more servers (or disk, bandwidth, or any other computing resource)? • Can we make it through this holiday season? (Notice that this requires a projection about a projection, which doesn’t just double the possibility for error but squares it.) "

Present status

"

Events and metrics should be categorized as normal and abnormal, and abnormal events as low/medium/high danger. In the absence of other information, 2 standard deviations is a good rule of thumb for 'normal'.

" For continuous metrics, a handy rule-of-thumb definition for nominal would be “the mean value for this time period plus or minus two stan- dard deviations.” The choice of time period is where it gets interesting. For most traffic-driven metrics, the time period that shows the most stable correlation will be the “hour of the week”—that is, 2 p.m. on Tuesday. The day of the month means little....For a retailer, the “day of week” pattern will be overlaid on a strong “week of year” cycle. There is no one right answer for all organizations.

"

" Most systems have a daily rhythm of expected events. Those might be feeds from other systems, extracts to ship out to other systems, or just batch jobs to integrate with legacy systems. Whatever the pur- pose, those jobs become just as much a part of the system as the web or database servers. Their execution falls in the category of “required expected events.” The dashboard should be able to represent those expected events, whether or not they’ve occurred, and whether they succeeded or not. A startling number of business-level issues can be traced back to batch jobs failing invisibly for 33 days straight. "

Dashboard

" Green

All of the following must be true:

• All expected events have occurred. • No abnormal events have occurred. • All metrics are nominal. • All states are fully operational.

Yellow At least one of the following is true:

• An expected event has not occurred. • At least one abnormal event, with a medium severity, has occurred. • One or more parameters is above or below nominal. • A noncritical state is not fully operational. (For example, a circuit breaker has cut off a noncritical feature.)

Red At least one of the following is true:

• A required event has not occurred. • At least one abnormal event, with high severity, has occurred. • One or more parameters is far above or below nominal. • A critical state is not at its expected value. (For example, "accepting requests” is false when it should be true.) "

" For the most utility, the dashboard should be able to present different facets of the overall sys- tem to different users. An engineer in operations probably cares first about the component-level view. A developer is more likely to want an application-centric view, whereas a business sponsor probably wants a view rolled up to the feature or business process level. Clearly, this implies that the dashboard should know the linkages between these different views. When observing a component-level outage—for exam- ple, a network failure—an administrator should be able to see which business processes are affected. "

Instantaneous behavior

" Instantaneous behavior answers the question, “What the is going on?” People will be most interested in instantaneous behavior when an incident is already underway. "

log files, etc

book recommendations

70 recommends the book 'Concurrent Programming in java'

other notes from reviews of the book

"

"General design" contains advice about networking (integration with local and remote servers), security (principle of least privilege and managing passwords), availability (load balancing and clustering), and administration (QA vs production environments, configuration files, anticipating failure in start-up and shut-down, and an administrative interface).

"Operations" contains advice about recovery-oriented computing (surviving failure by restarting components, et al.), transparency (allowing a view of the system's internals), and adaptation (managing change). "

other notes not from the book:

watch out for random-routing in a load balancer: by chance, some servers will be routed a large number of anomalously heavy requests (by heavy i mean requests taking much longer than usual to serve), and then routed a bunch of light requests behind them; the unluckly light requests will experience a long latency, which is avoidable. if you must random route, then the server should at least be able to 'bounce' requests when it is much more busy than the average server ( http://news.ycombinator.com/item?id=5215884 http://news.ycombinator.com/item?id=5218083 )

" So the issue here is two-fold: - It's very hard to do 'intelligent routing' at scale. - Random routing plays poorly with request times with a really bad tail (median is 50ms, 99th is 3 seconds)

The solution here is to figure out why your 99th is 3 seconds. Once you solve that, randomized routing won't hurt you anymore. You hit this exact same problem in a non-preemptive multi-tasking system (like gevent or golang).

reply

aristus 8 days ago

link

I do perf work at Facebook, and over time I've become more and more convinced that the most crucial metric is the width of the latency histogram. Narrowing your latency band --even if it makes the average case worse-- makes so many systems problems better (top of the list: load balancing) it's not even funny. " -- http://news.ycombinator.com/item?id=5216385

misc notes

you gotta test what happens when each service goes down, to make sure other things that dont need to go down, dont. e.g. on an e-commerce site, if the service that tells you if an item is available for in-store pickup goes down (hangs, not just crashed), the rest of the site shouldn't. be sure and test both crashes and hangs of each component.

amazon's 'chaos monkey' approach; actually kill random live services while the site is live, just to do real world testing (at a time when you are ready for a problem, and at least you will known the root cause of the problem)