Internet infrastructure (software-business-specific)

Tools (software-business-specific)

The Joel test

" 1. Do you use source control? 2. Can you make a build in one step? 3. Do you make daily builds? 4. Do you have a bug database? 5. Do you fix bugs before writing new code? 6. Do you have an up-to-date schedule? 7. Do you have a spec? 8. Do programmers have quiet working conditions? 9. Do you use the best tools money can buy? 10. Do you have testers? 11. Do new candidates write code during their interview? 12. Do you do hallway usability testing? "



Having many small services, compared to one or a few monoliths, is good in larger organizations but a hassle in small ones. Don't do microservices until you have at least 3 ops guys on staff (thanks P.R.).


Practice restoring from backup sometimes.

Have alerts, and test them sometimes to make sure they are getting through (not blocked as spam, etc).

On production systems, have a non-default (preferably colored) prompt that is visibly distinct (and maybe a different prompt for staging, so that if a prompt accidentally gets left at the default on a production machine, that's visible too). Try to avoid ever executing commands manually on a production system.


Have checklists/flowcharts of what to do in case of various kinds of outages. These are sometimes called 'runbooks'. Example:

A lot of problems in outages arise from the person trying to fix the outage making a mistake and making it worse. If there is an outage and you get to the end of the checklist or there is no checklist, and you think you have an idea for how to fix things, think through what what you plan to do and write it down before doing it. To avoid people stepping on each others' toes, make sure you communicate what you are about to (try to) do to others, and get confirmation that they heard it (maybe even have them repeat it back to you), before doing it. When about to run a command on production that cannot be undone, pause and think: is this the right directory? Is this the right server (is it staging or production? is it the master or the secondary?)? Are these the right parameters? Can I simulate the command beforehand? If possible, have two people on hand to work together and doublecheck each other (to prevent accidents).

Replication failure is not quite an outage but is similar in terms of urgency. If you are in the cloud, then instead of messing around on a live primary or secondary server to deal with this, usually you should be spinning up a new server and reconfiguring that one instead. Then you only remove the old, live server after the new one is working. The procedure for spinning up and inserting the new server should be the same as when you do upgrades.


Be reluctant to blame individuals for making mistakes that led to outages. Typically, you want to design your processes to be robust to individual mistakes, rather than relying on individuals not to make mistakes. If individuals are going to be blamed, this makes people less likely to be open about what happened, which makes it harder to fix the process.


Hash user passwords.

Use two-factor identification for admin/IT/dev employees.

Never store passwords in source controlled files.


Some common types of tests:

It is sometimes said that there should be a 'pyramid' of many unit tests, fewer integration tests, fewer still end-to-end tests, and fewest manual tests. This is because at the bottom of the pyramid, unit tests are easy to write and maintain, clearly identify what is wrong when they are failing, but can miss many potential problems; whereas end-to-end tests don't miss much but are hard to write and brittle, and when they fail you still have a lot of debugging to do to figure out the root problem.

Some other types of automated tests:

Some other types of manual tests:

The Joel Test includes "Do you have testers?" as a question because some kinds of testing (especially manual testing) can be performed by people with less specialized skill than software development, so it is uneconomical to pay expensive software developers to do what could be done by a cheaper dedicated tester.


Some tips from [1] (most of these are direct quotes, sometimes paraphrased) (you might want to just follow that link and read the original, because it has pictures that make it clear what they mean):

More tips:


nnq 1 day ago [-]

Your advice is actually good for a slightly unexpected reason: most people can't really understand the difference between UI and UX. Even worse are those that don't get it that "UI design" is of itself completely different from "graphic design". So if you tell someone "UX is terribly important", 90% of the time they will focus on "UI design" or "graphic design" and miss the point entirely while wasting tons of resources.

You can have:

So yeah... if what I wrote above ain't obvious to you, then don't focus on UX, because you'll actually do it wrong anyway. Focus on the problem you solve, on the product you develop to solve it, and on selling the solution. If the UX is not extremely bad or you're not in a "fashion/fad" driven niche, it will probable be ok despite suboptimal UX if you do the rest right.


idlewords 1 day ago [-]

One thing not on this list that is an easy win is to make your pages fast. This will separate you from 90%+ of competitors. A lot of the bloat in modern web apps and web pages is easy to remove, and people will love you for it.

I never mentioned it in Pinboard marketing, but I moved heaven and earth to keep the site fast for everyone, and it was speed that gave the site its first toehold.

That said, I think the leap from 100 to 10,000 users is very hard, and I don't know of any advice about how to cross that gap.



sre tips ops triage incident ticket

great article:


foo101 3 days ago [-]

I really like the point about runbooks/playbooks.

> We ended up embedding these dashboard within Confluence runbooks/playbooks followed by diagnosing/triaging, resolving, and escalation information. We also ended up associating these runbooks/playbooks with the alerts and had the links outputted into the operational chat along with the alert in question so people could easily follow it back.

When I used to work for Amazon, as a developer, I was required to write a playbook for every microservice I developed. The playbook had to be so detailed that, in theory, any site reliability engineer, who has no knowledge of the service should be able to read the playbook and perform the following activities:

It took a lot of documentation and excellent organization of such documentation to keep the services up and running.


twic 3 days ago [-]

A far-out old employer of mine decided that their standard format for alerts, sent by applications to the central monitoring system, would include a field for a URL pointing to some relevant documentation.

I think this was mostly pushed through by sysadmins annoyed at getting alerts from new applications that didn't mean anything to them.


peterwwillis 3 days ago [-]

I partly rely on incident reports and issues as part of my documentation. Sometimes you will get an issue like "disk filling up", and maybe someone will troubleshoot it and resolve it with a summary comment of "cleaned up free space in X process". Instead of making that the end of it, create a new issue which describes the problem and steps to resolve in detail. Update the issue over time as necessary. Add a tag to the issue called 'runbook'. Then mark related issues as duplicates of this one issue. It's kind of horrible, but it seamlessly integrates runbooks with your issue tracking.




"Elaborate usability tests are a waste of resources. The best results come from testing no more than 5 users and running as many small tests as you can afford." -- [2]