Bayle Shanks's website: notes-business-startups-startups-stSoftwareSpecific

Internet infrastructure (software-business-specific)

Tools (software-business-specific)

The Joel test

" 1. Do you use source control? 2. Can you make a build in one step? 3. Do you make daily builds? 4. Do you have a bug database? 5. Do you fix bugs before writing new code? 6. Do you have an up-to-date schedule? 7. Do you have a spec? 8. Do programmers have quiet working conditions? 9. Do you use the best tools money can buy? 10. Do you have testers? 11. Do new candidates write code during their interview? 12. Do you do hallway usability testing? "

-- http://www.joelonsoftware.com/articles/fog0000000043.html

Microservices

Having many small services, compared to one or a few monoliths, is good in larger organizations but a hassle in small ones. Don't do microservices until you have at least 3 ops guys on staff (thanks P.R.).

Operations

Practice restoring from backup sometimes.

Have alerts, and test them sometimes to make sure they are getting through (not blocked as spam, etc).

On production systems, have a non-default (preferably colored) prompt that is visibly distinct (and maybe a different prompt for staging, so that if a prompt accidentally gets left at the default on a production machine, that's visible too). Try to avoid ever executing commands manually on a production system.

Outages

Have checklists/flowcharts of what to do in case of various kinds of outages. These are sometimes called 'runbooks'. Example: https://gitlab.com/gitlab-com/runbooks

A lot of problems in outages arise from the person trying to fix the outage making a mistake and making it worse. If there is an outage and you get to the end of the checklist or there is no checklist, and you think you have an idea for how to fix things, think through what what you plan to do and write it down before doing it. To avoid people stepping on each others' toes, make sure you communicate what you are about to (try to) do to others, and get confirmation that they heard it (maybe even have them repeat it back to you), before doing it. When about to run a command on production that cannot be undone, pause and think: is this the right directory? Is this the right server (is it staging or production? is it the master or the secondary?)? Are these the right parameters? Can I simulate the command beforehand? If possible, have two people on hand to work together and doublecheck each other (to prevent accidents).

Replication failure is not quite an outage but is similar in terms of urgency. If you are in the cloud, then instead of messing around on a live primary or secondary server to deal with this, usually you should be spinning up a new server and reconfiguring that one instead. Then you only remove the old, live server after the new one is working. The procedure for spinning up and inserting the new server should be the same as when you do upgrades.

Postmortems

Be reluctant to blame individuals for making mistakes that led to outages. Typically, you want to design your processes to be robust to individual mistakes, rather than relying on individuals not to make mistakes. If individuals are going to be blamed, this makes people less likely to be open about what happened, which makes it harder to fix the process.

Security

Hash user passwords.

Use two-factor identification for admin/IT/dev employees.

Never store passwords in source controlled files.

Testing

Some common types of tests:

Manual functionality tests: A person uses the software (often according to prewritten scripts) and looks for bugs
Automated functionality tests: There are many kinds of these. Within automated functionality tests, at least three types of tests are commonly distinguished:
End-to-end tests: Tests that attempt to simply automate scripted manual functionality testing, usually using some sort of UI automation software
Unit tests: Tests that test relatively small portions of code in isolation
Integration tests: Sometimes used as a catch-all for tests of larger scale than unit tests but smaller scale than end-to-end tests. Examples might be:
- testing an application with a UI by issuing commands via an application's internal (or exposed) API
- testing just under the level of the UI, that is to say, instead of using external UI automation software you fake user inputs by manually calling whatever application-specific functions normally would be called by the UI layer of the application

It is sometimes said that there should be a 'pyramid' of many unit tests, fewer integration tests, fewer still end-to-end tests, and fewest manual tests. This is because at the bottom of the pyramid, unit tests are easy to write and maintain, clearly identify what is wrong when they are failing, but can miss many potential problems; whereas end-to-end tests don't miss much but are hard to write and brittle, and when they fail you still have a lot of debugging to do to figure out the root problem.

Some other types of automated tests:

Load tests: Even if a system works, perhaps it fails when there is high 'load', that is, when it is being asked to do a difficult task, or asked to do a lot of stuff at once.

Some other types of manual tests:

Usability testing: Watch a user using the product to identify places where the UI is difficult
Penetration testing: Have a person or group (a 'pentester' or 'red team') attempt to attack the security of your software (and/or organization)

The Joel Test includes "Do you have testers?" as a question because some kinds of testing (especially manual testing) can be performed by people with less specialized skill than software development, so it is uneconomical to pay expensive software developers to do what could be done by a cheaper dedicated tester.

Webdesign

Some tips from [1] (most of these are direct quotes, sometimes paraphrased) (you might want to just follow that link and read the original, because it has pictures that make it clear what they mean):

Use color and weight to create hierarchy instead of size
Don’t use grey text on colored backgrounds, make the text closer to the background color. Do this either by using white text with reduced opacity, or by hand-picking a color based on the background color
Offset your shadows. Eg the 4px in "box-shadow: 0 4px 6px 0 hsla(0, 0%, 0%, 0.2%);"
Use fewer borders. Instead, use box shadows, or different background colors, or extra whitespace
Don’t blow up icons that are meant to be small. Even if they are vector images, they will lack detail. If small icons are all you’ve got, try enclosing them inside another shape and giving the shape a background color.
Use color accent borders to add color to a bland design
Not every button needs a background color. Semantics are an important part of button design, but there’s a more important dimension that’s commonly forgotten: hierarchy. Secondary actions should have lower contrast background colors or even just outline styles. Tertiary actions should usually just be styled like links. What about destructive actions, shouldn’t they always be red? No, not if they are not the primary action. Save the big, red, and bold styling for when that negative action actually is the primary action in the interface, like in a confirmation dialog.

More tips:

todo

nnq 1 day ago [-]

Your advice is actually good for a slightly unexpected reason: most people can't really understand the difference between UI and UX. Even worse are those that don't get it that "UI design" is of itself completely different from "graphic design". So if you tell someone "UX is terribly important", 90% of the time they will focus on "UI design" or "graphic design" and miss the point entirely while wasting tons of resources.

You can have:

products with horrible UIs and horrible graphic design, but awesome user experience (UXs) -- because they solve a problem perfectly while enabling a good workflow, or are extensible and adaptable to new unanticipated workflows
products with horrible graphic design but great UIs -- if the user flow is intuitive and productive, the interface responsive and discoverable, it really doesn't matter how shit looks
products with great UIs and awesome graphic design but horrible UX -- the UI may be both intuitive and awesomly designed and responsive... but if it enables wrong or sub-optimal workflows more than the right ones, nudging users towards bad mindsets/perspectives/workflows, it lowers everyone's productivity and sooner or later people will realize that they are dragged down or disabled by that product with the "great" UI

So yeah... if what I wrote above ain't obvious to you, then don't focus on UX, because you'll actually do it wrong anyway. Focus on the problem you solve, on the product you develop to solve it, and on selling the solution. If the UX is not extremely bad or you're not in a "fashion/fad" driven niche, it will probable be ok despite suboptimal UX if you do the rest right.

idlewords 1 day ago [-]

One thing not on this list that is an easy win is to make your pages fast. This will separate you from 90%+ of competitors. A lot of the bloat in modern web apps and web pages is easy to remove, and people will love you for it.

I never mentioned it in Pinboard marketing, but I moved heaven and earth to keep the site fast for everyone, and it was speed that gave the site its first toehold.

That said, I think the leap from 100 to 10,000 users is very hard, and I don't know of any advice about how to cross that gap.

---

sre tips ops triage incident ticket

great article:

https://zwischenzugs.com/2017/04/04/things-i-learned-managing-site-reliability-for-some-of-the-worlds-busiest-gambling-sites/

---

foo101 3 days ago [-]

I really like the point about runbooks/playbooks.

> We ended up embedding these dashboard within Confluence runbooks/playbooks followed by diagnosing/triaging, resolving, and escalation information. We also ended up associating these runbooks/playbooks with the alerts and had the links outputted into the operational chat along with the alert in question so people could easily follow it back.

When I used to work for Amazon, as a developer, I was required to write a playbook for every microservice I developed. The playbook had to be so detailed that, in theory, any site reliability engineer, who has no knowledge of the service should be able to read the playbook and perform the following activities:

Understand what the service does.
Learn all the curl commands to run to test each service component in isolation and see which ones are not behaving as expected.
Learn how to connect to the actual physical/virtual/cloud systems that keep the service running.
Learn which log files to check for evidence of problems.
Learn which configuration files to edit.
Learn how to restart the service.
Learn how to rollback the service to an earlier known good version.
Learn resolution to common issues seen earlier.
Perform a checklist of activities to be performed to ensure all components are in good health.
Find out which development team of ours to page if the issue remains unresolved.

It took a lot of documentation and excellent organization of such documentation to keep the services up and running.

twic 3 days ago [-]

A far-out old employer of mine decided that their standard format for alerts, sent by applications to the central monitoring system, would include a field for a URL pointing to some relevant documentation.

I think this was mostly pushed through by sysadmins annoyed at getting alerts from new applications that didn't mean anything to them.

peterwwillis 3 days ago [-]

I partly rely on incident reports and issues as part of my documentation. Sometimes you will get an issue like "disk filling up", and maybe someone will troubleshoot it and resolve it with a summary comment of "cleaned up free space in X process". Instead of making that the end of it, create a new issue which describes the problem and steps to resolve in detail. Update the issue over time as necessary. Add a tag to the issue called 'runbook'. Then mark related issues as duplicates of this one issue. It's kind of horrible, but it seamlessly integrates runbooks with your issue tracking.

---

links: https://stripe.com/atlas/guides/business-of-saas

---

"Elaborate usability tests are a waste of resources. The best results come from testing no more than 5 users and running as many small tests as you can afford." -- [2]

---