Release It: Write software for production

November 10, 2014 software book

We need to design software to run in production. That’s the main lesson of Michael T. Nygard’s Release It. We often think of shipping the system as the end of the project, when in practice it is just the start.

Release It is an enjoyable book with some excellent production war stories. It suffers from being a little too broad in concepts, and a little too narrow in examples (all enterprise J2EE webapps). Despite this I’d recommend spending some time with it because it advocates a very important and easily overlooked idea: don’t code to pass the QA tests, code to avoid the 3am support call.

What follows are my notes from the book, which are a mixture of what the book says and what I think, grouped in categories that make sense to me.

Design

Think about failure modes as you build. Expect and handle failure. A router between your application and the database wil get upgraded, and start closing your TCP connections afer 30 seconds idle time. Your app is only idle that long on Sunday mornings. You can’t test for everything, but you can write your code to recover from failure.

Think about making the inevitable breaks less bad. It’s much easier to do this at development time, rather than try and retrofit.

Decouple parts of the system. Which ones can you survive without? For how long? Split them apart using middleware such as a message queue, or store-and-forward (the email model). Decoupling components means one failing won’t take out the others.

For every external call (disk I/O, socket I/O, sub-process, pipe) ask:

What if I can’t connect? What if it takes 10 minutes to connect? Remote side may be gone, or overloaded.
What it the connection drops? What if it hangs, or takes 2 minutes to respond? Some middle-ware (or remote) maybe have dropped our connection without telling us. TCP will re-transmit into the void for several minutes.
What if 10k requests arrive at once? What if remote is using the connection, but is very slow?

Always use timeouts. Don’t wait forever for anything.

For every incoming piece of data ask:

What’s the maximum size it could be?
What’s the maximum size we can handle?

Add a LIMIT to database queries. Stop reading incoming data beyond a certain size.

What is the maximum allowed size for any in-memory cache?

Connection pools are a common bottleneck. Make sure they are big enough and don’t block (unless number of connections db can handle is our capacity bound). We use connection pools to avoid the connection setup cost, so if you block on the pool all the gain is lost. What do we do when all connections in the pool are in use? Make some more? Wait a limited amount of time? (then what?).

Pay double attention to any external / resource requests that happen inside a locked region, because a hung external resource will cause deadlocks in your system.

Retry things (db might have done hot failover, virtual IP might have moved, etc), but only a limited amount of times. We don’t want to hammer an already struggling external system. Most failures are not intermittent, they mean something is actually wrong the other end (full disk, remote server down, etc), so don’t waste time and resources with too many retries.

Circuit Breaker pattern helps manager retries. A circuit breaker is a wrapper for dangerous operations (things that can fail, e.g. remote system call). Normally does the operation (‘closed’). If too many operations fail (e.g. timeout) it stops trying and returns errors immediately (‘open’), and of course raise an alert to monitoring system. When in fail state regularly try a single operation to probe if service has returned, if so go back to normal operation.

Avoid config file bloat. Keep system properties (db password, IP addresses) separate from application settings (feature flags, essential plumbing, etc). The first is often handled by the operations team and needs to be secure, the second the developers handle and isn’t sensitive information. 12-factor puts the first type of setting in environment variables, Google puts them in command line parameters.

Fail early: Don’t accept any work (typically connections) until the app is ready. Do a basic self check, add some connections to various connection pools (and ping them), and report status to monitoring system, then start working. You might be bringing up a server during peak load, so it needs to be ready before it opens the doors.

Fail fast: Don’t accept a unit of work unless you expect to complete it. Start by checking all the necessary resources (can get a db conn from pool, remote system has not tripped Circuit Breaker, not at capacity, etc). If we won’t be able to complete the request, return an error immediately.

Shutdown cleanly. Stop accepting new work but let current work finish (with a timeout). We’ll need this to stop the old version when we release new verions.

Design for transparency. Add monitoring from the start.

Anything useful here? Recovery-oriented computing

Testing

Do longevity testing. Not extreme load, but some high and some low. System should survive as long as you have between releases (a week? a month?). This will expose memory / resource leaks. The high / low variation will expose idle connection timeouts (such as connection in a pool).

Do integration testing, write tests that use the full system. Run the test suite against your beta / QA environment, run it during the longevity test.

Be perverse in your testing. Maybe TCP connections hang, be very slow, read or write large amounts of random data, repeat the exact same request / message 1000 times, etc.

Test with realistic (large) data volumes. Thing like poor SQL queries, absence of indexes, or slow algorithms, are often only revealed with larger data sets.

QA can’t always have the same number of nodes as production, but make sure it matches as zero, one, or many. If production has 8 app servers (‘many’), make sure QA has at least two. Virtual machines can help lower hardware costs, whilst keeping node counts closer to production. Get QA as close to production as possible, ideally a perfect equivalent. Spend the money, it’s nearly always a big saving compared to production downtime. See Blue / Green deployments

Capacity

Do load testing. How much load can the system handle before errors or slowdowns? This is tricky as often the load testing client fails first. It’s not something you can do in an hour between other tasks, take it seriously.

When the load test charts show an inflexion (a ‘knee’), that’s usually a bounding constraint has been reached. Figure out what it is: cpu, memory, disk i/o, num threads, num connections, etc.

Reaching capacity usually triggers a stability problem, so how do we limit the load to what we can handle? Options:

handshaking: Use protocol to tell upstream to back off. TCP does this.
healthcheck: Make an endpoint which says No if we are at capacity. The healthcheck page should also do a general healthcheck, more than just respond OK. Most load balancers can check this and remove back-end server from pool if healthcheck says no.
limit connections at the load balancer level to per . Drop connections above that, or send them a static ‘over capacity’ message.

In a webapp number of sessions is a capacity limit. Each session uses RAM, maybe CPU and net bandwidth (if it is shared between servers), and persists until it times out (often 30 mins). Clients that don’t have cookies (crawlers, bots), may create lots of sessions. If you can make the session be just an in-memory cache of on-disk data (last search, shopping cart contents, etc) then you both get the benefit of infinite sessions (users hate session expiry), and you can be very aggressive in purging idle sessions.

All the usual performance work applies to increasing capacity too.

Scaling

Vertical: Bigger box, go big. Db servers (at least the master / writes) usually have to scale vertically.

Horizontal: Add more boxes, go wide. This is preferred. App servers can often scale horizontally.

CPU cycles are only cheap in financial terms. They cost in clock time. Disk space is cheap, but if it’s local you have to multiply extra usage by number of servers, plus the RAID overhead.

Recovery

Prepare (and practice using) information gathering scripts for production. When something in production breaks you want to a) gather data to isolate the cause, but b) get it going again as fast a possible. The scripts allow you to gather the data you need for a) very fast, so as not to delay b).

Monitoring

Monitoring is about transparency. It is needed for immediate and historical reasons.

Make a dashboard for current (recent past) state, with red / green signals (red bad, green good). It allows seeing at-a-glane what the system is doing. Maybe have a business and an operations dashboard, as those user groups typically care about different indicators.

Historical trends (daily, weekly, monthly) are your best guess of the future; next Saturday’s work load will be most like last Saturday’s. They help answer “what happened before that server crashed“, and predict when you’re going to need to buy more disk space or another server.

Have an interface that makes correlating metrics easy. A single metric across all servers, and multiple metrics on a single server. When traffic goes up, what other metrics go up?

Monitor as a real user, in a remote location. Have a bot doing an as-realistic-as-possible interaction. There are many services to do this for web apps. Most metrics of interest come from inside the app (in the same process), but if you can monitor it externally use that instead.

Store monitoring data off-system.

Things to monitor:

System: CPU, Load, Memory, Net I/O, Disk I/O
App: Garbage collection times, num threads, num external connections, data about each connection pool (size, wait times, usage), data about caches (size, hit rate), circuit breaker info.
Work load: Num work units (typically net requests) processed, avg time, errors / work units aborted, time of last request. Traffic levels anonymous vs registered users.
Integration points: Avg response time from remote systems, num active, num errors, time of last expected events (e.g. in / out data feeds, backups).
Business metrics: Whatever they care about.

Monitor expected events as well as unexpected. e.g. You need to know if the batch jobs didn’t run, the incoming feed didn’t arrive.

If possible define an acceptable range for each metric and alert or info when out of it. For many metrics that’s very difficult; thanksgiving traffic is ‘normal’ for thanksgiving, not for any other part of the year.

All the alerting, thresholds, and display decisions should happen outside of the app, in the monitoring system.

Operation

Aggressively archive off data to remote systems. Do all reporting and ad-hoc queries from the remote system (or read-only slave db). “Ad-hoc” typically means less or not tested, keep them away from the critical system.

Multi-home the production servers (two IP addresses, one or more private). Production traffic uses one interface, admin traffic such as backups, data archiving, replication, monitoring, etc uses the other. The prevents admin traffic flooding the production network, and keeps monitoring going during e.g. a DDOS attack. For security only bind the admin features to the admin interface, it’s a separate listener.

Allow settings to be changed live. Things like which features are enabled, size of different connection pools, length of timeouts, etc. Don’t force a server restart to reload config. If you have to change config, it’s usually because something went wrong, load is high, and you can’t afford to reboot.

Write log file messages for operations staff (which might be you, at 2am). An error should be something actionable. Write log files that are easy to scan ( align fields, keep each log to one line). Log a work unit id (transaction id, user id, etc) so you can group all messages for a given work unit.

Don’t login to production servers (unless something really bad has happened). Automate and stay away. Don’t fiddle. Automate deployment, data purging (db tables, log files, any data that grows), anything else you might need production server access for.

Consider a separate set of servers for very important traffic – critical services, a sub group of users, certain third-party systems, etc. If the main service fails the critical ones are not impacted.

Release

Releasing needs to be easy, so you can do it often. Getting good at releasing is a goal in itself, focus on it, from very early on.

Version everything: files, endpoints, protocols, etc. Use a specific version. This allows multiple versions of the code to co-exist.

Release when the software is ready, not on arbitrary date set months ago (except if you are a product vendor with PR machine).

Release in phases to avoid downtime:

Expansion: Add new features, new db columns, new files, etc. Current site not affected. Don’t add db constraints yet, because old code is still running. Write backwards-compatible code. Upgrade data on-the-fly but make sure it stays compatible with old. During this stage there is extra code to bridge the two versions.
Rollout: Push new code to servers one by one, gradually. The backwards-compatible code can co-exist happily. Ideally handover socket connections, or hot-reload code, but at least shut down cleanly (see Design section).
Cleanup: Once all servers are on new version and it’s stable, drop old columns, add referential integrity, remove the bridging code.

Graham King