Graham King

Solvitas perambulum

Facebook’s code quality problem

tl;dr: It looks like Facebook is getting the textbook results of ignoring code quality.

Update: More examples, and insights from ex-employees in the reddit discussion

Facebook has a software quality problem. I’m going to try to convince you with three examples. This is important because it demonstrates the time-honored principle that quality matters. In demonstrates it, as Facebook engineers like to say, at scale. I don’t work at Facebook or any competitor, I’m just an observer.

Exhibit A: “iOS can’t handle our scale”

About a month ago a Facebook engineer gave this presentation: iOS at Facebook, which was followed by a discussion on reddit.

The Facebook iOS app has over 18,000 Objective-C classes, and in a single week 429 people contributing to it. That’s 429 people working, in some way, on the Facebook iOS app. Rather than take the obvious lesson that there are too many people working on this application, the presentation goes on to blame everything from git to Xcode for those 18,000 classes.

This comment from ChadBan on reddit sums it up:

All I can think of when reading this is Martin Fowler’s Design Stamina Hypothesis on what happens to a system without architecture. It becomes harder and takes longer to add new features versus a system where architecture is golden. Facebook’s solution to a downward curve seems to be to just throw more developers at it until it bends north. I’d never want anyone in my tiny team thinking this is what the cool kids are doing. I’d never want to work this way, but it works for them.

Exhibit B: Maybe use a ramdisk?

Fast Database Restarts at Facebook. The second exhibit is from Facebook Research. Serious stuff. On the surface it sounds like an interesting article, I read it because of this:

Our key observation is that we can decouple the memory lifetime from the process lifetime.

The idea is similar to storing your data in memcached or redis, restarting your process, and fetching it back out; you probably do this already. The only difference is that they store the data in shared memory instead of redis / memcached. The shared memory part is actually a red herring, but it takes the paper until the Conclusion to admit it.

They were already persisting data to disk between restarts, but the reload from disk was too slow: “Reading about 120 GB of data from disk takes 20-25 minutes; reading that data in its disk format and translating it to its in-memory format takes 2.5-3 hours.” It’s not the disk that’s making things slow, it’s the format conversion. You have to wait until the conclusion for them to realize this: “One large overhead in disk recovery is translating from the disk format to the heap memory format. We are planning to use the shared memory format described in this paper as the disk format.” What they did is wrote new store/reload code that worked with shared memory with it’s own new format converter.

If you are a diligent reader of your Kerrisk (and you should be) you will notice on page 275 (section 14.10) that shared memory on Linux is implemented with a tmpfs filesystem. And tmpfs is how Linux does ramdisk, which “consumes only as much memory and swap space as is currently required for the files it holds”.

So, if your save-to-disk format conversion routines are making your code slow, and you are going to have to re-write them anyway, and you want to “decouple the memory lifetime from the process lifetime”, wouldn’t you just write your disk files to a ramdisk? Surely they noticed this too, but by then it was too late, they had to move fast and publish things.

Exhibit C: Our site works when the engineers go on holiday

Fail at Scale. Facebook recognizes they have a reliability problem, and they have a team on the case. They identified one of the causes quite easily:

Figure 1a shows how incidents happened substantially less on Saturday and Sunday even though traffic to the site remains consistent throughout the week. Figure 1b shows a six-month period during which there were only two weeks with no incidents: the week of Christmas and the week when employees are expected to write peer reviews for each other.

These two data points seem to suggest that when Facebook employees are not actively making changes to infrastructure because they are busy with other things (weekends, holidays, or even performance reviews), the site experiences higher levels of reliability.

The article moves on, without wondering whether releases regularly breaking your app are a normal part of the software engineering process.

Conclusion

Facebook is very successful, manifestly has some great engineers, unlimited money, and yet seems to have big issues with software quality. I take two lessons from this:

  • Culture matters. The “Hack” and “Move fast and break things” culture must make it very hard for developers to focus on quality.

  • Quality matters. We all know that if you don’t focus on quality it will come back to bite you.

    • Making even small changes will become increasingly difficult. Eric Evans in Domain-Driven Design: “When complexity gets out of hand, developers can no longer understand the software well enough to change or extend it easily and safely.” Exhibit A showed Facebook needing a huge staff to keep up their momentum maintaining a big ball of mud.
    • Releases will break things, because you don’t understand the relationships well enough to predict the impact of your changes. Exhibit C showed just that.

Next time management or clients try to convince you to move faster and throw quality under the bus, you can say sure, that will work, as long as you can hire 429 engineers to work on our iOS app.