November 1, 2015

Facebook’s code quality problem

Posted in Software at 00:15 by graham

tl;dr: It looks like Facebook is getting the textbook results of ignoring code quality.

Update: More examples, and insights from ex-employees in the reddit discussion

Facebook has a software quality problem. I’m going to try to convince you with three examples. This is important because it demonstrates the time-honored principle that quality matters. In demonstrates it, as Facebook engineers like to say, at scale. I don’t work at Facebook or any competitor, I’m just an observer.

Exhibit A: “iOS can’t handle our scale”

About a month ago a Facebook engineer gave this presentation: iOS at Facebook, which was followed by a discussion on reddit.

The Facebook iOS app has over 18,000 Objective-C classes, and in a single week 429 people contributing to it. That’s 429 people working, in some way, on the Facebook iOS app. Rather than take the obvious lesson that there are too many people working on this application, the presentation goes on to blame everything from git to Xcode for those 18,000 classes.

This comment from ChadBan on reddit sums it up:

All I can think of when reading this is Martin Fowler’s Design Stamina Hypothesis on what happens to a system without architecture. It becomes harder and takes longer to add new features versus a system where architecture is golden. Facebook’s solution to a downward curve seems to be to just throw more developers at it until it bends north. I’d never want anyone in my tiny team thinking this is what the cool kids are doing. I’d never want to work this way, but it works for them.

Exhibit B: Maybe use a ramdisk?

Fast Database Restarts at Facebook. The second exhibit is from Facebook Research. Serious stuff. On the surface it sounds like an interesting article, I read it because of this:

Our key observation is that we can decouple the memory lifetime from the process lifetime.

The idea is similar to storing your data in memcached or redis, restarting your process, and fetching it back out; you probably do this already. The only difference is that they store the data in shared memory instead of redis / memcached. The shared memory part is actually a red herring, but it takes the paper until the Conclusion to admit it.

They were already persisting data to disk between restarts, but the reload from disk was too slow: “Reading about 120 GB of data from disk takes 20-25 minutes; reading that data in its disk format and translating it to its in-memory format takes 2.5-3 hours.” It’s not the disk that’s making things slow, it’s the format conversion. You have to wait until the conclusion for them to realize this: “One large overhead in disk recovery is translating from the disk format to the heap memory format. We are planning to use the shared memory format described in this paper as the disk format.” What they did is wrote new store/reload code that worked with shared memory with it’s own new format converter.

If you are a diligent reader of your Kerrisk (and you should be) you will notice on page 275 (section 14.10) that shared memory on Linux is implemented with a tmpfs filesystem. And tmpfs is how Linux does ramdisk, which “consumes only as much memory and swap space as is currently required for the files it holds”.

So, if your save-to-disk format conversion routines are making your code slow, and you are going to have to re-write them anyway, and you want to “decouple the memory lifetime from the process lifetime”, wouldn’t you just write your disk files to a ramdisk? Surely they noticed this too, but by then it was too late, they had to move fast and publish things.

Exhibit C: Our site works when the engineers go on holiday

Fail at Scale. Facebook recognizes they have a reliability problem, and they have a team on the case. They identified one of the causes quite easily:

Figure 1a shows how incidents happened substantially less on Saturday and Sunday even though traffic to the site remains consistent throughout the week. Figure 1b shows a six-month period during which there were only two weeks with no incidents: the week of Christmas and the week when employees are expected to write peer reviews for each other.

These two data points seem to suggest that when Facebook employees are not actively making changes to infrastructure because they are busy with other things (weekends, holidays, or even performance reviews), the site experiences higher levels of reliability.

The article moves on, without wondering whether releases regularly breaking your app are a normal part of the software engineering process.

Conclusion

Facebook is very successful, manifestly has some great engineers, unlimited money, and yet seems to have big issues with software quality. I take two lessons from this:

  • Culture matters. The “Hack” and “Move fast and break things” culture must make it very hard for developers to focus on quality.

  • Quality matters. We all know that if you don’t focus on quality it will come back to bite you.

    • Making even small changes will become increasingly difficult. Eric Evans in Domain-Driven Design: “When complexity gets out of hand, developers can no longer understand the software well enough to change or extend it easily and safely.” Exhibit A showed Facebook needing a huge staff to keep up their momentum maintaining a big ball of mud.
    • Releases will break things, because you don’t understand the relationships well enough to pretend the impact of your changes. Exhibit C showed just that.

Next time management or clients try to convince you to move faster and throw quality under the bus, you can say sure, that will work, as long as you can hire 429 engineers to work on our iOS app.

23 Comments »

  1. Alex said,

    July 22, 2016 at 13:58

    Code quality is not only Facebook’s problem – sad but true

  2. Guilherme Garnier said,

    November 23, 2015 at 15:22

    It seems like Stack Overflow is following the same path: http://blog.guilhermegarnier.com/2015/11/software-quality-isnt-optional/

  3. John said,

    November 5, 2015 at 21:55

    Barry said, November 3, 2015 at 02:43

    “And yet facebook has more active users at any given moment then any other site in world. How many startups run by functional programming snobs succeed vs the move fast and break things cohort?”

    Until it breaks big time and it becomes the most legendary software catastrophe ever.

    Moreover, I don’t know about you guys but when I get my arse our of my job and talk to normal people, we developers definitely are hated because we move fast and break things. Maybe it’s time to move a little slower and break a less things? Just a humble proposal but well, you can feel some users are really pissed….

  4. StuffIsReal said,

    November 4, 2015 at 17:23

    128 gigs of mutable data. Someone got fired :D

  5. Matt Perry said,

    November 4, 2015 at 13:00

    From the slides:

    “You may now have the impression that Facebook is staffed by superhumans” – I am absolutely not thinking that, reading this. Please hire an architect.

    “People who aren’t afraid to rewrite iOS from the group up” – just afraid of rewriting your own app from the ground up?

    This kind of attitude is so anti-user. I’m always conscious not to waste people’s bytes.

  6. Peter Marreck said,

    November 4, 2015 at 06:53

    Concerned Citizen:

    Of course I know what a database is. I also know that Facebook’s database can’t possibly be as small as 120GB, which implies that that’s all important runtime data that would be even costlier to not preserve somehow (perhaps an expensive reconstruction from DB data). Which… smells.

  7. Ducky said,

    November 3, 2015 at 19:02

    @SomeAnonymous You mis-understood. Using RAMDISK, they avoided conversions. RAMDISK = file system.

  8. Matthew Campbell said,

    November 3, 2015 at 18:20

    Perhaps they ship software more then refactoring the same code continually breaking production often ;) Code that is used will have more bugs

  9. Concerned Citizen said,

    November 3, 2015 at 17:52

    Peter Marreck, you understand what a database is, right? Mutable data is kinda their thing.

  10. Andrew said,

    November 3, 2015 at 15:26

    You’ve left out this debacle where they had to do an ugly hack to the dalvik runtime on Android because they have way too many methods in the code

    https://www.facebook.com/notes/facebook-engineering/under-the-hood-dalvik-patch-for-facebook-for-android/10151345597798920

  11. Tim McCormack said,

    November 3, 2015 at 13:22

    Part C is less convincing; change always introduces risk, so for me the question is whether the weekday rates are unacceptably higher than the weekend baselines. (I also don’t know what their SLA is, and therefore how bad those violations are…)

    Put another way, I’m more concerned about there being SLA violations on the weekends at all; they should mostly be happening during the work week!

    “The article moves on, without wondering whether releases regularly breaking your app are a normal part of the software engineering process.”

    For better or worse (my opinion: worse) we are in an industry where the customers do not hold service providers to a high quality standard. You look at those graphs and see “they have a code quality problem” (or devops quality problem, really), but I look at that graph and see “management has decided this is an acceptable level of fail”. I think it’s a valid critique of the industry to ask that question, but the paper’s authors probably take it as a given.

  12. Beb said,

    November 3, 2015 at 10:45

    Are you sure the problems came before the extra 428-7 developers? I might wager a guess it was largely acceptable until after the human bloat.

  13. anony said,

    November 3, 2015 at 09:18

    I did some work for Amex about 25 years ago, and the pattern that everything works from Thanksgiving til January was because Amex intentionally had code freeze in late October. No fucking with the cash cow until she has been milked.

    Facebook might be doing something similar.

  14. Menwn said,

    November 3, 2015 at 09:04

    About the third case.

    They seem to attribute it to developers making changes but I was wondering if, at least a part of it, is due to the lack of incidents’ reporting during weekends and holidays. Less people on the job reporting incidents (less staff, junior staff on call not recognizing the incidents as such, less motivation to report incidents on times like these).

  15. EJ said,

    November 3, 2015 at 08:45

    You’re assuming the data structure used to write to a spinning disk is the same as the one optimal for writing to shared memory / ramdisk. If they aren’t, then having two different serializations makes sense, and ramdisk vs. shared memory doesn’t really matter for the second one.

  16. Ej said,

    November 3, 2015 at 08:34

    If they used ramdisk, they would have has to use file system commands to get at their data. With shared memory, they can directly access portions of memory using native application data structures

  17. Another Voice said,

    November 3, 2015 at 08:00

    Your conclusions are completely wrong..

    You are not an engineer working at “facebook-scale.” You can conclude whatever you want to because you never have to face the problems Facebook has to solve every day.

    There are no general-wisdom, common design solutions for problems at that kind of scale. You have to adopt as you keep moving.

    Without a definition of “Quality” the whole discussion is irrelevant, anyway..

  18. Kannan Goundan said,

    November 3, 2015 at 07:33

    Re: the ramdisk thing. I’m trying to figure out if there may have been good reasons for using different formats for disk and for shared memory.

    For an on-disk format, they need incremental updates on a block device to be efficient. For the shared memory format, they need fast, roughly-in-place conversion to/from the heap format.

    An append-only journal-style format, for example, is great for incremental updates, but not so great for quick restores. The shared memory format they chose is basically an inlined/packed version of the heap format, which wouldn’t be good for incremental updates on disk.

    Is there a format that meets both requirements?

    (All that said, the paper should have at least discussed this stuff. Makes me wonder if you’re right and that they didn’t even think of it.)

  19. Peter Marreck said,

    November 3, 2015 at 06:36

    If your application has 120GB of mutable inflight data… You’ve already lost the fscking battle.

  20. barry said,

    November 3, 2015 at 02:43

    And yet facebook has more active users at any given moment then any other site in world. How many startups run by functional programming snobs succeed vs the move fast and break things cohort? Perhaps perfection is an unrealistic and distracting target as the future is uncertain, hard to predict and is therefore the enemy of good.

  21. AnotherAnonymous said,

    November 3, 2015 at 02:42

    Facebook’s engineers are great, but their product is crap. Seems like management thinks they can get better by throwing engineers at the problem… they need to rethink the whole network, the interaction between users, the news feed design… they need to stop, think, then implement something not the move fast crap.

  22. graham said,

    November 2, 2015 at 23:43

    @SomeAnonymous What that quote means is it’s the format conversion that takes the time, it doesn’t matter whether you write to memory or disk. In the conclusion they say they are going to re-write the disk storage routines, to use the format conversion from the shared memory storage. Because shared memory is in fact a ramdisk (on Linux), they could have written the format conversion once, and used a ramdisk directly. Much simpler.

  23. SomeAnonymous said,

    November 2, 2015 at 22:42

    “Reading about 120 GB of data from disk takes 20-25 minutes; reading that data in its disk format and translating it to its in-memory format takes 2.5-3 hours”

    So by writing to ram disk you only save 20-25 minutes and still have to pay for 2.5-3 hours of conversion time. By using shared memory they save it. Smart

Leave a Comment

Note: Your comment will only appear on the site once I approve it manually. This can take a day or two. Thanks for taking the time to comment.