March 13, 2012

In-memory key-value store in C, Go and Python

Posted in Software at 06:00 by graham

On paternity leave for my second child, I found myself writing an in-memory hashmap (a poor-man’s memcached), in Go, Python and C. I was wondering how hard it would be to replace memcached if we wanted to do something unusual with our key-value store. I also wanted to compare my knowledge of the languages, and, well, I get bored easily!

The code is on github as Key-Value-Polyglot.

Each version implements enough of the get and set commands from the memcached protocol that we can test them with a memcached client.

The wonderful Internet

Thanks to a great discussion on Hacker News, in the comments here, and on reddit, I have heavily updated this post. You’re reading the new improved version.

Having the Internet review your code is humbling. It is also an amazing way to learn. The code, and my knowledge, have improved tremendously thanks to the feedback.

So far wonderful folks have contributed versions in Java, Scala, Haskell, Node.js, Ruby (event machine), Python gevent and Python diesel. If you write a version in a different language (see Spec at bottom), please send it to me, either directly or as a pull-request on github. I’ll add it to the repo.

Networking

Respond to each request with a single write. Or switch TCP_NODELAY on.

I initially thought the Go version was many times faster than the C and Python versions, but that was due to quirks in my implementation.

The C and Python versions were doing multiple small writes to respond to each “get” request. On the server side Nagle’s algorithm was buffering those small writes, waiting for an ACK from the client of it’s previous send. On the client side, Delayed ACKs meant the client wasn’t sending the needed acknowledgment until a timeout. A textbook case, assuming you know about that textbook.

The solution was to combine those multiple writes into a single one. An alternative solution would be to switch on TCP_NODELAY.

The Go version didn’t require me to know about TCP_NODELAY, which I appreciated.

Internally, Go sets the socket to be non-blocking and uses epoll and futex to wait for data. Memcached does the same thing, via libevent. I only used a single client, and the implementations are not thread-safe, so these don’t matter right now. They should matter with a large number of clients.

Writing it: Layers over Unix

Sockets: Our languages are wrapping POSIX socket system calls. C maps the calls exactly, and Python maps the C calls very closely. Go does a bit more work for us, replacing socket-bind-listen-accept with just listen-accept.

Multiple concurrent requests: Go has the magic go statement, Python has the threading library, and C has pthread_create. None of my implementations are thread safe (thanks timtadh on HN).

Read buffering: Managing the data you get from the socket was easy, once commenters taught me about C’s fdopen. Sockets are just streams of data, it’s up to the protocol to say when a message starts and ends. For the memcached protocol, we need to be able to read until \r\n, and also read a specific amount of bytes (which might include \r\n).

Go has the bufio package. C lets you treat a socket as a file, thanks to fdopen. Python does the same thing with socket.makefile, which presumably wraps fdopen.

My C is rusty, so that version took me the longest to write, and is the least legible. The biggest slowdown was because I originally implemented buffering around the socket ‘recv’ call. Commenters (thanks kbob) pointed out I didn’t need to do that, and I replaced it.

The Python version took me a while to get right, and that’s mostly my fault. I knew Python wouldn’t make me write my own buffering code, and I spent too long playing with timeout and non-blocking setups. Once I realized I could treat the socket as a file (‘makefile’) things got easier.

Python 3’s universal newline support kindly normalizes \r\n to \n, but we’re disabling it here to preserve python 2 compatibility.

The Go version was the easiest to write because I had already used the bufio package, which solves our stream buffering problem.

Read on for how to build and profile the different versions yourself.


Build

  • Go version: You’ll need a recent weekly snapshot of Go (hg update weekly), or Go v1 if it is released by the time you read this. Run it with: go run memg.go, or compile it with go build memg.go and run the executable it makes.

  • Python version: Uses Python 2.7 or Python 3, just run the script.

  • C version: Build with gcc -std=gnu99 -g -W -Wall -pedantic -pthread -o memg memg.c. There should be no warnings.

Timing

This is explicitly not a benchmark, except if the question is “what can an average programmer do in a few hours?”.

All the versions perform about the same for me, and any differences are probably more due to my implementation than anything inherent in the languages. As antirez said in the Hacker News comments:

In order to perform such a comparison you can’t write three small throw-away programs, you need to optimize each version at your best for weeks to start to be meaningful, and you need an expert in the three systems.

To time the different versions on your machine run: time python test.py. That script requires python 2 and pylibmc (pip install pylibmc or sudo apt-get install python-pylibmc).

Try test.py against memcached too.

Profiling

To make profiling easier, each version accepts a --single command line parameter, so that it exits after a single connection.

strace was very useful for inspecting all the versions (strace -T and strace -c). Run the following for a summary of the system calls issued, and run test.py in a different window.

strace -c <the memg version you want> --single

Python

Start it in single mode, under the profiler:

python -m cProfile -o stats memg.py --single

Run test.py in a different window. View the output with pstats:

$ python
> import pstats
> p = pstats.Stats("stats")
> p.strip_dirs().sort_stats("cumulative").print_stats(20)

Go

Build an executable: go build memg.go. Run memg under prof, with --single so that it exits after the test connection closes:

go tool prof memg --single

See: prof command

You can get more detail using the runtime/pprof package and go tool pprof to examine the created file.

C

C has Valgrind, and it’s callgrind tool is very nice:

valgrind --tool=callgrind ./memg --single

Run test.py in a different window. View the output with kcachegrind:

kcachegrind callgrind.out.<num>

Spec

If you’re interested in trying it in a different language, or a different approach, here’s the short spec.

The key-value stores are an in-memory hash map. The listen on port 11211, and accept set and get commands, so that you can reproduce the following telnet interaction:

telnet 127.0.0.1 11211

set greeting 0 0 11     # User
Hello World             # User, 11 bytes
STORED                  # Server response
get greeting            # User
VALUE greeting 0 11     # Server response
Hello World             # Server response
END                     # Server response

You type the ‘set’, ‘Hello World’ and ‘get’ lines. The rest are replies. In the ‘set’ line, the ’11’ is the length of the value (‘Hello World’) in bytes. The two zero’s are ignored. All lines terminate with \r\n.

The repo has a test.py script which uses pylibmc. To get pylibmc:

  • sudo pip install pylibmc
  • OR sudo apt-get install python-pylibmc

If you make a version in a different language, please send it to me on github, or link it in the comments.

17 Comments »

  1. Taral said,

    March 22, 2012 at 04:43

    FYI – you can use “go run memg.go” instead of build & run separately.

  2. graham said,

    March 22, 2012 at 02:03

    @mallocator No, I don’t think it’s malloc. I straceed it, and the C version is spending 95% of it’s time in the recv call. For what it’s worth, I originally had a ring buffer, to minimise allocation, but I removed it because it didn’t help performance and complicated the code. When I get a moment I’m going to try removing my buffering entirely, and using fdopen like kbob suggests here.

  3. mallocator said,

    March 21, 2012 at 15:50

    Don’t you think this might also have something to do with all the malloc calls in the C version? Doesn’t Go do some pooling and other management?

  4. kbob said,

    March 21, 2012 at 14:35

    Two comments on the C version.

    (a) You could have used stdio for your I/O buffering needs.

     FILE *f = fdopen(conn, "rw");
     Then use fgets(), fread(), fprintf(), fwrite(), fflush() and fclose().
    

    (b) You allocate memory with malloc() but never free() any of it. C does not have garbage collection…

    Nonetheless, an interesting experiment.

  5. Clément Stenac said,

    March 21, 2012 at 13:39

    Hi,

    I think I disagree with the diagnostic that this is due to the use of libevent. Instead, the huge difference is due with the fact that each answer is written in the C version as several network messages, and to the particular implementation of the pylibmc client.

    If we look at the client side, we see:

    1332333112.003584 write(3, "get 485 &#092;r&#092;n", 10) = 10 &lt;0.000130&gt;
    1332333112.003837 read(3, "VALUE 485 0 3&#092;r&#092;n", 8196) = 15 &lt;0.000070&gt;
    1332333112.004036 read(3, 0x1243630, 8196) = -1 EAGAIN (Resource temporarily unavailable) &lt;0.000017&gt;
    1332333112.004162 poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}]) &lt;0.038909&gt;
    1332333112.043162 read(3, "485&#092;r&#092;nEND&#092;r&#092;n", 8196) = 10 &lt;0.000089&gt;
    1332333112.043389 write(3, "get 486 &#092;r&#092;n", 10) = 10 &lt;0.000088&gt;
    

    As we can see, the server starts responding very little after the write (about 50 microseconds later), and the client gets the beginning of the answer, but not the rest, so the client gets to polling, and can do that up to 4 times (as we send 4 lines) –> in total, the client needs 8ms to get the data, simply because it reverts to polling (and probably going through lots of python code each time).

    If we make a simple change to the C server to perform only one send call, then the client strace now looks like:

    1332333501.632992 write(3, "get 482 &#092;r&#092;n", 10) = 10 &lt;0.000013&gt;
    1332333501.633024 read(3, "VALUE 482 0 3&#092;r&#092;n482&#092;r&#092;nEND&#092;r&#092;n", 8196) = 25 &lt;0.000007&gt;
    1332333501.633055 write(3, "get 483 &#092;r&#092;n", 10) = 10 &lt;0.000013&gt;
    

    So, we now have a cycle time of about 60µs –> the C server is now about 100 times faster.

    That being said, I fully agree with the global point, which is that Go provides something which works simply out of the box, without having to concentrate on such details :)

  6. rog peppe said,

    March 21, 2012 at 13:28

    C has stdio – you don’t have to implement your own buffering.

  7. Bernt said,

    March 21, 2012 at 12:44

    I think you should modify the python code to use sockfile.write/flush instead of socket.sendall .

    This speeds it up more than x100.

  8. Jason T said,

    March 21, 2012 at 11:28

    C++ implementation using my gevent/Go-like library: https://github.com/toffaletti/libten/blob/master/examples/memg.cc

    FYI, pprof can be used to profile C also: http://goog-perftools.sourceforge.net/doc/cpu_profiler.html

  9. PB said,

    March 21, 2012 at 11:16

    You can use fdopen and let stdio do the buffering in C.

  10. Farid said,

    March 21, 2012 at 10:55

    I think to be fair, you should have used gevent or twisted for python, or libevent/libev/libuv for C.

    Maybe you should try to add NodeJs to the comparison too.

  11. Kura said,

    March 21, 2012 at 10:27

    Interesting stuff, I may write a Stackless Python version and see how it compares over the weekend.

  12. Will said,

    March 21, 2012 at 09:58

    Nice work :)

    You might like http://williamedwardscoder.tumblr.com/post/13363076806/buffcacher

  13. Questions said,

    March 21, 2012 at 09:01

    How about using python 3.x?

    Wht kind of cache eviction algo you are using?

  14. graham said,

    March 21, 2012 at 07:36

    @Christoph I’d like to see that too! Please fork it on github

  15. Christoph said,

    March 21, 2012 at 07:31

    I would be curious to see the result of a C++ implementation using boost asio. Could you show us the numbers ?

  16. Samuel Williams said,

    March 21, 2012 at 06:51

    Nice summary.

    I think it is worth considering that C isn’t really designed to provide a high level API by default. In a sense, this makes a direct comparison a bit unfair IMHO. But, if you used libev{ent} or some other “high” level C communication API, you’d get buffering and polling functionality too in C comparable to APIs that already exist in Python and Go.

    In this case, the point you made is clear though – that the “equivalent” (for some fairly arbitrary definition of equivalent) Go program can be faster by default. This begs the question: I wonder if there are any cases where this doesn’t pay off?

  17. PM said,

    March 21, 2012 at 01:36

    If you want that “hidden epoll” goodness in Python, consider using gevent.

Leave a Comment

Note: Your comment will only appear on the site once I approve it manually. This can take a day or two. Thanks for taking the time to comment.