May 31, 2012

Finally understanding Unicode and UTF-8

Posted in Software at 23:25 by graham

Unicode maps 32-bit (4 byte) integers, also called code points or runes, to characters. UTF-8 is a way of storing those code points using less than 4 bytes per character.

61 is the Unicode code point for a, 229 is å and 5793 is . Unicode is how most modern programming languages represent strings: Java, .Net (C#, VB.Net), Go, and Python3, for example. Code points are usually written as two hexadecimal bytes prefixed by the letter u, or four prefixed by U. In python 3 this will display ᚡ:

print('\u16a1')

The first byte of that 32-bit integer (the code point) covers most characters used by European languages. The first 127 code points (hex values 00 to 7f) are the same as ASCII: 61 is both Unicode and ASCII code for a. The next 128 code points (0x80-0xff) are the same as ISO-8859-1, also called latin-1: e5 (229) is both Unicode and ISO-8859-1 for å.

The first two bytes cover characters for almost all modern languages. It is extremely rare to need the full 4 bytes as they are mostly empty. A rare exception, sad kitty U0001F640 needs three bytes. It broke WordPress when I put it in this post – that’s how common characters above two bytes are!

An encoding is a mapping from bytes to Unicode code points. If you use the code points directly as their mapping (4 bytes per code point) you have UTF-32. So 00 00 00 61 is UTF-32 for Unicode code point 61, which is a.

English speakers will usually only need one byte, and other language users two, so there are more efficient encodings. The most common Unicode encoding is UTF-8.

The first 127 values of UTF-8 map directly to Unicode code points, and hence to ASCII codes. 61 is UTF-8 for Unicode code point 61, which is character a. If you only ever use values up to 127, UTF-8, Unicode code points, and ASCII are the same. This makes confusion easy.

Above 127, UTF-8 uses between two and four bytes for each code point. c3 a5 is UTF-8 for Unicode code point u00e5, which is å. In python3:

bytes([0xc3, 0xa5]).decode("utf8")

This means UTF-8 is not compatible with ISO-8859-1.

When you receive a string of bytes, you also need to know it’s encoding to interpret it as Unicode. Luckily is is quite easy to test for valid UTF-8. In Go you use the Valid function of unicode/utf8. In Python you try to .decode("utf8") and catch the UnicodeDecodeError.

In summary:

UTF-8UTF-32Unicode code pointASCIIISO-8859-1Character
6100 00 00 61616161a
c3 a500 00 00 e5e5 (229)Nonee5å
e1 9a a100 00 16 a116a1 (5793)NoneNone

3 Comments »

  1. Lee said,

    January 8, 2014 at 01:05

    Minor correction, the code point for the first entry (a) is actually 97 while the hex version of that is 61.

  2. Ravin said,

    October 4, 2013 at 11:27

    Good stuff. I was reading about Unicode just now, because my program using Unicode values was acting weird. Your post made my understanding clearer. Thanks!

  3. binith said,

    January 18, 2013 at 10:39

    great post , this has been an ever green question in my mind, now its clearer. Im surprised there are no other comments here, happy to be the first one

Leave a Comment

Note: Your comment will only appear on the site once I approve it manually. This can take a day or two. Thanks for taking the time to comment.