Finally understanding Unicode and UTF-8

May 31, 2012 software go python unicode utf8

Summary

Unicode maps 32-bit code points to characters, and UTF-8 is a space-efficient way to encode these points, using fewer than 4 bytes per character. Common programming languages like Java, .Net, Go, and Python3 use Unicode for string representation. ASCII and ISO-8859-1 (Latin-1) overlap with Unicode for characters in their ranges. Most modern language characters fit within the first two bytes of Unicode, while rare characters may need up to four bytes. Encoding translates code points to bytes, with UTF-8 being the most prevalent. UTF-8's first 127 values match ASCII, and it encodes higher values using two to four bytes, which distinguishes it from ISO-8859-1. Proper decoding requires knowing the byte string's encoding, with testing and validation methods available in languages like Go and Python.

Unicode maps 32-bit (4 byte) integers, also called code points or runes, to characters. UTF-8 is a way of storing those code points using less than 4 bytes per character.

61 is the Unicode code point for a, 229 is å and 5793 is ᚡ. Unicode is how most modern programming languages represent strings: Java, .Net (C#, VB.Net), Go, and Python3, for example. Code points are usually written as two hexadecimal bytes prefixed by the letter u, or four prefixed by U. In python 3 this will display ᚡ:

print('\u16a1')

The first byte of that 32-bit integer (the code point) covers most characters used by European languages. The first 127 code points (hex values 00 to 7f) are the same as

ASCII: 61 is both Unicode and ASCII code for a. The next 128 code points (0x80-0xff) are the same as ISO-8859-1, also called latin-1: e5 (229) is both Unicode and ISO-8859-1 for å.

The first two bytes cover characters for almost all modern languages. It is extremely rare to need the full 4 bytes as they are mostly empty. A rare exception, sad kitty U0001F640 needs three bytes. It broke WordPress when I put it in this post – that’s how common characters above two bytes are!

An encoding is a mapping from bytes to Unicode code points. If you use the code points directly as their mapping (4 bytes per code point) you have UTF-32. So 00 00 00 61 is UTF-32 for Unicode code point 61, which is a.

English speakers will usually only need one byte, and other language users two, so there are more efficient encodings. The most common Unicode encoding is UTF-8.

The first 127 values of UTF-8 map directly to Unicode code points, and hence to ASCII codes. 61 is UTF-8 for Unicode code point 61, which is character a. If you only ever use values up to 127, UTF-8, Unicode code points, and ASCII are the same. This makes confusion easy.

Above 127, UTF-8 uses between two and four bytes for each code point. c3 a5 is UTF-8 for Unicode code point u00e5, which is å. In python3:

bytes([0xc3, 0xa5]).decode("utf8")

This means UTF-8 is not compatible with ISO-8859-1.

When you receive a string of bytes, you also need to know it’s encoding to interpret it as Unicode. Luckily is is quite easy to test for valid UTF-8. In Go you use the Valid function of unicode/utf8. In Python you try to .decode("utf8") and catch the UnicodeDecodeError.

In summary (all values are hex):

UTF-8	UTF-32	Unicode code point	ASCII	ISO-8859-1	Character
61	00 00 00 61	61 (decimal 97)	61	61	a
c3 a5	00 00 00 e5	e5 (decimal 229)	None	e5	å
e1 9a a1	00 00 16 a1	16a1 (decimal 5793)	None	None	ᚡ

Graham King