August 11, 2011

Unicode in Python 2: Decode in, encode out

Posted in Software at 00:24 by graham

In Python 2 you need to convert between encoded strings and unicode. It’s easy if you follow these three simple rules:

Decode all input strings

name = input_name.decode('utf8', 'ignore')

You need to decode all input text: filenames, file contents, console input, database contents, socket data, etc. If you are using Django, it already does this for you, as much as it can.

The trick is figuring out which encoding you have. Here are Python’s supported encodings. The main ones I try are utf_8, latin_1 and cp1250.

Inside, work only with unicode

That means prefixing strings with ‘u’:

my_thing = u'Something'

Be careful when concatenating, always use a u:

my_other = part_one + u'-' + part_two

Encode all output strings

output_name = name.encode('utf8')
print(output_name)

Strings (bytes) have an encoding. The ‘ignore’ parameter to decode tells it to throw away any bytes that aren’t valid UTF8.

Unicode does not have an encoding. Hence you need to decode the byte string, specifying the encoding, to get unicode. When you want to output, you need to encode your unicode back to a byte string, again specifying the encoding.

In Python 3, all strings are Unicode, and happiness fills the land.

Leave a Comment

Note: Your comment will only appear on the site once I approve it manually. This can take a day or two. Thanks for taking the time to comment.