Friday 10:50 a.m.–11:20 a.m.

Character encoding and Unicode in Python: How to (╯°□°)╯︵ ┻━┻ with dignity

Travis Fischer, Esther Nam

Audience level:
Intermediate
Category:
Best Practices & Patterns

Description

Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.

Abstract

When it comes to passing text from one application to another, developers will inevitably encounter (or create!) issues that are caused by the improper handling of character encoding. This talk will briefly cover the basic Unicode “facts of life”, describe the kinds of Unicode issues that commonly occur when dealing with data from multiple sources, and introduce a set of best practices and tools that can take the pain out of fixing many common Unicode encoding issues. Our goal in giving this talk is to make developers recognize the importance of handling Unicode encoding correctly. Users should come away with a better understanding of how Unicode works in Python, a better appreciation for why it matters, and become aware of various strategies for dealing with cases where encoding issues break software. Attendees will be able to differentiate between the kinds of Unicode problems their application design should deal with (or prevent) and those that, in the end, can only be solved by writing some Python or using a library. We will walk through some real-world scenarios that caused us to bang our heads against the wall in frustration until we began to fully understand how Unicode works. We will cover how to avoid these problems and discovered some tools (including the python-ftfy library) which have solved some the unavoidable problems for us.