Change the future

Friday 3:15 p.m.–3:45 p.m.

The Guts of Unicode in Python

Benjamin Peterson

Audience level:
Experienced
Category:
Python Internals

Description

This talk will examine how Python's internal Unicode representation has changed from its introduction through the latest major changes in Python 3.3. I'll present properties of the current Unicode implementation like algorithmic complexity and standard compliance. The talk will also compare Unicode in Python with some other languages. Finally, I'll look into the future of Python's Unicode.

Abstract

History

  • The unicode type is added in Python 2.0. (PEP 100) It is intended as an optional feature. unicode is implicitly coercible to and from str, though it comes with a codecs system. Only the BMP is supported.
  • UCS4 support is added in Python 2.2. (PEP 261) It is based on a configure option and mostly works. It creates subtle portability issues and makes binary C extensions painful.
  • Unicode becomes the default string type in Python 3. Everything is supposed to deal with Unicode correctly. Various challenges are faced: PEP 383, IO, codecs. Not everything falls into place immediately.
  • Unicode implementation becomes much more complex as time goes on. (See unicodeobject.c file size.)

The Great PEP 393 Overhaul

  • This lands in Python 3.3.
  • Allows a string to have the most efficient memory representation: 1 byte, 2 bytes, or 4 bytes wide.
  • ASCII strings specially marked.
  • The maximum codepoint invariant.
  • Increases compliance with Unicode standard.
  • Ends the UCS2/UCS4 mess.
  • Various memory/speed/complexity/correctness tradeoffs are involved.
  • Tricks are used to be backwards compatible with <= 3.2 C extensions.

Consequences of the Current Unicode Implementation

  • No more UCS2 vs UCS4.
  • Speed and complexity is less intuitive now.
  • Common cases optimized. ASCII strings do well and are often special-cased.
  • Copying is often needed when manipulating strings with different internal representations.
  • Extra scanning is also occasionally needed to maintain invariants.
  • UTF-8 is wicked fast.
  • C extensions using the old Unicode API will have a performance penalty.

Comparison with other languages

  • The leaky abstraction of UTF-16 "codepoints" afflicts many languages today like JavaScript, Java, and .NET variants.
  • Other languages have taken different representation approaches like using UTF-8 internally. Applications in low-level languages seem to be prefering this approach, too.
  • Perl has some of the best Unicode support today, especially with respect to regular expressions. It uses UTF-8 internally.

The Future

  • Fixing regular expressions and Unicode. Perhaps importing the regex module?
  • More important unicode algorithms need to be implemented: collation, bidi, and segmentation.
  • Also, bringing our current algorithms up to spec.
  • Continued optimization work.