String sanitization in Python

Posted on 16 Jul 2007 in development django python

Sometimes users want to bring text from an editor like Word into your web forms. You will often find nasty little characters hiding in the text, like ‘\u2022’ (a.k.a. the notorious bullet). These characters will normally throw errors if you try to convert them to ASCII:

>>> u'\u2022'.encode('ascii')
Traceback (most recent call last):
  File "<console>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode ... (yadda yadda)

To sanitize these strings and make them XML/HTML safe:

>>> u'\u2022'.encode('ascii', 'xmlcharrefreplace')
'&#8226;'

It translates the invalid characters into their XML equivalents. Woo! You can also use 'ignore' or 'replace' (replaces with ?):

>>> u'\u2022'.encode('ascii', 'ignore')
''
>>> u'\u2022'.encode('ascii', 'replace')
'?'

If you’re getting nasty Unicode errors from your templates in Django now that they’ve merged the Unicode branch, this might help as a quick fix.