2

Is there a simple regular expression to match all unicode quotes? Or does one have to hand-code it like this:

quotes = ur"[\"'\u2018\u2019\u201c\u201d]"

Thank you for reading.

Brian

tchrist
  • 76,727
  • 28
  • 123
  • 176
Brian M. Hunt
  • 76,464
  • 71
  • 217
  • 333

2 Answers2

5

Python doesn't support Unicode properties, therefore you can't use the Pi and Pf properties, so I guess your solution is as good as it gets.

You might also want to consider the "false quotation marks" that are sadly being used - the acute and grave accent (´ and `` ):\u0060and\u00B4`.

Then there are guillemets (« » ‹ ›), do you want those, too? Use \u00BB\u203A\u00AB\u2039 for those.

Also, your command has a little bug: you're adding the backslash to the quotes string (because you're using a raw string). Use a triple-quoted string instead.

>>> quotes = ur"[\"'\u2018\u2019\u201c\u201d\u0060\u00b4]"
>>> "\\" in quotes
True
>>> quotes
u'[\\"\'\u2018\u2019\u201c\u201d`\xb4]'
>>> quotes = ur"""["'\u2018\u2019\u201c\u201d\u0060\u00b4]"""
>>> "\\" in quotes
False
>>> quotes
u'["\'\u2018\u2019\u201c\u201d`\xb4]'
Tim Pietzcker
  • 313,408
  • 56
  • 485
  • 544
  • 1
    Not yet; there is a rewrite of the `re` module underway, but I have no idea when/if it will be merged into the main development branch. I doubt it will be there before Python 3.3. – Tim Pietzcker Jun 28 '10 at 05:23
  • 1
    Even if they aren't yet available in the `re` module, you can still import the `unicodedata` module and do `quotes = ''.join(c for c in (chr(i) for i in range(0x110000)) if unicodedata.category(c) in ('Pf', 'Pi'))`. – dan04 Jul 19 '16 at 20:49
5

Quotation marks will often have the Unicode category Pi (punctuation, initial quote) or Pf (Punctuation, final quote). You'll have to handle the "neutral" quotation marks ' and " manually.

dan04
  • 82,709
  • 22
  • 159
  • 189
  • 2
    +1: Man, I overlooked this. I have corrected my answer. Sadly, Python doesn't support Unicode properties (yet). He didn't specify Python, but I'm guessing this from his code sample and his previous question. – Tim Pietzcker Jun 27 '10 at 21:31