26

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this

Ibrahim Najjar
  • 18,846
  • 4
  • 67
  • 94
deadlock
  • 6,618
  • 14
  • 63
  • 111

3 Answers3

31

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDIT Check the useful comment below from Martijn Pieters.

Ibrahim Najjar
  • 18,846
  • 4
  • 67
  • 94
  • 10
    Small caveat: `\w` won't match combined codepoints, so `a` and [U+0301 COMBINING ACUTE ACCENT](https://codepoints.net/U+0301) won't be matched, even though that *prints* as `á`. You may want to normalise to NFC, first. – Martijn Pieters Oct 02 '16 at 13:09
  • 1
    @MartijnPieters Thanks for sharing, always something extra to learn. – Ibrahim Najjar Oct 04 '16 at 14:29
  • @IbrahimNajjar can you implement the fix mentioned by Martijn Pieters to your solution? Thanks. – Robert Valencia Apr 17 '17 at 18:00
  • 2
    @RobertValencia Unless you really come across the situation he describes then my solution still works with accented characters. I am honestly not a Unicode expert and don't know the details exactly but if you want to normalize like he suggests then check the other answer to this question. Hope that helps – Ibrahim Najjar Apr 22 '17 at 15:26
  • I see. Thanks @IbrahimNajjar – Robert Valencia Apr 22 '17 at 16:28
  • Interestingly enough, I think I'm experiencing the exact problem @MartijnPieters described, except with é and e. I used the solution suggested by Berk below, and then decoded bytes object back to a string. Thanks all! – bddicken Sep 13 '17 at 18:31
8

I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['yogenfrüz']

Hope this'll help anyone else.

zanga
  • 452
  • 2
  • 14
4

You may also want to use

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a? Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') Explicit example...

myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?

Community
  • 1
  • 1
Berk
  • 339
  • 5
  • 11