How to account for accent characters for regex in Python?

Question

I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:

hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)

It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.

If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.

I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz

How can I go about doing this

@AshwiniChaudhary: the UNICODE flag won't make the range used match non-ASCII characters, no. If you tell regex to match `a-z`, it takes the literal range, not the human interpretation that `a` and `á` somehow are the same thing. — Martijn Pieters, Oct 02 '16 at 13:07

Ibrahim Najjar · Accepted Answer · 2016-10-04T14:30:27.510

31

Try the following:

hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)

Regex101 Demo

EDIT Check the useful comment below from Martijn Pieters.

edited Oct 04 '16 at 14:30

answered Sep 06 '13 at 17:52

Ibrahim Najjar

18,846
4
67
94

10

Small caveat: `\w` won't match combined codepoints, so `a` and [U+0301 COMBINING ACUTE ACCENT](https://codepoints.net/U+0301) won't be matched, even though that *prints* as `á`. You may want to normalise to NFC, first. – Martijn Pieters Oct 02 '16 at 13:09
1

@MartijnPieters Thanks for sharing, always something extra to learn. – Ibrahim Najjar Oct 04 '16 at 14:29
@IbrahimNajjar can you implement the fix mentioned by Martijn Pieters to your solution? Thanks. – Robert Valencia Apr 17 '17 at 18:00
2

@RobertValencia Unless you really come across the situation he describes then my solution still works with accented characters. I am honestly not a Unicode expert and don't know the details exactly but if you want to normalize like he suggests then check the other answer to this question. Hope that helps – Ibrahim Najjar Apr 22 '17 at 15:26
I see. Thanks @IbrahimNajjar – Robert Valencia Apr 22 '17 at 16:28
Interestingly enough, I think I'm experiencing the exact problem @MartijnPieters described, except with é and e. I used the solution suggested by Berk below, and then decoded bytes object back to a string. Thanks all! – bddicken Sep 13 '17 at 18:31

score 8 · Answer 2 · answered May 12 '21 at 08:00

I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.

hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)

which will return ['yogenfrüz']

Hope this'll help anyone else.

score 4 · Answer 3 · edited May 23 '17 at 12:26

4

You may also want to use

import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a? Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') Explicit example...

myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'

check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?

edited May 23 '17 at 12:26

Community

1
1

answered Feb 12 '17 at 19:41

Berk

339
5
11

great answer berk! someone will definitely find this useful! – deadlock Feb 13 '17 at 06:04
I downvoted because this is the opposite of what OP wants. They want to account for accents; so removing them is not a solution. – bfontaine Feb 10 '22 at 20:28

How to account for accent characters for regex in Python?

3 Answers3

Linked