0

I'm working with a JSON which has some fields filled in by humans in different countries. Some scripts share a letter, but the each script has distinct unicode IDs.

Here are some examples of this:

Р is the Cyrillic script has the unicode ID 0x420.P in the Latin alphabet has the unicode ID 0x50.

е and e have unicode IDs 0x435 and 0x45 respectively.

М and M have unicode IDs 0x41c and 0x4d.

This results in hard to understand behavior:

'МРе' in 'MPe'
False

'MPe' in 'MPe'
True

I don't have any way to control the incoming data.

What is the best way to handle alike unicode characters in Python? Phrased differently, what is the best way to make 'MPe' in 'MPe' is True regardless of the script that the data comes in?

ahoy
  • 1
  • Note: they are really different characters, so do not simplify them. OTOH for security reason you may treat them differently. For python, check e.g. https://pypi.org/project/confusables/ . For original source, see: https://util.unicode.org/UnicodeJsps/confusables.jsp . See also general answer (not Python specific) in https://stackoverflow.com/questions/9491890/is-there-a-list-of-characters-that-look-similar-to-english-letters – Giacomo Catenazzi Aug 26 '21 at 07:57

0 Answers0