0

I am working with a bunch of labeled text data generated by customers of my company. I often come across strings with weird characters like this: .

My machine learning models don't like these characters and I've resorted to simply removing them from strings following this answer on stackoverflow. This often times partially/completely destroys the meaning of the data I'm working with.

What I'd like to do is this:

normal_text = glitchy_to_ascii('   ')
print(normal_text)
'My name is Sam'

Is there some standard way of doing this? Are there Python packages out there? Also, what is this sort of text called? I've seen it called 'glitchy' text on various websites.

  • 1
    for this, i do something like i collect all possible similar characters and arrange them in a dictionary, and check each of the character in that dictionary – Ghost Ops Oct 07 '21 at 06:42
  • @GhostOps yah I was thinking of doing the same. I thought there might be a package out there or something but alas... – randomdatascientist Oct 07 '21 at 06:43
  • yeah, but to check the every duplicate of a character is a tough task, but in [this](https://util.unicode.org/UnicodeJsps/confusables.jsp) website, it shows all possible duplicate of a character, and also, i have did the same in my past and now i have a dictionary of letters like that, but u know what... i sometimes got confused on where to put the letter as there are some like `' '` – Ghost Ops Oct 07 '21 at 06:50
  • @GhostOps Oh yah that would be really cool if I could get that dictionary... How can we do this? – randomdatascientist Oct 07 '21 at 06:54
  • @randomdatascientist i guess sphennings got what u want, a package which does the same thing that i have, but sadly, my dictionary doesn't have any emojis support... i need to add it, or maybe i would try [this](https://stackoverflow.com/a/62819752/16693888) – Ghost Ops Oct 07 '21 at 06:57
  • this is [fraktur](https://en.wikipedia.org/wiki/Fraktur) font. you'll find data on its unicode range on that page – diggusbickus Oct 07 '21 at 06:59
  • @sphennings It partially answers the question - I'll have to take a look at the package referenced there (Unidecode). Perhaps this question could be left up to help others find the answer you referenced as it was hard for me to find any answers on StackOverflow. – randomdatascientist Oct 07 '21 at 07:00
  • @diggusbickus Looking for other weird fonts too. Thanks for identifying this particular font example though. – randomdatascientist Oct 07 '21 at 07:01

0 Answers0