Remove zero width space unicode character from Python string

Question

I have a string in Python like this:

u'\u200cHealth & Fitness'

How can i remove the

\u200c

part from the string ?

my bad, the encoding should be `ascii` as Arount answered below — Chen A., Sep 11 '17 at 11:48

score 46 · Accepted Answer · answered Sep 11 '17 at 11:29

46

You can encode it into ascii and ignore errors:

u'\u200cHealth & Fitness'.encode('ascii', 'ignore')

Output:

'Health & Fitness'

answered Sep 11 '17 at 11:29

Arount

8,960
1
27
40

5

This obviously works in the above example but you are forcing the string into ascii losing all unicode chars, which obviously is not a solution that works for all – Martin Massera Jul 28 '19 at 14:05

Hayat · Answer 2 · 2018-12-01T04:28:25.320

29

If you have a string that contains Unicode character, like

s = "Airports Council International \u2013 North America"

then you can try:

newString = (s.encode('ascii', 'ignore')).decode("utf-8")

and the output will be:

Airports Council International North America

Upvote if helps :)

edited Dec 01 '18 at 04:28

answered Feb 21 '18 at 07:47

Hayat

1,322
3
15
30

1

shouldn't we decode 'ascii' after encoding to ascii – Vaibhav Vishal Dec 05 '18 at 05:37
If you have a list of strings, you can adapt this as a list comprehension: `list_text_fixed = [(s.encode('ascii', 'ignore')).decode("utf-8") for s in list_text]` – timothyjgraham Sep 10 '19 at 04:48

score 16 · Answer 3 · edited Jul 28 '19 at 14:19

16

I just use replace because I don't need it:

varstring.replace('\u200c', '')

Or in your case:

u'\u200cHealth & Fitness'.replace('\u200c', '')

edited Jul 28 '19 at 14:19

joanis

6,977
11
26
33

answered Mar 28 '19 at 15:06

Sitti Munirah Abdul Razak

659
8
9

5

This is actually better than the accepted answer in most strings. The \u200c is a zero width non joiner, which is an unusual whitespace-type character that `strip()` ignores. In most cases with unicode strs you do not want to `encode(ascii, ignore)`. – Chet Mar 28 '19 at 15:41
1

This is general solution since ascii may remove some other Unicode characters as well. – prosti Dec 03 '19 at 14:31

score 3 · Answer 4 · answered Dec 11 '18 at 10:41

3

for me the following worked

mystring.encode('ascii', 'ignore').decode('unicode_escape')

answered Dec 11 '18 at 10:41

Diana

462
4
16

2

You could improve your answer by explaining _why_ this code works, and what you're doing here. That way, others can be educated. – RyanZim Dec 11 '18 at 13:44
tbh, that was a 'Frankenstein' version of all answers that I had previously found but which didn't work. I can't really explain why this one worked over the rest of solutions in my case.. – Diana Oct 23 '19 at 11:19

score 1 · Answer 5 · answered Jan 12 '21 at 17:50

In the specific case in the question: that the string is prefixed with a single u'\200c' character, the solution is as simple as taking a slice that does not include the first character.

original = u'\u200cHealth & Fitness'
fixed = original[1:]

If the leading character may or may not be present, str.lstrip may be used

original = u'\u200cHealth & Fitness'
fixed = original.lstrip(u'\u200c')

The same solutions will work in Python3. From Python 3.9, str.removeprefix is also available

original = u'\u200cHealth & Fitness'
fixed = original.removeprefix(u'\u200c')

Remove zero width space unicode character from Python string

5 Answers5

Linked