translate panda dataframe using dictionary sorted by word length

Question

I have imported an excel to a pandas dataframe, which I'm trying to translate and then export back to an excel.

For example purpose say this is my data set:

d = {"cool":"chill", "guy":"dude","cool guy":"bro"}```
data = [['cool guy'], ['cool'], ['guy']]
df = pd.DataFrame(data, columns = ['WORDS'])


print(df)
#    WORDS   
# 0  cool guy   
# 1  cool  
# 2  guy

So the easiest solution would be to use pandas built in function replace. However if you use:

df['WORDS'] = df['WORDS'].replace(d, regex=True)

The result is:

print(df)
#    WORDS   
# 0  chill dude   
# 1  chill  
# 2  dude

(cool guy doesn't get translated correctly)

This could be solved by sorting the dictionary by the longest word first. I tried to use this function:

import re
def replace_words(col, dictionary):
    # sort keys by length, in reverse order
    for item in sorted(dictionary.keys(), key = len, reverse = True):
        col = re.sub(item, dictionary[item], col)
    return col

But..

df['WORDS'] = replace_words(df['WORDS'], d)

Results in a type error: TypeError: expected string or bytes-like object

Trying to convert the row to a string did not help either

...*
col = re.sub(item, dictionary[item], [str(row) for row in col])

Does anyone have any solution or different approach I could try?

Unless I'm misunderstanding don't you just want `replace` without regex? `df['WORDS'] = df['WORDS'].replace(d)` — Henry Ecker, Jul 17 '21 at 18:11
I think you need `df['WORDS'].replace(dict(sorted(d.items(), key=lambda k: len(k[0]), reverse=True)), regex=True)`. — Henry Yik, Jul 17 '21 at 18:31
@HenryEcker I seem to have misunderstood the need for regex. Simply replace(d) was enough, as you said! Thank you! — Tomas Storås, Jul 17 '21 at 19:34

score 1 · Accepted Answer · answered Jul 17 '21 at 18:29

1

Let us try replace

df.WORDS.replace(d)
Out[307]: 
0      bro
1    chill
2     dude
Name: WORDS, dtype: object

answered Jul 17 '21 at 18:29

BENY

296,997
19
147
204

replace(d) was enough for it to work! Thank you! – Tomas Storås Jul 17 '21 at 19:37

score 0 · Answer 2 · answered Jul 17 '21 at 18:11

0

df['WORDS'] = df['WORDS'].apply(lambda x: d[x])

This will do the work.

answered Jul 17 '21 at 18:11

Diyar Mohammady

76
4

This doesn't work if there is a word in the df that is not in the dictionary. It causes a key error. – Tomas Storås Jul 17 '21 at 18:53

translate panda dataframe using dictionary sorted by word length

2 Answers2