1

I am working with fuzzy keyword matching The first dataset consists on 20180 rows and second dataset about 10000 rows I am using the .apply method to find the match I am using progress bar to see my iterations per seconds. It posted around 2-3 iterations per second. How do I increase the speed or Is there any better approach to give faster results than this code to fuzzy match?

df1['match']=df1['title'].progress_apply(lambda x: process.extractOne(x,df_conm['conm'].to_list(),score_cutoff=100))
df1

enter image description here

DarrylG
  • 14,084
  • 2
  • 15
  • 21
Achillies
  • 21
  • 2
  • 3
    Please post code as text, not image. Be copy/paste friendly. – tdelaney Sep 09 '21 at 22:56
  • 1
    One simple improvement is to compute `lst = df_conm['conm'].to_list()`, then use lst in `df1['match'] = ...`. This way you're not recomputing df_conm['conm'].to_list() for every row in df1 (i.e. 20180 rows). – DarrylG Sep 09 '21 at 23:13
  • Still the same, It is taking around 3 hours to get the output – Achillies Sep 09 '21 at 23:23
  • There was no improvement in time when you changed to `df1['match']=df1['title'].progress_apply(lambda x: process.extractOne(x, lst), score_cutoff=100))`? – DarrylG Sep 09 '21 at 23:29
  • A bigger improvement can be obtained by using [rapidfuzz](https://github.com/maxbachmann/RapidFuzz) (rather than fuzzywuzzy) as illustrated by [Is there a way to modify this code to reduce run time?](https://stackoverflow.com/questions/68483600/is-there-a-way-to-modify-this-code-to-reduce-run-time/68494221?r=SearchResults&s=1|8.9407#68494221) – DarrylG Sep 09 '21 at 23:39
  • 1
    As a side note it makes no sense to use a score_cutoff of 100. This means that only exact matches will be considered. – maxbachmann Sep 10 '21 at 09:51
  • I’m voting to close this question because it belongs to https://codereview.stackexchange.com/ – stackprotector Sep 16 '21 at 06:03

0 Answers0