0

I am looking for a way in R to indicate in one dataframe if a given string (in a specific column for each of the thousands of rows) is similar to another string in a different, similary large dataframe.

What I've done:

I have two dataframes with news headlines (among other columns). I want to identify for which news event (headline) in dataframe A a corresponding headline exists in dataframe B.

Initially I tried to join the two based on exact matches.

semi_join(A,
          B %>% dplyr::select(headline),
          by = c("Headline" = "headline"))

What issues I found with my approach:

Unfortunately, I found only very few matches. The reason is that many headlines in the two datasets are similar but not 100% exact matches. Please find three examples below:

A <- c("United States : Canon Extends uniFLOW Secure Print Support for Multi-Vendor Environments", "Canon Introduces the New PowerShot G3: A Digital Camera That Enhances Image Quality While Offering Photographers Greater Creative Flexibility", "Canon Introduces the PowerShot S230: the World's Smallest 3 Megapixel Digital Camera -a-")

B <- c("Canon U.S.A., Inc. Extends uniFLOW Secure Print Support for Multi-Vendor Environments", "Canon Introduces the New PowerShot G3", "Canon Introduces the PowerShot S230 the World's Smallest 3 Megapixel Digital Camera")

When comparing the two you will see that for the first the headline is very similar except for the initial piece "United States :" vs. "Canon U.S.A., Inc.". For the second there is a sub-headline in dataframe A which does not exist in dataframe B. For the third there is the additional "-a-" at the end as well as the ":" in the middle.

Consequently, I removed punctuation, converted everything to lower case, and used stemming (not sure if I need the latter but I figured it also wouldn't hurt.

What I am looking for:

I am looking for a way to compare the two dataframes and if a headline in either one of the two is a match to a (sub)string / headline in the other one, I would like to indicate it as a match. Preferably I would get a result like this: I want to keep all the information from dataframe A and I want an additional column with the headline from dataframe B mapped to it if it is a good match (empty if it is not a good match).

The match is likely not symmetric (e.g., for the second example above, 100% of the words from B are in A but because A has an additional sub-headline not all words from A are in B - still I would like this to be a match)

What I have tried to solve this (and failed):

I looked at fuzzyjoin but I am not sure if I found what I am looking for. I am fairly new to R so maybe fuzzyjoin has a solution for me but I am not able to spot it / understand how I would make it work.

I did find a suggestion on stack overflow (here) but received an error message (for which I consulted stack overflow again and tried what is suggested here). I believe the way how my data looks like, the solutions above are not viable.

Highly appreciate your input. Thank you!

AlexanderP
  • 106
  • 6

0 Answers0