0

Hi I am trying to join two large datasets >10000 entries each. To do this I have created a ‘unique ID’ - a combination of full name and date of birth which are present in both. However, the datasets have spelling mistakes/ different characters in the IDs so when using left join many won’t match. I don’t have access to fuzyjoin/ match so can’t use this to partially match them. Someone has suggested using adist(). How am I able to use this to match and merge the datasets or to flag ones which are close to matching? As simple as possible please I have only been using R for a few weeks! Examples of code would be amazing

Beth
  • 1
  • 1
  • 1
    In order for us to help you, please edit your question to include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). For example, to produce a minimal data set, you can use `head()`, `subset()`, or the indices. Then use `dput()` to give us something that can be put in R immediately. Also, please make sure you know what to do [when someone answers your question](https://stackoverflow.com/help/someone-answers). More info can be found at StackOverflow's [help center](https://stackoverflow.com/help). Thank you! – iamericfletcher Mar 18 '21 at 18:10
  • What do you mean by "don't have access to fuzzyjoin"? Does this mean we should not suggest other packages outside base R? – Jon Spring Mar 18 '21 at 18:11
  • My work do not allow us to install packages from cran, so can only access those which they store on the system. Includes most commonly used packages just not everything – Beth Mar 18 '21 at 18:20

3 Answers3

0

You could just rename them to names that are spelled correctly:

df$correct_spelling <- df$incorrect_spelling
  • 1
    One quick addendum to Jeremy's answer, if you don't already have a column of correct spellings, you would have to change each misspelling with something like `df$UniqueIDColumn[df$UniqueIDColumn == "WrongSpelling"] – Sean McKenzie Mar 18 '21 at 20:43
0

This may a bit of a manual solution, but perhaps a base - R solution would be to look through unique values of the join fields and correct any that are misspelled using the grep() function and creating a crosswalk to merge into the dataframes with misspelled unique IDs. Here's a trivial example of what I mean:

Let's say we have a dataframe of scientists and their year of birth, and then we have a second dataframe with the scientists' names and their field of study, but the "names" column is riddled with spelling errors. Here is the code to make the example dataframes:

##Fake Data##
Names<-c("Jill", "Maria", "Carlos", "DeAndre") #Names
BirthYears<-c(1974, 1980, 1991, 1985) # Birthyears 
Field<-c("Astronomy", "Geology", "Medicine", "Ecology") # Fields of science
Mispelled<-c("Deandre", "Marai", "Jil", "Clarlos")# Names misspelled

##Creating Dataframes##
DF<-data.frame(Names=Names, Birth=BirthYears) # Dataframe with correct spellings
DF2<-data.frame(Names=Mispelled, Field=Field) # Dataframe with incorrect spellings we want to fix and merge

What we can do is find all the unique values of the correctly spelled and the incorrectly spelled versions of the scientists' names using a regular expression replacement function gsub().

Mispelled2<-unique(DF2$Names) # Get unique values of names from misspelled dataframe
Correct<-unique(DF$Names) # Get unique values of names from correctly spelled dataframe

fix<-NULL #blank vector to save results from loop

for(i in 1:length(Mispelled2)){#Looping through all unique mispelled values
  ptn<-paste("^",substring(Mispelled2[i],1,1), "+", sep="") #Creating a regular expression string to find patterns similar to the correct name
  fix[i]<-grep(ptn, Correct, value=TRUE) #Finding and saving replacement name values
}#End loop

You'll have to come up with the regular expressions necessary for your situation, here is a link to a cheatsheet with how to build regular expressions

https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf

Now we can make a dataframe crosswalking the misspelled names with the correct spelling ie., Row 1 would have "Deandre" and "DeAndre" Row 2 would have "Jil" and "Jill."

CWX<-data.frame(Name_wrong=Mispelled2, Name_correct=fix)

Finally we merge the crosswalk to the dataframe with the incorrect spellings, and then merge the resultant dataframe to the dataframe with the correct spellings

Mispelled3<-merge(DF2, CWX, by.x="Names", by.y="Name_wrong")
Joined_DF<-merge(DF, Mispelled3[,-1], by.x="Names", by.y="Name_correct")
Sean McKenzie
  • 623
  • 2
  • 12
  • Do you know if something similar could be done using adist( ), working out the edit distances? And only pairing if it’s a certain number away? – Beth Mar 19 '21 at 18:02
  • Yes, I think so, although probably you are looking for `agrep()` which allows for fuzzy string matching. If you go that route, it would be almost the same code as above, except you would want to use `fix[i] – Sean McKenzie Mar 19 '21 at 20:23
  • How could I edit the regular expression so it would match either the beginning and/or the middle and/or the end? – Beth Apr 05 '21 at 11:33
  • Hi Beth, I think to do something like that, you would have to write a custom function using if/else statements. I think I understand what you want... Basically if your Id were something like "Carmen_1990_Chemistry", and the misspelled version of the ID were "Carmne_1990_Cemistry," you would want a regular expression that would match to 1990. But I am not 100% sure this is what you are after. If you have some example data of what you are wanting, please post it, and I will see if I can come up with custom function for you. – Sean McKenzie Apr 05 '21 at 15:20
  • Hi Sean, essentially yes so if an ID were “deborahoziajames04/14/2000” and I wanted to be able to match this to “oziajames04/14/2000”. But also would need to be able to match it if it were “deborahozia04/14/2000” or deborahoziajames10/14/1990 -so the misspelt names could be missing different parts – Beth Apr 05 '21 at 16:58
  • Hi Beth, I'm going to post a follow up answer below with the function – Sean McKenzie Apr 12 '21 at 16:18
0

Here is what I was able to come up with for your question about matching in multiple ways. It's a bit clunky, but it works with this below example data. The trick is making the call to agrep() sensitive enough to not match names that partially match but are truly different, but flexible enough that it allows for partial matches and misspellings:

Example1<-"deborahoziajames04/14/2000"
Example2<-"Somepersonnotdeborah04/15/2002"
Example3<-"AnotherpersonnamedJames01/23/1995"
Misspelled1<-"oziajames04/14/2000"
Misspelled2<-"deborahozia04/14/2000"
Misspelled3<-"deborahoziajames10/14/1990"
Misspelled4<-"personnamedJames"

String<-c(Example1, Example2, Example3)
Misspelled<-c(Misspelled1, Misspelled2, Misspelled3, Misspelled4)

Spell_Correct<-function(String, Misspelled){
  out<-NULL
 for(i in 1:length(Misspelled)){
  
    ptn_front<-paste('^', Misspelled[i], "\\B", sep="") 
    ptn_mid<-paste('\\B', Misspelled[i], "\\B", sep="") 
    ptn_end<-paste('\\B', Misspelled[i], "$", sep="")
  
      ptn<-c(ptn_front, ptn_mid, ptn_end)
    
    Nchar_M<-nchar(Misspelled[i])
    Nchar_S<-nchar(String)
    
      out_front<-agrep(pattern=ptn[1], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
      out_mid<-agrep(pattern=ptn[2], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
      out_end<-agrep(pattern=ptn[3], x=String, value=TRUE, max.distance=0.3, ignore.case=TRUE, costs = c(0.6, 0.6, 0.6))
      
      out_test<-list(out_front, out_mid, out_end)
      for (j in 1:length(out_test)){
        if(length(out_test[j])==1)
          use_me<-out_test[j]
      }
      out[i]<-use_me
 }
  return(unlist(out))
}

Spell_Correct(String, Misspelled)

Basically this just repeating the previous answer multiple times by using the loop and tweaking the regular expression to try a beginning, middle, and end call to agrep(). Depending on how bad the misspellings are, you may need to play around with the max.distance and cost arguments. Good Luck. Take Care, -Sean

Sean McKenzie
  • 623
  • 2
  • 12