0

I have a dataset which contains a field with individual's name. Some of the names are similar with minute differences like 'CANON INDIA PVT. LTD' and 'CANON INDIA PVT. LTD.', 'Antila,Thomas' and 'ANTILA THOMAS', 'Z_SANDSTONE COOLING LTD' and 'SANDSTONE COOLING LTD' etc. I need to identify such fuzzy duplicates and create a new subset containing these records.I have a huge table containing such records,so, I'm just producing a sample.

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| Antila,Thomas           | Georgia |
| Greg                    | Georgia |
| St.Luke's Hospital      | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| St.Luke's Hospital      | Georgia |
| CANON PVT. LTD.         | Georgia |
| SANDSTONE COOLING LTD   | Georgia |
| Greg                    | Georgia |
| ANTILA,THOMAS           | Georgia |

I want the output to be:

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| CANON PVT. LTD.         | Georgia |
| Antila,Thomas           | Georgia |
| ANTILA,THOMAS           | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| SANDSTONE COOLING LTD   | Georgia |

I tried using RecordLinkage and agrep, but they give out the original data as output.

library(RecordLinkage)
ClosestMatch2 = function(string, stringVector){
  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]
  }
Fuzzy_duplicate=ClosestMatch2(df$Name, df$Name)

The other method was:

lapply(df$Name, agrep, df$Name, value = TRUE)

Using agrep gives the output as vector indices. However, I want to extract all the records belonging to only those whose names are similar?

Jazz
  • 95
  • 6
  • I would first do some data cleaning using the tm library and regex: E.g. Remove punctuation, make all letters upper/lower case, remove excessive whitespace, remove common company suffixes (e.g. LTD) and then match using a metric (levenshtein is an option, but I think there are more suitable ones). After that the subsetting should be rather easy. – Koot6133 Jul 12 '19 at 13:38
  • @Chase I went through a similar post https://stackoverflow.com/questions/6044112/how-to-measure-similarity-between-string. How can I extract all the records as a dataframe containing similar records instead of vector indices? – Jazz Jul 12 '19 at 17:26

0 Answers0