Function in R for getting a 0 in new column when value in another column equals any of the rows in another dataset

Question

I have a list of names in one dataset and a column for 'name' in another dataset. I was R to give me a new column where it says 1 if any of the names in my first dataset appear in the column 'name' in that row. In other words, I want it to go row by row, and for a value in a cell of that row, look in my first dataset. If the value appears in my first dataset, I want it to code it as a 1 in a new column. Can you help? I apologize for not providing the data structure - it's my first time posting. Here is what I am trying to do.

myDataSet1 <- as.data.frame( cbind( "firstname" = c("Jenny", "Jane", "Jessica", "Jamie", "Hannah"), "year" = c(2018, 2019, 2020, 2021, 2022)  ) )
    
myDataSet2 <- as.data.frame( cbind( "name" = c("Jenny", "John", "Andy", "Jamie", "Hannah", "Donny"), "dob" = c(1, 2, 3, 4, 5, 6) ) )

I want to know if each of the names listed in column myDataSet1$firstname's each row appear anywhere in mydataset2$name column. So, in this case, an ideal result would look like this.

myDataSet1

firstname  year  namematch
Jenny      2018  1
Jane       2019  0
Jessica    2020  0
Jamie      2021  1
Hannah     2022  0

Welcome. Please read [this post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and provide a reproducible example of your data and expected output. Thank you. — user438383, Apr 26 '22 at 15:20
I am closing this question, please provide a reproducible example as suggested by @user438383. — DaveArmstrong, Apr 26 '22 at 17:03

score 0 · Accepted Answer · answered Apr 26 '22 at 15:42

0

Please supply some example of your data, i'm trying to guess with some random data:

    myDataSet1 <- as.data.frame( cbind( "PersonName" = c("Peter", "Jane", "John", "Louis", "Hannah"), 
                                        "NumberOfDogs" = c(9, 2, 5, 3, 5) ) )
    
    myDataSet2 <- as.data.frame( cbind( "Name" = c("Nora", "John", "Andy", "Louis", "Hannah", "Donny"), 
                                        "NumberOfCats" = c(1, 2, 3, 4, 5, 6) ) )
    myDataSet1
    myDataSet2
    
    # This applies anonymous function to each name of Mydataset1 -- PersonName, 
    # tests whether it is contained anywhere inside MyDataSet2 -- Name and return result of 0/1.
    myDataSet1$IsInDataSet2 <- sapply(myDataSet1$PersonName, 
                                      function(currentName) as.integer( currentName %in% myDataSet2$Name) )

Result

myDataSet1

PersonName NumberOfDogs IsInDataSet2
1      Peter            9            0
2       Jane            2            0
3       John            5            1  #contained in DataSet2
4      Louis            3            1  #contained in DataSet2
5     Hannah            5            1  #contained in DataSet2

answered Apr 26 '22 at 15:42

L D

430
1
3
16

1

@L D thank you. This is exactly what I am trying to do. Essentially, look at any of the values in "personame" and see if any of them match the row value in my data frame. if so, give me a 1. I'm trying to execute the command in r that you wrote up. it's taking quite awhile (one df has 500k rows, and the other has 6k). I'm wondering as I wait, i don't understand what the "currentName" is referring to in this command. could you explain? thanks! – oiuerl Apr 26 '22 at 16:58
Good! The `sapply` function applies a function with a single parameter (here named `currentName` ) to a given list (here myDataSet1$PersonName). ` A little bit more readable version could be without using the inline anonymous function, i.e., creating `testNamePresence – L D Apr 26 '22 at 17:18
Thank you so much. I edited, but the question is closed and I messed up so bad that I can't ask again until tomorrow. :( Thanks so much for your help! – oiuerl Apr 26 '22 at 17:26
Good, you are welcomed. Yeah, i've noticed, nothing happens! You can ask in comments if needed, also do not forget to upvote the answer – L D Apr 26 '22 at 17:35
Oh, I tried! But I'm so new it won't let me upvote. R finally finished running the commands, but it didn't work! It returned a 0 for everything :( Thanks for trying :) – oiuerl Apr 26 '22 at 19:21
There aren't many places where this code could go wrong, it is however possible that the dataframe, where you search and also the source dataframe, may have _factors_ instead of _strings_ in the name columns -- you can check it out with function `str( dataset)` , e.g., `str(iris)` shows that the last column of this dataframe is factor rather than a string. R loads all strings as factors if not paramitrized to False... This could explain why it failed. Also you can get only a part of the dataset for tests rather than waiting too long, e.g. using `head(dataset, 100)` will take first 100 rows. – L D Apr 26 '22 at 19:29
Thank you! I tested str on my dataset and it looks like the variable of interest in both datasets is character – oiuerl Apr 26 '22 at 19:34
Good, then i would check the format of the texts (trailing spaces, whitespaces, encoding, lowercase/uppercase, etc.) and if this looks fine, then test direct equivalence of the names. For example take any name from Dataset 1 and test whether it can be found in Dataset2: `dataset1[1, "Name"] %in% dataset2[,"Name"]`, or better, to check directly the match: `dataset1[1, "Name"] == dataset2[12345, "Name"]`, if the answer is TRUE, something is wrong with the code. If FALSE, something is wrong with the texts – L D Apr 26 '22 at 19:45

Function in R for getting a 0 in new column when value in another column equals any of the rows in another dataset

1 Answers1