-1

I have a list of names in one dataset and a column for 'name' in another dataset. I was R to give me a new column where it says 1 if any of the names in my first dataset appear in the column 'name' in that row. In other words, I want it to go row by row, and for a value in a cell of that row, look in my first dataset. If the value appears in my first dataset, I want it to code it as a 1 in a new column. Can you help? I apologize for not providing the data structure - it's my first time posting. Here is what I am trying to do.

myDataSet1 <- as.data.frame( cbind( "firstname" = c("Jenny", "Jane", "Jessica", "Jamie", "Hannah"), "year" = c(2018, 2019, 2020, 2021, 2022)  ) )
    
myDataSet2 <- as.data.frame( cbind( "name" = c("Jenny", "John", "Andy", "Jamie", "Hannah", "Donny"), "dob" = c(1, 2, 3, 4, 5, 6) ) )

I want to know if each of the names listed in column myDataSet1$firstname's each row appear anywhere in mydataset2$name column. So, in this case, an ideal result would look like this.

myDataSet1

firstname  year  namematch
Jenny      2018  1
Jane       2019  0
Jessica    2020  0
Jamie      2021  1
Hannah     2022  0
L D
  • 430
  • 1
  • 3
  • 16
oiuerl
  • 13
  • 3
  • 1
    Welcome. Please read [this post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and provide a reproducible example of your data and expected output. Thank you. – user438383 Apr 26 '22 at 15:20
  • I am closing this question, please provide a reproducible example as suggested by @user438383. – DaveArmstrong Apr 26 '22 at 17:03
  • 1
    I added details now, sorry about that. – oiuerl Apr 26 '22 at 17:11

1 Answers1

0

Please supply some example of your data, i'm trying to guess with some random data:

    myDataSet1 <- as.data.frame( cbind( "PersonName" = c("Peter", "Jane", "John", "Louis", "Hannah"), 
                                        "NumberOfDogs" = c(9, 2, 5, 3, 5) ) )
    
    myDataSet2 <- as.data.frame( cbind( "Name" = c("Nora", "John", "Andy", "Louis", "Hannah", "Donny"), 
                                        "NumberOfCats" = c(1, 2, 3, 4, 5, 6) ) )
    myDataSet1
    myDataSet2
    
    # This applies anonymous function to each name of Mydataset1 -- PersonName, 
    # tests whether it is contained anywhere inside MyDataSet2 -- Name and return result of 0/1.
    myDataSet1$IsInDataSet2 <- sapply(myDataSet1$PersonName, 
                                      function(currentName) as.integer( currentName %in% myDataSet2$Name) ) 

Result

myDataSet1

PersonName NumberOfDogs IsInDataSet2
1      Peter            9            0
2       Jane            2            0
3       John            5            1  #contained in DataSet2
4      Louis            3            1  #contained in DataSet2
5     Hannah            5            1  #contained in DataSet2
L D
  • 430
  • 1
  • 3
  • 16
  • 1
    @L D thank you. This is exactly what I am trying to do. Essentially, look at any of the values in "personame" and see if any of them match the row value in my data frame. if so, give me a 1. I'm trying to execute the command in r that you wrote up. it's taking quite awhile (one df has 500k rows, and the other has 6k). I'm wondering as I wait, i don't understand what the "currentName" is referring to in this command. could you explain? thanks! – oiuerl Apr 26 '22 at 16:58
  • Good! The `sapply` function applies a function with a single parameter (here named `currentName` ) to a given list (here myDataSet1$PersonName). ` A little bit more readable version could be without using the inline anonymous function, i.e., creating `testNamePresence – L D Apr 26 '22 at 17:18
  • Thank you so much. I edited, but the question is closed and I messed up so bad that I can't ask again until tomorrow. :( Thanks so much for your help! – oiuerl Apr 26 '22 at 17:26
  • Good, you are welcomed. Yeah, i've noticed, nothing happens! You can ask in comments if needed, also do not forget to upvote the answer – L D Apr 26 '22 at 17:35
  • Oh, I tried! But I'm so new it won't let me upvote. R finally finished running the commands, but it didn't work! It returned a 0 for everything :( Thanks for trying :) – oiuerl Apr 26 '22 at 19:21
  • There aren't many places where this code could go wrong, it is however possible that the dataframe, where you search and also the source dataframe, may have _factors_ instead of _strings_ in the name columns -- you can check it out with function `str( dataset)` , e.g., `str(iris)` shows that the last column of this dataframe is factor rather than a string. R loads all strings as factors if not paramitrized to False... This could explain why it failed. Also you can get only a part of the dataset for tests rather than waiting too long, e.g. using `head(dataset, 100)` will take first 100 rows. – L D Apr 26 '22 at 19:29
  • Thank you! I tested str on my dataset and it looks like the variable of interest in both datasets is character – oiuerl Apr 26 '22 at 19:34
  • Good, then i would check the format of the texts (trailing spaces, whitespaces, encoding, lowercase/uppercase, etc.) and if this looks fine, then test direct equivalence of the names. For example take any name from Dataset 1 and test whether it can be found in Dataset2: `dataset1[1, "Name"] %in% dataset2[,"Name"]`, or better, to check directly the match: `dataset1[1, "Name"] == dataset2[12345, "Name"]`, if the answer is TRUE, something is wrong with the code. If FALSE, something is wrong with the texts – L D Apr 26 '22 at 19:45