I have four data frames containing data from four countries. Each data frame contains data from only one country. First, I create a variable Country so that when I join I know which IDs belong to which countries.
I want to join all four together. The problem comes when I use lapply to full_join all my data frames. Since I'm holding the sample_MAL data frame constant in the full_join it duplicates sample_MAL in my output data frame each time it loops. I obviously need it to not do that. Since I could have as many as 100 country data frames, I'd like a looping solution to avoid having to do it all manually.
#Create a data frame with each of the vendor IDs, labeled by country
IND <- tibble(rid = 1:10, PID = 1:10)
MAL <- tibble(rid = 11:24, PID = 11:24)
US <- tibble(rid = 25:41,PID = 25:41)
HK <- tibble(rid = 42:99, PID = 42:99)
#Adds the Country variable with values in each data frame
sample_IND <- IND %>% mutate(Country = "Indonesia") %>% select(PID,rid,Country)
sample_MAL <- MAL %>% mutate(Country = "Malaysia") %>% select(PID,rid,Country)
sample_HK <- HK %>% mutate(Country = "Hong Kong") %>% select(PID,rid,Country)
sample_US <- US %>% mutate(Country = "United States") %>% select(PID,rid,Country)
#Creates a list of 3 of my 4 data frames
df.list <- list(sample_IND,sample_HK,sample_US)
#Apply inner_joins across the list to the sample_MAL data frame
joined_df <- lapply(df.list, FUN = function(var){
full_join(sample_MAL,var)})
new.df <- bind_rows(joined_df)
When I summarize the data, I see that sample_MAL has been duplicated 3 times, as it has 42n when it should have 14n.
new.df %>%
group_by(Country) %>%
summarize(
n = n()
)
I know that I can use unique() to dedupe, which works fine is this case, but I'm hoping to find a solution that simply joins all files correctly the first time.