0

I have four data frames containing data from four countries. Each data frame contains data from only one country. First, I create a variable Country so that when I join I know which IDs belong to which countries.

I want to join all four together. The problem comes when I use lapply to full_join all my data frames. Since I'm holding the sample_MAL data frame constant in the full_join it duplicates sample_MAL in my output data frame each time it loops. I obviously need it to not do that. Since I could have as many as 100 country data frames, I'd like a looping solution to avoid having to do it all manually.

#Create a data frame with each of the vendor IDs, labeled by country
IND <- tibble(rid = 1:10, PID = 1:10)
MAL <- tibble(rid = 11:24, PID = 11:24)
US <- tibble(rid = 25:41,PID = 25:41)
HK <- tibble(rid = 42:99, PID = 42:99)

#Adds the Country variable with values in each data frame
sample_IND <- IND %>% mutate(Country = "Indonesia") %>% select(PID,rid,Country)
sample_MAL <- MAL %>% mutate(Country = "Malaysia") %>% select(PID,rid,Country)
sample_HK <- HK %>% mutate(Country = "Hong Kong") %>% select(PID,rid,Country)
sample_US <- US %>% mutate(Country = "United States") %>% select(PID,rid,Country)

#Creates a list of 3 of my 4 data frames
df.list <- list(sample_IND,sample_HK,sample_US)

#Apply inner_joins across the list to the sample_MAL data frame
joined_df <- lapply(df.list, FUN = function(var){
                full_join(sample_MAL,var)})

new.df <- bind_rows(joined_df)

When I summarize the data, I see that sample_MAL has been duplicated 3 times, as it has 42n when it should have 14n.

new.df %>% 
    group_by(Country) %>% 
    summarize(
        n = n()
    )

I know that I can use unique() to dedupe, which works fine is this case, but I'm hoping to find a solution that simply joins all files correctly the first time.

0 Answers0