Insert values in one column in dataframe based on another reference table

Question

I currently have a product table and an ingredient table, as follows:

Product (actual: 13 columns, truncated for ease of explanation):

Product_code  Product_name  Ingredient  Brand_name
A123          Product1      NA          BrandA
B123          Product2      NA          BrandB
C123          Product3      NA          BrandC

Ingredient (actual has 2 columns, as shown):

Product_code  Ingredient
A123          Ing1
A123          Ing2
B123          IngA
B123          IngB
B123          IngC

The aim is "fill in" the ingredient details for each product in the "Product" table such that it appears as such:

Combined table:

Product_code  Product_name  Ingredient  Brand_name
A123          Product1      Ing1        BrandA
A123          Product1      Ing2        BrandA
B123          Product2      IngA        BrandB
B123          Product2      IngB        BrandB
B123          Product2      IngC        BrandB
C123          Product3      NA          BrandC

As some products have one ingredient while others have multiple, I want to be able to duplicate new rows for products with multiple ingredients (all other column values to retain the same except for the "Ingredient" column). Products without data in the "Ingredients" table will remain as "NA".

I have tried using dplyr::right_join but it left me with "NA" values for all the other columns other than "Product_code" and "Ingredient". Hoping to get some advice on how to proceed with this. Thanks!

Don't you need `full_join` instead? `full_join(Product, Ingredient, by = 'Product_code')` — Ronak Shah, Apr 07 '20 at 06:40

score 3 · Accepted Answer · answered Apr 07 '20 at 06:40

I think you are looking for full_join. It is also OK to remove the Ingredient column in the Product data frame as it does not provide information of the ingredient.

library(dplyr)

combined <- Ingrident %>%
  full_join(Product %>% select(-Ingredient), by = c("Product_code"))
combined
#   Product_code Ingredient Product_name Brand_name
# 1         A123       Ing1     Product1     BrandA
# 2         A123       Ing2     Product1     BrandA
# 3         B123       IngA     Product2     BrandB
# 4         B123       IngB     Product2     BrandB
# 5         B123       IngC     Product2     BrandB
# 6         C123       <NA>     Product3     BrandC

DATA

Product <- read.table(text = "Product_code  Product_name  Ingredient  Brand_name
A123          Product1      NA          BrandA
B123          Product2      NA          BrandB
C123          Product3      NA          BrandC",
                  header = TRUE, stringsAsFactors = FALSE)

Ingredient <- read.table(text = "Product_code  Ingredient
A123          Ing1
A123          Ing2
B123          IngA
B123          IngB
B123          IngC",
                        header = TRUE, stringsAsFactors = FALSE)

@Prahlad Thanks for the base R solution. Two suggestion. 1. Do not use an object name as `c` because `c` is a function in R and that causes confusions. 2. Remove the `Ingredient` column in the `Product` data frame before the merge otherwise there will be duplicated `Ingredient` column with `.x` and `.y` in the column names. — www, Apr 07 '20 at 06:55

score 0 · Answer 2 · answered Apr 07 '20 at 06:35

0

If you don't want NAs, look at dplyr::inner_join. With right join you are asking as much rows as your ingredients dataframe. Non matching keys in the left dataframe will have NAs

answered Apr 07 '20 at 06:35

linog

5,350
3
13
23

thanks @linog! I previously tried `inner_join` but it gave me less rows than my original "Product" table, that's why I considered `right_join`. I just tried `inner_join` again and it worked! Realised I had to `group_by` "Product_code" first before joining. Thanks alot again! – Desmond Apr 07 '20 at 06:42

Insert values in one column in dataframe based on another reference table

2 Answers2