cartesian product with dplyr R

Question

I'm trying to find the dplyr function for cartesian product. I've two simple data.frame with no common variable:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))

I would like to reproduce the result of

merge(x,y)

  x y
1 a 1
2 b 1
3 c 1
4 a 2
5 b 2
6 c 2
7 a 3
8 b 3
9 c 3

I've already looked for this (for example here or here) without finding anything useful.

Thank you very much

`expand.grid(x$x,y$y)` or `CJ(x$x, y$y)` from data.table – akrun Apr 05 '17 at 10:23 — akrun, Apr 05 '17 at 10:23

score 51 · Accepted Answer · answered Apr 03 '18 at 13:03

51

Use crossing from the tidyr package:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))

crossing(x, y)

Result:

answered Apr 03 '18 at 13:03

Gregor Sturm

2,494
24
31

dsz · Answer 2 · 2019-02-27T06:59:05.250

8

Apologies to all: the below example does not appear to work with data.frames or data.tables.

When x and y are database tbls (tbl_dbi / tbl_sql) you can now also do:

full_join(x, y, by = character())

Added to dplyr at the end of 2017, and also gets translated to a CROSS JOIN in the DB world. Saves the nastiness of having to introduce the fake variables.

edited Feb 27 '19 at 06:59

answered Jun 21 '18 at 14:22

dsz

3,927
33
31

3

I'm running with dplyr 0.7.6 and this gives the error `Error in full_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches) : by must specify variables to join by` for the given example. Any ideas? – Lyngbakr Aug 29 '18 at 17:29
Same error as @Lyngbakr with dplyr 0.7.8. Did this ever appear in dplyr? Absent this, I use `crossing()`, as in the accepted answer. – banbh Feb 25 '19 at 14:39
@andyyy What was the underlying type when this error occur? I've seen odd and different behaviours especially on data.tables. – dsz Feb 26 '19 at 21:41

score 6 · Answer 3 · answered Apr 05 '17 at 10:28

If we need a tidyverse output, we can use expand from tidyr

library(tidyverse)
y %>% 
   expand(y, x= x$x) %>%
   select(x,y)
# A tibble: 9 × 2
#       x     y
#  <fctr> <dbl>
#1      a     1
#2      b     1
#3      c     1
#4      a     2
#5      b     2
#6      c     2
#7      a     3
#8      b     3
#9      c     3

score 5 · Answer 4 · answered May 23 '17 at 10:32

When faced with this problem, I tend to do something like this:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
x %>% mutate(temp=1) %>% 
inner_join(y %>% mutate(temp=1),by="temp") %>%
dplyr::select(-temp)

If x and y are multi-column data frames, but I want to do every combination of a row of x with a row of y, then this is neater than any expand.grid() option that I can come up with

score 4 · Answer 5 · edited Feb 13 '19 at 09:05

4

expand.grid(x=c("a","b","c"),y=c(1,2,3))

Edit: Consider also this following elegant solution from "Y T" for n more complex data.frame :

https://stackoverflow.com/a/21911221/5350791

in short:

expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))
expand.grid.df(df1, df2, df3)

edited Feb 13 '19 at 09:05

markus

24,556
5
34
51

answered Apr 05 '17 at 10:28

Damien Cormann

189
6

There is also [expand_grid](https://tidyr.tidyverse.org/reference/expand_grid.html) in package tidyr. – Jean-Claude Arbaut Dec 05 '20 at 11:51

Aaron C · Answer 6 · 2019-11-27T20:11:07.280

4

This is a continuation of dsz's comment. Idea came from: http://jarrettmeyer.com/2018/07/10/cross-join-dplyr.

tbl_1$fake <- 1
tbl_2$fake <- 1
my_cross_join <- full_join(tbl_1, tbl_2, by = "fake") %>%
                 select(-fake)

I tested this on four columns of data ranging in size from 4 to 640 obs, and it took about 1.08 seconds.

edited Nov 27 '19 at 20:11

answered Dec 17 '18 at 02:53

Aaron C

109
7

I like this one. It's nice and simple, but not very performant in my experience. – talbe009 Nov 26 '19 at 20:53

score 1 · Answer 7 · answered Apr 23 '21 at 17:17

Using two answers above, using full_join() with by = character() seems to be faster:

library(tidyverse)
library(microbenchmark)

df <- data.frame(blah = 1:10)

microbenchmark(diamonds %>% crossing(df))
Unit: milliseconds
                      expr      min       lq     mean   median       uq     max neval
 diamonds %>% crossing(df) 21.70086 22.63943 23.72622 23.01447 24.25333 30.3367   100
microbenchmark(diamonds %>% full_join(df, by = character()))
Unit: milliseconds
                                         expr      min       lq     mean   median       uq      max neval
 diamonds %>% full_join(df, by = character()) 9.814783 10.23155 10.76592 10.44343 11.18464 15.71868   100

cartesian product with dplyr R

7 Answers7

Linked

Related