1

I am trying to do a cross join (from the original question here), and I have 500GB of ram. The problem is that the final data table has more than 2^31 rows, so I get this error:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Is there a way to override this? When I add by=.EACHI, I get the error:

  'by' or 'keyby' is supplied but not j

I know this question is not in ideal reproducible format (my apologies!), but I am not sure that is strictly necessary for an answer. Maybe I am just missing something or data.table is limited in this way?

I am aware only of this question from 2013, which seems to suggest data.table could not do this back then.

This is the below code that causes the error:

  pfill=q[, k:=t+1][q2[, k:=tprm], on=.(k), nomatch=0L,allow.cartesian=TRUE][,k:=NULL]
wolfsatthedoor
  • 6,697
  • 16
  • 43
  • 85
  • did you pass in `allow.cartesian=TRUE`? Can you show your code that causes this error? – chinsoon12 Apr 08 '20 at 00:04
  • @chinsoon12 Good to see you! You actually helped me a long while back with the join! :) – wolfsatthedoor Apr 08 '20 at 00:07
  • hi @wolfsatthedoor, do you really need all the rows from the join? Or can you add in one more joining jey? it is probably a many-to-many join causing the huge allocation required. i think there are some discussions on this in github/rdatatable you might want to check out there – chinsoon12 Apr 08 '20 at 00:13
  • @chinsoon12 I really do need all the rows unfortunately. Is data.table just stumped for over any data table with more than 2 billion rows? – wolfsatthedoor Apr 08 '20 at 00:32
  • You might need to search the github for discussions as I don’t have access right now – chinsoon12 Apr 08 '20 at 00:38
  • see https://github.com/Rdatatable/data.table/issues/3957 – jangorecki Apr 08 '20 at 18:55

0 Answers0