0

Let's say we have the folowing scenario:

val partitionedTabA = notPartitionedTabA.repartition($"id")
val joinedTab = notPartitionedTabB.join(partitionedTabA, "id")
  1. Do we need to repartition notPartitionedTabB too if we want to avoid a shuffle? They don't have the same partitioner. I see people sometimes explicitly repartitioning both RDDs/Dataframes/Datasets (e.g. https://towardsdatascience.com/should-i-repartition-836f7842298c - Example I) and sometimes not (e.g. Partition data for efficient joining for Spark dataframe/dataset).
  2. Will partitionedTabA, notPartitionedTabB and joinedTab have the same partitioning after the join?
atos
  • 37
  • 4

0 Answers0