Spark Partitioning and avoiding shuffle

Asked Oct 13 '21 at 21:23

Active Oct 13 '21 at 21:23

Viewed 38 times

Let's say we have the folowing scenario:

val partitionedTabA = notPartitionedTabA.repartition($"id")
val joinedTab = notPartitionedTabB.join(partitionedTabA, "id")

Do we need to repartition notPartitionedTabB too if we want to avoid a shuffle? They don't have the same partitioner. I see people sometimes explicitly repartitioning both RDDs/Dataframes/Datasets (e.g. https://towardsdatascience.com/should-i-repartition-836f7842298c - Example I) and sometimes not (e.g. Partition data for efficient joining for Spark dataframe/dataset).
Will partitionedTabA, notPartitionedTabB and joinedTab have the same partitioning after the join?

asked Oct 13 '21 at 21:23

atos

0 Answers0