Let's say we have the folowing scenario:
val partitionedTabA = notPartitionedTabA.repartition($"id")
val joinedTab = notPartitionedTabB.join(partitionedTabA, "id")
- Do we need to repartition
notPartitionedTabBtoo if we want to avoid a shuffle? They don't have the same partitioner. I see people sometimes explicitly repartitioning both RDDs/Dataframes/Datasets (e.g. https://towardsdatascience.com/should-i-repartition-836f7842298c - Example I) and sometimes not (e.g. Partition data for efficient joining for Spark dataframe/dataset). - Will
partitionedTabA,notPartitionedTabBandjoinedTabhave the same partitioning after the join?