1

What exactly does spark.sql.shuffle.partitions refer to? Are we talking of the number of partitions that is the results of a wide transformation, or something that happens in the middle as in some sort of intermediary partitioning before the result partition of the wide transformation?

Because in my understanding, as per a wide transformation we have

Parents RDDs -> shuffle files -> Child RDDs

What does the spark.sql.shuffle.partitions parameter refer to here? The shuffles files or the CHILD RDDs or something else that I ignored?

ZygD
  • 10,844
  • 36
  • 65
  • 84
MaatDeamon
  • 8,778
  • 6
  • 49
  • 112

1 Answers1

1

This is already explained in the official docs:

spark.sql.shuffle.partitions 200 Configures the number of partitions to use when shuffling data for joins or aggregations.

In other words it is the number of partitions of the child Dataset.

vinsce
  • 697
  • 6
  • 15