No of partitions when reading from any file follows below formula.
step1: find file size/folder size from specified path which i was tested on local.You can find based on your requirements(either s3/hdfs).
import os
def find_folder_size(path):
total = 0
for entry in os.scandir(path):
if entry.is_file():
total += entry.stat().st_size
elif entry.is_dir():
total += find_folder_size(entry.path)
return total
Step2 : Apply formula
target_partition_size = 200 #100 or 200 depends on your target partition
total_size = find_folder_size(paths)
print('Total size: {}'.format(total_size))
print(int(math.ceil(total_size / 1024.0 / 1024.0 / float(target_partition_size))))
num_partitions = int(math.ceil(total_size / 1024.0 / 1024.0 / float(target_partition_size)))
PARTITION_COLUMN_NAME = ['a','c']
df = df.repartition(num_partitions, PARTITION_COLUMN_NAME)
or
df = df.repartition(num_partitions)
we can apply for either large data/small data to get number of partition.