0

in a AWS Glue job, the heap memory usage of the driver is going wild and I'm unsure what's going on.

I've got mainly two phases: joins (of parquet-saved tables) and post-elaboration, where most of the transformations take place.

During the joins phase, the driver memory consumption is in line with the workers' (max around 15-20%). If I do not broadcast the tables, I can lower the memory usage a little (10-15%). I've applied a cache / unpersist mehanism to keep the dataframe's last version only, unpersisting the previous one. Nonetheless, the memory usage increases linearly.

Question time: is it possible to free all the driver's memory of whatever previous steps the driver went through? At certain "checkpoints" all I care about is a single dataframe at the latest "version" (which I already cache, unpersisting the previous cached version). Even at this job's part, it feels like the driver is keeping something not required.

The "serious"problem starts when I do the series of transformations. They are some df.select(...).distinct(), lots of "simple" df.withColumn (meaning tuple-wise calculations, like .withColumn('new_column', when((col('bool_column') == lit(False)) & (col('another_column') != ''), col('a_column') * -1).otherwise(col('b_column')))).

I also do some windowing operations (always specifying Window.partitionBy(*columns)). Previously I used groupBy and then join, but I sperimentally saw that using sum().over(window) is more efficient, as it doesn't require the consequent join.

All of this makes the driver's memory increase and increase and never go down, resulting in a heap space out of memory exception.

Follows the memory usage graph. The split is where the most transformation begin.

enter image description here

Interesting enough, the peak usage of the driver's heap usage is at 67%, which is roughly 6,7g of RAM, not 100% (I assume Glue does some black magic in there, or it represents the total memory and not the heap's).

Follows the driver's usage VS the workers' (their peak is at 24%) enter image description here

I wonder what's causing this usage on the driver.

I do not collect, I do not broadcast (at least not at the main part of the job where most of the memory is being used). The driver's memory allocation is of 10GB.

May you please help me?

Thanks

Jack
  • 1,579
  • 1
  • 12
  • 18
  • 1
    if all you care about is the last created dataframe, checkpointing is your best option. see [this](https://stackoverflow.com/q/35127720/8279585) SO question or refer spark doc for more info. BTW, unpersisting a dataframe does not unpersist it then and there -- it takes an amount of time to get rid of the allocation. try `gc.collect()` to experiment. – samkart May 31 '22 at 10:23

0 Answers0