2

How to find size (in MB) of dataframe in pyspark ,

df=spark.read.json("/Filestore/tables/test.json") I want to find how the size of df or test.json

Aravindh
  • 21
  • 1
  • 4
  • 1
    Does these answer your question? [How to estimate dataframe real size in pyspark?](https://stackoverflow.com/questions/37077432/how-to-estimate-dataframe-real-size-in-pyspark), https://stackoverflow.com/questions/46228138/how-to-find-pyspark-dataframe-memory-usage, https://stackoverflow.com/questions/39652767/pyspark-2-0-the-size-or-shape-of-a-dataframe – mazaneicha Jun 16 '20 at 15:19

1 Answers1

1

In general this is not easy. You can

  • use org.apache.spark.util.SizeEstimator
  • use an approach which involves caching, see e.g. https://stackoverflow.com/a/49529028/1138523
  • use df.inputfiles() and use an other API to get the file size directly (I did so using Hadoop Filesystem API (How to get file size). Not that only works if the dataframe was not fitered/aggregated
Raphael Roth
  • 25,362
  • 13
  • 78
  • 128