3

I have a spark dataframe which i can convert to pandas dataframe using the

toPandas()

method available in pyspark.

I have the following queries regarding this?

  1. Does this conversion break the purpose of using spark itself(Distributed computing)?
  2. The dataset is going to be huge , so what about the speed and memory issues?
  3. If somebody can also explain ,what exactly happens with this one line of code,that would really help.

Thanks

function
  • 1,228
  • 11
  • 33

1 Answers1

6

Yes, once toPandas is called on spark-dataframe it will get out of distributed system and new pandas dataframe will be in driver node of cluster.

And if the spark-data frame is huge and if doesnt fit into driver memory it will crash.

WoodChopper
  • 3,975
  • 5
  • 25
  • 49