What happens when a spark dataframe is converted to Pandas dataframe using toPandas() method

Question

I have a spark dataframe which i can convert to pandas dataframe using the

toPandas()

method available in pyspark.

I have the following queries regarding this?

Does this conversion break the purpose of using spark itself(Distributed computing)?
The dataset is going to be huge , so what about the speed and memory issues?
If somebody can also explain ,what exactly happens with this one line of code,that would really help.

Thanks

WoodChopper · Answer 1 · 2016-05-30T13:30:35.083

6

Yes, once toPandas is called on spark-dataframe it will get out of distributed system and new pandas dataframe will be in driver node of cluster.

And if the spark-data frame is huge and if doesnt fit into driver memory it will crash.

edited May 30 '16 at 13:30

answered May 28 '16 at 14:15

WoodChopper

1 Answers1