Save a CSV file that's too big to fit into memory into a parquet file

Question

My development environment is a single-user workstation with 4 cores but not running Spark or HDFS. I have a CSV file that's too big to fit in memory. I want to save it as a parquet file and analyze it locally using existing tools, but have the ability to move it to the Spark cluster in the future and analyze it with Spark.

Is there any way to do this row-by-row without moving the file over to the Spark cluster?

I'm looking for a pure-python solution that does not involve the use of Spark.

Have a look into this thread, it may answer your question https://stackoverflow.com/questions/42900757/sequentially-read-huge-csv-file-in-python — ascripter, Jan 25 '18 at 17:53
Can you please mention why are you concerned about moving the file to the Spark Cluster? — Ahmed Kamal, Jan 25 '18 at 21:07
Possible duplicate of [How to save a huge pandas dataframe to hdfs?](https://stackoverflow.com/questions/47393001/how-to-save-a-huge-pandas-dataframe-to-hdfs) — Alper t. Turker, Jan 25 '18 at 23:59
Not a duplicate, as I want to do this without spark and without HDFS. — vy32, Jan 26 '18 at 00:33

score 1 · Answer 1 · answered Jan 25 '18 at 17:53

1

There is no issue with reading files large than memory. Spark can handle cases like this, without any adjustments, and

spark.read.csv(in_path).write.parquet(out_path)

will work just fine, as long as you don't use unsplittable compression for the input (gzip for example).

answered Jan 25 '18 at 17:53

Alper t. Turker

32,514
8
78
112

Yes, but we're trying to do this locally, without spark. – vy32 Jan 25 '18 at 21:13
So why is it tagged [apache-spark]? :) – Alper t. Turker Jan 25 '18 at 23:59

score 0 · Answer 2 · answered Nov 11 '18 at 19:25

The pyarrow.parquet function write_to_dataset() looks like it may do this. https://arrow.apache.org/docs/python/parquet.html#writing-to-partitioned-datasets

However I can't find detailed documentation for this function at the moment - you may need to look at the source code to see what it does. https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py, line 1154 at time of writing.

The pyarrow.parquet.ParquetWriter object might also do it..

Save a CSV file that's too big to fit into memory into a parquet file

2 Answers2