1

My development environment is a single-user workstation with 4 cores but not running Spark or HDFS. I have a CSV file that's too big to fit in memory. I want to save it as a parquet file and analyze it locally using existing tools, but have the ability to move it to the Spark cluster in the future and analyze it with Spark.

Is there any way to do this row-by-row without moving the file over to the Spark cluster?

I'm looking for a pure-python solution that does not involve the use of Spark.

vy32
  • 26,286
  • 33
  • 110
  • 218
  • Have a look into this thread, it may answer your question https://stackoverflow.com/questions/42900757/sequentially-read-huge-csv-file-in-python – ascripter Jan 25 '18 at 17:53
  • Can you please mention why are you concerned about moving the file to the Spark Cluster? – Ahmed Kamal Jan 25 '18 at 21:07
  • 1
    Possible duplicate of [How to save a huge pandas dataframe to hdfs?](https://stackoverflow.com/questions/47393001/how-to-save-a-huge-pandas-dataframe-to-hdfs) – Alper t. Turker Jan 25 '18 at 23:59
  • 1
    Not a duplicate, as I want to do this without spark and without HDFS. – vy32 Jan 26 '18 at 00:33

2 Answers2

1

There is no issue with reading files large than memory. Spark can handle cases like this, without any adjustments, and

spark.read.csv(in_path).write.parquet(out_path)

will work just fine, as long as you don't use unsplittable compression for the input (gzip for example).

Alper t. Turker
  • 32,514
  • 8
  • 78
  • 112
0

The pyarrow.parquet function write_to_dataset() looks like it may do this. https://arrow.apache.org/docs/python/parquet.html#writing-to-partitioned-datasets

However I can't find detailed documentation for this function at the moment - you may need to look at the source code to see what it does. https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py, line 1154 at time of writing.

The pyarrow.parquet.ParquetWriter object might also do it..

DavidH
  • 1