Custom delimiter csv reader spark

Question

I would like to read in a file with the following structure with Apache Spark.

628344092\t20070220\t200702\t2007\t2007.1370

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

pandas.read_csv(file, sep = '\t')

Thanks a lot!

T. Gawęda · Accepted Answer · 2017-09-21T17:28:36.653

80

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

edited Sep 21 '17 at 17:28

answered Sep 21 '17 at 17:21

T. Gawęda

1

Is there any website to check the documentation of spark.read or anything else? Thanks for the answer! :) – inneb Sep 21 '17 at 17:41
1

CSV supports is a merge of this project: https://github.com/databricks/spark-csv It has some documentation. I'm personally just checking the code :) – T. Gawęda Sep 21 '17 at 17:43
What's the difference between sep and delimiter? – aglavina Sep 13 '18 at 19:21
1

@aglavina None, both means the same :) – T. Gawęda Sep 13 '18 at 20:31
This changed in Spark now, with the pandas solution at the top also possible? – Jan Janiszewski Jan 29 '19 at 20:48

score 3 · Answer 2 · answered Oct 21 '21 at 14:27

This works for me and it is much more clear (for me): As you mentioned, in pandas you would do:

df_pandas = pandas.read_csv(file_path, sep = '\t')

In spark:

df_spark = spark.read.csv(file_path, sep ='\t', header = True)

Please note that if the first row of your csv are the column names, you should set header = False, like this:

df_spark = spark.read.csv(file_path, sep ='\t', header = False)

You can change the separator (sep) to fit your data.

2 Answers2