43

I would like to read in a file with the following structure with Apache Spark.

628344092\t20070220\t200702\t2007\t2007.1370

The delimiter is \t. How can I implement this while using spark.read.csv()?

The csv is much too big to use pandas because it takes ages to read this file. Is there some way which works similar to

pandas.read_csv(file, sep = '\t')

Thanks a lot!

samthebest
  • 29,729
  • 24
  • 97
  • 132
inneb
  • 950
  • 1
  • 8
  • 19

2 Answers2

80

Use spark.read.option("delimiter", "\t").csv(file) or sep instead of delimiter.

If it's literally \t, not tab special character, use double \: spark.read.option("delimiter", "\\t").csv(file)

T. Gawęda
  • 14,828
  • 4
  • 42
  • 60
3

This works for me and it is much more clear (for me): As you mentioned, in pandas you would do:

df_pandas = pandas.read_csv(file_path, sep = '\t')

In spark:

df_spark = spark.read.csv(file_path, sep ='\t', header = True)

Please note that if the first row of your csv are the column names, you should set header = False, like this:

df_spark = spark.read.csv(file_path, sep ='\t', header = False)

You can change the separator (sep) to fit your data.

Tom
  • 376
  • 3
  • 13