Read few parquet files at the same time in Spark

Question

I can read few json-files at the same time using * (star):

sqlContext.jsonFile('/path/to/dir/*.json')

Is there any way to do the same thing for parquet? Star doesn't works.

score 23 · Answer 1 · answered May 18 '16 at 08:59

23

FYI, you can also:

read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")
read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

answered May 18 '16 at 08:59

Boris

963
1
11
22

In addition you can also use a hadoop glob pattern or take advantage of the spark partitioning schema, see https://stackoverflow.com/a/41712465/179014 . – asmaier Sep 12 '17 at 15:58

score 18 · Answer 2 · edited Jul 24 '18 at 03:17

18

InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)

edited Jul 24 '18 at 03:17

4b0

20,627
30
92
137

answered Jul 24 '18 at 03:09

user6602391

181
1
2

im my case first i filter the files in s3 and then give the list to read.parquet() thanks! – Carlos Gomez Dec 06 '19 at 13:53

score 12 · Accepted Answer · answered May 24 '15 at 16:18

12

See this issue on the spark jira. It is supported from 1.4 onwards.

Without upgrading to 1.4, you could either point at the top level directory:

sqlContext.parquetFile('/path/to/dir/')

which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).

answered May 24 '15 at 16:18

dpeacock

2,629
12
16

3

I get `AttributeError: 'SQLContext' object has no attribute 'parquetFile' ` – Soren Oct 11 '18 at 17:51

score 5 · Answer 4 · edited Jan 15 '19 at 11:03

5

For Read: Give the file's path and '*'

Example

pqtDF=sqlContext.read.parquet("Path_*.parquet")

edited Jan 15 '19 at 11:03

Suraj Rao

28,850
10
94
99

answered Jan 15 '19 at 10:57

Idrees

51
1
1

Read few parquet files at the same time in Spark

4 Answers4

Linked