11

I can read few json-files at the same time using * (star):

sqlContext.jsonFile('/path/to/dir/*.json')

Is there any way to do the same thing for parquet? Star doesn't works.

SkyFox
  • 1,715
  • 3
  • 20
  • 33

4 Answers4

23

FYI, you can also:

  • read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")

  • read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

Boris
  • 963
  • 1
  • 11
  • 22
  • In addition you can also use a hadoop glob pattern or take advantage of the spark partitioning schema, see https://stackoverflow.com/a/41712465/179014 . – asmaier Sep 12 '17 at 15:58
18
InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)
4b0
  • 20,627
  • 30
  • 92
  • 137
user6602391
  • 181
  • 1
  • 2
12

See this issue on the spark jira. It is supported from 1.4 onwards.

Without upgrading to 1.4, you could either point at the top level directory:

sqlContext.parquetFile('/path/to/dir/')

which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).

dpeacock
  • 2,629
  • 12
  • 16
5

For Read: Give the file's path and '*'

Example

pqtDF=sqlContext.read.parquet("Path_*.parquet")
Suraj Rao
  • 28,850
  • 10
  • 94
  • 99
Idrees
  • 51
  • 1
  • 1