3

I need to read parquet files from multiple directories.

for example,

 Dir---
          |
           ----dir1---
                      |
                       .parquet
                       .parquet
          |
           ----dir2---
                      |
                       .parquet
                       .parquet
                       .parquet

Is there a way to read these file to single pandas data frame?

note: All of parquet files was generated using pyspark.

Ahmad Senousi
  • 583
  • 2
  • 10
  • 24

1 Answers1

6

Use read_parquet in list comprehension and concat with all files generated by glob with ** (python 3.5+):

import pandas as pd
import glob

files = glob.glob('Dir/**/*.parquet')
df = pd.concat([pd.read_parquet(fp) for fp in files])
jezrael
  • 729,927
  • 78
  • 1,141
  • 1,090
  • I got this error ```RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']``` – Ahmad Senousi Jan 15 '20 at 05:07
  • 1
    @AhmadSuliman - Check [this](https://stackoverflow.com/questions/50800748/decompression-snappy-not-available-with-fastparquet) – jezrael Jan 15 '20 at 05:12