How to create pandas dataframe from parquet files kept on google storage

Question

I need to create data frame using pandas library using parquet files hosted on a google cloud storage bucket. I have searched the documents and online examples but can't seem to figure out how to go about it.

Could you please assist me by pointing me towards the right direction?

I am not looking for a solution but for a location where I could look for further information so that I could devise my own solution.

Thank you in advance.

score 2 · Answer 1 · answered May 28 '21 at 06:41

2

You may use gcsfs and pyarrow libraries to do so.

import gcsfs
from pyarrow import parquet

url = "gs://bucket_name/.../folder_name"
fs = gcsfs.GCSFileSystem()

// Assuming your parquet files start with `part-` prefix
files = ["gs://" + path for path in fs.glob(url + "/part-*")]
ds = parquet.ParquetDataset(files, filesystem=fs)
df = ds.read().to_pandas()

answered May 28 '21 at 06:41

Terence

21
2

suggest to do `"/*.parquet"` instead of `"/part-*"` to be more explicit. – Ryan Y Feb 25 '22 at 10:15

score 1 · Answer 2 · answered Feb 26 '20 at 10:00

1

You can read it with pandas.read_parquet like this:

df = pandas.read_parquet('gs:/bucket_name/file_name')

Additionally you will need gcsfs library and either pyarrow or fastparquet installed.

Don't forget to provide credentials in case you access private bucket.

answered Feb 26 '20 at 10:00

Emil Gi

1,013
2
9

Hi, Thank you for your answer. It is close but there is an issue. The said method reads a parquet file - agreed but it if a folder has multiple parquet files - it doesn't work OR is it that some other option is to be added? Basically I will not know whether there would be a single parquet file or multiple, and that is what I need to achieve. – User9102d82 Feb 26 '20 at 11:11
You can get a list of files in the bucket and then iterate over it with a loop and read files one by one. Refer to [this question](https://stackoverflow.com/q/54988092/12232507) for an example. I don't think there is a method to read the entire bucket at once. – Emil Gi Feb 26 '20 at 11:57
I would recommend you to kindly update your last comment as answer and I shall accept it as there is no other alternative. – User9102d82 Mar 30 '20 at 09:59

How to create pandas dataframe from parquet files kept on google storage

2 Answers2