1

I need to create data frame using pandas library using parquet files hosted on a google cloud storage bucket. I have searched the documents and online examples but can't seem to figure out how to go about it.

Could you please assist me by pointing me towards the right direction?

I am not looking for a solution but for a location where I could look for further information so that I could devise my own solution.

Thank you in advance.

User9102d82
  • 1,142
  • 8
  • 19

2 Answers2

2

You may use gcsfs and pyarrow libraries to do so.

import gcsfs
from pyarrow import parquet

url = "gs://bucket_name/.../folder_name"
fs = gcsfs.GCSFileSystem()

// Assuming your parquet files start with `part-` prefix
files = ["gs://" + path for path in fs.glob(url + "/part-*")]
ds = parquet.ParquetDataset(files, filesystem=fs)
df = ds.read().to_pandas()
Terence
  • 21
  • 2
1

You can read it with pandas.read_parquet like this:

df = pandas.read_parquet('gs:/bucket_name/file_name')

Additionally you will need gcsfs library and either pyarrow or fastparquet installed.

Don't forget to provide credentials in case you access private bucket.

Emil Gi
  • 1,013
  • 2
  • 9
  • Hi, Thank you for your answer. It is close but there is an issue. The said method reads a parquet file - agreed but it if a folder has multiple parquet files - it doesn't work OR is it that some other option is to be added? Basically I will not know whether there would be a single parquet file or multiple, and that is what I need to achieve. – User9102d82 Feb 26 '20 at 11:11
  • You can get a list of files in the bucket and then iterate over it with a loop and read files one by one. Refer to [this question](https://stackoverflow.com/q/54988092/12232507) for an example. I don't think there is a method to read the entire bucket at once. – Emil Gi Feb 26 '20 at 11:57
  • I would recommend you to kindly update your last comment as answer and I shall accept it as there is no other alternative. – User9102d82 Mar 30 '20 at 09:59