0

This thread here helps with how to read a bucket containing parquet files into pandas data frame. Suppose my bucket structure is as following:

s3://my_bucket
                           ...
                           PRE date=2021-05-28/
                           PRE date=2021-05-29/
                           PRE date=2021-06-01/
                           PRE date=2021-06-02/
                           PRE date=2021-06-03/
                           PRE date=2021-06-04/
                           PRE date=2021-06-05/
                           PRE date=2021-06-06/
                           PRE date=2021-06-07/
                           PRE date=2021-06-08/
                           PRE date=2021-06-09/
                           PRE date=2021-06-16/
                           PRE date=2021-06-17/
                           PRE date=2021-06-18/
                           PRE date=2021-06-22/
                           PRE date=2021-06-23/
                           ...

Each sub-bucket is a date. Suppose I don't know which sub-buckets are faulty and attempting to read them may render errors and even kill the Python process. I want to exclude those buckets. What would be the good way, if any, to do this, from this snippet:

import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()
Tristan Tran
  • 1,129
  • 4
  • 14

0 Answers0