This thread here helps with how to read a bucket containing parquet files into pandas data frame. Suppose my bucket structure is as following:
s3://my_bucket
...
PRE date=2021-05-28/
PRE date=2021-05-29/
PRE date=2021-06-01/
PRE date=2021-06-02/
PRE date=2021-06-03/
PRE date=2021-06-04/
PRE date=2021-06-05/
PRE date=2021-06-06/
PRE date=2021-06-07/
PRE date=2021-06-08/
PRE date=2021-06-09/
PRE date=2021-06-16/
PRE date=2021-06-17/
PRE date=2021-06-18/
PRE date=2021-06-22/
PRE date=2021-06-23/
...
Each sub-bucket is a date. Suppose I don't know which sub-buckets are faulty and attempting to read them may render errors and even kill the Python process. I want to exclude those buckets. What would be the good way, if any, to do this, from this snippet:
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()