How to write parquet file from pandas dataframe in S3 in python

Question

I have a pandas dataframe. i want to write this dataframe to parquet file in S3. I need a sample code for the same.I tried to google it. but i could not get a working sample code.

score 45 · Answer 1 · edited May 28 '19 at 15:37

45

For your reference, I have the following code works.

s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')

In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.

Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909

edited May 28 '19 at 15:37

otmezger

9,770
18
60
88

answered Nov 27 '18 at 15:00

Wai Kiat

649
1
7
13

4

I have multiple profiles in my .aws/config and credentials files... is there a way to set which profile to use? (I suppose setting my ENV var: AWS_PROFILE= would work, but would be nice to do it in code) – Brian Wylie Oct 24 '19 at 17:07
4

Yes, you first import `boto3`, then set your profile using `session = boto3.Session(profile_name={your_profile}")` – Wai Kiat Oct 25 '19 at 02:24
3

For completeness, if you want a `.parquet` as output file, drop the compression arg and change the file name to `.parquet`: `s3_url = 's3://bucket/folder/bucket.parquet' df.to_parquet(s3_url) ` – Rajat Shah Sep 24 '21 at 04:33
Fully agree with ending filename as parquet, because .gzip implies you need to unzip it. My comment is to warn of a caveat using to_parquet(...). If you use engine=fast_parquet and provide partition_cols, to_parquet leaves a trail of directory starting with "s3:" at your working dir. Be warned. – michaelgbj Nov 11 '21 at 16:58

gurjarprateek · Answer 2 · 2019-09-08T06:24:57.823

16

the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally

Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager

def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):

        if format == 'parquet':
            out_buffer = BytesIO()
            input_datafame.to_parquet(out_buffer, index=False)

        elif format == 'csv':
            out_buffer = StringIO()
            input_datafame.to_parquet(out_buffer, index=False)

        s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())

S3_client is nothing but a boto3 client object.Hope this helps!

courtesy- https://stackoverflow.com/a/40615630/12036254

edited Sep 08 '19 at 06:24

answered Sep 08 '19 at 02:50

gurjarprateek

160
1
6

1

For anyone wondering what is input_dataframe.to_parquet: https://stackoverflow.com/questions/41066582/python-save-pandas-data-frame-to-parquet-file – JOHN Feb 18 '20 at 06:43
1

For data having timestamp: https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error – JOHN Feb 18 '20 at 07:08
1

I followed this and got garbage values written in the file. :( What could be going wrong? – ShwetaJ Nov 16 '20 at 16:39
@gurjarprateek, it seems some of data is being lost even though I'm not seeing any errors. At first I believe to be lack of memory (DFS ares somewhat large), but I'd expect a error message – Lucas Abreu Feb 02 '22 at 15:43

score 15 · Accepted Answer · edited Apr 22 '21 at 00:51

15

First ensure that you have pyarrow or fastparquet installed with pandas.

Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.

Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.

Sample code excluding imports:

def main():
    data = {0: {"data1": "value1"}}
    df = pd.DataFrame.from_dict(data, orient='index')
    write_pandas_parquet_to_s3(
        df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")


def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
    # dummy dataframe
    table = pa.Table.from_pandas(df)
    pq.write_table(table, fileName)

    # upload to s3
    s3 = boto3.client("s3")
    BucketName = bucketName
    with open(fileName) as f:
       object_data = f.read()
       s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)

edited Apr 22 '21 at 00:51

pitchblack408

2,731
3
34
49

answered Nov 27 '18 at 07:21

andreas

176
1
3

3

Your example would have looked cleaner with the imports. I also think you will get more points if you add a second example using BytesIO as a buffer. – pitchblack408 Apr 22 '21 at 00:45
1

import pyarrow as pa , import pyarrow.parquet as pq is needed. – Akhilendra Sep 02 '21 at 11:54
how would you write partitioned parquet? – dshri Oct 04 '21 at 08:26

score 12 · Answer 4 · answered Sep 12 '20 at 12:37

12

For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

if you want to write your pandas dataframe as a parquet file to S3 do;

import awswrangler as wr
wr.s3.to_parquet(
    dataframe=df,
    path="s3://my-bucket/key/my-file.parquet"
)

answered Sep 12 '20 at 12:37

Vincent Claes

2,937
3
31
52

4

Caveat: unlike `pandas.DataFrame.to_parquet()`, wrangler has no option to pass kwargs to the underlying parquet library. This means that you can't set lower-level options if you need to. I ran into this issue when PyArrow failed to infer the table schema-- in pandas, you can work around this by [explicitly defining](https://stackoverflow.com/a/66805787/4212158) a PyArrow schema – crypdick Jul 13 '21 at 15:10

How to write parquet file from pandas dataframe in S3 in python

4 Answers4

Linked