33

I have a pandas dataframe. i want to write this dataframe to parquet file in S3. I need a sample code for the same.I tried to google it. but i could not get a working sample code.

Alexsander
  • 433
  • 1
  • 5
  • 12

4 Answers4

45

For your reference, I have the following code works.

s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')

In order to use to_parquet, you need pyarrow or fastparquet to be installed. Also, make sure you have correct information in your config and credentials files, located at .aws folder.

Edit: Additionally, s3fs is needed. see https://stackoverflow.com/a/54006942/1862909

otmezger
  • 9,770
  • 18
  • 60
  • 88
Wai Kiat
  • 649
  • 1
  • 7
  • 13
  • 4
    I have multiple profiles in my .aws/config and credentials files... is there a way to set which profile to use? (I suppose setting my ENV var: AWS_PROFILE= would work, but would be nice to do it in code) – Brian Wylie Oct 24 '19 at 17:07
  • 4
    Yes, you first import `boto3`, then set your profile using `session = boto3.Session(profile_name={your_profile}")` – Wai Kiat Oct 25 '19 at 02:24
  • 3
    For completeness, if you want a `.parquet` as output file, drop the compression arg and change the file name to `.parquet`: `s3_url = 's3://bucket/folder/bucket.parquet' df.to_parquet(s3_url) ` – Rajat Shah Sep 24 '21 at 04:33
  • Fully agree with ending filename as parquet, because .gzip implies you need to unzip it. My comment is to warn of a caveat using to_parquet(...). If you use engine=fast_parquet and provide partition_cols, to_parquet leaves a trail of directory starting with "s3:" at your working dir. Be warned. – michaelgbj Nov 11 '21 at 16:58
16

the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally

Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager

def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):

        if format == 'parquet':
            out_buffer = BytesIO()
            input_datafame.to_parquet(out_buffer, index=False)

        elif format == 'csv':
            out_buffer = StringIO()
            input_datafame.to_parquet(out_buffer, index=False)

        s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())

S3_client is nothing but a boto3 client object.Hope this helps!

courtesy- https://stackoverflow.com/a/40615630/12036254

gurjarprateek
  • 160
  • 1
  • 6
  • 1
    For anyone wondering what is input_dataframe.to_parquet: https://stackoverflow.com/questions/41066582/python-save-pandas-data-frame-to-parquet-file – JOHN Feb 18 '20 at 06:43
  • 1
    For data having timestamp: https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error – JOHN Feb 18 '20 at 07:08
  • 1
    I followed this and got garbage values written in the file. :( What could be going wrong? – ShwetaJ Nov 16 '20 at 16:39
  • @gurjarprateek, it seems some of data is being lost even though I'm not seeing any errors. At first I believe to be lack of memory (DFS ares somewhat large), but I'd expect a error message – Lucas Abreu Feb 02 '22 at 15:43
15

First ensure that you have pyarrow or fastparquet installed with pandas.

Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.

Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.

Sample code excluding imports:

def main():
    data = {0: {"data1": "value1"}}
    df = pd.DataFrame.from_dict(data, orient='index')
    write_pandas_parquet_to_s3(
        df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")


def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
    # dummy dataframe
    table = pa.Table.from_pandas(df)
    pq.write_table(table, fileName)

    # upload to s3
    s3 = boto3.client("s3")
    BucketName = bucketName
    with open(fileName) as f:
       object_data = f.read()
       s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)
pitchblack408
  • 2,731
  • 3
  • 34
  • 49
andreas
  • 176
  • 1
  • 3
12

For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet

to install do;

pip install awswrangler

if you want to write your pandas dataframe as a parquet file to S3 do;

import awswrangler as wr
wr.s3.to_parquet(
    dataframe=df,
    path="s3://my-bucket/key/my-file.parquet"
)
Vincent Claes
  • 2,937
  • 3
  • 31
  • 52
  • 4
    Caveat: unlike `pandas.DataFrame.to_parquet()`, wrangler has no option to pass kwargs to the underlying parquet library. This means that you can't set lower-level options if you need to. I ran into this issue when PyArrow failed to infer the table schema-- in pandas, you can work around this by [explicitly defining](https://stackoverflow.com/a/66805787/4212158) a PyArrow schema – crypdick Jul 13 '21 at 15:10