13

I am trying to write a pandas dataframe to parquet file format (introduced in most recent pandas version 0.21.0) in append mode. However, instead of appending to the existing file, the file is overwritten with new data. What am i missing?

the write syntax is

df.to_parquet(path, mode='append')

the read syntax is

pd.read_parquet(path)
Siraj S.
  • 3,051
  • 3
  • 27
  • 42
  • [try opening the file in append mode](https://stackoverflow.com/a/17531025/1278112) – Shihe Zhang Nov 09 '17 at 05:57
  • 1
    this does not work (makes not difference from the previous situation) – Siraj S. Nov 09 '17 at 18:16
  • from this link "https://stackoverflow.com/questions/39234391/how-to-append-data-to-an-existing-parquet-file" it looks like append is not supported in parquet client API – Siraj S. Nov 09 '17 at 18:19
  • [In the doc](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.to_parquet.html#pandas-dataframe-to-parquet) there is no `append` mode for `to_parquet()` API.If you want to append to a file, the `append` mode is for the file.That's what I try to express earlier. – Shihe Zhang Nov 10 '17 at 01:04
  • see here https://stackoverflow.com/questions/47113813/using-pyarrow-how-do-you-append-to-parquet-file – Andrey May 24 '21 at 10:17

4 Answers4

6

To append, do this:

import pandas as pd 
import pyarrow.parquet as pq
import pyarrow as pa

dataframe = pd.read_csv('content.csv')
output = "/Users/myTable.parquet"

# Create a parquet table from your dataframe
table = pa.Table.from_pandas(dataframe)

# Write direct to your parquet file
pq.write_to_dataset(table , root_path=output)

This will automatically append into your table.

Victor Faro
  • 155
  • 1
  • 4
3

I used aws wrangler library. It works like charm

Below are the reference docs

https://aws-data-wrangler.readthedocs.io/en/latest/stubs/awswrangler.s3.to_parquet.html

I have read from kinesis stream and used kinesis-python library to consume the message and writing to s3 . processing logic of json I have not included as this post deals with problem unable to append data to s3. Executed in aws sagemaker jupyter

Below is the sample code I used:

!pip install awswrangler
import awswrangler as wr
import pandas as pd
evet_data=pd.DataFrame({'a': [a], 'b':[b],'c':[c],'d':[d],'e': [e],'f':[f],'g': [g]},columns=['a','b','c','d','e','f','g'])
#print(evet_data)
s3_path="s3://<your bucker>/table/temp/<your folder name>/e="+e+"/f="+str(f)
try:
    wr.s3.to_parquet(
    df=evet_data,
    path=s3_path,
    dataset=True,
    partition_cols=['e','f'],
    mode="append",
    database="wat_q4_stg",
    table="raw_data_v3",
    catalog_versioning=True  # Optional
    )
    print("write successful")       
except Exception as e:
    print(str(e))

Any clarifications ready to help. In few more posts I have read to read data and overwrite again. But as the data gets larger it will slow down the process. It is inefficient

Naveen Srikanth
  • 669
  • 2
  • 7
  • 23
2

There is no append mode in pandas.to_parquet(). What you can do instead is read the existing file, change it, and write back to it overwriting it.

ben26941
  • 1,285
  • 13
  • 20
-1

Pandas to_parquet can handle both single files as well as directories with multiple files in it. Pandas will silently overwrite the file, if the file is already there. To append to a parquet object just add a new file to the same parquet directory.

os.makedirs(path, exist_ok=True)

# write append (replace the naming logic with what works for you)
filename = f'{datetime.datetime.utcnow().timestamp()}.parquet'
df.to_parquet(os.path.join(path, filename))

# read
pd.read_parquet(path)
natbusa
  • 1,498
  • 1
  • 17
  • 24