1

I am saving a spark dataframe to S3 bucket. The default storage type for the saved file is STANDARD. I need it to be STANDARD_IA. What is the option to achieve this. I have looked into the spark source codes and found no such options for spark DataFrameWriter in https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Below is the code I am using to write to S3:

val df = spark.sql(<sql>)
df.coalesce(1).write.mode("overwrite").parquet(<s3path>)

Edit: I am now using CopyObjectRequest to change the storage type of the created parquet:

val copyObjectRequest = new CopyObjectRequest(bucket, key, bucket, key).withStorageClass(<storageClass>)
s3Client.copyObject(copyObjectRequest)
tusher
  • 43
  • 6

1 Answers1

1

Not possible with the S3A connector; its up for a volunteer to implement with all the tests, in HADOOP-12020. FWIW, it's the tests which will be the hard part. I don't know about Amazon's own connectors.

Why not just define a lifecycle for the bucket and have things moved over every night?

stevel
  • 11,052
  • 1
  • 37
  • 45
  • "Why not just define a lifecycle for the bucket and have things moved over every night?" - it's because you can move objects to OneZone AI only after 30 days. it makes a lot of sense to upload directly with OZ-IA – Vladimir Semashkin Apr 27 '22 at 06:39
  • aah, that's a slightly different use case than glacier. if there's a way to mark files in that category during upload, it'd be viable. as usual: contributor to the oss codebase is expected to add new tests and declare which endpoint they ran the current tests against.... – stevel Apr 27 '22 at 18:33