26

How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data.

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.

Madhava Carrillo
  • 3,600
  • 3
  • 16
  • 21
  • Before Spark 2.3.0 there is a JIRA created for this. In 2.3.0 this is fixed. https://issues.apache.org/jira/browse/SPARK-20236 – wandermonk Apr 24 '18 at 18:31

2 Answers2

63

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example in scala:

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

wiesiu_p
  • 500
  • 5
  • 6
Madhava Carrillo
  • 3,600
  • 3
  • 16
  • 21
  • How to delete multiple partitions in HIVE table? – Neeraj Bhadani Jul 18 '18 at 13:24
  • Thats quite a different topic, check this out http://bigdataprogrammers.com/drop-multiple-partitions-in-hive/ – Madhava Carrillo Jul 20 '18 at 20:20
  • I have already tried the solution but it is not allowing to delete the partitions with operators like "" – Neeraj Bhadani Jul 22 '18 at 18:45
  • 2
    Hi, I tried this but didn't work for me. I had to pass in the overwrite=True as an argument in the insertInto method https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.insertInto But thanks for the clue, this rocks – suhprano Dec 10 '18 at 23:16
  • https://stackoverflow.com/questions/54246038/what-is-the-fastest-way-to-property-update-hdfs-data-with-spark can you help me? – pavel_orekhov Jan 18 '19 at 00:15
  • Doesn't work with `insertInto` but works perfectly with `save.` – cph_sto Aug 21 '19 at 12:29
  • @MadhavaCarrillo Hi, are you aware of a setting to only delete the partition, instead of replacing it? Thanks – cph_sto Aug 22 '19 at 10:37
  • @cph_sto not really, no. You can use the drop partition from hive. `alter table t1 drop if exists partition (p1=1);` or even using comparators like `alter table t drop partition (PARTITION_COL>1);` – Madhava Carrillo Aug 22 '19 at 13:18
  • @neerajbhadani you can use those in hive, in a hive CLI. – Madhava Carrillo Aug 22 '19 at 13:21
  • @suhprano I experienced that in python, not in Scala. – Madhava Carrillo Aug 22 '19 at 13:22
  • @MadhavaCarrillo Hi, Many thanks for confirming it. Actually, I has posted a question [here](https://stackoverflow.com/questions/57520227/how-to-delete-a-particular-month-from-a-parquet-file-partitioned-by-month) and someone answered with a similar answer, but I am using `external tables` and not `internal tables `. In addition to that, I am not on HIVE, instead I am using `Jupyter-PySpark`. I don't know how to do it there. – cph_sto Aug 22 '19 at 15:04
  • Note that your target table format has to support dynamic partitioning. If it does not and you apply the steps above, be aware that your table might be deleted! Learned that the hard way, even though I luckily made a backup before – Markus Jan 07 '22 at 14:02
21

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

from the source code:

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here

Community
  • 1
  • 1
Ali Bey
  • 431
  • 3
  • 8