How to maintain Primary Key columns in Databricks Delta Multi Cluster environment

Asked Aug 25 '19 at 11:41

Active Aug 26 '19 at 07:15

Viewed 3,658 times

I am trying to replicate the SQL DB like feature of maintaining the Primary Keys in Databrciks Delta approach where the data is being written to Blob Storage such as ADLS2 or AWS S3.

I want a Auto Incremented Primary key feature using Databricks Delta.

Existing approach - is using the latest row count and maintaining the Primary keys. However, this approach does not suit in parallel processing environment where Primary keys get duplicated data.

edited Aug 26 '19 at 07:15

asked Aug 25 '19 at 11:41

mn0102

Possible duplicate of [Primary keys with Apache Spark](https://stackoverflow.com/questions/33102727/primary-keys-with-apache-spark) – simon_dmorias Aug 27 '19 at 14:50
I've flagged as duplicate. This isn't a Databricks Delta issue - rather a Spark in general issue. Ideally I would not use an incremental key - they don't work in a distributed world. Instead try a guid - or look at a function called monotonicallyIncreasingId. – simon_dmorias Aug 27 '19 at 14:52

How to maintain Primary Key columns in Databricks Delta Multi Cluster environment

0 Answers0