0

I'm having difficulty finding anything covering this.

This answer is similar - Get the same hash value for a Pandas DataFrame each time. I'm looking for the same logic of returning a sha256 repeatably when passing in a dataframe, but using databricks / spark dataframes rather than Pandas.

Thanks,

Steve Homer
  • 3,687
  • 2
  • 20
  • 39
  • 1
    What is your use case? What do you really want to achieve? This is not available out of the box as far as I know. You can hash each record, collect these hashes as an array to the driver if small enough and hash that again. – Georg Heiler Sep 30 '20 at 21:42
  • The dataframes can be very large i.e: 100 TB, you can not insert such a key into the hash function. If you need to compare dataframes you can set a unique identifier for each record, let's call it row_id (this could also be the hash of each row). Next, you can compare the dataframes by comparing the two sets of the datasets and add more criteria such as row count – abiratsis Oct 01 '20 at 15:01
  • Unless the data is small enough to be collected on the driver, some kind of [reduce](https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/Dataset.html#reduce(func:(T,T)=%3ET):T) function will be involved to enable parallelism. This function must be must be commutative and associative. Is it a hard requirement to use sha256? Or could it be any function? – werner Oct 04 '20 at 16:05

0 Answers0