0

Is there a way to compare two values of type double in PySpark, with a specified margin of error? Essential similar to this post, but in PySpark.

Something like:

df=#some dataframe with 2 columns RESULT1 and RESULT2

df=withColumn('compare', when(col('RESULT1')==col('RESULT2') +/- 0.05*col('RESULT2'), lit("match")).otherwise(lit("no match"))

But in a more elegant way?

ZygD
  • 10,844
  • 36
  • 65
  • 84
thentangler
  • 797
  • 9
  • 23

2 Answers2

1

You can use between as the condition:

df2 = df.withColumn(
    'compare',
    when(
        col('RESULT1').between(0.95*col('RESULT2'), 1.05*col('RESULT2')), 
        lit("match")
    ).otherwise(
        lit("no match")
    )
)
mck
  • 37,331
  • 13
  • 29
  • 45
1

You can also write as |RESULT1 - RESULT2| <= 0.05 * RESULT2 :

from pyspark.sql import functions as F

df1 = df.withColumn(
    'compare',
    F.when(
        F.abs(F.col('RESULT1') - F.col("RESULT2")) <= 0.05 * F.col("RESULT2"),
        F.lit("match")
    ).otherwise(F.lit("no match"))
)
blackbishop
  • 26,760
  • 8
  • 50
  • 69