21

Spark 2.4 introduced new useful Spark SQL functions involving arrays but I was a little bit puzzled when I find out that the result of: select array_remove(array(1, 2, 3, null, 3), null) is null and not [1, 2, 3, 3].

Is this an expected behavior? Is it possible to remove nulls using array_remove?

As a side note, for now the alternative I am using is a higher order function in databricks:

select filter(array(1, 2, 3, null, 3), x -> x is not null)

datapug
  • 1,991
  • 1
  • 12
  • 30
  • 3
    The alternative is the way to go. `array_remove` depends on notion of equality, and [equality with `NULL` is undefined](https://stackoverflow.com/q/41533290/6910411). – zero323 Jan 12 '19 at 13:46

4 Answers4

17

To answer your first question, Is this an expected behavior? , Yes. Because the official notebook(https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html) points out "Remove all elements that equal to the given element from the given array." and NULL corresponds to undefined values & the results will also not defined.

So,I think NULL s are out of the purview of this function.

Better you found out a way to overcome this, you can also use spark.sql("""SELECT array_except(array(1, 2, 3, 3, null, 3, 3,3, 4, 5), array(null))""").show() but the downside is that the result will be without duplicates.

Sarath Chandra Vema
  • 763
  • 1
  • 6
  • 12
9

You can do something like this in Spark 2:

import org.apache.spark.sql.functions._
import org.apache.spark.sql._

/**
  * Array without nulls
  * For complex types, you are responsible for passing in a nullPlaceholder of the same type as elements in the array
  */
def non_null_array(columns: Seq[Column], nullPlaceholder: Any = "רכוב כל יום"): Column =
  array_remove(array(columns.map(c => coalesce(c, lit(nullPlaceholder))): _*), nullPlaceholder)

In Spark 3, there is new array filter function and you can do:

df.select(filter(col("array_column"), x => x.isNotNull))
pathikrit
  • 30,920
  • 35
  • 134
  • 215
5

https://docs.databricks.com/_static/notebooks/apache-spark-2.4-functions.html

array_remove(array, T): array Remove all elements that equal to the given element from the given array.

Note: I only referred the documentation and they have taken the same data. **null can never be equal to null.

Shubhanshu
  • 164
  • 2
  • 11
-1

I don't think you can use array_remove() or array_except() for your problem. However, though it's not a very good solution, but it may help.

@F.udf("array<string>")
def udf_remove_nulls(arr):
    return [i for i in arr if i is not None]

df = df.withColumn("col_wo_nulls", udf_remove_nulls(df["array_column"]))
Loochie
  • 2,066
  • 10
  • 18