29

I want to do something like this:

df.replace('empty-value', None, 'NAME')

Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?

talloaktrees
  • 3,218
  • 5
  • 26
  • 42

4 Answers4

62

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])

def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## |  x|   y|
## +---+----+
## |  1| foo|
## |  2|null|
## +---+----+

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

zero323
  • 305,283
  • 89
  • 921
  • 912
9

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

Daniel Zolnai
  • 15,665
  • 7
  • 57
  • 68
  • I didn't think of trying UDFs, that seems to be the way to go – talloaktrees Apr 27 '16 at 21:02
  • 2
    Your code will run faster if you use native Spark functions rather than UDFs - see the other answers. (That's why they have more upvotes than the accepted answer) – RobinL Jan 15 '20 at 11:26
6

You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.

df = df.replace({'empty-value':None}, subset=['NAME'])

Note that your 'empty-value' needs to be hashable.

Willem
  • 765
  • 6
  • 19
5

The best alternative is the use of a when combined with a NULL. Example:

from pyspark.sql.functions import when, lit, col

df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))

If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.

Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

Chp11
  • 51
  • 1
  • 5