How do I replace a string value with a NULL in PySpark?

Question

I want to do something like this:

df.replace('empty-value', None, 'NAME')

Basically, I want to replace some value with NULL. but it does not accept None in this function. How can I do this?

zero323 · Answer 1 · 2016-04-27T23:48:36.503

62

You can combine when clause with NULL literal and types casting as follows:

from pyspark.sql.functions import when, lit, col

df = sc.parallelize([(1, "foo"), (2, "bar")]).toDF(["x", "y"])

def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

df.withColumn("y", replace(col("y"), "bar")).show()
## +---+----+
## |  x|   y|
## +---+----+
## |  1| foo|
## |  2|null|
## +---+----+

It doesn't introduce BatchPythonEvaluation and because of that should be significantly more efficient than using an UDF.

edited Apr 27 '16 at 23:48

answered Apr 27 '16 at 18:37

zero323

305,283
89
921
912

6

Excellent, much more efficient for large datasets. – cftarnas Sep 23 '16 at 20:44
@cftarnas It is, Python UDFs have relatively high serde overhead. – zero323 Sep 23 '16 at 21:24
7

@zero323 you don't actually need the `otherwise()` part - `when()` will return `null` by default. – pault Jul 11 '18 at 14:36
Could you please show how to do that with nested fields? – Ébe Isaac Sep 25 '18 at 09:00
@zero323 in otherwise, I tried returning None instead of lit(None) and the result is same. Can you please explain why you used lit? – ravi malhotra Oct 05 '19 at 09:20
@zero323 when can directly mention .otherwise(lit(null )) right ? – BdEngineer Sep 15 '20 at 09:42

score 9 · Accepted Answer · answered Apr 27 '16 at 20:10

9

This will replace empty-value with None in your name column:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType


df = sc.parallelize([(1, "empty-value"), (2, "something else")]).toDF(["key", "name"])
new_column_udf = udf(lambda name: None if name == "empty-value" else name, StringType())
new_df = df.withColumn("name", new_column_udf(df.name))
new_df.collect()

Output:

[Row(key=1, name=None), Row(key=2, name=u'something else')]

By using the old name as the first parameter in withColumn, it actually replaces the old name column with the new one generated by the UDF output.

answered Apr 27 '16 at 20:10

Daniel Zolnai

15,665
7
57
68

I didn't think of trying UDFs, that seems to be the way to go – talloaktrees Apr 27 '16 at 21:02
2

Your code will run faster if you use native Spark functions rather than UDFs - see the other answers. (That's why they have more upvotes than the accepted answer) – RobinL Jan 15 '20 at 11:26

score 6 · Answer 3 · answered Jul 08 '20 at 14:46

6

You could also simply use a dict for the first argument of replace. I tried it and this seems to accept None as an argument.

df = df.replace({'empty-value':None}, subset=['NAME'])

Note that your 'empty-value' needs to be hashable.

answered Jul 08 '20 at 14:46

Willem

765
6
19

score 5 · Answer 4 · answered Nov 22 '18 at 19:58

The best alternative is the use of a when combined with a NULL. Example:

from pyspark.sql.functions import when, lit, col

df= df.withColumn('foo', when(col('foo') != 'empty-value',col('foo)))

If you want to replace several values to null you can either use | inside the when condition or the powerfull create_map function.

Important to note is that the worst way to solve it with the use of a UDF. This is so because udfs provide great versatility to your code but come with a huge penalty on performance.

How do I replace a string value with a NULL in PySpark?

4 Answers4

Linked