PySpark dataframe - Replace consecutive NaN values in a column with previous valid value

Asked May 26 '18 at 07:01

Active May 26 '18 at 12:18

Viewed 100 times

[I am new to PySpark. If this is a duplicate to some existing question, though I cannot find it, please point me to it. Thanks.]

I have a dataset where, out of every 4 consecutive values, first one is fine but remaining 3 are NaN. (It happened because sampling rate in one column was one-fourth that of other columns.) Something like:

ColA
-----
 3.4
 NaN
 NaN
 NaN
 6.3
 NaN
 NaN
 NaN

and so on. What is the most efficient PySparkic way of replacing these three NaN values with preceding valid value.

Efficiency is a consideration as I have 3.8 billion rows with that pattern repeating (micro-second resolution sensor reading).

Many thanks.

edited May 26 '18 at 12:18

Alper t. Turker

32,514
8
78
112

asked May 26 '18 at 07:01

ImranAli

I think you can use window functions, but I would have to see the rest of the dataframe and just not one column. – pissall May 26 '18 at 10:13
The rest of the data is unrelated with this column. You can assume a few neighbouring columns with some non-NaN random values. - Thanks. – ImranAli May 26 '18 at 10:29

PySpark dataframe - Replace consecutive NaN values in a column with previous valid value

0 Answers0