0

[I am new to PySpark. If this is a duplicate to some existing question, though I cannot find it, please point me to it. Thanks.]

I have a dataset where, out of every 4 consecutive values, first one is fine but remaining 3 are NaN. (It happened because sampling rate in one column was one-fourth that of other columns.) Something like:

ColA
-----
 3.4
 NaN
 NaN
 NaN
 6.3
 NaN
 NaN
 NaN

and so on. What is the most efficient PySparkic way of replacing these three NaN values with preceding valid value.

Efficiency is a consideration as I have 3.8 billion rows with that pattern repeating (micro-second resolution sensor reading).

Many thanks.

Alper t. Turker
  • 32,514
  • 8
  • 78
  • 112
ImranAli
  • 134
  • 1
  • 7
  • I think you can use window functions, but I would have to see the rest of the dataframe and just not one column. – pissall May 26 '18 at 10:13
  • The rest of the data is unrelated with this column. You can assume a few neighbouring columns with some non-NaN random values. - Thanks. – ImranAli May 26 '18 at 10:29

0 Answers0