[I am new to PySpark. If this is a duplicate to some existing question, though I cannot find it, please point me to it. Thanks.]
I have a dataset where, out of every 4 consecutive values, first one is fine but remaining 3 are NaN. (It happened because sampling rate in one column was one-fourth that of other columns.) Something like:
ColA
-----
3.4
NaN
NaN
NaN
6.3
NaN
NaN
NaN
and so on. What is the most efficient PySparkic way of replacing these three NaN values with preceding valid value.
Efficiency is a consideration as I have 3.8 billion rows with that pattern repeating (micro-second resolution sensor reading).
Many thanks.