2

I have a set datasets with sequential measurements. Since the size of these sets is quite big (>80000 measurements) I decided to simplify them by applying a Simple Moving Average (SMA) and selecting the data every n measurements.

Each set belongs to a patient and we want to see the effect of a certain lifestyle on the parameter we are measuring, as described in this question.

But I have missing values in the sets, therefore SMA cannot be applied.

How should I treat the missing values? I thought of two solutions: eliminating the missing values or substitute them with the previous one, based on the assumption that, being biologically linked, a value is not exceptionally different from the precedent. Another solution is to get the missing value by averaging the one before and the one after.

So, which solution is better? Do you suggest other solutions?

Bakaburg
  • 2,917

1 Answers1

1

I would ask yourself the following questions:

  1. How often do the missing values occur?
  2. Are the missing values biased to a particular characteristic that would materially change your results?

After you answer these questions then you can make a decision whether to exclude them or make an adjustment. For example, if there is a low occurrence and they do not seem to be biased anywhere (missing values occur randomly) I would probably exclude and document why.

If you want to get complicated you can build a separate model to try a feel in these values. However, the cost of this may not be worth the benefit.

Andre Silva
  • 3,080
  • The number of missing values varies from subject to subject, from 0 to 2% of values. The missing values we suppose are simply due to the machine not taking the values. Being one value per second, one measurement doesn't influence much and for sure sure there are overestimation error together with NA values. That's why we are averaging by minute with SMA. – Bakaburg Jan 14 '14 at 19:50