Getting rid of sparks in sample data

Question

Possible Duplicate:
Simple algorithm for online outlier detection of a generic time series
Getting rid of spikes in sample data

How could I get rid of sparky (aka spikey) data in a discrete data set, but in a "smoother out" manner?

Take for instance

enter image description here

There are two sparks, at 20000, but the next one at 600 is also considered a spark.

I've managed to get the very high ones to zero, by

a = 2
b = 5
beta_dist = RealDistribution('beta', [a, b])
f(x) = x / 19968
normalized_insertions = [f(i) for i in insertions]

insertions_pairs = [(i, beta_dist.distribution_function(i)) for i in normalized_insertions]
plot_b = beta_dist.plot()

show(list_plot(insertions_pairs)+plot_b)

No idea how to go about the lower ones. The maximum should be reached at 100, perhaps the parameters for the beta distribution need a little more twiddling?

Currently, it looks like this: enter image description here

If possible, use sage as a reference for your explanations.

You say that the maximum should be 100, but you don't want to delete the data points? Is the spikes important for whatever you are doing? You could take the log or log10 of all numbers and work with the log transformed data. — pgericson, Sep 14 '12 at 11:42
I agree with pgericson that you had to eliminate or impute lower values to the spikes in order to get a fit with a maximum at 100. I am not completely clear about your exact approach. But if you think the spikes are errors that should be imputed or deleted you should have a firm rational basis for modifying your data. If you impute values to the spikes what is your imputation method and how do you justify it.
You need to explain precisely what you are doing and supply your rationale before we can intelligently evaluate what you have done or even to be able to suggest a better approach. — Michael R. Chernick, Sep 14 '12 at 15:08
I don't see why this was migrated. I thought it was closed on CV as a duplicate. While many statistics related questions appear on both Mathematics and CV I do not see anything to say that it is more suitable for math. — Michael R. Chernick, Sep 14 '12 at 17:07

Getting rid of sparks in sample data

0 Answers0