Say, I have some data obtained from real-world measurements, which is therefore noisy, but otherwise shows multiple distinct "ranges" of operation. Each range could be seen as being best modeled by a specific function, so I call them also "functional ranges", in lack of a better term (apologies if I'm using wrong terminology throughout this Q, feel free to correct me).
I have attempted to model this kind of data with sim_signal_y, which demonstrates three ranges of operation, in the code below:
import math
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
from scipy import signal
def get_constructed_function():
flat_level = 0.07
r1_len = 100
region_1_x = np.linspace(0, r1_len-1, r1_len) # 100 steps for range [0, 99], default endpoint=True
region_1_y = np.repeat(flat_level, len(region_1_x))
r2_len = 200
r2_fact = 0.039
r2_offsx = math.log(flat_level, math.e)/r2_fact
region_2_x = np.linspace(r1_len, r1_len+r2_len-1, r2_len)
region_2_y = np.exp(r2_fact*(region_2_x+r2_offsx-r1_len))
r2_ep = (region_2_x[-1], region_2_y[-1])
r2_prep = (region_2_x[-2], region_2_y[-2])
tanr2end = (r2_ep[1]-r2_prep[1])/(r2_ep[0]-r2_prep[0])
r12_len = r1_len + r2_len
r3_len = 700
r3_fact = 0.067
r3_xtan = ( tanr2end )/(-2*r3_fact*r3_fact)
r3_y_xtan_nc = -(r3_fact*r3_xtan)**2
r3_offsx = -r3_fact*( r2_ep[0] + abs(r3_xtan) )
r3_yconst = r2_ep[1] - r3_y_xtan_nc
region_3_x = np.linspace(r12_len, r12_len+r3_len-1, r3_len)
region_3_y = r3_yconst-np.square(r3_fact*region_3_x+r3_offsx)
all_x = np.concatenate( (region_1_x, region_2_x, region_3_x) )
all_y = np.concatenate( (region_1_y, region_2_y, region_3_y) )
return (all_x, all_y)
def noisify(in_y):
# partly via https://stackoverflow.com/q/70216468
num_elems = len(in_y)
noise_orig_ampl = 1.0
noise_y_first = np.random.default_rng().uniform(-noise_orig_ampl,noise_orig_ampl,num_elems)
sos = signal.butter(N=2, Wn=1000, btype='lowpass', fs=4000, output='sos') #
noise_y_filt = signal.sosfilt(sos, noise_y_first)
noise_y_filt_scaled = ( noise_y_filt / np.max(noise_y_filt) )*0.05
ret_y = np.add(in_y, noise_y_filt_scaled)
return ret_y
def get_sim_signal():
func_orig_x, func_orig_y = get_constructed_function()
ax.plot(func_orig_x, func_orig_y, ls='dashed', color='black') # uncomment this line, to also plot the "original" constructed function
ret_y = noisify(func_orig_y)
return func_orig_x, ret_y
matplotlib.rc('axes.formatter', useoffset=False) # nowork for 1e125
fig, ax = plt.subplots()
arr_x, sim_signal_y = get_sim_signal()
ax.plot(arr_x, sim_signal_y, color='teal')
# attempt to fit sim_signal
order=30 #
z = np.polyfit(arr_x, sim_signal_y, deg=order)
p = np.poly1d(z) # class object for dealing with polynomials
ax.plot(arr_x, p(arr_x), color='darksalmon')
ax.get_yaxis().get_major_formatter().set_useOffset(False) # nowork for 1e125
ax.grid(True)
fig.canvas.draw_idle()
plt.show()
Basically, sim_signal_y in the example above, can be seen to demonstrate these three ranges of operation, given that these are apriori used in its construction:
- Constant - for low x, the output values cluster around a constant value (taken to be 0.07 as example in the code)
- The noise is low enough, so it is perceptible that this region tends to be constant
- Exponential - a region of fast exponential growth, followed by
- Quadratic - a region of slower, quadratic growth
So, to remove the influence of noise, the code also shows, that I've tried to model this data with polyfit. Overall (seen throughout whole range), for order=30, the modelling looks quite good:
... however, the problem is that for the low values in constant range, the model is way off as it oscillates (green/teal is the signal; pink/orange is the model):
... and also, the oscillations are quite visible at the end of the final region:
I am aware, that it is likely impossible to model this kind of data in its entirety; however, I would like to find an approach that would allow me to specify different requirements - such as specifying the following:
- Model the constant region with (as close as possible to) a "flat" line, with a value as close as possible to a possible "real" one (which in this example is apriori 0.07)
- Model the final part of the data (say, from 950-1000 in the domain of the abscissa in this example) as close as possible (i.e. model line should pass through as many data points in this range as possible, and definitely land on the end point)
- Model at least the start of the exponential growth region with decent (though not necessarily best) approximation
- Larger errors in the other remaining regions are tolerable
Is there a modelling approach like this, that I could use in Python (matplotlib/numpy/scipy/etc)?