I am fitting several models to data of unknown size. The models range from linear, quadratic and ODE, however the parameter-identification is always linear and I am using OLS. The parameters of the models are at the end stored, as they reflect crucial properties of a system. During the manual validation of the approaches I heavily relied on graphical checks like QQ-plots and histogram in order to check, whether the residuals and thus subsequently the goodness of the fit is reasonable.
However the graphical inspection will not be possible anymore, as I will get a lot of data sets and each data set describes another system and the data sets could also vary quite a lot in their size. That's why I am exploring to store characteristic quantities, that describe the quality of the fits within one data set, so that I can pinpoint to data sets and investigate in more detail, why there is a quite different fit compared to the other data sets. I also cannot do this by looking at the identified parameters, as they are allowed to differ. I want to identify parameters that might vary due to bad fits and subsequently analyze the data set. For example I will store $R^2$-value of the fits.
I also tried relying on the p-value of the chi-squared-test, whether the residuals of the fit follow a normal distribution as my hypothesis. I ran into the problem, that the chi-squared-test does not really give me reliable and reproducible results by quite minor changes of the data especially if the data is quite large - like discussed here Is normality testing 'essentially useless'?.
I basically want to know by looking at the residuals, whether there are anomalies or trends in them, that won't be explained by $R^2$-value and thus could worsen or invalid this specific parameter. I was thinking about calculating higher moments like skewness and curtosis as well, but I am not sure, whether these values will suffice. I also thought about fitting linear or quadratic models to the residuals itself and use the parameters to judge about occuring trends or anomalies, but I am not sure, if that is a bit of an overkill.
Furthermore I tried to understand the idea behind directed tests as proposed by https://stats.stackexchange.com/a/30053, whether this could be an applicable solution, but I was not able to grasp the idea behind it.
Do you know any concept resp. values, that could suffice my needs?
I was also considering using the test statistics, but somehow this will not work, as for example, a fit of parameter X can be done based on 500 data tuples and in another data set it is done based on 2000 data tuples. So using the absolute test statistic will not be of any big help, as I need something normalized.
– bluhub Jan 15 '23 at 19:37