1

I am fitting several models to data of unknown size. The models range from linear, quadratic and ODE, however the parameter-identification is always linear and I am using OLS. The parameters of the models are at the end stored, as they reflect crucial properties of a system. During the manual validation of the approaches I heavily relied on graphical checks like QQ-plots and histogram in order to check, whether the residuals and thus subsequently the goodness of the fit is reasonable.

However the graphical inspection will not be possible anymore, as I will get a lot of data sets and each data set describes another system and the data sets could also vary quite a lot in their size. That's why I am exploring to store characteristic quantities, that describe the quality of the fits within one data set, so that I can pinpoint to data sets and investigate in more detail, why there is a quite different fit compared to the other data sets. I also cannot do this by looking at the identified parameters, as they are allowed to differ. I want to identify parameters that might vary due to bad fits and subsequently analyze the data set. For example I will store $R^2$-value of the fits.

I also tried relying on the p-value of the chi-squared-test, whether the residuals of the fit follow a normal distribution as my hypothesis. I ran into the problem, that the chi-squared-test does not really give me reliable and reproducible results by quite minor changes of the data especially if the data is quite large - like discussed here Is normality testing 'essentially useless'?.

I basically want to know by looking at the residuals, whether there are anomalies or trends in them, that won't be explained by $R^2$-value and thus could worsen or invalid this specific parameter. I was thinking about calculating higher moments like skewness and curtosis as well, but I am not sure, whether these values will suffice. I also thought about fitting linear or quadratic models to the residuals itself and use the parameters to judge about occuring trends or anomalies, but I am not sure, if that is a bit of an overkill.

Furthermore I tried to understand the idea behind directed tests as proposed by https://stats.stackexchange.com/a/30053, whether this could be an applicable solution, but I was not able to grasp the idea behind it.

Do you know any concept resp. values, that could suffice my needs?

bluhub
  • 11
  • "graphical inspection will not be possible anymore" - are you sure? There are lots of visualisations for big data sets, see, e.g., here (just to give an example): https://towardsdatascience.com/how-to-create-fast-and-accurate-scatter-plots-with-lots-of-data-in-python-a1d3f578e551 – Christian Hennig Jan 13 '23 at 22:27
  • I actually think so. I guess, I was not precise enough in describing my particular use-case. In the end I will receive a lot of different data sets from different systems, where I will apply the fits on. I need to identify data sets and subsequently the fits within it, which might be suspicious resp. deviate significantly from the other fits of other data sets. I cannot do this by just looking on the identified parameters, as they are allowed to vary. I rather want to identify varying parameters due to suspicious fits – bluhub Jan 14 '23 at 07:48
  • This is a hard question, and it'll depend on details. How many data sets are we talking about, and how big are the individual data sets? There are formal tests of model assumptions that you could run. To what extent these are useful depends on the individual sample sizes - if samples are too large, tests will always reject even if assumptions are not violated in a critical way. Still in some cases the value of the test statistic (for example Kolmogorov-Smirnov) may be informative if not just to simply "reject" or "accept". – Christian Hennig Jan 15 '23 at 10:31
  • There is also robust outlier identification, but if individual samples are too large that may be computationally prohibitive, and if dimensions are too high, it may not work that well. In my view this is an interesting problem, but beyond the level where one can expect to get a free solution by strangers on the internet. – Christian Hennig Jan 15 '23 at 10:34
  • The chi squared test by the way isn't a very good normality test or doesn't even provide a good normality diagnostic (which may be of interest even understanding what the problem with standard normality testing is). $R^2$ doesn't contain information about model assumptions, rather it's about the "signal to noise"-ratio assuming the standard model (although as such it may still be of interest). – Christian Hennig Jan 15 '23 at 10:37
  • the amount of data sets will be ~4k. The size of an individual data set is again not fixed and can vary quite a lot. But to give some sort of feeling: a fit is done based on minimum 500 data tuples and max 20k data tuples.

    I was also considering using the test statistics, but somehow this will not work, as for example, a fit of parameter X can be done based on 500 data tuples and in another data set it is done based on 2000 data tuples. So using the absolute test statistic will not be of any big help, as I need something normalized.

    – bluhub Jan 15 '23 at 19:37
  • And you are right, I do also start to think, that there will not be one parameter as the holy grail, which will fulfill all the details. That's why I opted for using the p-value of the normality test, but this ended fast due to the varying samples size used for one fit – bluhub Jan 15 '23 at 19:43
  • I presume, it will be several values like outlier-identification, quantites reflecting normality (e.g. skewness, kurtosis, heteroskedasticity), correlation of residuals etc.pp.. – bluhub Jan 15 '23 at 19:51
  • That makes sense to me. – Christian Hennig Jan 15 '23 at 21:29

0 Answers0