1

One small doubt I may be novice Please help me with your guidance.

I have dependent variable Log(Y) and a set of several independent variables say x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...... around 50 variables. I have tried many transformations like normalization , mean subtraction , reciprocals, sqrt and many other but I was not able improve my F value at all above 15 max.

But if I divide Log(Y) by one of independent variables elementwise say by x1 then I able a get a good fit of (Log(Y)/x1)= ax_i+bx_j+c*x_k+d with good regression values.

Is their anyway I can interpret this statistically any advice to improve. I am at a novice level when it comes to deep thing in statistics.

Please help with your suggestion it will help me.

By data is more biological in nature and it be difficult to share the data directly kind help

ALL X are some properties of proteins

Y is some activity of it.

50 variables are the main effects After including many other transformations it has become around 600 in totals with the transformation

Their 36 rows of data in this case As getting more may difficult in certain cases in biological studies based on the nature of the problem in use

I used standard model, best subsets method from automatic linear regression in SPSS to get this

The same has been verified via linear regression through Enter

My doubt here is on this mainly

But if I divide Log(Y) by one of independent variables elementwise say by x1 then I able a get a good fit of (Log(Y)/x1)= ax_i+bx_j+c*x_k+d with good regression values.

How can I interpret kind help

sriram
  • 59
  • 3
    Please edit the question to at provide information about the nature of your Y value and the x1 value that you are dividing by. Also, please add to the question how many rows of data that you are trying to fit with 50 predictors and why you care so much about the value of F. If you have developed this model with stepwise regression, as seems to be the case from this question, and you also tried several different types of transformations before you came up with this model, all your measures of the quality of model fit will be wrong. – EdM Dec 10 '22 at 20:30
  • Made the changes to my knowledge as requested please kind help – sriram Dec 11 '22 at 01:34

1 Answers1

1

You have a big risk of overfitting your data. That would be developing a model that fits your current data perfectly, but in a way that depends on details of the specific data sample. Such a model typically won't extend well to other data samples. See this page and its links for some examples.

A rule of thumb for ordinary regression like this in biological and biomedical studies is that you are at risk of overfitting if you try to evaluate more than one predictor for every 10-20 cases. With only 36 cases here (rows of data), even trying to use 4 predictors runs a risk of overfitting. Yet with all of your different transformations and combinations of the original variables, you have tried up to 600. Although automated model selection is allowed by statistical software, it tends to lead to overfitting. This page goes into extensive detail; also see its links.

This problem comes from at least two major sources. First is that there simply aren't enough data points to fit a complex model reliably. Second is that you have tried multiple transformations and combinations of both predictors and the outcome variable, and even proposed an alternate outcome variable in the form of Log(Y)/x1, after you saw the results. There is a place for exploratory data analysis, but then you have violated the assumptions needed to draw inference from F-statistics, $R^2$ values, and p-values.

That means that your attempts to improve measures like F-statistics by further transforming your data will probably just be giving you even more overfitting of a small data set, instead of a model that will describe the underlying population better. That's not good statistical practice.

There are ways to deal properly with the problem of too many predictors for two few cases. Frank Harrell's course notes and book discusses the general problem of regression modeling and ways to combine predictors intelligently without looking at outcomes. There are methods like LASSO that can select from among a set of predictors but adjust regression coefficient values to minimize the associated overfitting. An Introduction to Statistical Learning illustrates that and other approaches to the problem.

Finally, don't forget to apply your understanding of the subject matter intelligently. For example, if your understanding of the subject matter (not including having seen these results) suggests that an outcome variable like Log(Y)/x1 makes sense, then there could be a strong argument to build your model based on that outcome. But that decision would best be made based on your understanding of what Y, Log(Y), and x1 mean for your data--not that it just happened to fit this particular data sample.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • How data would be mostly ideal for such regression kind help with your advice. How much minimum I could require. – sriram Dec 11 '22 at 21:58
  • One more is in biological data in studies we may not be able to get too much data we may need to do with 42 or 80 data maximum how to work on them see patterns to fit the model better is another question or worry – sriram Dec 11 '22 at 22:05
  • @sriram I'm familiar with these problems; I've been doing biomedical research for over 50 years. You have to adjust the complexity of the model to match the available data. Chapter 4 of either of the linked Harrell references discusses some approaches; section 4.4 describes rules of thumb for the number of predictors per case in different types of models. Chapter 6 of the linked "An Introduction to Statistical Learning" discusses alternate methods like LASSO, ridge regression, and principal components regression that reduce or combine predictors in ways that minimize overfitting. – EdM Dec 11 '22 at 22:32
  • @sriram you might also find this Technical Perspective in Molecular Biology of the Cell, 30: 1359-1368 (2019) to be helpful. It doesn't discuss overfitting directly, but it has much useful guidance. – EdM Dec 11 '22 at 22:35