2

I have a GLM model which has a quadratic term and looks like this.

glm(formula = tumor_count ~ exp_val + I(exp_val^2), 
        family = "poisson",
        data = x)

tumor_count is the number of tumor cells in a region of tissue and exp_val is average expression of a gene in that region of a tissue.

While the tumor_count is represented as the dependent variable (is a counts data 0, 1... and so on) and the independent variable exp_val is a numeric value (which could be fraction or a normalized value > 1 like counts per millions)

I am doing this for multiple genes.

I am getting the following error when I tried the ideas presented here glm in R - which pvalue represents the goodness of fit of entire model?

When I do this I get the following error :

pchisq(
  summary(count_glm_model$HOXA9)$deviance- 
    summary(count_glm_model$HOXA9)$null.deviance,
    summary(count_glm_model$HOXA9)$df.residual- 
    summary(count_glm_model$HOXA9)$df.null,
    lower.tail=FALSE)

[1] NaN Warning message: In pchisq(summary(count_glm_model$HOXA9)$deviance - summary(count_glm_model$HOXA9)$null.deviance, : NaNs produced

I was wondering what I am doing wrong and how can I get a p-value defining the fit of the model?

Saad Khan
  • 101

1 Answers1

2

This is primarily a coding problem, I think. You don't give any reproducible code (count_glm_model is not a count glm model; it must be some sort of list)

Here's code:

d<-data.frame(exp_val=runif(100),tumor_count=rpois(100,rgamma(100,1)))

count_glm_model<-glm(formula = tumor_count ~ exp_val + I(exp_val^2), family = "poisson", data = d) pchisq( summary(count_glm_model)$deviance- summary(count_glm_model)$null.deviance, summary(count_glm_model)$df.residual- summary(count_glm_model)$df.null, lower.tail=FALSE)

This is wrong because

> summary(count_glm_model)$deviance- summary(count_glm_model)$null.deviance
[1] -3.849819

and also

> summary(count_glm_model)$df.residual-summary(count_glm_model)$df.null
[1] -2

If you put the two subtractions the right way around you'd get

pchisq(
   summary(count_glm_model)$null.deviance- 
     summary(count_glm_model)$deviance,
   summary(count_glm_model)$df.null- 
     summary(count_glm_model)$df.residual,
 lower.tail=FALSE)

This is a test of the null hypothesis that none of the predictors is correlated with the outcome, which isn't really a goodness of fit test, but is the test at the answer you linked. If that's the test you want, that's one way to get it. However, the test is only valid if the outcome variable isn't overdispersed relative to a Poisson distribution, and it almost certainly will be. If it is you'll need some sort of test based on quasilikelihood or a sandwich estimator or something instead.

There's also a deviance-based test for goodness of fit in the sense of testing whether the data are a good fit to the assumed model

pchisq(
     summary(count_glm_model)$deviance,
     summary(count_glm_model)$df.residual,
 lower.tail=FALSE)

In this context, however, it's basically a test for whether the outcome variable is Poisson or, alternatively, overdispersed. It is likely to give similar answers for all your genes and so not be very helpful

Thomas Lumley
  • 38,062
  • I only wanted to calculate the p-value of goodness of fit for my model and was thus wondering what is the easiest way to go about it? – Saad Khan Jan 19 '24 at 19:16