0

First of all, forgive my moronosity (new word I made up).

I am running R on data from trees in burn areas. Simple, straightforward dataset. And yet, R for some reason hates chestnut oak. My species are NRO, WO, and CHO. It will not report on CHO, no matter what I try. What gives? Examples of what I see in the console:

data.frame':    72 obs. of  4 variables:
 $ site  : Factor w/ 24 levels "Arbutus Ridge 1 Low/Moderate",..: 1 1 1 2 2 2 3 3 3 4 ...
 $ burn  : Factor w/ 4 levels "Control","High",..: 2 2 2 4 4 4 3 3 3 4 ...
 $ sp    : Factor w/ 3 levels "CHO","NRO","WO": 1 2 3 1 2 3 1 2 3 1 ...
 $ seedba: num  NA NA NA 1.85 NA ...

Call:
lm(formula = seedba ~ burn + sp + burn * sp, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1287 -1.0779 -0.3294  0.1242 12.4723 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)          0.7720     1.8144   0.426    0.675
burnHigh             0.1979     2.7715   0.071    0.944
burnLow              1.1491     2.7715   0.415    0.683
burnModerate         2.1360     3.1426   0.680    0.505
spNRO                1.1394     2.5659   0.444    0.662
spWO                -0.2964     2.5659  -0.116    0.909
burnHigh:spNRO      -1.4967     4.1901  -0.357    0.725
burnLow:spNRO        4.4422     3.6887   1.204    0.243
burnModerate:spNRO  -2.5252     5.1318  -0.492    0.628
burnHigh:spWO            NA         NA      NA       NA
burnLow:spWO        -1.3814     4.9133  -0.281    0.782
burnModerate:spWO   -2.6117     5.1318  -0.509    0.617

Residual standard error: 3.629 on 19 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.4196,    Adjusted R-squared:  0.1141 
F-statistic: 1.373 on 10 and 19 DF,  p-value: 0.2644

It still won't show CHO:

Call:
lm(formula = seedba ~ -1 + burn + sp + burn * sp, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1287 -1.0779 -0.3294  0.1242 12.4723 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)
burnControl          0.7720     1.8144   0.426    0.675
burnHigh             0.9700     2.0951   0.463    0.649
burnLow              1.9212     2.0951   0.917    0.371
burnModerate         2.9080     2.5659   1.133    0.271
spNRO                1.1394     2.5659   0.444    0.662
spWO                -0.2964     2.5659  -0.116    0.909
burnHigh:spNRO      -1.4967     4.1901  -0.357    0.725
burnLow:spNRO        4.4422     3.6887   1.204    0.243
burnModerate:spNRO  -2.5252     5.1318  -0.492    0.628
burnHigh:spWO            NA         NA      NA       NA
burnLow:spWO        -1.3814     4.9133  -0.281    0.782
burnModerate:spWO   -2.6117     5.1318  -0.509    0.617

Residual standard error: 3.629 on 19 degrees of freedom
  (42 observations deleted due to missingness)
Multiple R-squared:  0.5712,    Adjusted R-squared:  0.323 
F-statistic: 2.301 on 11 and 19 DF,  p-value: 0.05338
dimitriy
  • 35,430

1 Answers1

5

R automatically includes an intercept in regression. If you include an intercept and all categories of a dummy variable, then you have succumbed to the dummy variable trap. R is smart enough to notice when you have fallen to the dummy variable trap, so it will omit a category for you.

If you want to include all levels of a categorical predictor, then you will have to omit the intercept using the -1 or the +0 argument: lm(seedba ~ -1 + burn + sp + burn*sp, data=data). See ?formula for more information.

Sycorax
  • 90,934
  • Thanks very much. Unfortunately, when I run it with the above, it still won't show CHO. I edited my original question to include the output. – Matt Aldrovandi Sep 25 '18 at 20:27
  • 2
    If excluding the intercept still results in this behavior, it's possibly because there is still perfect multi-collinearity in your data. On the other hand, R handles NA values by deleting the corresponding rows from the analysis, and your printout says that some observations were removed due to missingness. Do instances of CHO have NA values? Also, the blob of text you added is hard to read. Please use code formatting. – Sycorax Sep 25 '18 at 20:36
  • Sorry about that, will do. Yes, I still have NA values. Should I switch them to something? Worried about messing up the data if I make them zeroes. – Matt Aldrovandi Sep 25 '18 at 21:27
  • Working with missing data is its own topic in statistics; we have a number of posts about [tag:missing-data] and [tag:data-imputation]. – Sycorax Sep 25 '18 at 21:35