0

Sorry if the question is elementary, I am a beginned both in R and in statistics.

I am reading from An Introduction to Machine Learning with R. We are looking at a dataset called Sonar. It contains 61 variables; the first 60 are numerical, and the last one is an unordered factor with levels M and R, (meaning mine and rock). We want to create a model that predicts weather a certain observation is a rock or a mine:

library("mlbench")
data(Sonar)
## 60/40 split
tr <- sample(nrow(Sonar), round(nrow(Sonar) * 0.6))
train <- Sonar[tr, ]
test <- Sonar[-tr, ]
model <- glm(Class ~ ., data = train, family = "binomial")
p <- predict(model, test, type = "response")
summary(p)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.7265  0.5079  1.0000  1.0000
cl <- ifelse(p > 0.5, "M", "R")

From the last line, it seems clear that each entry of p is the probability that the corresponding observation is a mine.

My Question: What, in the code, determines that the entries of p are the probabilities of mines, and not of rocks?

Ovi
  • 383

1 Answers1

1

I believe R used your first level as the reference, which was "M". You can reorder it yourself. Google R reference factor level.

SmallChess
  • 7,211