3

I am using the Titanic Data from Kaggle. I am trying to find the number of missing values in each column using a simple function.

I was able to find the number of missing values for each column using the code below:

length(which(is.na(titanic_data$PassengerId)))
length(which(is.na(titanic_data$Survived)))
length(which(is.na(titanic_data$Pclass)))
length(which(is.na(titanic_data$Name)))
length(which(is.na(titanic_data$Sex)))
length(which(is.na(titanic_data$Age)))
length(which(is.na(titanic_data$SibSp)))
length(which(is.na(titanic_data$Parch)))
length(which(is.na(titanic_data$Ticket)))
length(which(is.na(titanic_data$Fare)))
length(which(is.na(titanic_data$Cabin)))
length(which(is.na(titanic_data$Embarked)))

I did not want to be repeating code for each column. So I wrote the following function:

missing_val<- function(x,y){
  len <-length(which(is.na(x$y)))
  len
}

#create a list of all column names
cols<- colnames(titanic_data)
cols

#call the function
missing_val(titanic_data,cols)

I keep getting a singular zero when executing missing_val function, when I know for a fact that there are missing values in Cabin and Embarked columns.

What I am trying to get is something like, 0,0,0,0,0,0,0,0,687,2 indicating the fact that there are 687 missing variables in Cabin column and 2 missing in Embark column.

What am I doing wrong here? Any hint would be appreciated. Thx

2 Answers2

12

If I'm not mistaken, sapply is not vectorized. Can use colSums and is.na directly

>>> colSums(is.na(titanic_train))
rafaelc
  • 52,436
  • 15
  • 51
  • 78
1

You can do this with sapply

library(titanic)
data(titanic_train)
sapply(titanic_train, function(x) sum(is.na(x)))
PassengerId    Survived      Pclass        Name         Sex         Age 
          0           0           0           0           0         177 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           0           0           0 
G5W
  • 34,378
  • 10
  • 39
  • 71