1

I am taking a class in data mining and I am working on a term project using the BRFSS dataset. I have a huge dataset with 405 columns and 12,000 rows. There are many columns which are completely empty. I was trying to remove empty columns using SAS, R or Excel but it doesn't work. Could you suggest a method to remove the empty columns or any tutorial that will help me cleaning up the data? There are a lot of missing cells too. I am using KNIME to train my data and it doesn't work if there are missing values. How can I handle the missing values?

user3897
  • 527
  • 1
    In SAS or R you need a few lines of code to remove empty columns, but how come it did not work in Excel? Have you tried right-clicking the header of the empty column and selecting delete? – GaBorgulya Apr 02 '11 at 00:13
  • SAS uses keep and drop which are about as simple as it gets. What exactly "doesn't work"? – whuber Apr 02 '11 at 04:39
  • @GaBorgulya: It is a huge dataset and it is not possible to manually identify columns which are empty. I was thinking of a way in which we can automatically remove empty columns and also generate a list of these columns for ananlysis. I am not sure how to do it automatically in excel and SAS. – user3897 Apr 02 '11 at 13:21
  • @whuber: I have to do it automatically since it is a huge dataset. SAS should go over the data automatically and remove the empty columns and also generate a list of those empty columns for analysis – user3897 Apr 02 '11 at 13:22
  • @user In other words, your question is how to identify empty columns, not how to remove them. – whuber Apr 02 '11 at 15:52

4 Answers4

3

Empty columns contain NAs only or ""s only, they have has no variability. This code removes all columns without variability (which is probably a plus in this case).

d=data.frame(r=seq(1, 5), a=rep('a', 5), n=rep(NA, 5), n1=c(NA, NA, 3, 3, 3))

homogenous = apply(d, 2, function(var) length(unique(var)) == 1)
d[, !homogenous]
GaBorgulya
  • 3,363
2

If they really are just completely blank columns then in R...

read.table( 'myBigFile', strip.white = TRUE)

might do what you want. You will have to set other arguments of the read.table() command as needed. It's best to use this when specifying the specific column delimiter you have.

John
  • 22,838
  • 9
  • 53
  • 87
1

Maybe just for completeness' sake:
df[, colSums( is.na(df) ) < nrow(df)] returns only columns with at least one nonmissing value.

And what I do prior to clearing columns without variability is this

 df <- df[,sapply(df,is.character)] = colwise(       # replace empty or white-space-only
     function (x) str_replace(x,"^\\s*$",NA))        # character columns with NA
     (df[,sapply(df,is.character)])                  # uses stringr for str_replace

 df <- df[, colwise(function(x){
         length (unique(x)) })(df)!=1]               # I like writing it like this for 
                                                     # conciseness, uses plyr for colwise
Ruben
  • 354
0

Hard for me to think of this as a huge dataset, but these things are relative ;)

To do this in excel (a version like 2007 or 2010 which has enough columns), you could insert two rows at the top. In the first row, just have consecutive integers: 1 in column A, 2 in column B, etc. You can do this with a function [I think it's col()] and then convert it to values.

In the second row, you need some logical expression which will tell you if the column is empty. This might be as simple as +max(a3:A12002)<>0 if all variables which are non-empty contain a nonzero value. Without knowing the characteristics of your data I can't be sure that will work. +max(a3:A12002)<>0 should be true if there is data in the column, false if it isn't. Now, use sort to sort columns based on row 2, true/false. Next, delete all the false columns. Next, sort on row 1 to put the data back in the original order. Finally, delete rows 1 and 2 and save the file (probably as .CSV).

I'm writing this from memory; should be right.

zbicyclist
  • 3,423
  • In terms of missing values, the simplest approach is to see how many complete cases you have, and if there's any particular bias likely to be introduced by dropping all cases which are not complete (listwise deletion in some statistics packages). Beyond that you really should repost; missing data is a big issue and a lot more complicated than editing columns. – zbicyclist Apr 02 '11 at 20:38