4

I have three data sets that, when joined, have O(320) independent variables for a classification problem.

Principal component analysis (PCA) seems out of the question because the data is mostly factors, not continuous.

I'm at a loss as to how to proceed.

How do experienced analysts go about winnowing a large data set with hundreds of columns to something manageable? How do you decide between variables? What calculations can you go on to supplement your gut and experience? How do you avoid throwing away significant variables?

A large number of columns might not be a problem for R, given enough CPU and RAM, but coming up with a cogent story should include identifying what is truly significant. How to accomplish that?

Should I just toss all of it into a logistic regression and see what happens, without any forethought?

More detail in response to comments:

  1. Classification.
  2. Many more observations than columns.
  3. Yes, big oh notation meaning approximately.
  4. Linear model at first. Also interested in boosted models in addition to logistic regression.
Sycorax
  • 90,934
duffymo
  • 143
  • 1
    What do you want to learn from your data? Are you building a regression or classification model? Do you have more observations than features? How many features do you have (I'm not sure what O(320) means -- does it mean approximately 320 features?) Do you think problem can be well-approximated by a linear model? As an aside, I'm not sure that 320 features is particularly unwieldly in the present day... – Sycorax May 31 '16 at 14:10
  • Amendments above to clarify. Not daunting to a computer or expert, just me personally. – duffymo May 31 '16 at 14:18
  • 3
    You can't use PCA but for categorical variables there is MCA (https://en.wikipedia.org/wiki/Multiple_correspondence_analysis) which is roughly the same thing so you could reduce number of variables that way – Riff May 31 '16 at 14:25
  • 3
    (1) Big-O notation does not mean approximately: it means asymptotically. Thus, $O(320)=O(1)$. (2) Around one hundred variants of this question have already been asked: see http://stats.stackexchange.com/search?q=large+number+categorical+variables+is%3Aquestion . Perhaps one or more of them already answers yours? If not, please indicate how your situation is different. – whuber May 31 '16 at 14:45
  • Correct. I'll do a search. – duffymo May 31 '16 at 14:52

1 Answers1

10

The key thing to do in this case is to properly account for the feature selection as a part of the modeling process. This means that inside of your CV loop, you'll apply the feature selection process to just the training data, and then to the testing data separately. This is done to minimize information leakage between training and testing. Failure to do this can result in overly-optimistic results and spurious findings.

Elastic net regression allows one to fit a linear model, perform feature selection and control for correlation among variables all at once. The R package glmnet supports GLMs (i.e. includes logistic regression as an option) with sparse matrices. This will be a good first-pass linear model of the data. However, be aware that the features selected by a linear model may have little to do with what's important in a nonlinear model (cf this answer of mine).

Next steps are models that are particularly apt at handling sparsity in feature vectors. Factorization machines were conceived to solve this particular task, but they are naturally a bit more complicated than linear models.(1)

Linear SVMs are another option. Inner products of sparse vectors are cheap to compute. There are even packages like sofia that work in the primal SVM problem, which means that the complexity is driven by the number of features rather than the number of observations. This can be a nice property, but keep in mind that just because a tool is efficient, it's not necessarily the right one for the job.

I'm often hesitant to assume that methods like PCA will solve my problems because it's not "$y$-aware," by which I mean that the projection may not contain any useful information about the target of the modeling process. In some cases, it will even discard or destroy the only useful information! On the other hand, some people insist that its application is usually good enough for their needs: your mileage may vary. PLS is a regression method to perform "$y$-aware" regression.

Instead of PCA for dimension reduction, non-negative matrix factorization might be useful. Suppose you use binary encodings on all of your categories. The resulting matrix is non-negative; non-negative matrix factorization will often be able to discover some latent "structure" to that data, where a small number of vectors coincide with particular patterns of association among your different categories. For example, the same people who enjoy the movies Ant-Man and Avengers probably also enjoy Iron Man; "superhero movies" would represent a latent factor that NMF could discover.

Boosted trees are a workhorse model; some people consider them to be the "model to beat" if they're just starting some new project.

(1) Steffen Rendle "Factorization Machines"

Sycorax
  • 90,934