15

I need to train a linear classifier on my laptop with hundreds of thousands of data points and about ten thousand features. What are my options? What is the state of the art for this type of problem?

It seems like stochastic gradient descent is promising direction, and my sense is that this is state of the art:

"Pegasos: Primal Estimated sub-GrAdient SOlver for SVM" Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, Andrew Cotter." Mathematical Programming, Series B, 127(1):3-30, year: 2007 . "

Is this the consensus? Should I be looking in some other direction?

carlosdc
  • 3,235
  • @Thom Blake: I've edited the question a little bit, with my thoughts. – carlosdc Feb 07 '12 at 20:31
  • @vzn tall fat data is the setup I just described. many datapoints and many features. – carlosdc Feb 10 '12 at 19:00
  • @vzn: yes, I think it is – carlosdc Feb 13 '12 at 15:51
  • 1
    have you considered of using some dimension reduction methods ? thousand of features call for a dimension reduction see: http://en.wikipedia.org/wiki/Dimension_reduction – Dov Feb 07 '12 at 20:16
  • This question could be improved with research effort. Do you have any techniques in mind? – Tamzin Blake Feb 07 '12 at 20:27
  • Without knowing more about the data, any answer would be uninformed. Is it sparse? continuous? discrete? redundant features/objects? how many classes? For example, PCA on sparse data can sometimes be harmful. – cyborg Feb 08 '12 at 10:12
  • could you explain what you mean by "tall, fat data" in your subj which is not in the q at all? –  Feb 10 '12 at 18:51
  • 2
    tall=many pts? fat=many features? is this std terminology anywhere, used in refs somewhere? –  Feb 10 '12 at 19:05

4 Answers4

7

First, I would like to ask you how do you know linear classifier is the best choice? Intuitively for such a large space (R^10000) it is possible that some other non-linear classifier is a better choice.

I suggest you to try several different classifiers and observe the prediction errors (I would try several regularized classification models).

If you run out of memory reduce the dimension using PCA

niko
  • 1,353
  • 2
    Thanks. I was thinking more about how to handle the scale of the problem. I wanted to start by doing linear, because it is simpler. I think you suggest a kernel based method. Let me just point out that if I have 750000 datapoints the kernel matrix will be of size 540 GB. So the answer cannot be: plug it in to LIBSVM/SVMLight/etc get a good C and gamma and see what you get. – carlosdc Feb 07 '12 at 23:13
  • 2
    (-1) the answer is a) partly so generic, it could be applied to any classification question b) it is not explained why PCA is recommended (over any other dimension reduction technique). – steffen Feb 08 '12 at 07:21
6

I think you should look at Online Learning methods. The perceptron and the kernel perceptron are extremely easy to code and work extremely well in practice, and there are a whole host of other online methods. Note that any online learning method can be converted into a batch learning algorithm, in which case they closely resemble stochastic gradient descent methods.

If you're using Matlab there's a really nice toolbox called DOGMA by Francesco Orabona, which contains a range of online learning algorithms, and you can evaluate a few different methods using that. I've used this in some of my research and found it to be very useful (note that as far as I remember it expects the data as [features x examples] so you might have to transpose it).

As others have mentioned, you might want to try dimensionality reduction. PCA might not be such a good option here, as you have to compute the covariance matrix which will be very costly. You could try looking at Random Projections. The theory is tough, but the principle is very simple. It's based on the Johnson-Lindenstrauss Lemma if you're interested, but the basic idea is that if you randomly project to a lower dimensional space, then $\ell_2$ distances between points are preserved up to some $\epsilon$. If you're using an RBF kernel, then $\ell_2$ distances are all you are interested in!

tdc
  • 7,569
3

You can also use PCA to reduce dimensions without computing the covariance matrix --- by using neural newtork equivalent of PCA.

Here is a paper that describes it (but I recommend doing your own search): http://users.ics.tkk.fi/oja/Oja1982.pdf, and here is a link to somethings that may be working matlab implementation: http://www.cs.purdue.edu/homes/dgleich/projects/pca_neural_nets_website/index.html.

jb.
  • 1,120
1

As jb suggested, I think it is better to use a "Dimension Reduction" method. Principle Component Analysis (PCA) is a popular choice. Also you can try unsupervised feature learning techniques as well. For more information about unsupervised feature learning can be found at http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

Upul
  • 647