2

I have searched for that problem but I haven't found a straight forward answer, so I am working with about 1.4 million numeric values and a few train() funtions. Now the problem is with "svmradial" and the "Random Forest" models which they take a really long time with my i7 6500U CPU and my 4GB of RAM. Is there a way to speed up that process?

control=trainControl(method="cv", number=10)
metric="Accuracy"
fit.knn=train(class~., data=dataset, method="knn", metric=metric, trControl=control)
fit.rf=train(class~., data=dataset, method="rf", metric=metric, trControl=control)
Haitao Du
  • 36,852
  • 25
  • 145
  • 242
Ka_Papa
  • 121
  • There is automatic tuning in random forest in caret, that takes some extra time. Try the ranger package. There you can set the number of cores that you use, if you have a lot of cores this will be much faster. – PhilippPro Jan 05 '18 at 15:30

1 Answers1

3

I am surprised by your set up: doing 10 fold cross validation with random forest or SVM on 1.4 million data can take weeks if not months to run!

Here is the basic complexity on SVM. Note that, the space complexity for kernel matrix is $O(n^2)$, when $n$ is in order of $10^6$, getting SVM to work is almost impossible (require ~8000G memory). See this discussion for details: Can support vector machine be used in large data?


I would suggest to do small scale experiments before you run 10 fold cross validation on complicated model. In addition, before you try some model, try to estimate the time and space complexity. For example, if you are building a linear regression, and using QR decomposition on least square, try to see how number of data $n$ and number of features $p$, impact the complexity.

Here are some high level suggestions:

  • Step 1, try to sample your data say get 20% of the data, and make it into training and testing set. (no cross validation)

  • Step 2. start with some simpler models, such as decision tree or linear model. (In fact, random forest and neural network may be OK, but SVM definitely not efficient on this amount of data.)

Finally try to do some diagnostic experiments such as learning curve to see if you are underfitting or overfitting. Then treat it with different approaches.

Haitao Du
  • 36,852
  • 25
  • 145
  • 242
  • I am working with a machine learning script and a raster image, should I try to aggregating the image instead of getting a sample? – Ka_Papa Dec 29 '17 at 11:28