Most Popular

1500 questions
26
votes
2 answers

Why do we need to discard one dummy variable?

I have learned that, for creating a regression model, we have to take care of categorical variables by converting them into dummy variables. As an example, if, in our data set, there is a variable like location: Location…
26
votes
1 answer

back propagation in CNN

I have the following CNN: I start with an input image of size 5x5 Then I apply convolution using 2x2 kernel and stride = 1, that produces feature map of size 4x4. Then I apply 2x2 max-pooling with stride = 2, that reduces feature map to size 2x2.…
koryakinp
  • 436
  • 1
  • 5
  • 14
26
votes
6 answers

make seaborn heatmap bigger

I create a corr() df out of an original df. The corr() df came out 70 X 70 and it is impossible to visualize the heatmap... sns.heatmap(df). If I try to display the corr = df.corr(), the table doesn't fit the screen and I can see all the…
redeemefy
  • 631
  • 1
  • 6
  • 9
26
votes
7 answers

Sharing Jupyter notebooks within a team

I would like to set up a server which could support a data science team in the following way: be a central point for storing, versioning, sharing and possible also executing Jupyter notebooks. Some desired properties: Different users can access the…
Dror Atariah
  • 383
  • 1
  • 4
  • 10
26
votes
4 answers

Is Data Science the Same as Data Mining?

I am sure data science as will be discussed in this forum has several synonyms or at least related fields where large data is analyzed. My particular question is in regards to Data Mining. I took a graduate class in Data Mining a few years back. …
demongolem
  • 413
  • 5
  • 10
26
votes
1 answer

RandomForestClassifier OOB scoring method

Does the random forest implementation in scikit-learn use mean accuracy as its scoring method to estimate generalization error with out-of-bag samples? This is not mentioned in the documentation, but the score() method reports the mean accuracy. I…
darXider
  • 613
  • 1
  • 5
  • 12
26
votes
2 answers

How fit pairwise ranking models in XGBoost?

As far as I know, to train learning to rank models, you need to have three things in the dataset: label or relevance group or query id feature vector For example, the Microsoft Learning to Rank dataset uses this format (label, group id, and…
tokestermw
  • 418
  • 1
  • 4
  • 8
25
votes
2 answers

What kinds of learning problems are suitable for Support Vector Machines?

What are the hallmarks or properties that indicate that a certain learning problem can be tackled using support vector machines? In other words, what is it that, when you see a learning problem, makes you go "oh I should definitely use SVMs for…
Ragnar
  • 511
  • 1
  • 5
  • 16
25
votes
4 answers

How to predict probabilities in xgboost using R?

The below predict function is giving -ve values as well so it cannot be probabilities. param <- list(max.depth = 5, eta = 0.01, objective="binary:logistic",subsample=0.9) bst <- xgboost(param, data = x_mat, label = y_mat,nround = 3000) pred_s <-…
GeorgeOfTheRF
  • 2,028
  • 5
  • 17
  • 20
25
votes
2 answers

Can you explain the difference between SVC and LinearSVC in scikit-learn?

I've recently started learning to work with sklearn and have just come across this peculiar result. I used the digits dataset available in sklearn to try different models and estimation methods. When I tested a Support Vector Machine model on the…
metjush
  • 536
  • 1
  • 5
  • 7
25
votes
3 answers

K-means incoherent behaviour choosing K with Elbow method, BIC, variance explained and silhouette

I'm trying to cluster some vectors with 90 features with K-means. Since this algorithm asks me the number of clusters, I want to validate my choice with some nice math. I expect to have from 8 to 10 clusters. The features are Z-score scaled. Elbow…
marcodena
  • 1,667
  • 4
  • 14
  • 17
25
votes
3 answers

How do you manage expectations at work?

With all the hoopla around Data Science, Machine Learning, and all the success stories around, there are a lot of both justified, as well as overinflated, expectations from Data Scientists and their predictive models. My question to practicing…
neuron
  • 664
  • 1
  • 6
  • 9
25
votes
4 answers

What does the output of model.predict function from Keras mean?

I have built a LSTM model to predict duplicate questions on the Quora official dataset. The test labels are 0 or 1. 1 indicates the question pair is duplicate. After building the model using model.fit, I test the model using model.predict on the…
Dookoto_Sea
  • 361
  • 1
  • 3
  • 3
25
votes
3 answers

How do you apply SMOTE on text classification?

Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
  • 369
  • 1
  • 3
  • 5
25
votes
3 answers

Should I use GPU or CPU for inference?

I'm running a deep learning neural network that has been trained by a GPU. I now want to deploy this to multiple hosts for inference. The question is what are the conditions to decide whether I should use GPU's or CPUs for inference? Adding more…
Dan
  • 361
  • 1
  • 3
  • 6