Most Popular
1500 questions
26
votes
2 answers
Why do we need to discard one dummy variable?
I have learned that, for creating a regression model, we have to take care of categorical variables by converting them into dummy variables. As an example, if, in our data set, there is a variable like location:
Location…
Mithun Sarker Shuvro
- 373
- 1
- 3
- 7
26
votes
1 answer
back propagation in CNN
I have the following CNN:
I start with an input image of size 5x5
Then I apply convolution using 2x2 kernel and stride = 1, that produces feature map of size 4x4.
Then I apply 2x2 max-pooling with stride = 2, that reduces feature map to size 2x2.…
koryakinp
- 436
- 1
- 5
- 14
26
votes
6 answers
make seaborn heatmap bigger
I create a corr() df out of an original df. The corr() df came out 70 X 70 and it is impossible to visualize the heatmap... sns.heatmap(df). If I try to display the corr = df.corr(), the table doesn't fit the screen and I can see all the…
redeemefy
- 631
- 1
- 6
- 9
26
votes
7 answers
Sharing Jupyter notebooks within a team
I would like to set up a server which could support a data science team in the following way: be a central point for storing, versioning, sharing and possible also executing Jupyter notebooks.
Some desired properties:
Different users can access the…
Dror Atariah
- 383
- 1
- 4
- 10
26
votes
4 answers
Is Data Science the Same as Data Mining?
I am sure data science as will be discussed in this forum has several synonyms or at least related fields where large data is analyzed.
My particular question is in regards to Data Mining. I took a graduate class in Data Mining a few years back. …
demongolem
- 413
- 5
- 10
26
votes
1 answer
RandomForestClassifier OOB scoring method
Does the random forest implementation in scikit-learn use mean accuracy as its scoring method to estimate generalization error with out-of-bag samples? This is not mentioned in the documentation, but the score() method reports the mean accuracy.
I…
darXider
- 613
- 1
- 5
- 12
26
votes
2 answers
How fit pairwise ranking models in XGBoost?
As far as I know, to train learning to rank models, you need to have three things in the dataset:
label or relevance
group or query id
feature vector
For example, the Microsoft Learning to Rank dataset uses this format (label, group id, and…
tokestermw
- 418
- 1
- 4
- 8
25
votes
2 answers
What kinds of learning problems are suitable for Support Vector Machines?
What are the hallmarks or properties that indicate that a certain learning problem can be tackled using support vector machines?
In other words, what is it that, when you see a learning problem, makes you go "oh I should definitely use SVMs for…
Ragnar
- 511
- 1
- 5
- 16
25
votes
4 answers
How to predict probabilities in xgboost using R?
The below predict function is giving -ve values as well so it cannot be probabilities.
param <- list(max.depth = 5, eta = 0.01, objective="binary:logistic",subsample=0.9)
bst <- xgboost(param, data = x_mat, label = y_mat,nround = 3000)
pred_s <-…
GeorgeOfTheRF
- 2,028
- 5
- 17
- 20
25
votes
2 answers
Can you explain the difference between SVC and LinearSVC in scikit-learn?
I've recently started learning to work with sklearn and have just come across this peculiar result.
I used the digits dataset available in sklearn to try different models and estimation methods.
When I tested a Support Vector Machine model on the…
metjush
- 536
- 1
- 5
- 7
25
votes
3 answers
K-means incoherent behaviour choosing K with Elbow method, BIC, variance explained and silhouette
I'm trying to cluster some vectors with 90 features with K-means. Since this algorithm asks me the number of clusters, I want to validate my choice with some nice math.
I expect to have from 8 to 10 clusters. The features are Z-score scaled.
Elbow…
marcodena
- 1,667
- 4
- 14
- 17
25
votes
3 answers
How do you manage expectations at work?
With all the hoopla around Data Science, Machine Learning, and all the success stories around, there are a lot of both justified, as well as overinflated, expectations from Data Scientists and their predictive models.
My question to practicing…
neuron
- 664
- 1
- 6
- 9
25
votes
4 answers
What does the output of model.predict function from Keras mean?
I have built a LSTM model to predict duplicate questions on the Quora official dataset. The test labels are 0 or 1. 1 indicates the question pair is duplicate. After building the model using model.fit, I test the model using model.predict on the…
Dookoto_Sea
- 361
- 1
- 3
- 3
25
votes
3 answers
How do you apply SMOTE on text classification?
Synthetic Minority Oversampling Technique (SMOTE) is an oversampling technique used in an imbalanced dataset problem. So far I have an idea how to apply it on generic, structured data. But is it possible to apply it on text classification problem?…
catris25
- 369
- 1
- 3
- 5
25
votes
3 answers
Should I use GPU or CPU for inference?
I'm running a deep learning neural network that has been trained by a GPU. I now want to deploy this to multiple hosts for inference. The question is what are the conditions to decide whether I should use GPU's or CPUs for inference?
Adding more…
Dan
- 361
- 1
- 3
- 6