Highest Voted Questions - Data Science Stack Exchange

10

votes

3 answers

XGboost - Choice made by model

i am using XGboost to predict a 2 classes target variable on insurance claims. I have a model ( training with cross validation, hyper parameters tuning etc...) i run on another dataset. My question is : is there a way to know why a given claim has…

xgboost

asked Jun 14 '18 at 14:48

Fabrice BOUCHAREL

439
3
12

10

votes

4 answers

Is this a good practice of feature engineering?

I have a practical question about feature engineering... say I want to predict house prices by using logistic regression and used a bunch of features including zip code. Then by checking the feature importance, I realize zip is a pretty good…

asked Jun 13 '18 at 22:07

user3768495

927
1
7
8

10

votes

2 answers

NN embedding layer

Several neural network libraries such as tensorflow and pytorch offer an Embedding layer. Having implemented word2vec in the past, I understand the reasoning behind wanting a lower dimensional representation. However, it would seem the embedding…

asked May 31 '18 at 15:30

cbake

101
1
3

10

votes

2 answers

How to train the same RNN over multiple series?

I have multiple separate time series and would like to train the same LSTM network on them. How to do in this situation? I can't just concatenate timeseries (along time), because I am afraid network will be confused by jumps at the points of…

asked May 30 '18 at 09:57

Dims

201
2
5

10

votes

5 answers

In which epoch should i stop the training to avoid overfitting

I'm working on an age estimation project trying to classify a given face in a predefined age range. For that purpose I'm training a deep NN using the keras library. The accuracy for the training and the validation sets is shown in the graph…

asked May 29 '18 at 09:33

Yiannis Ath

188
1
1
10

10

votes

1 answer

Assumptions of linear regression

In simple terms, what are the assumptions of Linear Regression? I just want to know that when I can apply a linear regression model to our dataset.

linear-regression

asked May 29 '18 at 04:27

Anvay Joshi

119
4

10

votes

1 answer

How to draw convolutional neural network diagrams?

I have to draw a CNN diagram similar to this: I tried all the tools mentioned in https://datascience.stackexchange.com/a/14900, but there is no easy way to do it. Is there any automated way to do it? Or do I have to do it manually. In addition, is…

asked May 21 '18 at 15:55

Beginner

209
1
2
5

10

votes

2 answers

Debugging Neural Networks

I've built an artificial neural network in python using the scipy.optimize.minimize (Conjugate gradient) optimization function. I've implemented gradient checking, double checked everything etc and I'm pretty certain it's working correctly. I've run…

asked Jun 11 '14 at 18:22

user3726050

109
4

10

votes

3 answers

AUC-ROC of a random classifier

Why the area under the ROC Curve for a random classifier is equal to 0.5 and has diagonal shape? For me a random classifier would have 25% of TP,TN,FP,FN and therefore it would only be a single point on the ROC Curve.

classification

asked May 20 '18 at 06:12

Victor

281
1
3
5

10

votes

2 answers

weighted cross entropy for imbalanced dataset - multiclass classification

I am trying to classify images to more then a 100 classes, of different sizes ranged from 300 to 4000 (mean size 1500 with std 600). I am using a pretty standard CNN where the last layer outputs a vector of length number of classes, and using…

asked May 15 '18 at 12:35

Itaysason

139
1
1
6

10

votes

3 answers

Why do we use gradients instead of residuals in Gradient Boosting?

I have found mentions of two advantages in using gradients instead of actual residuals: 1) Using gradients will allow us to plug in any loss function (not just mse) without having to change our base learners to make them compatible with the loss…

asked May 13 '18 at 20:25

eyio

101
1
3

10

votes

2 answers

Multicollinearity in Decision Tree

Can anybody please explain the affect of multicollinearity on Decision Tree algorithms (Classification and regression). I have done some searching but was not able to find the right answer as some say it affects it and others say it doesn't.

decision-trees

asked May 08 '18 at 18:43

deepguy

1,441
8
18
39

10

votes

2 answers

Validation showing huge fluctuations. What could be the cause?

I'm training a CNN for a 3-class image classification problem. My training loss decreased smoothly, which is the expected behaviour. However, my validation loss shows a lot of fluctuation. Is this something that I should be worried about, or should…

asked May 02 '18 at 11:25

Josh

487
4
8

10

votes

1 answer

Clustering customer data stored in ElasticSearch

I have a bunch of customer profiles stored in a elasticsearch cluster. These profiles are now used for creation of target groups for our email subscriptions. Target groups are now formed manually using elasticsearch faceted search capabilities…

asked May 14 '14 at 08:38

Konstantin V. Salikhov

634
7
18

10

votes

2 answers

How does the bounding box regressor work in Fast R-CNN?

In the fast R-CNN paper (https://arxiv.org/abs/1504.08083) by Ross Girshick, the bounding box parameters are continuous variables. These values are predicted using regression method. Unlike other neural network outputs, these values do not represent…

asked Apr 20 '18 at 07:25

Saptarshi Roy

439
2
4
11

Most Popular