Highest Voted Questions - Data Science Stack Exchange

10

votes

5 answers

How to create a good list of stopwords

I am looking for some hints on how to curate a list of stopwords. Does someone know / can someone recommend a good method to extract stopword lists from the dataset itself for preprocessing and filtering? The Data: a huge amount of human text input…

asked May 24 '15 at 21:45

PlagTag

333
1
3
10

10

votes

1 answer

How does the forward method get called in this pyTorch conv net?

In this example network from pyTorch tutorial import torch import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): super(Net, self).__init__() # 1 input image channel, 6 output channels,…

asked Aug 30 '19 at 10:17

SheppLogan

322
4
11

10

votes

3 answers

Decision Trees - how does split for categorical features happen?

A decision tree, while performing recursive binary splitting, selects an independent variable (say $X_j$) and a threshold (say $t$) such that the predictor space is split into regions {$X|X_j < t$} and {$X|X_j >= t$}, and which leads to greatest…

asked Aug 08 '19 at 17:25

Supratim Haldar

309
1
3
10

10

votes

1 answer

Spark, optimally splitting a single RDD into two

I have a large dataset that I need to split into groups according to specific parameters. I want the job to process as efficiently as possible. I can envision two ways of doing so Option 1 - Create map from original RDD and filter def…

asked May 01 '15 at 20:32

j.a.gartner

1,215
1
9
18

10

votes

2 answers

How does class_weight work in Decision Tree

The scikit-learn implementation of DecisionTreeClassifier has a parameter as class_weight. As per documentation: Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. and The…

asked Jul 23 '19 at 14:29

Supratim Haldar

309
1
3
10

10

votes

1 answer

Multi-Head attention mechanism in transformer and need of feed forward neural network

After reading the paper, Attention is all you need, I have two questions: 1. What is the need of a multi-head attention mechanism? The paper says that: "Multi-head attention allows the model to jointly attend to information from different…

asked Jul 14 '19 at 14:37

Zephyr

243
1
7

10

votes

3 answers

Isolation forest sklearn contamination param

I am working on an unsupervised anomaly detection task on time series data using an isolation forest algorithm. I am developing it in Python, more in detail using scikit-learn. I found a lot of examples on this, but what is not very clear, is how to…

asked Jul 01 '19 at 19:58

Giordano

345
2
4
11

10

votes

2 answers

Is it valid to shuffle time-series data for a prediction task?

I have a time-series dataset that records some participants' daily features from wearable sensors and their daily mood status. The goal is to use one day's daily features and predict the next day's mood status for participants with machine learning…

asked Jun 21 '19 at 17:48

Han

103
1
1
5

10

votes

1 answer

Differences between class_weight and scale_pos weight in LightGBM

I have a very imbalanced dataset with the ratio of the positive samples to the negative samples being 1:496. The scoring metric is the f1 score and my desired model is LightGBM. I am using the sklearn implementation of LightGBM. I have read the…

asked Jun 18 '19 at 20:36

Oluwafemi E. Ogundare

101
1
1
4

10

votes

5 answers

Why decision tree needs categorical variable to be encoded?

As per my intuition, decision trees should work better with categorical variables than with continuous variables. If this is the case, why is encoding needed on categorical variables? Can someone give me the intuition behind this?

asked May 16 '19 at 11:58

Mukesh K

101
1
1
5

10

votes

4 answers

How to replace NaN values for image data?

My data set has a total of 200 columns, where each column corresponds to the same pixel in all of my images. In total, I have 48,500 rows. The labels for the data range from 0-9. The data looks something like this: raw_0 raw_1 raw_2 raw_3 …

asked May 04 '19 at 10:23

chomprrr

203
2
6

10

votes

3 answers

Splitting train/test sets by an identifier?

I know sklearn has train_test_split() to split a train and test set. But I read that, even with setting a random seed, if your actual dataset is updated regularly, the random seed will reset with each updated dataset and take a different train/test…

asked May 03 '19 at 22:42

Greg Rosen

323
4
11

10

votes

2 answers

Proper way of fighting negative outputs of a regression algorithms where output must be positive all the way

Maybe it is a bit general question. I am trying to solve various regression tasks and I try various algorithms for them. For example, multivariate linear regression or an SVR. I know that the output can't be negative and I never have negative output…

asked Jan 30 '15 at 21:30

Maksim Khaitovich

403
3
9

10

votes

2 answers

how to check the distribution of the training set and testing set are similar

I have been playing the Kaggle Competition and I find there is a situation that the distribution of the training set and testing set are different, so I am wondering how to check the distribution of the training set and testing set are similar. And…

asked Apr 18 '19 at 11:22

Bowen Peng

277
2
6

10

votes

4 answers

SGDClassifier: Online Learning/partial_fit with a previously unknown label

My training set contains about 50k entries with which I do an initial learning. On a weekly basis, ~ 5k entries are added; but the same amount "disappears" (as it is user data which has to be deleted after some time). Therefore I use online learning…

asked Apr 02 '19 at 11:25

swalkner

111
6

Most Popular