Most Popular
1500 questions
10
votes
5 answers
How to create a good list of stopwords
I am looking for some hints on how to curate a list of stopwords. Does someone know / can someone recommend a good method to extract stopword lists from the dataset itself for preprocessing and filtering?
The Data:
a huge amount of human text input…
PlagTag
- 333
- 1
- 3
- 10
10
votes
1 answer
How does the forward method get called in this pyTorch conv net?
In this example network from pyTorch tutorial
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# 1 input image channel, 6 output channels,…
SheppLogan
- 322
- 4
- 11
10
votes
3 answers
Decision Trees - how does split for categorical features happen?
A decision tree, while performing recursive binary splitting, selects an independent variable (say $X_j$) and a threshold (say $t$) such that the predictor space is split into regions {$X|X_j < t$} and {$X|X_j >= t$}, and which leads to greatest…
Supratim Haldar
- 309
- 1
- 3
- 10
10
votes
1 answer
Spark, optimally splitting a single RDD into two
I have a large dataset that I need to split into groups according to specific parameters. I want the job to process as efficiently as possible. I can envision two ways of doing so
Option 1 - Create map from original RDD and filter
def…
j.a.gartner
- 1,215
- 1
- 9
- 18
10
votes
2 answers
How does class_weight work in Decision Tree
The scikit-learn implementation of DecisionTreeClassifier has a parameter as class_weight.
As per documentation:
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.
and
The…
Supratim Haldar
- 309
- 1
- 3
- 10
10
votes
1 answer
Multi-Head attention mechanism in transformer and need of feed forward neural network
After reading the paper, Attention is all you need, I have two questions:
1. What is the need of a multi-head attention mechanism?
The paper says that:
"Multi-head attention allows the model to jointly attend to
information from different…
Zephyr
- 243
- 1
- 7
10
votes
3 answers
Isolation forest sklearn contamination param
I am working on an unsupervised anomaly detection task on time series data using an isolation forest algorithm.
I am developing it in Python, more in detail using scikit-learn.
I found a lot of examples on this, but what is not very clear, is how to…
Giordano
- 345
- 2
- 4
- 11
10
votes
2 answers
Is it valid to shuffle time-series data for a prediction task?
I have a time-series dataset that records some participants' daily features from wearable sensors and their daily mood status.
The goal is to use one day's daily features and predict the next day's mood status for participants with machine learning…
Han
- 103
- 1
- 1
- 5
10
votes
1 answer
Differences between class_weight and scale_pos weight in LightGBM
I have a very imbalanced dataset with the ratio of the positive samples to the negative samples being 1:496. The scoring metric is the f1 score and my desired model is LightGBM. I am using the sklearn implementation of LightGBM.
I have read the…
Oluwafemi E. Ogundare
- 101
- 1
- 1
- 4
10
votes
5 answers
Why decision tree needs categorical variable to be encoded?
As per my intuition, decision trees should work better with categorical variables than with continuous variables. If this is the case, why is encoding needed on categorical variables? Can someone give me the intuition behind this?
Mukesh K
- 101
- 1
- 1
- 5
10
votes
4 answers
How to replace NaN values for image data?
My data set has a total of 200 columns, where each column corresponds to the same pixel in all of my images. In total, I have 48,500 rows. The labels for the data range from 0-9.
The data looks something like this:
raw_0 raw_1 raw_2 raw_3 …
chomprrr
- 203
- 2
- 6
10
votes
3 answers
Splitting train/test sets by an identifier?
I know sklearn has train_test_split() to split a train and test set. But I read that, even with setting a random seed, if your actual dataset is updated regularly, the random seed will reset with each updated dataset and take a different train/test…
Greg Rosen
- 323
- 4
- 11
10
votes
2 answers
Proper way of fighting negative outputs of a regression algorithms where output must be positive all the way
Maybe it is a bit general question. I am trying to solve various regression tasks and I try various algorithms for them. For example, multivariate linear regression or an SVR. I know that the output can't be negative and I never have negative output…
Maksim Khaitovich
- 403
- 3
- 9
10
votes
2 answers
how to check the distribution of the training set and testing set are similar
I have been playing the Kaggle Competition and I find there is a situation that the distribution of the training set and testing set are different, so I am wondering how to check the distribution of the training set and testing set are similar.
And…
Bowen Peng
- 277
- 2
- 6
10
votes
4 answers
SGDClassifier: Online Learning/partial_fit with a previously unknown label
My training set contains about 50k entries with which I do an initial learning. On a weekly basis, ~ 5k entries are added; but the same amount "disappears" (as it is user data which has to be deleted after some time).
Therefore I use online learning…
swalkner
- 111
- 6