Most Popular
1500 questions
16
votes
3 answers
Why are ensembles so unreasonably effective
It seems to have become axiomatic that an ensemble of learners leads to the best possible model results - and it is becoming far rarer, for example, for single models to win competitions such as Kaggle. Is there a theoretical explanation for why…
Robert de Graaf
- 899
- 5
- 17
16
votes
4 answers
Need help understanding xgboost's approximate split points proposal
background:
in xgboost the $t$ iteration tries to fit a tree $f_t$ over all $n$ examples which minimizes the following objective:
$$\sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$
where $g_i, h_i$ are first order and second order derivatives…
ihadanny
- 1,357
- 2
- 11
- 19
16
votes
4 answers
How to do postal addresses fuzzy matching?
I would like to know how to match postal addresses when their format differ or when one of them is mispelled.
So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so…
Stéphanie C
- 281
- 1
- 2
- 5
16
votes
3 answers
Doc2vec(gensim) - How can I infer unseen sentences’ label?
https://radimrehurek.com/gensim/models/doc2vec.html
For example, if we have trained doc2vec with
"aaaaaAAAAAaaaaaa" - "label 1"
“bbbbbbBBBBBbbbb" - "label 2"
can we infer “aaaaAAAAaaaaAA” is label 1 using Doc2vec?
I know Doc2vec can train word…
Seongho
- 163
- 1
- 1
- 6
15
votes
4 answers
Pandas: how can I create multi-level columns
I have a pandas DataFrame which has the following columns:
n_0
n_1
p_0
p_1
e_0
e_1
I want to transform it to have columns and sub-columns:
0
n
p
e
1
n
p
e
I've searched in the documentation, and I'm completely lost on how…
Michael Hooreman
- 793
- 2
- 9
- 21
15
votes
2 answers
Analyzing A/B test results which are not normally distributed, using independent t-test
I have a set of results from an A/B test (one control group, one feature group) which do not fit a Normal Distribution.
In fact the distribution resembles more closely the Landau Distribution.
I believe the independent t-test requires that the…
teebszet
- 253
- 2
- 6
15
votes
2 answers
Clustering unique visitors by useragent, ip, session_id
Given website access data in the form session_id, ip, user_agent, and optionally timestamp, following the conditions below, how would you best cluster the sessions into unique visitors?
session_id: is an id given to every new visitor. It does not…
AdrianBR
- 367
- 2
- 10
15
votes
2 answers
What is the difference between active learning and reinforcement learning?
From Wikipedia:
Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs.
Reinforcement learning (RL)…
Moradnejad
- 265
- 1
- 2
- 6
15
votes
2 answers
How to understand ANOVA-F for feature selection in Python. Sklearn SelectKBest with f_classif
I am trying to understand what it really means to calculate an ANOVA F value for feature selection for a binary classification problem.
As I understand from the calculation of ANOVA from basic statistics, we should have at least 2 samples for which…
JuniorDataGuyAKL
- 153
- 1
- 1
- 6
15
votes
3 answers
Keras Sequential model returns loss 'nan'
I'm implementing a neural network with Keras, but the Sequential model returns nan as loss value.
I have sigmoid activation function in the output layer to squeeze output between 0 and 1, but maybe doesn't work properly.
This is the code:
def…
pairon
- 405
- 1
- 4
- 15
15
votes
3 answers
How to calculate the mean of a dataframe column and find the top 10%
I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…
the3rdNotch
- 253
- 1
- 2
- 7
15
votes
3 answers
Why does frequency encoding work?
Frequency encoding is a widely used technique in Kaggle competitions, and many times proves to be a very reasonable way of dealing with categorical features with high cardinality. I really don't understand why it works.
Does it work in very…
David Masip
- 6,051
- 2
- 24
- 61
15
votes
3 answers
How to convert categorical data to numerical data in Pyspark
I am using Ipython notebook to work with pyspark applications. I have a CSV file with lots of categorical columns to determine whether the income falls under or over the 50k range. I would like to perform a classification algorithm taking all the…
SRS
- 1,065
- 5
- 11
- 22
15
votes
2 answers
Using attributes to classify/cluster user profiles
I have a dataset of users purchasing products from a website.
The attributes I have are user id, region(state) of the user, the categories id of product, keywords id of product, keywords id of website, and sales amount spent of the product.
The goal…
sylvia
- 303
- 1
- 2
- 8
15
votes
3 answers
How to remove outliers using box-plot?
I have data of a metric grouped date wise. I have plotted the data, now, how do I remove the values outside the range of the boxplot (outliers)?
All the ['AVG'] data is in a single column,
I need it for time series modelling.
Uday T
- 342
- 1
- 5
- 11