Highest Voted Questions - Data Science Stack Exchange

16

votes

3 answers

Why are ensembles so unreasonably effective

It seems to have become axiomatic that an ensemble of learners leads to the best possible model results - and it is becoming far rarer, for example, for single models to win competitions such as Kaggle. Is there a theoretical explanation for why…

asked May 25 '16 at 13:08

Robert de Graaf

899
5
17

16

votes

4 answers

Need help understanding xgboost's approximate split points proposal

background: in xgboost the $t$ iteration tries to fit a tree $f_t$ over all $n$ examples which minimizes the following objective: $$\sum_{i=1}^n[g_if_t(x_i) + \frac{1}{2}h_if_t^2(x_i)]$$ where $g_i, h_i$ are first order and second order derivatives…

asked Apr 01 '16 at 14:35

ihadanny

1,357
2
11
19

16

votes

4 answers

How to do postal addresses fuzzy matching?

I would like to know how to match postal addresses when their format differ or when one of them is mispelled. So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so…

asked Mar 21 '16 at 12:01

Stéphanie C

281
1
2
5

16

votes

3 answers

Doc2vec(gensim) - How can I infer unseen sentences’ label?

https://radimrehurek.com/gensim/models/doc2vec.html For example, if we have trained doc2vec with "aaaaaAAAAAaaaaaa" - "label 1" “bbbbbbBBBBBbbbb" - "label 2" can we infer “aaaaAAAAaaaaAA” is label 1 using Doc2vec? I know Doc2vec can train word…

gensim

asked Mar 09 '16 at 08:37

Seongho

163
1
1
6

15

votes

4 answers

Pandas: how can I create multi-level columns

I have a pandas DataFrame which has the following columns: n_0 n_1 p_0 p_1 e_0 e_1 I want to transform it to have columns and sub-columns: 0 n p e 1 n p e I've searched in the documentation, and I'm completely lost on how…

pandas

asked Dec 21 '15 at 11:18

Michael Hooreman

793
2
9
21

15

votes

2 answers

Analyzing A/B test results which are not normally distributed, using independent t-test

I have a set of results from an A/B test (one control group, one feature group) which do not fit a Normal Distribution. In fact the distribution resembles more closely the Landau Distribution. I believe the independent t-test requires that the…

asked Aug 04 '14 at 22:27

teebszet

253
2
6

15

votes

2 answers

Clustering unique visitors by useragent, ip, session_id

Given website access data in the form session_id, ip, user_agent, and optionally timestamp, following the conditions below, how would you best cluster the sessions into unique visitors? session_id: is an id given to every new visitor. It does not…

clustering

asked May 15 '14 at 09:04

AdrianBR

367
2
10

15

votes

2 answers

What is the difference between active learning and reinforcement learning?

From Wikipedia: Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs. Reinforcement learning (RL)…

asked Nov 13 '20 at 12:54

Moradnejad

265
1
2
6

15

votes

2 answers

How to understand ANOVA-F for feature selection in Python. Sklearn SelectKBest with f_classif

I am trying to understand what it really means to calculate an ANOVA F value for feature selection for a binary classification problem. As I understand from the calculation of ANOVA from basic statistics, we should have at least 2 samples for which…

asked May 19 '20 at 14:50

JuniorDataGuyAKL

153
1
1
6

15

votes

3 answers

Keras Sequential model returns loss 'nan'

I'm implementing a neural network with Keras, but the Sequential model returns nan as loss value. I have sigmoid activation function in the output layer to squeeze output between 0 and 1, but maybe doesn't work properly. This is the code: def…

asked Feb 19 '20 at 10:18

pairon

405
1
4
15

15

votes

3 answers

How to calculate the mean of a dataframe column and find the top 10%

I am very new to Scala and Spark, and am working on some self-made exercises using baseball statistics. I am using a case class create a RDD and assign a schema to the data, and am then turning it into a DataFrame so I can use SparkSQL to select…

asked Jul 22 '15 at 14:16

the3rdNotch

253
1
2
7

15

votes

3 answers

Why does frequency encoding work?

Frequency encoding is a widely used technique in Kaggle competitions, and many times proves to be a very reasonable way of dealing with categorical features with high cardinality. I really don't understand why it works. Does it work in very…

asked Nov 25 '19 at 15:36

David Masip

6,051
2
24
61

15

votes

3 answers

How to convert categorical data to numerical data in Pyspark

I am using Ipython notebook to work with pyspark applications. I have a CSV file with lots of categorical columns to determine whether the income falls under or over the 50k range. I would like to perform a classification algorithm taking all the…

asked Jun 29 '15 at 22:55

SRS

1,065
5
11
22

15

votes

2 answers

Using attributes to classify/cluster user profiles

I have a dataset of users purchasing products from a website. The attributes I have are user id, region(state) of the user, the categories id of product, keywords id of product, keywords id of website, and sales amount spent of the product. The goal…

asked May 19 '15 at 23:34

sylvia

303
1
2
8

15

votes

3 answers

How to remove outliers using box-plot?

I have data of a metric grouped date wise. I have plotted the data, now, how do I remove the values outside the range of the boxplot (outliers)? All the ['AVG'] data is in a single column, I need it for time series modelling.

asked Jul 01 '19 at 04:15

Uday T

342
1
5
11

Most Popular