Highest Voted Questions - Data Science Stack Exchange

12

votes

2 answers

How to split train/test in recommender systems

I am working with the MovieLens10M dataset, predicting user ratings. If I want to fairly evaluate my algorithm, how should I split my training v. test data? By default, I believe the data is split into train v. test sets where 'test' contains…

asked Aug 17 '15 at 20:34

jamesmf

3,097
1
17
25

12

votes

1 answer

Why ML model produces different results despite random_state defined? And how to set global random seed for sklearn

I have been running few ML models on same set of data for a binary classification problem with class proportion of 33:67. I had the same algorithms and same set of hyperparamters during yesterday and today's run. Please note that I also have the…

asked Jan 12 '20 at 11:34

The Great

2,565
2
20
43

12

votes

1 answer

What are practical differences between kernel k-means and spectral clustering?

I've been lately wondering about kernel k-means and spectral clustering algorithms and their differences. I know that spectral clustering is a more broad term and different settings can affect the way it works, but one popular variant is using…

asked Jan 09 '20 at 07:40

Kuba_

264
1
10

12

votes

5 answers

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller I've been trying to come…

asked Jan 06 '20 at 15:17

lurscher

223
2
5

12

votes

3 answers

Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in…

asked Jul 23 '15 at 03:45

gcd

121
1
3

12

votes

2 answers

Shall I use the Euclidean Distance or the Cosine Similarity to compute the semantic similarity of two words?

I want to compute the semantic similarity of two words using their vector representations (obtained using e.g. word2vec, GloVe, etc.). Shall I use the Euclidean Distance or the Cosine Similarity? The GloVe website mentions both measures without…

asked Jul 20 '15 at 04:48

Franck Dernoncourt

5,690
10
40
76

12

votes

1 answer

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…

asked Jun 11 '15 at 21:21

cjauvin

451
3
7

12

votes

3 answers

Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?

I want to investigate price-setting behavior of airlines -- specifically how airlines react to competitors pricing. As I would say my knowledge about more complex analysis is quite limited I've done mostly all basic methods to gather a overall view…

asked May 17 '15 at 20:12

s1x

221
1
8

12

votes

1 answer

Finding linear transformation under which distance matrices are similar

I have $n$ sets of vectors, where each set $S_i$ contains $k$ vectors in $\mathbb{R}^d$. I know there is some unknown linear transformation $W$ under which the distance matrix $D_i$ (a $k\times k$ matrix) is approximately "the same" (i.e. has a low…

asked Jul 22 '19 at 14:24

user1767774

33
6

12

votes

5 answers

How to scrape imdb webpage?

I am trying to learn web scraping using Python by myself as part of an effort to learn data analysis. I am trying to scrape imdb webpage. I am using BeautifulSoup module. Following is the code I am using: r = requests.get(url) # where url is the…

asked Apr 15 '15 at 23:53

user62198

1,091
4
16
32

12

votes

1 answer

model.cuda() in pytorch

If I call model.cuda() in pytorch where model is a subclass of nn.Module, and say if I have four GPUs, how it will utilize the four GPUs and how do I know which GPUs that are using?

pytorch

asked Jul 02 '19 at 12:20

william007

775
1
10
20

12

votes

1 answer

What is whole word masking in the recent BERT model?

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it: In the original pre-processing code, we randomly select WordPiece tokens to…

asked Jun 15 '19 at 23:13

kee

223
2
6

12

votes

2 answers

Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting…

asked Jun 03 '19 at 14:59

Sarah

611
2
5
17

12

votes

7 answers

What is an 'old name' of data scientist?

Terms like 'data science' and 'data scientist' are increasingly used these days. Many companies are hiring 'data scientist'. But I don't think it's a completely new job. Data have existed from the past and someone had to deal with data. I guess the…

bigdata

asked Feb 28 '15 at 22:10

user67275

263
1
3
15

12

votes

1 answer

what is the first input to the decoder in a transformer model?

The image is from url: Jay Alammar on transformers K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder. The previous output is…

asked May 11 '19 at 08:36

mLstudent33

594
1
5
19

Most Popular