Most Popular

1500 questions
12
votes
2 answers

How to split train/test in recommender systems

I am working with the MovieLens10M dataset, predicting user ratings. If I want to fairly evaluate my algorithm, how should I split my training v. test data? By default, I believe the data is split into train v. test sets where 'test' contains…
jamesmf
  • 3,097
  • 1
  • 17
  • 25
12
votes
1 answer

Why ML model produces different results despite random_state defined? And how to set global random seed for sklearn

I have been running few ML models on same set of data for a binary classification problem with class proportion of 33:67. I had the same algorithms and same set of hyperparamters during yesterday and today's run. Please note that I also have the…
The Great
  • 2,565
  • 2
  • 20
  • 43
12
votes
1 answer

What are practical differences between kernel k-means and spectral clustering?

I've been lately wondering about kernel k-means and spectral clustering algorithms and their differences. I know that spectral clustering is a more broad term and different settings can affect the way it works, but one popular variant is using…
Kuba_
  • 264
  • 1
  • 10
12
votes
5 answers

Retrieve dropped column names from `sklearn.impute.SimpleImputer`

The SimpleImputer class takes pandas dataframes and returns unlabeled numpy arrays. Which means that the SimpleImputer drops some features at will, but has no way to communicate which features have been dropped to the caller I've been trying to come…
lurscher
  • 223
  • 2
  • 5
12
votes
3 answers

Issue with IPython/Jupyter on Spark (Unrecognized alias)

I am working on setting up a set of VMs to experiment with Spark before I spend go out and spend money on building up a cluster with some hardware. Quick note: I am an academic with a background in applied machine learning and work quit a bit in…
gcd
  • 121
  • 1
  • 3
12
votes
2 answers

Shall I use the Euclidean Distance or the Cosine Similarity to compute the semantic similarity of two words?

I want to compute the semantic similarity of two words using their vector representations (obtained using e.g. word2vec, GloVe, etc.). Shall I use the Euclidean Distance or the Cosine Similarity? The GloVe website mentions both measures without…
Franck Dernoncourt
  • 5,690
  • 10
  • 40
  • 76
12
votes
1 answer

MinHashing vs SimHashing

Suppose I have five sets I'd like to cluster. I understand that the SimHashing technique described here: https://moultano.wordpress.com/2010/01/21/simple-simhashing-3kbzhsxyg4467-6/ could yield three clusters ({A}, {B,C,D} and {E}), for instance, if…
cjauvin
  • 451
  • 3
  • 7
12
votes
3 answers

Airline Fares - What analysis should be used to detect competitive price-setting behavior and price correlations?

I want to investigate price-setting behavior of airlines -- specifically how airlines react to competitors pricing. As I would say my knowledge about more complex analysis is quite limited I've done mostly all basic methods to gather a overall view…
s1x
  • 221
  • 1
  • 8
12
votes
1 answer

Finding linear transformation under which distance matrices are similar

I have $n$ sets of vectors, where each set $S_i$ contains $k$ vectors in $\mathbb{R}^d$. I know there is some unknown linear transformation $W$ under which the distance matrix $D_i$ (a $k\times k$ matrix) is approximately "the same" (i.e. has a low…
12
votes
5 answers

How to scrape imdb webpage?

I am trying to learn web scraping using Python by myself as part of an effort to learn data analysis. I am trying to scrape imdb webpage. I am using BeautifulSoup module. Following is the code I am using: r = requests.get(url) # where url is the…
user62198
  • 1,091
  • 4
  • 16
  • 32
12
votes
1 answer

model.cuda() in pytorch

If I call model.cuda() in pytorch where model is a subclass of nn.Module, and say if I have four GPUs, how it will utilize the four GPUs and how do I know which GPUs that are using?
william007
  • 775
  • 1
  • 10
  • 20
12
votes
1 answer

What is whole word masking in the recent BERT model?

I was checking BERT GitHub page and noticed that there are new models built from a new training technique called "whole word masking". Here is a snippet describing it: In the original pre-processing code, we randomly select WordPiece tokens to…
kee
  • 223
  • 2
  • 6
12
votes
2 answers

Which comes first? Multiple Imputation, Splitting into train/test, or Standardization/Normalization

I am working on a multi-class classification problem, with ~65 features and ~150K instances. 30% of features are categorical and the rest are numerical (continuous). I understand that standardization or normalization should be done after splitting…
Sarah
  • 611
  • 2
  • 5
  • 17
12
votes
7 answers

What is an 'old name' of data scientist?

Terms like 'data science' and 'data scientist' are increasingly used these days. Many companies are hiring 'data scientist'. But I don't think it's a completely new job. Data have existed from the past and someone had to deal with data. I guess the…
user67275
  • 263
  • 1
  • 3
  • 15
12
votes
1 answer

what is the first input to the decoder in a transformer model?

The image is from url: Jay Alammar on transformers K_encdec and V_encdec are calculated in a matrix multiplication with the encoder outputs and sent to the encoder-decoder attention layer of each decoder layer in the decoder. The previous output is…
mLstudent33
  • 594
  • 1
  • 5
  • 19