Most Popular

1500 questions
62
votes
1 answer

Bootstrap vs. jackknife

Both bootstrap and jackknife methods can be used to estimate bias and standard error of an estimate and mechanisms of both resampling methods are not huge different: sampling with replacement vs. leave out one observation at a time. However,…
Tu.2
  • 2,957
62
votes
4 answers

How does LSTM prevent the vanishing gradient problem?

LSTM was invented specifically to avoid the vanishing gradient problem. It is supposed to do that with the Constant Error Carousel (CEC), which on the diagram below (from Greff et al.) correspond to the loop around cell. (source:…
62
votes
3 answers

Why does shrinkage work?

In order to solve problems of model selection, a number of methods (LASSO, ridge regression, etc.) will shrink the coefficients of predictor variables towards zero. I am looking for an intuitive explanation of why this improves predictive ability.…
62
votes
2 answers

Should I normalize word2vec's word vectors before using them?

After training word vectors with word2vec, is it better to normalize them before using them for some downstream applications? I.e what are the pros/cons of normalizing them?
Franck Dernoncourt
  • 46,817
  • 33
  • 176
  • 288
62
votes
10 answers

Measuring entropy/ information/ patterns of a 2d binary matrix

I want to measure the entropy/ information density/ pattern-likeness of a two-dimensional binary matrix. Let me show some pictures for clarification: This display should have a rather high entropy: A) This should have medium entropy: B) These…
Felix S
  • 4,700
62
votes
3 answers

Why do Convolutional Neural Networks not use a Support Vector Machine to classify?

In recent years, Convolutional Neural Networks (CNNs) have become the state-of-the-art for object recognition in computer vision. Typically, a CNN consists of several convolutional layers, followed by two fully-connected layers. An intuition behind…
Karnivaurus
  • 7,019
62
votes
4 answers

Why sigmoid function instead of anything else?

Why is the de-facto standard sigmoid function, $\frac{1}{1+e^{-x}}$, so popular in (non-deep) neural-networks and logistic regression? Why don't we use many of the other derivable functions, with faster computation time or slower decay (so…
62
votes
4 answers

Recurrent vs Recursive Neural Networks: Which is better for NLP?

There are Recurrent Neural Networks and Recursive Neural Networks. Both are usually denoted by the same acronym: RNN. According to Wikipedia, Recurrent NN are in fact Recursive NN, but I don't really understand the explanation. Moreover, I don't…
62
votes
12 answers

Software needed to scrape data from graph

Anybody have any experience with software (preferably free, preferably open source) that will take an image of data plotted on cartesian coordinates (a standard, everyday plot) and extract the coordinates of the points plotted on the…
62
votes
7 answers

Period detection of a generic time series

This post is the continuation of another post related to a generic method for outlier detection in time series. Basically, at this point I'm interested in a robust way to discover the periodicity/seasonality of a generic time series affected by a…
gianluca
  • 1,981
  • 4
  • 16
  • 9
61
votes
2 answers

What is the difference between a Normal and a Gaussian Distribution

Is there a deep difference between a Normal and a Gaussian distribution, I've seen many papers using them without distinction, and I usually also refer to them as the same thing. However, my PI recently told me that a normal is the specific case of…
61
votes
10 answers

What does "Scientists rise up against statistical significance" mean? (Comment in Nature)

The title of the Comment in Nature Scientists rise up against statistical significance begins with: Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories call for an end to hyped claims and the dismissal of possibly…
uhoh
  • 685
61
votes
4 answers

Box-Cox like transformation for independent variables?

Is there a Box-Cox like transformation for independent variables? That is, a transformation that optimizes the $x$ variable so that the y~f(x) will make a more reasonable fit for a linear model? If so, is there a function to perform this with R?
Tal Galili
  • 21,541
61
votes
7 answers

Industry vs Kaggle challenges. Is collecting more observations and having access to more variables more important than fancy modelling?

I'd hope the title is self explanatory. In Kaggle, most winners use stacking with sometimes hundreds of base models, to squeeze a few extra % of MSE, accuracy... In general, in your experience, how important is fancy modelling such as stacking vs…
Tom
  • 1,373
  • 10
  • 21
61
votes
11 answers

Brain teaser: How to generate 7 integers with equal probability using a biased coin that has a pr(head) = p?

This is a question I found on Glassdoor: How does one generate 7 integers with equal probability using a coin that has a $\mathbb{Pr}(\text{Head}) = p\in(0,1)$? Basically, you have a coin that may or may not be fair, and this is the only…
Amazonian
  • 1,534