Most Popular

1500 questions
127
votes
18 answers

Including the interaction but not the main effects in a model

Is it ever valid to include a two-way interaction in a model without including the main effects? What if your hypothesis is only about the interaction, do you still need to include the main effects?
Glen
  • 7,250
126
votes
1 answer

What is an ablation study? And is there a systematic way to perform it?

What is an ablation study? And is there a systematic way to perform it? For example, I have $n$ predictors in a linear regression which I will call as my model. How will I perform an ablation study to this? What metrics should I use? A…
cgo
  • 9,107
125
votes
11 answers

Calculating optimal number of bins in a histogram

I'm interested in finding as optimal of a method as I can for determining how many bins I should use in a histogram. My data should range from 30 to 350 objects at most, and in particular I'm trying to apply thresholding (like Otsu's method) where…
Tony Stark
  • 1,353
  • 2
  • 9
  • 5
124
votes
3 answers

Does an unbalanced sample matter when doing logistic regression?

Okay, so I think I have a decent enough sample, taking into account the 20:1 rule of thumb: a fairly large sample (N=374) for a total of 7 candidate predictor variables. My problem is the following: whatever set of predictor variables I use, the…
Michiel
  • 1,343
124
votes
3 answers

Intuitive explanation of unit root

How would you explain intuitively what is a unit root, in the context of the unit root test? I'm thinking in ways of explaining much like I've founded in this question. The case with unit root is that I know (little, by the way) that the unit root…
Lucas Reis
  • 2,062
124
votes
7 answers

Why use gradient descent for linear regression, when a closed-form math solution is available?

I am taking the Machine Learning courses online and learnt about Gradient Descent for calculating the optimal values in the hypothesis. h(x) = B0 + B1X why we need to use Gradient Descent if we can easily find the values with the below formula?…
Purus
  • 1,343
122
votes
3 answers

tanh activation function vs sigmoid activation function

The tanh activation function is: $$tanh \left( x \right) = 2 \cdot \sigma \left( 2 x \right) - 1$$ Where $\sigma(x)$, the sigmoid function, is defined as: $$\sigma(x) = \frac{e^x}{1 + e^x}$$. Questions: Does it really matter between using those…
satya
  • 1,373
121
votes
4 answers

Why does the Lasso provide Variable Selection?

I've been reading Elements of Statistical Learning, and I would like to know why the Lasso provides variable selection and ridge regression doesn't. Both methods minimize the residual sum of squares and have a constraint on the possible values of…
Shiwen
  • 1,422
121
votes
5 answers

How do you calculate precision and recall for multiclass classification using confusion matrix?

I wonder how to compute precision and recall using a confusion matrix for a multi-class classification problem. Specifically, an observation can only be assigned to its most probable class / label. I would like to compute: Precision = TP / (TP+FP)…
daiyue
  • 1,321
121
votes
21 answers

At each step of a limiting infinite process, put 10 balls in an urn and remove one at random. How many balls are left?

The question (slightly modified) goes as follows and if you have never encountered it before you can check it in example 6a, chapter 2, of Sheldon Ross' A First Course in Probability: Suppose that we possess an infinitely large urn and an infinite …
121
votes
5 answers

Comprehensive list of activation functions in neural networks with pros/cons

Are there any reference document(s) that give a comprehensive list of activation functions in neural networks along with their pros/cons (and ideally some pointers to publications where they were successful or not so successful)?
Franck Dernoncourt
  • 46,817
  • 33
  • 176
  • 288
120
votes
6 answers

Is it possible to train a neural network without backpropagation?

Many neural network books and tutorials spend a lot of time on the backpropagation algorithm, which is essentially a tool to compute the gradient. Let's assume we are building a model with ~10K parameters / weights. Is it possible to run the…
Haitao Du
  • 36,852
  • 25
  • 145
  • 242
120
votes
9 answers

How does the reparameterization trick for VAEs work and why is it important?

How does the reparameterization trick for variational autoencoders (VAE) work? Is there an intuitive and easy explanation without simplifying the underlying math? And why do we need the 'trick'?
David Dao
  • 2,824
119
votes
4 answers

Assessing approximate distribution of data based on a histogram

Suppose I want to see whether my data is exponential based on a histogram (i.e. skewed to the right). Depending on how I group or bin the data, I can get wildly different histograms. One set of histograms will make is seem that the data is…
119
votes
5 answers

Mean absolute error OR root mean squared error?

Why use Root Mean Squared Error (RMSE) instead of Mean Absolute Error (MAE)?? I've been investigating the error generated in a calculation - I initially calculated the error as a Root Mean Normalised Squared Error. Looking a little closer, I see the…
user1665220
  • 1,325