Most Popular

1500 questions
72
votes
5 answers

What is so cool about de Finetti's representation theorem?

From Theory of Statistics by Mark J. Schervish (page 12): Although DeFinetti's representation theorem 1.49 is central to motivating parametric models, it is not actually used in their implementation. How is the theorem central to parametric…
gui11aume
  • 14,703
72
votes
7 answers

Why is tanh almost always better than sigmoid as an activation function?

In Andrew Ng's Neural Networks and Deep Learning course on Coursera he says that using $tanh$ is almost always preferable to using $sigmoid$. The reason he gives is that the outputs using $tanh$ centre around 0 rather than $sigmoid$'s 0.5, and this…
Tom Hale
  • 2,561
72
votes
8 answers

How to simulate data that satisfy specific constraints such as having specific mean and standard deviation?

This question is motivated by my question on meta-analysis. But I imagine that it would also be useful in teaching contexts where you want to create a dataset that exactly mirrors an existing published dataset. I know how to generate random data…
Jeromy Anglim
  • 44,984
72
votes
7 answers

What is a "saturated" model?

What is meant when we say we have a saturated model?
72
votes
4 answers

What is the proper usage of scale_pos_weight in xgboost for imbalanced datasets?

I have a very imbalanced dataset. I'm trying to follow the tuning advice and use scale_pos_weight but not sure how should I tune it. I can see that RegLossObj.GetGradient does: if (info.labels[i] == 1.0f) w *= param_.scale_pos_weight so a gradient…
ihadanny
  • 3,300
72
votes
2 answers

How to interpret type I, type II, and type III ANOVA and MANOVA?

My primary question is how to interpret the output (coefficients, F, P) when conducting a Type I (sequential) ANOVA? My specific research problem is a bit more complex, so I will break my example into parts. First, if I am interested in the effect…
djhocking
  • 1,931
72
votes
4 answers

How do you calculate the probability density function of the maximum of a sample of IID uniform random variables?

Given the random variable $$Y = \max(X_1, X_2, \ldots, X_n)$$ where $X_i$ are IID uniform variables, how do I calculate the PDF of $Y$?
72
votes
2 answers

Derivation of closed form lasso solution

For the lasso problem $\min_\beta (Y-X\beta)^T(Y-X\beta)$ such that $\|\beta\|_1 \leq t$. I often see the soft-thresholding result $$ \beta_j^{\text{lasso}}= \mathrm{sgn}(\beta^{\text{LS}}_j)(|\beta_j^{\text{LS}}|-\gamma)^+ $$ for the orthonormal…
Gary
  • 1,601
72
votes
4 answers

How to tune hyperparameters of xgboost trees?

I have a class imbalanced data & I want to tune the hyperparameters of the boosted tress using xgboost. Questions Is there an equivalent of gridsearchcv or randomsearchcv for xgboost? If not what is the recommended approach to tune the parameters…
72
votes
6 answers

Difference between "kernel" and "filter" in CNN

What is the difference between the terms "kernel" and "filter" in the context of convolutional neural networks?
ryguy
  • 941
  • 1
  • 7
  • 7
72
votes
12 answers

What does orthogonal mean in the context of statistics?

In other contexts, orthogonal means "at right angles" or "perpendicular". What does orthogonal mean in a statistical context? Thanks for any clarifications.
pmgjones
  • 5,773
  • 8
  • 38
  • 36
72
votes
3 answers

Why does ridge estimate become better than OLS by adding a constant to the diagonal?

I understand that the ridge regression estimate is the $\beta$ that minimizes residual sum of square and a penalty on the size of $\beta$ $$\beta_\mathrm{ridge} = (\lambda I_D + X'X)^{-1}X'y = \operatorname{argmin}\big[ \text{RSS} + \lambda…
Heisenberg
  • 4,590
  • 4
  • 30
  • 62
72
votes
4 answers

Intuitive explanation of Fisher Information and Cramer-Rao bound

I am not comfortable with Fisher information, what it measures and how is it helpful. Also it's relationship with the Cramer-Rao bound is not apparent to me. Can someone please give an intuitive explanation of these concepts?
Infinity
  • 963
  • 1
  • 8
  • 7
71
votes
2 answers

Do we need a global test before post hoc tests?

I often hear that post hoc tests after an ANOVA can only be used if the ANOVA itself was significant. However, post hoc tests adjust $p$-values to keep the global type I error rate at 5%, don't they? So why do we need the global test first? If…
even
  • 2,347
  • 6
  • 19
  • 13
71
votes
2 answers

Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?

The coefficient of an explanatory variable in a multiple regression tells us the relationship of that explanatory variable with the dependent variable. All this, while 'controlling' for the other explanatory variables. How I have viewed it so…