Most Popular

1500 questions
43
votes
4 answers

What are the advantages of stacking multiple LSTMs?

What are the advantages, why would one use multiple LSTMs, stacked one side-by-side, in a deep-network? I am using a LSTM to represent a sequence of inputs as a single input. So once I have that single representation— why would I pass it through…
wordSmith
  • 745
43
votes
1 answer

Relative variable importance for Boosting

I'm looking for an explanation of how relative variable importance is computed in Gradient Boosted Trees that is not overly general/simplistic like: The measures are based on the number of times a variable is selected for splitting, weighted by the…
Antoine
  • 6,159
43
votes
2 answers

Variance of product of dependent variables

What is the formula for variance of product of dependent variables? In the case of independent variables the formula is simple: $$ \operatorname{var}(XY) = E(X^2Y^2) - E(XY)^2 = \operatorname{var}(X) \operatorname{var}(Y) +…
Riga
  • 133
43
votes
6 answers

Testing for autocorrelation: Ljung-Box versus Breusch-Godfrey

I am used to seeing Ljung-Box test used quite frequently for testing autocorrelation in raw data or in model residuals. I had nearly forgotten that there is another test for autocorrelation, namely, Breusch-Godfrey test. Question: what are the main…
Richard Hardy
  • 67,272
43
votes
3 answers

When should one use Coordinate descent vs. gradient descent?

I was wondering what the different use cases are for the two algorithms, Coordinate Descent and Gradient Descent. I know that coordinate descent has problems with non-smooth functions but it is used in popular algorithms like SVM and LASSO. Gradient…
Bar
  • 2,862
43
votes
5 answers

LDA vs word2vec

I am trying to understand what is similarity between Latent Dirichlet Allocation and word2vec for calculating word similarity. As I understand, LDA maps words to a vector of probabilities of latent topics, while word2vec maps them to a vector of…
Piotr Migdal
  • 5,776
43
votes
2 answers

Finding Quartiles in R

I'm working through a statistics textbook while learning R and I've run into a stumbling block on the following example: After looking at ?quantile I attempted to recreate this in R with the following: > nuclear <- c(7, 20, 16, 6, 58, 9, 20, 50,…
user60305
43
votes
3 answers

What are the measure for accuracy of multilabel data?

Consider a scenario where you are provided with KnownLabel Matrix and PredictedLabel matrix. I would like to measure the goodness of the PredictedLabel matrix against the KnownLabel Matrix. But the challenge here is that KnownLabel Matrix have few…
Learner
  • 4,457
43
votes
9 answers

When teaching statistics, use "normal" or "Gaussian"?

I use mostly "Gaussian distribution" in my book, but someone just suggested I switch to "normal distribution". Any consensus on which term to use for beginners? Of course the two terms are synonyms, so this is not a question about substance, but…
42
votes
2 answers

When is logistic regression solved in closed form?

Take $x \in \{0,1\}^d$ and $y \in \{0,1\}$ and suppose we model the task of predicting y given x using logistic regression. When can logistic regression coefficients be written in closed form? One example is when we use a saturated model. That is,…
Yaroslav Bulatov
  • 6,199
  • 2
  • 28
  • 42
42
votes
3 answers

Distribution of scalar products of two random unit vectors in $D$ dimensions

If $\mathbf{x}$ and $\mathbf{y}$ are two independent random unit vectors in $\mathbb{R}^D$ (uniformly distributed on a unit sphere), what is the distribution of their scalar product (dot product) $\mathbf x \cdot \mathbf y$? I guess as $D$ grows the…
amoeba
  • 104,745
42
votes
4 answers

Justification of one-tailed hypothesis testing

I understand two-tailed hypothesis testing. You have $H_0 : \theta = \theta_0$ (vs. $H_1 = \neg H_0 : \theta \ne \theta_0$). The $p$-value is the probability that $\theta$ generates data at least as extreme as what was observed. I don't understand…
xyzzyrz
  • 3,161
42
votes
2 answers

Error "system is computationally singular" when running a glm

I'm using the robustbase package to run a glm estimation. However when I do it, I get the following error: Error in solve.default(crossprod(X, DiagB * X)/nobs, EEq) : system is computationally singular: reciprocal condition number =…
NK1
  • 603
42
votes
8 answers

Looking for a good and complete probability and statistics book

I never had the opportunity to visit a stats course from a math faculty. I am looking for a probability theory and statistics book that is complete and self-sufficient. By complete I mean that it contains all the proofs and not just states results.…
Julian Karch
  • 1,890
  • 1
  • 18
  • 29
42
votes
4 answers

Good methods for density plots of non-negative variables in R?

plot(density(rexp(100)) Obviously all density to the left of zero represents bias. I'm looking to summarize some data for non-statisticians, and I want to avoid questions about why non-negative data has density to the left of zero. The plots are…
generic_user
  • 13,339