Most Popular

1500 questions
109
votes
13 answers

Understanding "variance" intuitively

What is the cleanest, easiest way to explain someone the concept of variance? What does it intuitively mean? If one is to explain this to their child how would one go about it? It's a concept that I have difficulty in articulating - especially when…
PhD
  • 14,627
109
votes
12 answers

Explain "Curse of dimensionality" to a child

I heard many times about curse of dimensionality, but somehow I'm still unable to grasp the idea, it's all foggy. Can anyone explain this in the most intuitive way, as you would explain it to a child, so that I (and the others confused as I am)…
109
votes
1 answer

Conditional inference trees vs traditional decision trees

Can anyone explain the primary differences between conditional inference trees (ctree from party package in R) compared to the more traditional decision tree algorithms (such as rpart in R)? What makes CI trees different? Strengths and…
B_Miner
  • 8,630
109
votes
13 answers

Simple algorithm for online outlier detection of a generic time series

I am working with a large amount of time series. These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic). I would…
gianluca
  • 1,981
  • 4
  • 16
  • 9
108
votes
5 answers

Does the variance of a sum equal the sum of the variances?

Is it (always) true that $$\mathrm{Var}\left(\sum\limits_{i=1}^m{X_i}\right) = \sum\limits_{i=1}^m{\mathrm{Var}(X_i)} \>?$$
Abe
  • 3,811
108
votes
4 answers

How to select kernel for SVM?

When using SVM, we need to select a kernel. I wonder how to select a kernel. Any criteria on kernel selection?
xiaohan2012
  • 7,179
108
votes
6 answers

Principled way of collapsing categorical variables with many levels?

What techniques are available for collapsing (or pooling) many categories to a few, for the purpose of using them as an input (predictor) in a statistical model? Consider a variable like college student major (discipline chosen by an undergraduate…
shadowtalker
  • 12,551
108
votes
5 answers

Loadings vs eigenvectors in PCA: when to use one or another?

In principal component analysis (PCA), we get eigenvectors (unit vectors) and eigenvalues. Now, let us define loadings as $$\text{Loadings} = \text{Eigenvectors} \cdot \sqrt{\text{Eigenvalues}}.$$ I know that eigenvectors are just directions and…
user2696565
  • 1,389
108
votes
1 answer

Correlation between a nominal (IV) and a continuous (DV) variable

I have a nominal variable (different topics of conversation, coded as topic0=0 etc) and a number of scale variables (DV) such as the length of a conversation. How can I derive correlations between the nominal and scale variables?
Paul Miller
  • 1,081
107
votes
32 answers

What book would you recommend for non-statistician scientists?

What book would you recommend for scientists who are not statisticians? Clear delivery is most appreciated. As well as the explanation of the appropriate techniques and methods for typical tasks: time series analysis, presentation and aggregation of…
107
votes
15 answers

US Election results 2016: What went wrong with prediction models?

First it was Brexit, now the US election. Many model predictions were off by a wide margin, and are there lessons to be learned here? As late as 4 pm PST yesterday, the betting markets were still favoring Hillary 4 to 1. I take it that the betting…
horaceT
  • 3,352
107
votes
9 answers

Generate a random variable with a defined correlation to an existing variable(s)

For a simulation study I have to generate random variables that show a predefined (population) correlation to an existing variable $Y$. I looked into the R packages copula and CDVine which can produce random multivariate distributions with a given…
Felix S
  • 4,700
107
votes
9 answers

Is there an intuitive explanation why multicollinearity is a problem in linear regression?

The wiki discusses the problems that arise when multicollinearity is an issue in linear regression. The basic problem is multicollinearity results in unstable parameter estimates which makes it very difficult to assess the effect of independent…
user28
106
votes
3 answers

What are examples where a "naive bootstrap" fails?

Suppose I have a set of sample data from an unknown or complex distribution, and I want to perform some inference on a statistic $T$ of the data. My default inclination is to just generate a bunch of bootstrap samples with replacement, and calculate…
raegtin
  • 9,930
106
votes
19 answers

How to annoy a statistical referee?

I recently asked a question regarding general principles around reviewing statistics in papers. What I would now like to ask, is what particularly irritates you when reviewing a paper, i.e. what's the best way to really annoy a statistical…
csgillespie
  • 13,029