Most Popular

1500 questions
103
votes
7 answers

The Book of Why by Judea Pearl: Why is he bashing statistics?

I am reading The Book of Why by Judea Pearl, and it is getting under my skin1. Specifically, it appears to me that he is unconditionally bashing "classical" statistics by putting up a straw man argument that statistics is never, ever able to…
January
  • 7,559
103
votes
8 answers

When is unbalanced data really a problem in Machine Learning?

We already had multiple questions about unbalanced data when using logistic regression, SVM, decision trees, bagging and a number of other similar questions, what makes it a very popular topic! Unfortunately, each of the questions seems to be…
Tim
  • 138,066
103
votes
9 answers

Are there any examples where Bayesian credible intervals are obviously inferior to frequentist confidence intervals

A recent question on the difference between confidence and credible intervals led me to start re-reading Edwin Jaynes' article on that topic: Jaynes, E. T., 1976. `Confidence Intervals vs Bayesian Intervals,' in Foundations of Probability Theory,…
Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204
103
votes
9 answers

What's the difference between correlation and simple linear regression?

In particular, I am referring to the Pearson product-moment correlation coefficient.
102
votes
6 answers

Kendall Tau or Spearman's rho?

In which cases should one prefer the one over the other? I found someone who claims an advantage for Kendall, for pedagogical reasons, are there other reasons?
Tal Galili
  • 21,541
102
votes
9 answers

If mean is so sensitive, why use it in the first place?

It is a known fact that median is resistant to outliers. If that is the case, when and why would we use the mean in the first place? One thing I can think of perhaps is to understand the presence of outliers i.e. if the median is far from the mean,…
Legend
  • 4,532
101
votes
7 answers

Mutual information versus correlation

Why and when we should use Mutual Information over statistical correlation measurements such as "Pearson", "spearman", or "Kendall's tau" ?
SaZa
  • 1,155
101
votes
5 answers

How to calculate Area Under the Curve (AUC), or the c-statistic, by hand

I am interested in calculating area under the curve (AUC), or the c-statistic, by hand for a binary logistic regression model. For example, in the validation dataset, I have the true value for the dependent variable, retention (1 = retained; 0 = not…
100
votes
1 answer

What correlation makes a matrix singular and what are implications of singularity or near-singularity?

I am doing some calculations on different matrices (mainly in logistic regression) and I commonly get the error "Matrix is singular", where I have to go back and remove the correlated variables. My question here is what would you consider a "highly"…
Error404
  • 1,421
100
votes
1 answer

What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?

The Mean Absolute Percentage Error (mape) is a common accuracy or error measure for time series or other predictions, $$ \text{MAPE} = \frac{100}{n}\sum_{t=1}^n\frac{|A_t-F_t|}{A_t}\%,$$ where $A_t$ are actuals and $F_t$ corresponding forecasts or…
Stephan Kolassa
  • 123,354
100
votes
6 answers

How to 'sum' a standard deviation?

I have a monthly average for a value and a standard deviation corresponding to that average. I am now computing the annual average as the sum of monthly averages, how can I represent the standard deviation for the summed average ? For example…
klonq
  • 1,277
100
votes
5 answers

How to plot ROC curves in multiclass classification?

In other words, instead of having a two class problem I am dealing with 4 classes and still would like to assess performance using AUC.
CLOCK
100
votes
2 answers

How much do we know about p-hacking "in the wild"?

The phrase p-hacking (also: "data dredging", "snooping" or "fishing") refers to various kinds of statistical malpractice in which results become artificially statistically significant. There are many ways to procure a "more significant" result,…
Silverfish
  • 23,353
  • 27
  • 103
  • 201
99
votes
3 answers

Shape of confidence interval for predicted values in linear regression

I have noticed that the confidence interval for predicted values in a linear regression tends to be narrow around the mean of the predictor and fat around the minimum and maximum values of the predictor. This can be seen in plots of these 4 linear…
luciano
  • 14,269
99
votes
4 answers

How to choose nlme or lme4 R library for mixed effects models?

I have fit a few mixed effects models (particularly longitudinal models) using lme4 in R but would like to really master the models and the code that goes with them. However, before diving in with both feet (and buying some books) I want to be sure…
Chris Beeley
  • 5,761