Most Popular
1500 questions
118
votes
7 answers
T-test for non normal when N>50?
Long ago I learnt that normal distribution was necessary to use a two sample T-test. Today a colleague told me that she learnt that for N>50 normal distribution was not necessary. Is that true?
If true is that because of the central limit theorem?
even
- 2,347
- 6
- 19
- 13
118
votes
2 answers
Why do we need to normalize data before principal component analysis (PCA)?
I'm doing principal component analysis on my dataset and my professor told me that I should normalize the data before doing the analysis. Why?
What would happen If I did PCA without normalization?
Why do we normalize data in general?
Could…
jjepsuomi
- 5,807
118
votes
10 answers
Validation Error less than training error?
I found two questions here and here about this issue but there is no obvious answer or explanation yet.I enforce the same problem where the validation error is less than training error in my Convolution Neural Network. What does that mean?
Bido
- 1,283
118
votes
4 answers
Why isn't Logistic Regression called Logistic Classification?
Since Logistic Regression is a statistical classification model dealing with categorical dependent variables, why isn't it called Logistic Classification? Shouldn't the "Regression" name be reserved to models dealing with continuous dependent…
Ismael Ghalimi
- 2,166
117
votes
17 answers
What is the role of the logarithm in Shannon's entropy?
Shannon's entropy is the negative of the sum of the probabilities of each outcome multiplied by the logarithm of probabilities for each outcome. What purpose does the logarithm serve in this equation?
An intuitive or visual answer (as opposed to a…
histelheim
- 2,993
117
votes
5 answers
What skills are required to perform large scale statistical analyses?
Many statistical jobs ask for experience with large scale data. What are the sorts of statistical and computational skills that would be need for working with large data sets. For example, how about building regression models given a data set with…
bit-question
- 2,817
116
votes
4 answers
What is the difference between zero-inflated and hurdle models?
I wonder if there is a clear-cut difference between the so-called zero-inflated distributions (models) and so-called hurdle-at-zero distributions (models)? The terms occur quite often in the literature and I suspect they are not the same, but would…
skulker
- 1,394
116
votes
8 answers
What is the benefit of breaking up a continuous predictor variable?
I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model.
It seems to me that by binning the variable we lose information.
Is this just so we can model…
Tom
- 1,771
- 2
- 13
- 18
116
votes
4 answers
What is rank deficiency, and how to deal with it?
Fitting a logistic regression using lme4 ends with
Error in mer_finalize(ans) : Downdated X'X is not positive definite.
A likely cause of this error is apparently rank deficiency. What is rank deficiency, and how should I address it?
Jack Tanner
- 4,842
116
votes
4 answers
Difference between standard error and standard deviation
I'm struggling to understand the difference between the standard error and the standard deviation. How are they different and why do you need to measure the standard error?
louis xie
- 1,333
116
votes
16 answers
If 900 out of 1000 people say a car is blue, what is the probability that it is blue?
This initially arose in connection some work we are doing to a model to classify natural text, but I've simplified it... Perhaps too much.
You have a blue car (by some objective scientific measure - it is blue).
You show it to 1000 people.
900 say…
Pat Molloy
- 1,125
116
votes
20 answers
What misused statistical terms are worth correcting?
Statistics is everywhere; common usage of statistical terms is, however, often unclear.
The terms probability and odds are used interchangeable in lay English despite their well-defined and different mathematical expressions.
Not separating the term…
Antoni Parellada
- 26,280
116
votes
10 answers
ASA discusses limitations of $p$-values - what are the alternatives?
We already have multiple threads tagged as p-values that reveal lots of misunderstandings about them. Ten months ago we had a thread about psychological journal that "banned" $p$-values, now American Statistical Association (2016) says that with our…
Tim
- 138,066
116
votes
6 answers
What is the relation between k-means clustering and PCA?
It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction).
However I am interested in a comparative and…
mic
- 4,318
115
votes
12 answers
When should linear regression be called "machine learning"?
In a recent colloquium, the speaker's abstract claimed they were using machine learning. During the talk, the only thing related to machine learning was that they perform linear regression on their data. After calculating the best-fit coefficients…
jvriesem
- 1,507