Highest Voted Questions - Statistical Analysis Stack Exchange

118

votes

7 answers

T-test for non normal when N>50?

Long ago I learnt that normal distribution was necessary to use a two sample T-test. Today a colleague told me that she learnt that for N>50 normal distribution was not necessary. Is that true? If true is that because of the central limit theorem?

asked Apr 14 '11 at 21:55

even

2,347
6
19
13

118

votes

2 answers

Why do we need to normalize data before principal component analysis (PCA)?

I'm doing principal component analysis on my dataset and my professor told me that I should normalize the data before doing the analysis. Why? What would happen If I did PCA without normalization? Why do we normalize data in general? Could…

asked Sep 04 '13 at 08:12

jjepsuomi

5,807

118

votes

10 answers

Validation Error less than training error?

I found two questions here and here about this issue but there is no obvious answer or explanation yet.I enforce the same problem where the validation error is less than training error in my Convolution Neural Network. What does that mean?

asked Dec 17 '15 at 22:04

Bido

1,283

118

votes

4 answers

Why isn't Logistic Regression called Logistic Classification?

Since Logistic Regression is a statistical classification model dealing with categorical dependent variables, why isn't it called Logistic Classification? Shouldn't the "Regression" name be reserved to models dealing with continuous dependent…

asked Dec 07 '14 at 18:44

Ismael Ghalimi

2,166

117

votes

17 answers

What is the role of the logarithm in Shannon's entropy?

Shannon's entropy is the negative of the sum of the probabilities of each outcome multiplied by the logarithm of probabilities for each outcome. What purpose does the logarithm serve in this equation? An intuitive or visual answer (as opposed to a…

asked Feb 19 '14 at 17:33

histelheim

2,993

117

votes

5 answers

What skills are required to perform large scale statistical analyses?

Many statistical jobs ask for experience with large scale data. What are the sorts of statistical and computational skills that would be need for working with large data sets. For example, how about building regression models given a data set with…

asked Mar 02 '11 at 19:05

bit-question

2,817

116

votes

4 answers

What is the difference between zero-inflated and hurdle models?

I wonder if there is a clear-cut difference between the so-called zero-inflated distributions (models) and so-called hurdle-at-zero distributions (models)? The terms occur quite often in the literature and I suspect they are not the same, but would…

zero-inflation

asked Jan 07 '14 at 04:46

skulker

1,394

116

votes

8 answers

What is the benefit of breaking up a continuous predictor variable?

I'm wondering what the value is in taking a continuous predictor variable and breaking it up (e.g., into quintiles), before using it in a model. It seems to me that by binning the variable we lose information. Is this just so we can model…

asked Aug 31 '13 at 05:32

Tom

1,771
2
13
18

116

votes

4 answers

What is rank deficiency, and how to deal with it?

Fitting a logistic regression using lme4 ends with Error in mer_finalize(ans) : Downdated X'X is not positive definite. A likely cause of this error is apparently rank deficiency. What is rank deficiency, and how should I address it?

asked Aug 25 '12 at 06:30

Jack Tanner

4,842

116

votes

4 answers

Difference between standard error and standard deviation

I'm struggling to understand the difference between the standard error and the standard deviation. How are they different and why do you need to measure the standard error?

asked Jul 15 '12 at 10:21

louis xie

1,333

116

votes

16 answers

If 900 out of 1000 people say a car is blue, what is the probability that it is blue?

This initially arose in connection some work we are doing to a model to classify natural text, but I've simplified it... Perhaps too much. You have a blue car (by some objective scientific measure - it is blue). You show it to 1000 people. 900 say…

probability

asked Aug 20 '17 at 19:57

Pat Molloy

1,125

116

votes

20 answers

What misused statistical terms are worth correcting?

Statistics is everywhere; common usage of statistical terms is, however, often unclear. The terms probability and odds are used interchangeable in lay English despite their well-defined and different mathematical expressions. Not separating the term…

terminology

asked Mar 21 '16 at 21:24

Antoni Parellada

26,280

116

votes

10 answers

ASA discusses limitations of $p$-values - what are the alternatives?

We already have multiple threads tagged as p-values that reveal lots of misunderstandings about them. Ten months ago we had a thread about psychological journal that "banned" $p$-values, now American Statistical Association (2016) says that with our…

asked Mar 08 '16 at 08:32

Tim

138,066

116

votes

6 answers

What is the relation between k-means clustering and PCA?

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction). However I am interested in a comparative and…

asked Nov 23 '15 at 22:42

mic

4,318

115

votes

12 answers

When should linear regression be called "machine learning"?

In a recent colloquium, the speaker's abstract claimed they were using machine learning. During the talk, the only thing related to machine learning was that they perform linear regression on their data. After calculating the best-fit coefficients…

asked Mar 20 '17 at 22:10

jvriesem

1,507

Most Popular