Highest Voted Questions - Statistical Analysis Stack Exchange

64

votes

5 answers

Why does collecting data until finding a significant result increase Type I error rate?

I was wondering exactly why collecting data until a significant result (e.g., $p \lt .05$) is obtained (i.e., p-hacking) increases the Type I error rate? I would also highly appreciate an R demonstration of this phenomenon.

asked Oct 26 '17 at 17:29

Reza

936

64

votes

5 answers

Training a decision tree against unbalanced data

I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy. The data consists of students studying courses, and the class variable is the…

asked May 08 '12 at 16:13

chrisb

905

64

votes

3 answers

Statistics and causal inference?

In his 1984 paper "Statistics and Causal Inference", Paul Holland raised one of the most fundamental questions in statistics: What can a statistical model say about causation? This led to his motto: NO CAUSATION WITHOUT MANIPULATION which…

causality

asked Aug 31 '10 at 19:13

Shane

12,461

64

votes

3 answers

Interpretation of log transformed predictor and/or response

I'm wondering if it makes a difference in interpretation whether only the dependent, both the dependent and independent, or only the independent variables are log transformed. Consider the case of log(DV) = Intercept + B1*IV + Error I can…

asked Nov 16 '11 at 10:03

upabove

3,137

64

votes

17 answers

Machine learning cookbook / reference card / cheatsheet?

I find resources like the Probability and Statistics Cookbook and The R Reference Card for Data Mining incredibly useful. They obviously serve well as references but also help me to organize my thoughts on a subject and get the lay of the land. Q:…

asked Jun 27 '11 at 03:33

lowndrul

2,117

64

votes

3 answers

What is the objective function of PCA?

Principal component analysis can use matrix decomposition, but that is just a tool to get there. How would you find the principal components without the use of matrix algebra? What is the objective function (goal), and what are the constraints?

pca

asked May 02 '11 at 23:10

Neil McGuigan

9,872

63

votes

3 answers

What is quasi-binomial distribution (in the context of GLM)?

I'm hoping someone can provide an intuitive overview of what quasibinomial distribution is and what it does. I'm particularly interested in these points: How quasibinomial differs to the binomial distribution. When the response variable is a…

asked Mar 28 '14 at 16:56

luciano

14,269

63

votes

6 answers

Is random forest a boosting algorithm?

Short definition of boosting: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random…

asked Nov 19 '13 at 16:34

Atilla Ozgur

1,291

63

votes

3 answers

What is the intuition behind conditional Gaussian distributions?

Suppose that $\mathbf{X} \sim N_{2}(\mathbf{\mu}, \mathbf{\Sigma})$. Then the conditional distribution of $X_1$ given that $X_2 = x_2$ is multivariate normally distributed with mean: $$ E[P(X_1 | X_2 = x_2)] =…

asked Sep 27 '13 at 14:37

eroeijr

631

63

votes

4 answers

Confidence interval for Bernoulli sampling

I have a random sample of Bernoulli random variables $X_1 ... X_N$, where $X_i$ are i.i.d. r.v. and $P(X_i = 1) = p$, and $p$ is an unknown parameter. Obviously, one can find an estimate for $p$: $\hat{p}:=(X_1+\dots+X_N)/N$. My question is how can…

asked Nov 20 '10 at 12:05

Oleg

63

votes

2 answers

How should one interpret the comparison of means from different sample sizes?

Take the case of book ratings on a website. Book A is rated by 10,000 people with an average rating of 4.25 and the variance $\sigma = 0.5$. Similarly Book B is rated by 100 people and has a rating of 4.5 with $\sigma = 0.25$. Now because of the…

asked Jun 29 '12 at 01:24

PhD

14,627

63

votes

4 answers

What is the root cause of the class imbalance problem?

I've been thinking a lot about the "class imbalance problem" in machine/statistical learning lately, and am drawing ever deeper into a feeling that I just don't understand what is going on. First let me define (or attempt to) define my terms: The…

asked Nov 25 '16 at 19:02

Matthew Drury

35,629

63

votes

3 answers

Who created the first standard normal table?

I'm about to introduce the standard normal table in my introductory statistics class, and that got me wondering: who created the first standard normal table? How did they do it before computers came along? I shudder to think of someone brute-force…

asked Sep 04 '16 at 23:16

Daniel Smolkin

633

63

votes

2 answers

What is a difference between random effects-, fixed effects- and marginal model?

I am trying to expand my knowledge of statistics. I come from a physical sciences background with a "recipe based" approach to statistical testing, where we say is it continuous, is it normally distributed -- OLS regression. In my reading I have…

asked Jan 26 '12 at 12:56

N26

1,955

63

votes

7 answers

Is it a good practice to always scale/normalize data for machine learning?

My understanding is that when some features have different ranges in their values (for example, imagine one feature being the age of a person and another one being their salary in USD) will affect negatively algorithms because the feature with…

asked Jan 07 '16 at 04:09

Juan Antonio Gomez Moriano

1,329
1
13
16

Most Popular