Most Popular

1500 questions
64
votes
5 answers

Why does collecting data until finding a significant result increase Type I error rate?

I was wondering exactly why collecting data until a significant result (e.g., $p \lt .05$) is obtained (i.e., p-hacking) increases the Type I error rate? I would also highly appreciate an R demonstration of this phenomenon.
Reza
  • 936
64
votes
5 answers

Training a decision tree against unbalanced data

I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy. The data consists of students studying courses, and the class variable is the…
chrisb
  • 905
64
votes
3 answers

Statistics and causal inference?

In his 1984 paper "Statistics and Causal Inference", Paul Holland raised one of the most fundamental questions in statistics: What can a statistical model say about causation? This led to his motto: NO CAUSATION WITHOUT MANIPULATION which…
Shane
  • 12,461
64
votes
3 answers

Interpretation of log transformed predictor and/or response

I'm wondering if it makes a difference in interpretation whether only the dependent, both the dependent and independent, or only the independent variables are log transformed. Consider the case of log(DV) = Intercept + B1*IV + Error I can…
upabove
  • 3,137
64
votes
17 answers

Machine learning cookbook / reference card / cheatsheet?

I find resources like the Probability and Statistics Cookbook and The R Reference Card for Data Mining incredibly useful. They obviously serve well as references but also help me to organize my thoughts on a subject and get the lay of the land. Q:…
lowndrul
  • 2,117
64
votes
3 answers

What is the objective function of PCA?

Principal component analysis can use matrix decomposition, but that is just a tool to get there. How would you find the principal components without the use of matrix algebra? What is the objective function (goal), and what are the constraints?
63
votes
3 answers

What is quasi-binomial distribution (in the context of GLM)?

I'm hoping someone can provide an intuitive overview of what quasibinomial distribution is and what it does. I'm particularly interested in these points: How quasibinomial differs to the binomial distribution. When the response variable is a…
luciano
  • 14,269
63
votes
6 answers

Is random forest a boosting algorithm?

Short definition of boosting: Can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random…
Atilla Ozgur
  • 1,291
63
votes
3 answers

What is the intuition behind conditional Gaussian distributions?

Suppose that $\mathbf{X} \sim N_{2}(\mathbf{\mu}, \mathbf{\Sigma})$. Then the conditional distribution of $X_1$ given that $X_2 = x_2$ is multivariate normally distributed with mean: $$ E[P(X_1 | X_2 = x_2)] =…
eroeijr
  • 631
63
votes
4 answers

Confidence interval for Bernoulli sampling

I have a random sample of Bernoulli random variables $X_1 ... X_N$, where $X_i$ are i.i.d. r.v. and $P(X_i = 1) = p$, and $p$ is an unknown parameter. Obviously, one can find an estimate for $p$: $\hat{p}:=(X_1+\dots+X_N)/N$. My question is how can…
Oleg
63
votes
2 answers

How should one interpret the comparison of means from different sample sizes?

Take the case of book ratings on a website. Book A is rated by 10,000 people with an average rating of 4.25 and the variance $\sigma = 0.5$. Similarly Book B is rated by 100 people and has a rating of 4.5 with $\sigma = 0.25$. Now because of the…
PhD
  • 14,627
63
votes
4 answers

What is the root cause of the class imbalance problem?

I've been thinking a lot about the "class imbalance problem" in machine/statistical learning lately, and am drawing ever deeper into a feeling that I just don't understand what is going on. First let me define (or attempt to) define my terms: The…
Matthew Drury
  • 35,629
63
votes
3 answers

Who created the first standard normal table?

I'm about to introduce the standard normal table in my introductory statistics class, and that got me wondering: who created the first standard normal table? How did they do it before computers came along? I shudder to think of someone brute-force…
63
votes
2 answers

What is a difference between random effects-, fixed effects- and marginal model?

I am trying to expand my knowledge of statistics. I come from a physical sciences background with a "recipe based" approach to statistical testing, where we say is it continuous, is it normally distributed -- OLS regression. In my reading I have…
N26
  • 1,955
63
votes
7 answers

Is it a good practice to always scale/normalize data for machine learning?

My understanding is that when some features have different ranges in their values (for example, imagine one feature being the age of a person and another one being their salary in USD) will affect negatively algorithms because the feature with…