Most Popular
1500 questions
64
votes
5 answers
Why does collecting data until finding a significant result increase Type I error rate?
I was wondering exactly why collecting data until a significant result (e.g., $p \lt .05$) is obtained (i.e., p-hacking) increases the Type I error rate?
I would also highly appreciate an R demonstration of this phenomenon.
Reza
- 936
64
votes
5 answers
Training a decision tree against unbalanced data
I'm new to data mining and I'm trying to train a decision tree against a data set which is highly unbalanced. However, I'm having problems with poor predictive accuracy.
The data consists of students studying courses, and the class variable is the…
chrisb
- 905
64
votes
3 answers
Statistics and causal inference?
In his 1984 paper "Statistics and Causal Inference", Paul Holland raised one of the most fundamental questions in statistics:
What can a statistical model say about
causation?
This led to his motto:
NO CAUSATION WITHOUT MANIPULATION
which…
Shane
- 12,461
64
votes
3 answers
Interpretation of log transformed predictor and/or response
I'm wondering if it makes a difference in interpretation whether only the dependent, both the dependent and independent, or only the independent variables are log transformed.
Consider the case of
log(DV) = Intercept + B1*IV + Error
I can…
upabove
- 3,137
64
votes
17 answers
Machine learning cookbook / reference card / cheatsheet?
I find resources like the Probability and Statistics Cookbook and The R Reference Card for Data Mining incredibly useful. They obviously serve well as references but also help me to organize my thoughts on a subject and get the lay of the land.
Q:…
lowndrul
- 2,117
64
votes
3 answers
What is the objective function of PCA?
Principal component analysis can use matrix decomposition, but that is just a tool to get there.
How would you find the principal components without the use of matrix algebra?
What is the objective function (goal), and what are the constraints?
Neil McGuigan
- 9,872
63
votes
3 answers
What is quasi-binomial distribution (in the context of GLM)?
I'm hoping someone can provide an intuitive overview of what quasibinomial distribution is and what it does. I'm particularly interested in these points:
How quasibinomial differs to the binomial distribution.
When the response variable is a…
luciano
- 14,269
63
votes
6 answers
Is random forest a boosting algorithm?
Short definition of boosting:
Can a set of weak learners create a single strong learner? A weak
learner is defined to be a classifier which is only slightly
correlated with the true classification (it can label examples better
than random…
Atilla Ozgur
- 1,291
63
votes
3 answers
What is the intuition behind conditional Gaussian distributions?
Suppose that $\mathbf{X} \sim N_{2}(\mathbf{\mu}, \mathbf{\Sigma})$. Then the conditional distribution of $X_1$ given that $X_2 = x_2$ is multivariate normally distributed with mean:
$$ E[P(X_1 | X_2 = x_2)] =…
eroeijr
- 631
63
votes
4 answers
Confidence interval for Bernoulli sampling
I have a random sample of Bernoulli random variables $X_1 ... X_N$, where $X_i$ are i.i.d. r.v. and $P(X_i = 1) = p$, and $p$ is an unknown parameter.
Obviously, one can find an estimate for $p$: $\hat{p}:=(X_1+\dots+X_N)/N$.
My question is how can…
Oleg
63
votes
2 answers
How should one interpret the comparison of means from different sample sizes?
Take the case of book ratings on a website. Book A is rated by 10,000 people with an average rating of 4.25 and the variance $\sigma = 0.5$. Similarly Book B is rated by 100 people and has a rating of 4.5 with $\sigma = 0.25$.
Now because of the…
PhD
- 14,627
63
votes
4 answers
What is the root cause of the class imbalance problem?
I've been thinking a lot about the "class imbalance problem" in machine/statistical learning lately, and am drawing ever deeper into a feeling that I just don't understand what is going on.
First let me define (or attempt to) define my terms:
The…
Matthew Drury
- 35,629
63
votes
3 answers
Who created the first standard normal table?
I'm about to introduce the standard normal table in my introductory statistics class, and that got me wondering: who created the first standard normal table? How did they do it before computers came along? I shudder to think of someone brute-force…
Daniel Smolkin
- 633
63
votes
2 answers
What is a difference between random effects-, fixed effects- and marginal model?
I am trying to expand my knowledge of statistics. I come from a physical sciences background with a "recipe based" approach to statistical testing, where we say is it continuous, is it normally distributed -- OLS regression.
In my reading I have…
N26
- 1,955
63
votes
7 answers
Is it a good practice to always scale/normalize data for machine learning?
My understanding is that when some features have different ranges in their values (for example, imagine one feature being the age of a person and another one being their salary in USD) will affect negatively algorithms because the feature with…
Juan Antonio Gomez Moriano
- 1,329
- 1
- 13
- 16