Most Popular

1500 questions
55
votes
3 answers

Help me understand the quantile (inverse CDF) function

I am reading about the quantile function, but it is not clear to me. Could you provide a more intuitive explanation than the one provided below? Since the cdf $F$ is a monotonically increasing function, it has an inverse; let us denote this by…
55
votes
4 answers

How to interpret Mean Decrease in Accuracy and Mean Decrease GINI in Random Forest models

I'm having some difficulty understanding how to interpret variable importance output from the Random Forest package. Mean decrease in accuracy is usually described as "the decrease in model accuracy from permuting the values in each feature". Is…
FlacoT
  • 842
55
votes
3 answers

What kind of information is Fisher information?

Suppose we have a random variable $X \sim f(x|\theta)$. If $\theta_0$ were the true parameter, the the likelihood function should be maximized and the derivative equal to zero. This is the basic principle behind the maximum likelihood estimator. As…
55
votes
8 answers

Book for reading before Elements of Statistical Learning?

Based on this post, I want to digest Elements of Statistical Learning. Fortunately it is available for free and I started reading it. I don't have enough knowledge to understand it. Can you recommend a book that is a better introduction to the…
B Seven
  • 2,913
55
votes
3 answers

Gradient Boosting for Linear Regression - why does it not work?

While learning about Gradient Boosting, I haven't heard about any constraints regarding the properties of a "weak classifier" that the method uses to build and ensemble model. However, I could not imagine an application of a GB that uses linear…
Matek
  • 951
55
votes
3 answers

What does the term saturating nonlinearities mean?

I was reading the paper ImageNet Classification with Deep Convolutional Neural Networks and in section 3 were they explain the architecture of their Convolutional Neural Network they explain how they preferred using: non-saturating nonlinearity…
55
votes
5 answers

Is minimizing squared error equivalent to minimizing absolute error? Why squared error is more popular than the latter?

When we conduct linear regression $y=ax+b$ to fit a bunch of data points $(x_1,y_1),(x_2,y_2),...,(x_n,y_n)$, the classic approach minimizes the squared error. I have long been puzzled by a question that will minimizing the squared error yield the…
Tony
  • 1,803
55
votes
1 answer

How large should the batch size be for stochastic gradient descent?

I understand that stochastic gradient descent may be used to optimize a neural network using backpropagation by updating each iteration with a different sample of the training dataset. How large should the batch size be?
Simon Kuang
  • 2,111
55
votes
5 answers

What is the difference between errors and residuals?

While these two ubiquitous terms are often used synonymously, there sometimes seems to be a distinction. Is there indeed a difference, or are they exactly synonymous?
Constantin
  • 1,367
  • 1
  • 12
  • 27
55
votes
5 answers

Bayesian equivalent of two sample t-test?

I'm not looking for a plug and play method like BEST in R but rather a mathematical explanation of what are some Bayesian methods I can use to test the difference between the mean of two samples.
John
  • 581
  • 1
  • 5
  • 3
55
votes
6 answers

How to determine the optimal threshold for a classifier and generate ROC curve?

Let say we have a SVM classifier, how do we generate ROC curve? (Like theoretically) (because we are generate TPR and FPR with each of the threshold). And how do we determine the optimal threshold for this SVM classifier?
RockTheStar
  • 12,907
  • 34
  • 71
  • 96
55
votes
6 answers

Why downsample?

Suppose I want to learn a classifier that predicts if an email is spam. And suppose only 1% of emails are spam. The easiest thing to do would be to learn the trivial classifier that says none of the emails are spam. This classifier would give us…
Jessica
  • 2,091
55
votes
3 answers

Why does frequentist hypothesis testing become biased towards rejecting the null hypothesis with sufficiently large samples?

I was just reading this article on the Bayes factor for a completely unrelated problem when I stumbled upon this passage Hypothesis testing with Bayes factors is more robust than frequentist hypothesis testing, since the Bayesian form avoids model…
54
votes
3 answers

What is the distribution of the Euclidean distance between two normally distributed random variables?

Assume you are given two objects whose exact locations are unknown, but are distributed according to normal distributions with known parameters (e.g. $a \sim N(m, s)$ and $b \sim N(v, t))$. We can assume these are both bivariate normals, such that…
Nick
  • 3,537
54
votes
2 answers

PP-plots vs. QQ-plots

What is the difference between probability plots, PP-plots and QQ-plots when trying to analyse a fitted distribution to data?
kay
  • 671