Most Popular

1500 questions
39
votes
6 answers

Training a neural network for regression always predicts the mean

I am training a simple convolutional neural network for regression, where the task is to predict the (x,y) location of a box in an image, e.g.: The output of the network has two nodes, one for x, and one for y. The rest of the network is a…
Karnivaurus
  • 7,019
39
votes
4 answers

Why does logistic regression become unstable when classes are well-separated?

Why is it that logistic regression becomes unstable when classes are well-separated? What does well-separated classes mean? I would really appreciate if someone can explain with an example.
Jane Dow
  • 491
39
votes
11 answers

How to determine the confidence of a neural network prediction?

To illustrate my question, suppose that I have a training set where the input has a degree of noise but the output does not, for example; # Training data [1.02, 1.95, 2.01, 3.06] : [1.0] [2.03, 4.11, 5.92, 8.00] : [2.0] [10.01, 11.02, 11.96, 12.04]…
John
  • 491
39
votes
2 answers

Is there any algorithm combining classification and regression?

I'm wondering if there's any algorithm could do classification and regression at the same time. For example, I'd like to let the algorithm learn a classifier, and at the same time within each label, it also learns a continuous target. Thus, for each…
Shudong
  • 571
  • 1
  • 5
  • 8
39
votes
5 answers

Why use regularisation in polynomial regression instead of lowering the degree?

When doing regression, for example, two hyper parameters to choose are often the capacity of the function (eg. the largest exponent of a polynomial), and the amount of regularisation. What I'm confused about, is why not just choose a low capacity…
Karnivaurus
  • 7,019
39
votes
3 answers

Do we need a test set when using k-fold cross-validation?

I've been reading about k-fold validation, and I want to make sure I understand how it works. I know that for the holdout method, the data is split into three sets, and the test set is only used at the very end to assess the performance of the…
b_pcakes
  • 495
39
votes
3 answers

What does the Akaike Information Criterion (AIC) score of a model mean?

I have seen some questions here about what it means in layman terms, but these are too layman for for my purpose here. I am trying to mathematically understand what does the AIC score mean. But at the same time, I don't want a rigor proof that…
caveman
  • 2,701
  • 2
  • 21
  • 34
39
votes
3 answers

When is it appropriate to use an improper scoring rule?

Merkle & Steyvers (2013) write: To formally define a proper scoring rule, let $f$ be a probabilistic forecast of a Bernoulli trial $d$ with true success probability $p$. Proper scoring rules are metrics whose expected values are minimized if…
39
votes
7 answers

What are the most common biases humans make when collecting or interpreting data?

I am an econ/stat major. I am aware that economists have tried to modify their assumptions about human behavior and rationality by identifying situations in which people don't behave rationally. For example, suppose I offer you a 100% chance of a…
39
votes
7 answers

Why is the null hypothesis often sought to be rejected?

I hope I am making sense with the title. Often, the null hypothesis is formed with the intention of rejecting it. Is there a reason for this, or is it just a convention?
39
votes
7 answers

Should parsimony really still be the gold standard?

Just a thought: Parsimonious models have always been the default go-to in model selection, but to what degree is this approach outdated? I'm curious about how much our tendency toward parsimony is a relic of a time of abaci and slide rules (or, more…
39
votes
2 answers

How would PCA help with a k-means clustering analysis?

Background: I want to classify the residential areas of a city into groups based on their social-economic characteristics, including housing unit density, population density, green space area, housing price, number of schools / health centers / day…
enaJ
  • 595
39
votes
5 answers

Why are there two spellings of "heteroskedastic" or "heteroscedastic"?

I frequently see both the spellings "heteroskedastic" and "heteroscedastic", and similarly for "homoscedastic" and "homoskedastic". There seems to be no difference in meaning between the "c" and the "k" variants, simply an orthographic difference…
Silverfish
  • 23,353
  • 27
  • 103
  • 201
39
votes
5 answers

What is a Highest Density Region (HDR)?

In statistical inference, problem 9.6b, a "Highest Density Region (HDR)" is mentioned. However, I didn't find the definition of this term in the book. One similar term is the Highest Posterior Density (HPD). But it doesn't fit in this context, since…
user3813057
  • 1,092
39
votes
4 answers

Information gain, mutual information and related measures

Andrew More defines information gain as: $IG(Y|X) = H(Y) - H(Y|X)$ where $H(Y|X)$ is the conditional entropy. However, Wikipedia calls the above quantity mutual information. Wikipedia on the other hand defines information gain as the…