Most Popular

1500 questions
54
votes
3 answers

Different ways to write interaction terms in lm?

I have a question about which is the best way to specify an interaction in a regression model. Consider the following data: d <- structure(list(r = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),…
54
votes
2 answers

Hierarchical clustering with mixed type data - what distance/similarity to use?

In my dataset we have both continuous and naturally discrete variables. I want to know whether we can do hierarchical clustering using both type of variables. And if yes, what distance measure is appropriate?
Beta
  • 6,334
54
votes
5 answers

Dynamic Time Warping Clustering

What would be the approach to use Dynamic Time Warping (DTW) to perform clustering of time series? I have read about DTW as a way to find similarity between two time series, while they could be shifted in time. Can I use this method as a similarity…
54
votes
4 answers

Class imbalance in Supervised Machine Learning

This is a question in general, not specific to any method or data set. How do we deal with a class imbalance problem in Supervised Machine learning where the number of 0 is around 90% and number of 1 is around 10% in your dataset.How do we optimally…
NG_21
  • 1,556
  • 4
  • 17
  • 25
54
votes
3 answers

Why do we care so much about normally distributed error terms (and homoskedasticity) in linear regression when we don't have to?

I suppose I get frustrated every time I hear someone say that non-normality of residuals and /or heteroskedasticity violates OLS assumptions. To estimate parameters in an OLS model neither of these assumptions are necessary by the Gauss-Markov…
54
votes
3 answers

Do we have a problem of "pity upvotes"?

I know, this may sound like it is off-topic, but hear me out. At Stack Overflow and here we get votes on posts, this is all stored in a tabular form. E.g.: post id voter id vote type datetime ------- -------- --------- …
53
votes
4 answers

Approximate order statistics for normal random variables

Are there well known formulas for the order statistics of certain random distributions? Particularly the first and last order statistics of a normal random variable, but a more general answer would also be appreciated. Edit: To clarify, I am…
Chris Taylor
  • 3,682
53
votes
2 answers

Why is a Bayesian not allowed to look at the residuals?

In the article "Discussion: Should Ecologists Become Bayesians?" Brian Dennis gives a surprisingly balanced and positive view of Bayesian statistics when his aim seems to be to warn people about it. However, in one paragraph, without any citations…
Mankka
  • 633
53
votes
4 answers

Why does the correlation coefficient between X and X-Y random variables tend to be 0.7

Taken from Practical Statistics for Medical Research where Douglas Altman writes in page 285: ...for any two quantities X and Y, X will be correlated with X-Y. Indeed, even if X and Y are samples of random numbers we would expect the…
nostock
  • 1,507
  • 4
  • 17
  • 23
53
votes
3 answers

Are splines overfitting the data?

My problem: I recently met a statistician that informed me that splines are only useful for exploring data and are subjected to overfitting, thus not useful in prediction. He preferred exploring with simple polynomials ... As I’m a big fan of…
Max Gordon
  • 5,926
  • 8
  • 34
  • 52
53
votes
1 answer

Regression: Transforming Variables

When transforming variables, do you have to use all of the same transformation? For example, can I pick and choose differently transformed variables, as in: Let, $x_1,x_2,x_3$ be age, length of employment, length of residence, and income. Y =…
Brandon Bertelsen
  • 7,232
  • 9
  • 41
  • 48
53
votes
4 answers

Is it possible to give variable sized images as input to a convolutional neural network?

Can we give images with variable size as input to a convolutional neural network for object detection? If possible, how can we do that? But if we try to crop the image, we will be loosing some portion of the image and if we try to resize, then, the…
53
votes
8 answers

Excel as a statistics workbench

It seems that lots of people (including me) like to do exploratory data analysis in Excel. Some limitations, such as the number of rows allowed in a spreadsheet, are a pain but in most cases don't make it impossible to use Excel to play around with…
Carlos Accioly
  • 5,025
  • 4
  • 28
  • 29
53
votes
1 answer

Rank in R - descending order

I am looking to rank data that, in some cases, the larger value has the rank of 1. I am relatively new to R, but I don't see how I can adjust this setting in the rank function. x <- c(23,45,12,67,34,89) rank(x) generates: [1] 2 4 1 5 3 6 when I…
Btibert3
  • 1,334
  • 2
  • 15
  • 24
53
votes
6 answers

Why is softmax output not a good uncertainty measure for Deep Learning models?

I've been working with Convolutional Neural Networks (CNNs) for some time now, mostly on image data for semantic segmentation/instance segmentation. I've often visualized the softmax of the network output as a "heat map" to see how high per pixel…
Honeybear
  • 659