Most Popular
1500 questions
54
votes
3 answers
Different ways to write interaction terms in lm?
I have a question about which is the best way to specify an interaction in a regression model. Consider the following data:
d <- structure(list(r = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),…
Manuel Ramón
- 2,115
54
votes
2 answers
Hierarchical clustering with mixed type data - what distance/similarity to use?
In my dataset we have both continuous and naturally discrete variables. I want to know whether we can do hierarchical clustering using both type of variables. And if yes, what distance measure is appropriate?
Beta
- 6,334
54
votes
5 answers
Dynamic Time Warping Clustering
What would be the approach to use Dynamic Time Warping (DTW) to perform clustering of time series?
I have read about DTW as a way to find similarity between two time series, while they could be shifted in time. Can I use this method as a similarity…
Kobe-Wan Kenobi
- 2,857
54
votes
4 answers
Class imbalance in Supervised Machine Learning
This is a question in general, not specific to any method or data set. How do we deal with a class imbalance problem in Supervised Machine learning where the number of 0 is around 90% and number of 1 is around 10% in your dataset.How do we optimally…
NG_21
- 1,556
- 4
- 17
- 25
54
votes
3 answers
Why do we care so much about normally distributed error terms (and homoskedasticity) in linear regression when we don't have to?
I suppose I get frustrated every time I hear someone say that non-normality of residuals and /or heteroskedasticity violates OLS assumptions. To estimate parameters in an OLS model neither of these assumptions are necessary by the Gauss-Markov…
Zachary Blumenfeld
- 3,974
54
votes
3 answers
Do we have a problem of "pity upvotes"?
I know, this may sound like it is off-topic, but hear me out.
At Stack Overflow and here we get votes on posts, this is all stored in a tabular form.
E.g.:
post id voter id vote type datetime
------- -------- --------- …
Sam Saffron
- 619
53
votes
4 answers
Approximate order statistics for normal random variables
Are there well known formulas for the order statistics of certain random distributions? Particularly the first and last order statistics of a normal
random variable, but a more general answer would also be appreciated.
Edit: To clarify, I am…
Chris Taylor
- 3,682
53
votes
2 answers
Why is a Bayesian not allowed to look at the residuals?
In the article "Discussion: Should Ecologists Become Bayesians?" Brian Dennis gives a surprisingly balanced and positive view of Bayesian statistics when his aim seems to be to warn people about it. However, in one paragraph, without any citations…
Mankka
- 633
53
votes
4 answers
Why does the correlation coefficient between X and X-Y random variables tend to be 0.7
Taken from Practical Statistics for Medical Research where Douglas Altman writes in page 285:
...for any two quantities X and Y, X will be correlated with X-Y.
Indeed, even if X and Y are samples of random numbers we would expect
the…
nostock
- 1,507
- 4
- 17
- 23
53
votes
3 answers
Are splines overfitting the data?
My problem: I recently met a statistician that informed me that splines are only useful for exploring data and are subjected to overfitting, thus not useful in prediction. He preferred exploring with simple polynomials ... As I’m a big fan of…
Max Gordon
- 5,926
- 8
- 34
- 52
53
votes
1 answer
Regression: Transforming Variables
When transforming variables, do you have to use all of the same transformation? For example, can I pick and choose differently transformed variables, as in:
Let, $x_1,x_2,x_3$ be age, length of employment, length of residence, and income.
Y =…
Brandon Bertelsen
- 7,232
- 9
- 41
- 48
53
votes
4 answers
Is it possible to give variable sized images as input to a convolutional neural network?
Can we give images with variable size as input to a convolutional neural network for object detection? If possible, how can we do that?
But if we try to crop the image, we will be loosing some portion of the image and if we try to resize, then, the…
Ashna Eldho
- 631
53
votes
8 answers
Excel as a statistics workbench
It seems that lots of people (including me) like to do exploratory data analysis in Excel. Some limitations, such as the number of rows allowed in a spreadsheet, are a pain but in most cases don't make it impossible to use Excel to play around with…
Carlos Accioly
- 5,025
- 4
- 28
- 29
53
votes
1 answer
Rank in R - descending order
I am looking to rank data that, in some cases, the larger value has the rank of 1. I am relatively new to R, but I don't see how I can adjust this setting in the rank function.
x <- c(23,45,12,67,34,89)
rank(x)
generates:
[1] 2 4 1 5 3 6
when I…
Btibert3
- 1,334
- 2
- 15
- 24
53
votes
6 answers
Why is softmax output not a good uncertainty measure for Deep Learning models?
I've been working with Convolutional Neural Networks (CNNs) for some time now, mostly on image data for semantic segmentation/instance segmentation. I've often visualized the softmax of the network output as a "heat map" to see how high per pixel…
Honeybear
- 659