Most Popular
1500 questions
38
votes
1 answer
Benefits of stratified vs random sampling for generating training data in classification
I would like to know if there are any/some advantages of using stratified sampling instead of random sampling, when splitting the original dataset into training and testing set for classification.
Also, does stratified sampling introduce more bias…
gc5
- 1,227
38
votes
3 answers
What algorithms need feature scaling, beside from SVM?
I am working with many algorithms: RandomForest, DecisionTrees, NaiveBayes, SVM (kernel=linear and rbf), KNN, LDA and XGBoost. All of them were pretty fast except for SVM. That is when I got to know that it needs feature scaling to work faster. Then…
Aizzaac
- 1,179
38
votes
3 answers
How to interpret root mean squared error (RMSE) vs standard deviation?
Let's say I have a model that gives me projected values. I calculate RMSE of those values. And then the standard deviation of the actual values.
Does it make any sense to compare those two values (variances)? What I think is, if RMSE and…
jkim19
- 481
38
votes
7 answers
Are there algorithms for computing "running" linear or logistic regression parameters?
A paper "Accurately computing running variance" at http://www.johndcook.com/standard_deviation.html
shows how to compute running mean, variance and standard deviations.
Are there algorithms where the parameters of a linear or logistic regression…
adrcuth
38
votes
5 answers
Estimating same model over multiple time series
I have a novice background in time series (some ARIMA estimation/forecasting) and am facing a problem I don't fully understand. Any help would be greatly appreciated.
I am analyzing multiple time series, all over the same time interval and all of…
sparc_spread
- 815
38
votes
3 answers
How to draw neat polygons around scatterplot regions in ggplot2
How do I add a neat polygon around a group of points on a scatterplot? I am using ggplot2 but am disappointed with the results of geom_polygon.
The dataset is over there, as a tab-delimited text file. The graph below shows two measures of attitudes…
Fr.
- 1,453
38
votes
5 answers
What are the dangers of violating the homoscedasticity assumption for linear regression?
As an example, consider the ChickWeight data set in R. The variance obviously grows over time, so if I use a simple linear regression like:
m <- lm(weight ~ Time*Diet, data=ChickWeight)
My questions:
Which aspects of the model will be…
Dan M.
- 940
38
votes
5 answers
A measure of "variance" from the covariance matrix?
If the data is 1d, the variance shows the extent to which the data points are different from each other. If the data is multi-dimensional, we'll get a covariance matrix.
Is there a measure that gives a single number of how the data points are…
dontloo
- 16,356
38
votes
5 answers
Interpreting negative cosine similarity
My question may be a silly one. So I shall apologize in advance.
I was trying to use the GLOVE model pre-trained by Stanford NLP group (link). However, I noticed that my similarity results showed some negative numbers.
That immediately prompted me…
Patrick the Cat
- 606
- 1
- 6
- 15
38
votes
1 answer
What is the difference between generalized estimating equations and GLMM?
I'm running a GEE on 3-level unbalanced data, using a logit link. How does this differ (in terms of the conclusions I can draw and the meaning of the coefficients) from a GLM with mixed effects (GLMM) and logit link?
More detail: The observations…
user6666
38
votes
10 answers
What is your favorite layman's explanation for a difficult statistical concept?
I really enjoy hearing simple explanations to complex problems. What is your favorite analogy or anecdote that explains a difficult statistical concept?
My favorite is Murray's explanation of cointegration using a drunkard and her dog. Murray…
brotchie
- 701
38
votes
3 answers
How to tell the difference between linear and non-linear regression models?
I was reading the following link on non linear regression SAS Non Linear. My understanding from reading the first section "Nonlinear Regression vs. Linear Regression" was that the equation below is actually a linear regression, is that correct? If…
mHelpMe
- 687
38
votes
2 answers
What's the difference between the variance and the mean squared error?
I'm surprised this hasn't been asked before, but I cannot find the question on stats.stackexchange.
This is the formula to calculate the variance of a normally distributed sample:
$$\frac{\sum(X - \bar{X}) ^2}{n-1}$$
This is the formula to calculate…
luciano
- 14,269
38
votes
4 answers
What are the differences between sparse coding and autoencoder?
Sparse coding is defined as learning an over-complete set of basis vectors to represent input vectors (<-- why do we want this) . What are the differences between sparse coding and autoencoder? When will we use sparse coding and autoencoder?
RockTheStar
- 12,907
- 34
- 71
- 96
38
votes
5 answers
How to visualize/understand what a neural network is doing?
Neural networks are often treated as "black boxes" due to their complex structure. This is not ideal, as it is often beneficial to have an intuitive grasp of how a model is working internally. What are methods of visualizing how a trained neural…
rm999
- 758