Most Popular
1500 questions
155
votes
7 answers
Clustering on the output of t-SNE
I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the variability, so clustering on just a couple of…
generic_user
- 13,339
154
votes
3 answers
Help me understand Bayesian prior and posterior distributions
In a group of students, there are 2 out of 18 that are left-handed. Find the posterior distribution of left-handed students in the population assuming uninformative prior. Summarize the results. According to the literature 5-20% of people are…
Bob
- 1,541
154
votes
8 answers
Why L1 norm for sparse models
I am reading books about linear regression. There are some sentences about the L1 and L2 norm. I know the formulas, but I don't understand why the L1 norm enforces sparsity in models. Can someone give a simple explanation?
Yongwei Xing
- 1,773
151
votes
15 answers
Amazon interview question—probability of 2nd interview
I got this question during an interview with Amazon:
50% of all people who receive a first interview receive a second interview
95% of your friends that got a second interview felt they had a good first interview
75% of your friends that DID NOT…
Rick
- 1,481
- 3
- 11
- 9
151
votes
6 answers
Batch gradient descent versus stochastic gradient descent
Suppose we have some training set $(x_{(i)}, y_{(i)})$ for $i = 1, \dots, m$. Also suppose we run some type of supervised learning algorithm on the training set. Hypotheses are represented as $h_{\theta}(x_{(i)}) = \theta_0+\theta_{1}x_{(i)1} +…
user20616
- 1,551
- 3
- 11
- 7
151
votes
6 answers
Why are neural networks becoming deeper, but not wider?
In recent years, convolutional neural networks (or perhaps deep neural networks in general) have become deeper and deeper, with state-of-the-art networks going from 7 layers (AlexNet) to 1000 layers (Residual Nets) in the space of 4 years. The…
Karnivaurus
- 7,019
150
votes
3 answers
Removal of statistically significant intercept term increases $R^2$ in linear model
In a simple linear model with a single explanatory variable,
$\alpha_i = \beta_0 + \beta_1 \delta_i + \epsilon_i$
I find that removing the intercept term improves the fit greatly (value of $R^2$ goes from 0.3 to 0.9). However, the intercept term…
Ernest A
- 2,342
150
votes
9 answers
Why does a time series have to be stationary?
Would like to understand primary reasons for making a data stationary?
I understand that a stationary time series is one whose mean and variance is constant over time. Can someone please explain why we have to make sure our data set is stationary…
alex
- 1,501
149
votes
21 answers
What's the difference between probability and statistics?
What's the difference between probability and statistics, and why are they studied together?
hslc
- 1,607
148
votes
9 answers
What is the difference between linear regression on y with x and x with y?
The Pearson correlation coefficient of x and y is the same, whether you compute pearson(x, y) or pearson(y, x). This suggests that doing a linear regression of y given x or x given y should be the same, but I don't think that's the case.
Can…
user9097
- 3,263
148
votes
6 answers
Correlations with unordered categorical variables
I have a dataframe with many observations and many variables. Some of them are categorical (unordered) and the others are numerical.
I'm looking for associations between these variables. I've been able to compute correlation for numerical variables…
Clément F
- 1,797
146
votes
6 answers
Should one remove highly correlated variables before doing PCA?
I'm reading a paper where author discards several variables due to high correlation to other variables before doing PCA. The total number of variables is around 20.
Does this give any benefits? It looks like an overhead to me as PCA should handle…
type2
- 1,571
145
votes
5 answers
What is the .632+ rule in bootstrapping?
Here @gung makes reference to the .632+ rule. A quick Google search doesn't yield an easy to understand answer as to what this rule means and for what purpose it is used. Would someone please elucidate the .632+ rule?
russellpierce
- 18,599
145
votes
19 answers
Books for self-studying time series analysis?
I started by Time Series Analysis by Hamilton, but I am lost hopelessly. This book is really too theoretical for me to learn by myself.
Does anybody have a recommendation for a textbook on time series analysis that's suitable for self-study?
CuriousMind
- 2,253
144
votes
4 answers
Difference between neural net weight decay and learning rate
In the context of neural networks, what is the difference between the learning rate and weight decay?
Ryan Zotti
- 6,647