Creating groups from an extremely positively skewed population (further explanation + image in text)

Question

Obligatory caveat: Stats neophyte, R neophyte, trying to learn more.

I have daily traffic data for 3k URLs for the entirety of 2016. There is a broad cyclical seasonal trend that the majority of them share, but the extent to which this seasonality is expressed differs. There is a tremendous positive skew histogram showing positive skew of data to the distribution of traffic in the data (here the x axis shows the total amount of traffic).

I want to create six groups of 500 URLs for testing purposes and want to ensure that the variance in seasonality of each group is minimized so I can draw conclusions without running the rest for a year, but I don't really know how to proceed. I initially tried to do so by randomly assigning each URL to one of six groups, but since the skew is so nuts I don't really know how to test to ensure that the seasonality of my resultant groups is significantly different. Here time series showing variation of seasonality is a time series showing the variation in seasonality between each randomly generated group--the y axis is the % difference of the month relative to the month before it, which is why January isn't present.

To the eye, this looks like the same general trend is occurring, but there seems to be significant variation at certain points. Anyway, I'm hoping to do something reproducible so I don't want to just trust my gut.

I'm basically looking to apply the methodology found here for my own purposes. In this post, they indicated that they used t-tests to reinforce the assumption that the test groups aren't meaningfully different, but due to my data's skew, I don't think t-tests will tell me anything. I read about transforming the data using log transformation but even after that histogram of data after log transformation the distribution wasn't remotely normal. After that I read about Box-Cox transformations but that got so confusing I couldn't make heads or tails of it--trying to execute it in R assumed I already had a linear model, and so far as I can tell I only have one variable.

Anyway, I'm really banging my head against a wall at this point. I would seriously appreciate any pointers you have. I'd already poked around CV and found stuff like this that didn't really help for reasons explained above. I'm not looking for a bulletproof solution, just something that can reasonably reduce some of the seasonal ambiguity.

score 0 · Answer 1 · answered Jan 06 '17 at 00:38

I don't have enough rep to comment, but I think you're missing the link to the "methodology here" that you'd like to replicate. From my general viewpoint, it looks like you at least start with a GLM (defaults to Poisson-distribution) and use your % difference variable as the response, not the highly skewed Population variable (though I'm not certain which that is). You can include Group as a predictive factor in addition to Month. There could be deeper patterns to be unveiled using temporal autocorrelative analyses or random effects; a glm would be a start point.

Thanks for the note--you're correct; I forgot to add the link but just did so. And thank you for the pointers. I'll try incorporating group as a predictive factor and will see about using the % difference as the response variable. — mowshowitz, Jan 06 '17 at 02:20

score 0 · Accepted Answer · answered Jun 21 '20 at 13:36

You are right that t-tests are for Gaussian data and they may not be adapted to your purpose. However, two remarks :

You have a lot of data, 3k urls from what you said, hence you can consider that even though the data is not Gaussian, its mean is close to a Gaussian (the justification comes from the Central Limit Theorem). Then, in fact you can use the t-test because even though $X$ is not normal, $\overline{X}$ and $\widehat{\sigma}$ are approximately normal and then $$\frac{\overline{X}-\mu}{\widehat{\sigma}}\simeq T(n-1) $$ where $X_1,\dots,X_n$ is your data in one group, when you want to compare one group to a benchmark $\mu$. You can also adapt this to construbt the two sample test.
You can choose a model that is not Gaussian, for instance you may want to model with a Poisson distribution as makai said and then you have to work out the test with this model. It would be an exact test where you use the empirical mean as test statistics and notice that a sum of Poisson random variables follows a poisson distribution you can look into any lecture note on statistical test to learn how to do that or refer to R documentation of Poisson test

Remark : be carefull that if you want to do a test to compare 6 groups, you can't just do a bunch of two sample tests, you have to use multiple testing. The easier way to do that is to consider Bonferoni correction but once again you may want to look at some lecture notes on statistical tests to learn what this means.

Creating groups from an extremely positively skewed population (further explanation + image in text)

2 Answers2