2

I have two groups, A & B. Let's say A are the users who wrote a review for place A, and B for place B.

For each user $u$ I have recorded his total number of reviews $r_u$. Since we're talking about number of reviews, which is a very skewed variable, I thought it would be better to use the natural logarithm $\log(r_u) = \hat{r_u}$.

Now, I suspect that the users who write for A are "fake" or at least "green" accounts with 1 or very few reviews, and I want to declare whether these users are statistically significant from those who write for B. From a visual inspection, it seems clearn that something is off (black trace = A; green trace = B).

enter image description here

Still, I would like to formalize this difference. I need to compare $R_A = \{\hat{r_u} \ \forall \ u \in A \}$ and $R_B = \{\hat{r_u} \ \forall \ u \in B \}$ somehow, but I am unsure on how to do it properly from a statistical point of view.

Some doubts and questions I have in mind (I am fully aware that it's not one specific question, I'm sorry about that):

  • is it appropriate to test for difference of means? I mean, should I really compare means to show what I want to show? What about a given percentile? Or maybe compare proportions of 1-review users in the two groups? What would you go for? Would you go for more than one?
  • if I compare means I'll go for an independent group t-test but I think it is clear that in my case the assumption of normality doesn't hold. Would you go with a non-parametric, e.g. the Mann-Whitney test?
  • what's a good sample size? and how much do samples' sizes matter? For example, now I have gathered $|R_A| \approx 400$ and $|R_B| \approx 700$ and I am not really sure what to do with that (because I could even gather more for both). Should I sample $\approx$ 400 records from $R_B$ to level out the groups?

Thank you for your help.

rusiano
  • 564
  • 2
  • 14
  • What do you want to know about the groups? If you’re interested in testing for different means, a test of means is appropriate. If you’re interested in something else, such a test is less appropriate. 2) Let’s first address what you even want to learn from your data. Then we can discuss specific approaches. 3) It depends on what you want to do. However, i cannot think of any circumstances where discarding data to balance the sizes will help you.
  • – Dave Nov 23 '22 at 12:43
  • @Dave Not knowing how to go about 1) and 2) that's exactly why I ask the opinion of some experts here. I present my (confuse) ideas hoping someone can guide me. – rusiano Nov 23 '22 at 13:15
  • What do you want to learn from your data? – Dave Nov 23 '22 at 13:23
  • @Dave I think it is stated: whether these two groups are different, specifically if there are significantly more 1-review users in group A than in group B. The problem, as you say, is that there is no unambiguous definition of "different": I could look at mean (which is the most "popular" choice), but why not at median, at another percentile or maybe at proportions? Or why not at multiple metrics? – rusiano Nov 23 '22 at 13:28