Statistical test for comparing two frequency distributions expressed as arrays (buckets) of values

Question

I am looking for an appropriate statistical test that will compare two frequency distributions, where the data is in the form of two arrays (or buckets) of values.

For example, suppose I have two distributions, where A, B, and C are observed outcomes from a software logging system (such as whether customers clicked on button A, B, or C).

HISTORICAL: 
A        B        C
122319   295701   101195

ONE MONTH:
A        B        C
1734     3925     1823

My goal is to create an automated A/B testing system. For example, we've collected this data for the last 6 months (in the HISTORICAL data set). After we roll out a new algorithm, we can collect new results (in the ONE MONTH data set). If the two distributions are "significantly" different, we'd then know to take some action.

My specific questions:

What's the proper statistical test for this problem, and how could I know when these distributions differ significantly? An answer using R or python would be appreciated.
What's the minimum number of samples I'd need for both HISTORICAL and ONE MONTH for the test to be valid?

I've read several other questions related to chi-squared and Kolmogorov-Smirnov tests but don't know where to begin. Related questions:

Thank you for any help.

I think the answer is no, but it's worth double checking -- those "One month" values are not included in the "Historical" set, is that right? Are the categories ordered (A<BB>C) or unordered (nominal)? (for example if the buttons corresponded to a sequence of options like increasing dollar amounts that would be different than if they were choices of colors) In your actual application the number of categories is small? (3 or 4 is small but 20-30 isn't, and for ordered categories my advice would change somewhere in between, but perhaps as low as between 4 and 5) — Glen_b, Jul 05 '17 at 22:08
This setting--data are logged and the question is "how could I know when"--appears to call for a sequential test. Neither $\chi^2$ nor KS tests (as usually applied) would be appropriate, although likely either could be adapted for the purpose. — whuber, Jul 05 '17 at 22:12
What do you intend with "for a test to be valid" there? Are you just asking about the accuracy of nominal significance levels? — Glen_b, Jul 05 '17 at 22:37
@Glen_b: The "one month" values are not part of the "historical" set. The categories are not ordered. There are usually under 10 categories, usually on order of 4-6. — stackoverflowuser2010, Jul 05 '17 at 23:24
@whuber: I'm not really asking for "when" but rather "if". I collected "historical" data for January 1 to June 30, and then I have "one month" data of July 1 to July 31. Now I want to know if these distributions differ. — stackoverflowuser2010, Jul 05 '17 at 23:26
@Glen_b: "What do you intend with "for a test to be valid" there": I want to know how many samples I need in both "historical" and "one month" for a given test to be valid. — stackoverflowuser2010, Jul 05 '17 at 23:27
You've simply repeated the phrase "for a [given] test to be valid" without explaining what kind of validity you're asking about. What is it that you mean by "valid" there? What kind of non-validity (which you say changes with sample size) are your concerned with? — Glen_b, Jul 06 '17 at 01:19
@Glen_b: Just forget I mentioned that. I would just like to know how many samples I would need to achieve statistical significance or confidence at a certain level. — stackoverflowuser2010, Jul 06 '17 at 16:01
Are you after a general formula that relates sample size to effect size and significance level, or do you need a specific sample size from a particular effect size and significance level? — Glen_b, Jul 07 '17 at 00:58

score 2 · Answer 1 · answered Jul 07 '17 at 20:21

Run a chi-squared goodness-of-fit test to determine if an observed frequency distribution observed differs from a desired (perhaps theoretical) distribution expected.

Note carefully the definition of the statistic $X^2$ (the eponymous chi squared):

$$X^2 = \sum_{i}^{}{ \frac{(observed_i - expected_i)^2}{expected_i} }$$

Both series should be of the same order, so one of them needs to be scaled to the other. One can scale expected to observed.

Below is some Python code that encapsulates this test. To make the final evaluation, a decision is made against the test's resulting p-value.

#!/usr/bin/env python 
import numpy as np
import scipy.stats as stats

def ComputeChiSquareGOF(expected, observed):
    """
    Runs a chi-square goodness-of-fit test and returns the p-value.
    Inputs:
    - expected: numpy array of expected values.
    - observed: numpy array of observed values.
    Returns: p-value
    """
    expected_scaled = expected / float(sum(expected)) * sum(observed)
    result = stats.chisquare(f_obs=observed, f_exp=expected_scaled)
    return result[1]

def MakeDecision(p_value):
    """ 
    Makes a goodness-of-fit decision on an input p-value.
    Input: p_value: the p-value from a goodness-of-fit test.
    Returns: "different" if the p-value is below 0.05, "same" otherwise
    """  
    return "different" if p_value < 0.05 else "same"

if __name__ == "__main__":
    expected = np.array([122319, 295701, 101195])
    observed1 = np.array([1734, 3925, 1823])
    observed2 = np.array([122, 295, 101])

    p_value = ComputeChiSquareGOF(expected, observed1)
    print "Comparing distributions %s vs %s = %s" % \
        (expected, observed1, MakeDecision(p_value))

    p_value = ComputeChiSquareGOF(expected, observed2)
    print "Comparing distributions %s vs %s = %s" % \
        (expected, observed2, MakeDecision(p_value))

The output from running this test is:

Comparing distributions [122319 295701 101195] vs [1734 3925 1823] = different
Comparing distributions [122319 295701 101195] vs [122 295 101] = same

Statistical test for comparing two frequency distributions expressed as arrays (buckets) of values

1 Answers1