A/B test with a result of another deep learning analysis with 90% accuracy

Question

I am planning to conduct a A/B test with data obtained through a deep learning algorithm. Say, I got a binary classification dataset through machine learning with about 100k rows classified into yes, or no. But, the accuracy of prediction in the classification process was 90%, which may mean that there are still falsely labelled rows such as false yes and false no.

If I run a A/B test with this dataset, say trying to find out which one's more profitable between old and new stuff, how do I deal with the 10% possibility of false yes or no statistically?

To get more accurate result from A/B test, there should be a way to mitigate the influence of the 10% of wrong predictions the dataset has, mathematically or statistically.

Please, let me know the name of a methodological way or any relevant info. so that I can search and study for myself. Or, easy and thorough explanation of them would be really thankful.

ex)

review	sentiment	(accurate or not)	date
review1	positive	(accurate)	01-01-2022
review2	positive	(inaccurate)	01-01-2022
review4	positive	(accurate)	01-02-2022
review5	positive	(accurate)	01-02-2022

Using the above dataset, I want to split it into positive reviews on 01-01-2022(control) and positive reviews on 01-02-2022(experiment) and see if a new campaign changed the user's sentiment. To skip all and put it simply, if I don't do anything about false positive generated by the predicting model in the sentiment analysis process, the result would be "no change after the campaign" because the rate of positive reviews are same in both 01-01-2022 and 01-02-2022 cases, ignoring review2 is in fact a negative one.

When a dataset like the above is so large that I cannot manually check if an individual review is correctly labelled, and if I just run an AB test with the dataset, those wrongly labelled reviews, like the review2 in the example, will somehow affect the test result, I guess.

What should I do to mitigate the impact from those wrongly labelled reviews? Should I just avoid doing AB test with a dataset gotten by any deep learning classification model like one for sentiment analysis? (because it's impossible to obtain 100% accuracy in any deep learning model) Or, is there mathematic or statistical method to mitigate it?

Hope this makes clear what I am asking.

Just to get this right: You created a partitioning of a data set in two classes and now you want to use this partitioning as group split for an A/B-Test ? yes = A, no = B ? If I am mistaken, please explain. Otherwise ... what is the reasoning behind this approach ? — steffen, Mar 28 '22 at 08:03
@mlwida no, I am not going to use the 90% accuracy dataset as group split, yes group, no group. It's going to be rather before/after split, and the frequency of yes/no will be a key metric that I am going to build a hypo. upon. Thanks. — Dan K, Mar 28 '22 at 23:53
Thank you. I still don't get the link between A/B-Test and classification model. If the A/B-Test group split is performed independently of a true or predicted class label, then why do you ask "more accurate result from A/B test, there should be a way to mitigate the influence of the 10% of wrong predictions" ? Maybe a dummy example set might help to demonstrate the expected affection of accuracy ? — steffen, Mar 29 '22 at 03:39
@mlwida oh, now I see what your point is. you mean, if I just do the split only with accurately labelled ones, there's no need to worry about, right? But, I am talking about the case with very large dataset, so large that you cannot manually check if an individual row is correctly labelled. Also, I will use the model with 90% accuracy to label each row, so I won't be able to know which specific rows are wrongly labelled. Thanks again! — Dan K, Mar 29 '22 at 19:06
Sorry. I still don't get the link between label and split and what you mean with "more accurate result from A/B test" ? HOW is the accuracy of the A/B-Test-Result affected, a result which is supposed to be INDEPENDENT (?) of the label ? Again, a dummy data set and / or a more detailed explanation might help. You have asked a very specific question based on assumptions and thought processes I cannot reverse engineer from your question alone. — steffen, Mar 30 '22 at 06:19
@mlwida Thanks for your patience in trying to help a newbie like me. I tried my best to elaborate what I am curious about above. Hope now it helps you understand my question. — Dan K, Mar 30 '22 at 17:18

steffen · Accepted Answer · 2022-04-01T10:10:39.300

About the general setup of the A/B-Test

First of all, an A/B-Test is per definition a randomized experiment, i.e. users / visitors are assigned certain treatments / groups at random. To measure the impact of the treatments, everything is kept the same except the variation of the treatment.

Regarding selection of users for the A/B-Test

When only a subset of users is selected, the impact of the treatment can only be measured for that subset. This is valid from a practical point of view if the treatment is only available for the subset.

Otherwise a regular A/B-Test for all users is performed, followed by an analysis to see how the subgroups of interest have been affected. This is called cohort analysis.

Applied to your case:

Is the new campaign / treatment only available to those with a positive sentiment ? Has this campaign the potential to change the profitability or sentiment of the users with a negative sentiment ? Are these changes potentially interesting for business ? If you answer one of this questions with yes or are not sure, I strongly recommend to perform the test on all users accompanied by a cohort analysis.

Regarding assignment of users to groups

If the assignment of users to groups is not happening at random, be aware that you introduce potential confounding factors. It is best to avoid this or make sure, that no bias is introduced this way.

Applied to your case:

Are you sure, that group assignment based on review date 01-01-2022 and 01-02-2022 is a random assignment ? Differences like "national holiday / weekend vs working day" or "sunny vs cloudy weather" can influence the basic sentiment of the users and hence affect the base probability of leaving a positive review.

Keep in mind, that the result of such an A/B-Test is used to make general statements about the usefulness of the new campaign / treatment. Is the validity of this general statement affected if only users with certain review dates are compared ?

I strongly recommend to assign treatments at random whenever possible.

About the uncertainty of the classification model

If the classification error rate is INDEPENDENT of the group assignment, then both groups are affected equally. The variance induced by these misclassifications is "captured" by the subsequent statistical test, e.g. the G-Test or t-test.

BUT

If the classification error rate of the model affects the measured metric (e.g. change of sentiment) and the error is to high compared to the effect size of this metric, then it might be that the effect is "drowned in noise" and hence the statistical test used for evaluating the A/B-Test result will show no significant difference.

You can simulate such scenarios by using the Monte Carlo method. Despite its fancy name, it is nothing more than repetitive random sampling from defined distributions, calculating the desired function / outcome and hence obtaining an distribution for said outcome.

How do I know whether it is independent ?

In your particular case I would assume independence if the classification error is roughly the same for both review date 01-01-2022 and 01-02-2022. This can be checked by a manual analysis of a sample. Again a statistical test can be performed to see whether there is an significant difference. If you get rid of this group determining factor (see section about confounders above), you can assume independence.

Is it possible to just "substract" the missclassifications from the final result ?

No, since we do not know which instances have been misclassified. But one can apply basic probability calculations (or Monte Carlo) to estimate the worst / best / average case effect of the classification error rate on obtained results.

I have tried to give a rather broad answer, because there a multiple paths for you to choose (depending on your problem). I hope it gives you the tools or at least starting points to get a good grip of this topic. — steffen, Apr 01 '22 at 10:07
a rather broad? NO WAY! This answer has all the thing I wanted to know. How kind of you! It's just unbelievable and you have to know that it's my first time to copy and paste an answer from stack exchange. Thank you so much for the great great answer! Now, I know what to study more! — Dan K, Apr 01 '22 at 18:00
@Dank Thank you for your kind words, glad I could help :). Sidenote: Please do not crosspost. I have flagged your question on datascience.SE, since we already have two answers here. No offense ! — steffen, Apr 01 '22 at 20:18

dimitriy · Answer 2 · 2022-03-31T02:00:08.097

0

If you are using your algorithm in the wild, you will have a $\le 10\%$ Type II error rate, which will impact profitability. That means that you should compare against the status quo without any adjustments for misclassification.

For example, say variant B entails raising the price if the review sentiment is classified as positive, and variant A is not doing anything. You split your classified-as-positive customers into two groups at random, A and B, and raise the prices for B. Your treated group will consist of mostly true positives and some false positives. The true positives are fairly price-insensitive since they like the product, but the false positives are very price-sensitive since they don't like the product and you raised prices. This creates a tradeoff between the two revenue effects. If you remove or adjust for the second group in your experiment analysis, you could miss that tradeoff.

edited Mar 31 '22 at 02:00

answered Mar 26 '22 at 01:44

dimitriy

35,430

So, in my example, where I just want to see if sentiment changed towards positive more, the additional comparison experiment like user experience research in web analysis should be done in the metric definition step? Sorry for my ignorance, but I still don't understand. Even if you successfully obtain sensitivity difference between positive and false positive, how can u make an adjustment to a large sentiment labelled dataset with 100000m rows? How do you make an adjustment after knowing sensitivity difference? Can you please explain it in detail using my example? Thanks! – Dan K Mar 31 '22 at 13:21
1

Your question uses the improvement criterion of "which one's more profitable between old and new stuff". My suggestion is not to adjust at all for that. It sounds like you have something else in mind, but it's not clear to me what that is from your comments. I suggest you revise your question with all the details: the exact comparisons you want to make, the A/B test setup, the treatment, the metrics, and the summary statistics. – dimitriy Mar 31 '22 at 16:19

A/B test with a result of another deep learning analysis with 90% accuracy

2 Answers2

About the general setup of the A/B-Test

About the uncertainty of the classification model