I am planning to conduct a A/B test with data obtained through a deep learning algorithm. Say, I got a binary classification dataset through machine learning with about 100k rows classified into yes, or no. But, the accuracy of prediction in the classification process was 90%, which may mean that there are still falsely labelled rows such as false yes and false no.
If I run a A/B test with this dataset, say trying to find out which one's more profitable between old and new stuff, how do I deal with the 10% possibility of false yes or no statistically?
To get more accurate result from A/B test, there should be a way to mitigate the influence of the 10% of wrong predictions the dataset has, mathematically or statistically.
Please, let me know the name of a methodological way or any relevant info. so that I can search and study for myself. Or, easy and thorough explanation of them would be really thankful.
ex)
<labelled dataset I got from a sentiment labelling model with 90% accuracy>
| review | sentiment | (accurate or not) | date |
|---|---|---|---|
| review1 | positive | (accurate) | 01-01-2022 |
| review2 | positive | (inaccurate) | 01-01-2022 |
| review4 | positive | (accurate) | 01-02-2022 |
| review5 | positive | (accurate) | 01-02-2022 |
Using the above dataset, I want to split it into positive reviews on 01-01-2022(control) and positive reviews on 01-02-2022(experiment) and see if a new campaign changed the user's sentiment. To skip all and put it simply, if I don't do anything about false positive generated by the predicting model in the sentiment analysis process, the result would be "no change after the campaign" because the rate of positive reviews are same in both 01-01-2022 and 01-02-2022 cases, ignoring review2 is in fact a negative one.
When a dataset like the above is so large that I cannot manually check if an individual review is correctly labelled, and if I just run an AB test with the dataset, those wrongly labelled reviews, like the review2 in the example, will somehow affect the test result, I guess.
What should I do to mitigate the impact from those wrongly labelled reviews? Should I just avoid doing AB test with a dataset gotten by any deep learning classification model like one for sentiment analysis? (because it's impossible to obtain 100% accuracy in any deep learning model) Or, is there mathematic or statistical method to mitigate it?
Hope this makes clear what I am asking.