How to use ML to confirm if Cuban cigars are over priced?

Question

I want to use ML to verify if Cuban cigars are overpriced. I want to use this website https://www.cigaraficionado.com/ratings/search?q=&brand= to get cigar data. The website provides a blind testing score for every cigar and key cigar characteristics e.g. length, type, price, origin.

My current thinking is that I should build a model to predict the Cigar rating based on the other variables. Then I can see if the 'origin = cuba' variable actually significantly determines the rating or not. If it is not a significant variable, then my understanding is that this proves that, when all other variables are kept equal, (i.e. for 2 cigars that are equivalent but one from cuba and one not) that the fact it comes from cuba does not make it better...

Does this make sense? Is there a better / different way?

No difference at all in prices would be surprising if you control only for these variables, as other things drive prices (production costs, transportation costs, taxes, rarity, demand, which all vary from country to country). In addition, the whole population of cigar consumers might disagree with the ratings from this website. Cigar consumers might also care about other things than quality (for example, status), in which case your definition of overpriced would not take into account all the reasons why people buy cigars. — J-J-J, Sep 30 '23 at 11:36
I think you are right that the data is missing probably key variables. It's really just a fun personal project, so I am keen to proceed to see what I can do (and learn) with the data I have regardless. Regarding people's opinions, indeed they are fundamentally subjective and to get around that I make the unrealistic assumption the website provides a perfect unbiased assessment, which in reality is wrong of course but I can't see how else to assess something like this. Given this, do you think what I am proposing makes sense or is flawed? — Rupert Hart, Sep 30 '23 at 15:46
If we're in an imaginary world where only these variables affect the price, it could have educational purposes (though you will certainly observe differences between countries, as this is a real world dataset), with conclusions that could only apply to this imaginary world. — J-J-J, Sep 30 '23 at 16:19
Also, you should consider if the cigars listed on this website are your whole "population" of interest, or if you are trying to infer something on a larger population of cigars. If you're trying to infer something, you should ask yourself if the cigars listed on this website can be treated as a random sample. On the other hand, if this website list is your whole population of interest, you can still uses models and look at the coefficient estimates, but looking at p-values or confidence intervals would be redundant (in which case you can forget questions about statistical significance). — J-J-J, Sep 30 '23 at 16:21
(If if it's for fun/learning about stats, of course you don't have to overthink all of this, and simply consider that these cigars are a random sample from some larger population of interest. But if later you want to apply the same methodology to a "real world" project, this is the kind of questions you should ask yourself, before asking yourself "what model to choose?"). — J-J-J, Sep 30 '23 at 16:34
Thanks for your help - I have some clarifying questions about some of your advice. Would conclusions from the dataset not apply to the real world because I have not accounted for all potentially important variables? And is there any way / possible situation whereby the results would apply to the real world? Or would the best case be 'this could be true but would need to incorporate more variables...'? — Rupert Hart, Sep 30 '23 at 18:11
The entire website is my population, I think it covers pretty much all the major cigars of interest (although honestly thats an optimistic assumption). Why would the p values and confidence intervals not apply? Is it the same as above i.e. missing variables? — Rupert Hart, Sep 30 '23 at 18:15
If your model intends to check if there's a smaller quality/price ratio for Cuba (controlling for relevant product features), but you also intend to conclude things beyond that (e.g. that someone in the supply chain is ripping off customers), the variables may not allow you to draw such conclusions. But if your intent is to find out if cigars from Cuba are not worth the price from a customer's point of view, your variables might be sufficient for that (assuming these variables are relevant to all customers, and that you did not omit other important variables relative to quality). — J-J-J, Sep 30 '23 at 18:56
About the population and p-values/confidence intervals, you may have a look at https://stats.stackexchange.com/questions/478142/do-we-need-hypothesis-testing-when-we-have-all-the-population if you're curious about that. Note that if you really have the whole population, a model is still useful for the kind of question you have, it's just that looking at the p-values/confidence intervals may make little sense in this case, given that their purpose is statistical inference (drawing conclusions from a sample to a population). Looking at the model effect sizes/coefficents is still useful, though. — J-J-J, Sep 30 '23 at 19:08
Ok brilliant, I'll take a look at the link, thanks for your help :) — Rupert Hart, Oct 01 '23 at 06:30

How to use ML to confirm if Cuban cigars are over priced?

0 Answers0