What is the meaning of generating data from a probabilistic model such as a naive bayes classifier?

Question

I am studying probabilistic modeling but I am stuck with the concept of generating data from the probabilistic model. Say I have built a naive bayes classification model, what is the point of generating data from it? Generating data doesn't make sense to me. Hope somebody make me understand.

What doesn't make sense about "generating data"? You should view this as simulating artificial data. There are many reasons you might want to do this; one is to check if these artificial outcomes are somehow similar to real data - i.e. to test the quality of your model. — Maurits M, Jul 17 '19 at 15:15
Thank you for the comment. What do you mean by "to test the quality of my model by checking f theses articial outcomes are somehow si.ilar to real data? If it is a long story, some links will be appreciated. — xabzakabecd, Jul 17 '19 at 16:40
A link with lots of references to fake data simulation, in general, is https://statmodeling.stat.columbia.edu/2019/03/23/yes-i-really-really-really-like-fake-data-simulation-and-i-cant-stop-talking-about-it/ .
A simple example of what I mean is as follows. Suppose you have some data and your probabilistic model is that it comes from a normal distribution with parameters $\mu$ and $\sigma^2$. After fitting the model (and thus finding the best estimates for $\mu$ and $\sigma^2$) you simulate 1000 new observations from the model and compare with your actual data, looking for any differences. — Maurits M, Jul 17 '19 at 20:03
Thank you, Maurits. Can I assume a probability distribution for a generative model like Naive Bayes? I have never thought about generating data from this probabilistic model. But I know I can generate data because each variables value has probability. However, the problem I don't know what probability distribution it is from. This case, can I just assume it has a normal probability distribution with a mean and a variance?
I am trying to learn it, forgive my ignorance if I sound like I am not making sense. — xabzakabecd, Jul 18 '19 at 08:32
So this depends on the structure you assume in the first place. See for instance https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Parameter_estimation_and_event_models for several examples of probability distributions used in Naive Bayes. — Maurits M, Jul 18 '19 at 10:47

score 3 · Answer 1 · answered Jul 17 '19 at 14:27

3

[Naive] bayes is a generative model, which means we can generate data using it if we wanted. In NB, we estimate $p(\mathbf{x}|y)$, where $\mathbf{x}$ is our feature vector and $y$ is the class variable. For example, we first pick a $y$, indicating the class, and then pick word(s) according to the probability distribution, $p(\mathbf{x}|y)$.

answered Jul 17 '19 at 14:27

gunes

57,205

Thank you the answer. Could I have an exaplained example for generating data? Even some links will be appreciated. – xabzakabecd Jul 17 '19 at 16:42

score 2 · Accepted Answer · answered Jul 18 '19 at 08:49

Having the ability to generate data from the model may be useful for many reasons, e.g.

Simulate the data from the model to judge if the representation of the reality by your model is reasonable, to conduct posterior predictive checks (compare distribution of the simulated data with the empirical data),
If you can generate data from the model, you can learn about the distribution of the outcomes that are possible under the model, this is much richer information then the point estimate,
If you want your model to make suggestions for the users, sometimes producing a set of most likely guesses is better then returning the single "best" prediction (think of machine translation, or text autocomplete),
You can simulate results from the model to check if it is biased, for example, you have a model that helps HR department with recruitment, but when you generate simulated results from the model, you can learn that some minorities are unrepresented in the simulated results, so this tells you that there could be some kind of bias in the model against the minorities.

See also closely related thread When to use simulations?

What is the meaning of generating data from a probabilistic model such as a naive bayes classifier?

2 Answers2