I am studying probabilistic modeling but I am stuck with the concept of generating data from the probabilistic model. Say I have built a naive bayes classification model, what is the point of generating data from it? Generating data doesn't make sense to me. Hope somebody make me understand.
Asked
Active
Viewed 627 times
2 Answers
3
[Naive] bayes is a generative model, which means we can generate data using it if we wanted. In NB, we estimate $p(\mathbf{x}|y)$, where $\mathbf{x}$ is our feature vector and $y$ is the class variable. For example, we first pick a $y$, indicating the class, and then pick word(s) according to the probability distribution, $p(\mathbf{x}|y)$.
gunes
- 57,205
-
Thank you the answer. Could I have an exaplained example for generating data? Even some links will be appreciated. – xabzakabecd Jul 17 '19 at 16:42
2
Having the ability to generate data from the model may be useful for many reasons, e.g.
- Simulate the data from the model to judge if the representation of the reality by your model is reasonable, to conduct posterior predictive checks (compare distribution of the simulated data with the empirical data),
- If you can generate data from the model, you can learn about the distribution of the outcomes that are possible under the model, this is much richer information then the point estimate,
- If you want your model to make suggestions for the users, sometimes producing a set of most likely guesses is better then returning the single "best" prediction (think of machine translation, or text autocomplete),
- You can simulate results from the model to check if it is biased, for example, you have a model that helps HR department with recruitment, but when you generate simulated results from the model, you can learn that some minorities are unrepresented in the simulated results, so this tells you that there could be some kind of bias in the model against the minorities.
See also closely related thread When to use simulations?
Tim
- 138,066
A simple example of what I mean is as follows. Suppose you have some data and your probabilistic model is that it comes from a normal distribution with parameters $\mu$ and $\sigma^2$. After fitting the model (and thus finding the best estimates for $\mu$ and $\sigma^2$) you simulate 1000 new observations from the model and compare with your actual data, looking for any differences.
– Maurits M Jul 17 '19 at 20:03I am trying to learn it, forgive my ignorance if I sound like I am not making sense.
– xabzakabecd Jul 18 '19 at 08:32