I have a problem in which I need to visualize the distribution of possible outcomes from a large number of independent Bernoulli events with varying probabilities among those events in python (summing n ≈ 1000 variables). My background in statistics is predominantly in sampling and multivariate regression. I feel a bit like a fish out of water on this one. I've come up with two approaches, but I could use some assistance.
Modeling Approach
My initial reaction to this problem was to take the column of probabilities associated with the events in my dataset and use random number generation to "model" success or failure. Then after doing many "runs" of the model, take the outcomes and generate the distribution of outcomes from the runs.
Three questions regarding this approach
- Is this a valid approach to this problem?
- How would one quantify how good of an approximation the model is?
- Can you quantify error akin to standard error?
Poisson Binomial Distribution
My other reaction was to use a probability distribution, but it seems this case gets pretty complicated.
This problem seems to be suited to using the methods laid out in Probability distribution of "large" number of independent events and Success of Bernoulli trials with different probabilities
However, I am a bit lost on the best way forward implementing their advice. @whuber proposes "discretization" of the distribution and @Tim recommends other approximation methods or Monte Carlo. The @whuber method seems easier to implement and may be good enough for my purposes, but the @Tim non-Monte Carlo method seems more mathematically rigorous though will be much more difficult for me to implement into python. I don't know how to decide on that tradeoff. Any advice would be greatly appreciated.
I think my answer would ultimately be both of these approaches. I really appreciate anyone taking the time to help me figure this out.