Every explanation of laplace smoothing for, e.g, spam filtering, includes the following:
The solution is to never let any word probabilities be zero, by smoothing them upwards. Instead of starting each word count at 0, start it at 1. This way none of the counts will ever have a numerator of 0. This overestimates the word probability, so we also add 2 to the denominator. (We add 2 because we’re implicitly keeping track of 2 things: the number of emails that contain that word, and the number that don’t. The sum of those two things should be in the denominator, and the 2 accounts for starting both the counters at 1.)
I don't understand why we add 2 to the denominator. For example, if we start every word count at 1, this is equivalent to adding an email to our dataset with 1 of each word.
And, since the formula for calculating $P(w_1|S)$ is $\frac{\text{# of spam emails containing }w_1}{\text{total # of spam emails}}$, shouldn't we just add 1 to the numerator and denominator?
If possible, please explain with a very simple example/dataset with concrete numbers. My guess is that the explanation should include something about the probabilities not summing to 1, if you only add 1 to the denominator (instead of 2).