I am attempting to use a Naive Bayes classifier to classify text. To accomplish this I have created an Excel sheet with a binary distribution for three variables. The workbook can be found here. Assuming that my math is correct, my questions are:
- Can my training set can be expanded as I classify new inputs? In other words, every time I check a classification that the model has produced, I can add it to the training model but then I might have an uneven number of examples for each class. Is this a problem?
- How can I incorporate a prior distribution to the equation? If I for example know from prior data that Class A is twice as likely than Class B?
- How can I incorporate tf–idf to the equation? I can analyze all the data sets a priori and the frequencies of each word in both the corpus and each document, but am unsure how to incorporate this into the Classifier.
Thanks in advance for everyone help.
AMAS