Today, when we were trying to calculate P(M is Spam|F_i), the actual number of spam messages and ham messages (we used 4000 and 6000 if I remember right) would only matter if they were selected apriori. However assuming that we obtained a completely random sample of messages and categorised them into ham and spam ourselves, the actual number of ham and spam messages will never affect our result of P(M is Spam|F_i). But if we select 4000 messages known to be spam, 6000 messages known to be ham, and then create the table (as seemed to be the case in our lecture), we would also require P(M is Spam) to be provided … the absolute probability of a random message being spam.

What we had in class was correct. P(M is spam | F_i) does not depend on the total number of mails, neither the total number spams.

Yes, as Yuchen explained to me after class, the probability result as we computed in class is correct, and changing the number of messages of each type doesn’t impact it since it effects both probabilities in the same way.

Where this breaks is that it is assuming the prior probability of spam/non-spam is the same for the training data as it is for the real data, so changing the proportion of spam/non-span in the training data is assuming the proportion in the real data changes accordingly. In cases where the training data isn’t representative of the real distribution, we would need to adjust the probabilities to account for that.