I am trying to implement Apriori algorithm. However I have a small doubt what to do when the same item appears more than once in the one basket,I have 2 transactions say
T1 = {A,A,C}
T2 = {A,X}
What is the support of A ? Is it 3 or 2 ?
I am trying to implement Apriori algorithm. However I have a small doubt what to do when the same item appears more than once in the one basket,I have 2 transactions say
T1 = {A,A,C}
T2 = {A,X}
What is the support of A ? Is it 3 or 2 ?
The support count of an itemset is always calculated with the respect to the number of transactions which contains the specific itemset.
So ...
boring, but true.
The english wikipedia page does not explain this well. My reference is "Data Mining: Concept and Techniques" by Han and Kamber.
Technical definitions aside, one can also apply logic here.
The point of the whole support calculation is to consider only item(sets) which appear frequently enough in different transactions so that one can be sure that the resulting rules are based on an actual patterns and did not appear due to chance (i.e. the strange behavior of just a few customers). This is important so that one can use the rule to make predictions about the likes/dislikes of future customers. If a pattern is based only on two customers, the applicability is ... questionable.
Your example more extreme to ease the illustration:
then
The Apriori algorithm is designed to be applied on a binary database, that is a database where items are NOT allowed to appear more than once in each transaction.
If you look at the definition in the paper, a transaction is a subset of the set of items. As a mathematical set, the same item cannot appear more than once in a same basket/transaction.
If you want to consider purchase quantities (that the same item can appear multiple time in a same basket/transaction), you should look at high utility itemset mining algorithms such as EFIM or FHM (I am the author by the way). These algorithms and others consider a more general version of the pattern mining problem where the purchase quantities in transactions and also the unit profits of items are considered, to find the patterns that generate the highest profit. Note that in high utility itemset mining, if all the quantities and unit profits are set to 1, it becomes the traditional problem of frequent itemset mining.