How is the support in the Apriori algorithm calculated in the case of duplicates?

Question

I am trying to implement Apriori algorithm. However I have a small doubt what to do when the same item appears more than once in the one basket,I have 2 transactions say

T1 = {A,A,C}
T2 = {A,X}

What is the support of A ? Is it 3 or 2 ?

I edited the question to make it more precise. Can you check whether I have changed the meaning accidentically ? — steffen, Feb 24 '12 at 08:55

steffen · Accepted Answer · 2012-02-24T10:48:13.787

The support count of an itemset is always calculated with the respect to the number of transactions which contains the specific itemset.

So ...

the absolute support of A, i.e. the absolute number of transactions which contains A, is 2
the relative support of A, i.e. the relative number of transactions which contains A, is $\frac{2}{2}=1$

boring, but true.

The english wikipedia page does not explain this well. My reference is "Data Mining: Concept and Techniques" by Han and Kamber.

Technical definitions aside, one can also apply logic here.

The point of the whole support calculation is to consider only item(sets) which appear frequently enough in different transactions so that one can be sure that the resulting rules are based on an actual patterns and did not appear due to chance (i.e. the strange behavior of just a few customers). This is important so that one can use the rule to make predictions about the likes/dislikes of future customers. If a pattern is based only on two customers, the applicability is ... questionable.

Your example more extreme to ease the illustration:

One customer bought A a thousand times (once), because he loves A so much
A second customer bought it just to try it out
Another 1000 customers visited the store but noone else bought A

then

the absolute support is 2, not 1001
the relative support is $\frac{2}{1000+1+1}=0.0019$ and not $\frac{1001}{1000+1+1}=0.9990$

Phil · Answer 2 · 2017-07-04T01:17:00.950

The Apriori algorithm is designed to be applied on a binary database, that is a database where items are NOT allowed to appear more than once in each transaction.

If you look at the definition in the paper, a transaction is a subset of the set of items. As a mathematical set, the same item cannot appear more than once in a same basket/transaction.

If you want to consider purchase quantities (that the same item can appear multiple time in a same basket/transaction), you should look at high utility itemset mining algorithms such as EFIM or FHM (I am the author by the way). These algorithms and others consider a more general version of the pattern mining problem where the purchase quantities in transactions and also the unit profits of items are considered, to find the patterns that generate the highest profit. Note that in high utility itemset mining, if all the quantities and unit profits are set to 1, it becomes the traditional problem of frequent itemset mining.

How is the support in the Apriori algorithm calculated in the case of duplicates?

2 Answers2