Jeune has a fascination for building and solving things. His interests include Algorithms, Development Practices and Software Design. When not immersed in his passions, he spends time experimenting in the kitchen. Jose is a DZone MVB and is not an employee of DZone and has posted 10 posts at DZone. You can read more from them at their website. View Full User Profile

The Apriori Algorithm

  • submit to reddit

Here are just notes from my data mining class which I began to consolidate here in my blog as a way to assimilate the lessons.

The Apriori algorithm is a basic method for finding frequent itemsets. The latter is used to generate association rules with high confidence and high interest.

Here is my summary of it along with a running example. The following set of baskets will be used:

D = [ ('milk', 'cheese', 'tuna'),
      ('cheese', 'mushroom'),
      ('cheese', 'eggs'),
      ('milk', 'cheese', 'mushroom'),
      ('milk', 'eggs'),
      ('cheese', 'eggs'),
      ('milk', 'cheese', 'mushroom', 'tuna'), 
      ('milk', 'cheese', 'eggs') ]

Some definitions:

I – is the universal set of items. In the example above, the universal set would be {milk, cheese, tuna, mushroom, eggs}.

C_{k} – is like a k-combination of I. Like because items in this set should have frequent (k-1, k-2,…1)-itemsets.

Now, the apriori algorithm.

1. Generate C_{k} by cross joining the itemsets of L_{k-1} among themselves.

Cross joining two sets

A cross join between two k-item sets is a union of those two sets which results in a k+1 itemset. However, the join only happens if and only if the first k-1 items of both sets are equal.

For example:

[1,2] |x| [1,3] = [1,2,3]
[1,3] |x| [1,4] = [1,3,4]
[2,3] |x| [1,5] = no join

If k=1, simply list all 1 itemsets.

one_itemset = ['milk', 'cheese', 'eggs', 'mushroom', 'tuna']

2. Generate L_{k}, the frequent itsemsets by counting the number of times each element in C_{k} occurs in D. If an element in C_{k} \geq S or the support threshold, it is qualified to be a member of L_{k}. For k=1 and our example above for L_{1} is

c1 = {'cheese': 7, 
      'tuna': 2, 
      'eggs': 6, 
      'mushroom': 2, 
      'milk': 6}

3. Repeat the process for k \leq |I| until no frequent itemsets are found.

Published at DZone with permission of Jose Asuncion, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)