One has to have a container for the data, found frequent item sets and some helpers for generating the candidates.
Suppose first the setting where you really are going to do this thing for maybe twice in your life. First the data could be just read to an array (one element per row) of sorted sets. After this initialize candidates with all the elements. Then start looping the following: calculate the support of all candidates and add them to frequent item sets if needed. All the candidates and frequent item sets should also be sorted sets and they should be kept in order. You should also keep booking of the largest elements of all candidates and frequent sets. Check which pairs of k-size frequent item sets have an intersection of size k-1. Then try each k-size subset of these sets' union and see if they are frequent. If no candidates were generated, quit.
Use a trie to calculate candidates. When you have create the set of candidates for the next level, just walk the leaves of the trie from left to right and try to expand each of the nodes in order. There is a huge number of different optimizations to do and there actually have been competitions of creating the fastest apriori program (search for fimi & apriori).
This is very case-specific, but naturally new columns give rise to a large number of new candidate pairs. In tests with the 1st problem's data also the number of frequent pairs shot up with low thresholds.
Think about random data. New singleton's follow the binomial distribution's P(Bin(k, 0.5) >= threshold). As the subsets of frequent sets have to be frequent, we can think that all new frequent sets arise when an earlier frequent set receives a new member. The probability of a frequent set with support S and threshold T getting a new member can be calculated. In the set of S rows of the frequent set the probability of a variable X reaching the threshold T is P = P(Bin(S, 1/2) >= threshold). Then the probability of the set getting a new member is 1-(1-P)^k.
Note! The data file given to apriori program must have a blank last line. (It would have been nice to mention this in the non-existent documentation...).
Take a frequent set of size S and support P. The new support of the set will be something like Bin(P, (1-q)^S), which is a pretty small number, P*(1-q)^S in expectation. Also, the bigger the set, the more its support gets reduced (exponentially more). This means a serious reduce in the number of frequent sets in total.
Comparing runs with original data and flipped data:
|Set||Data supp||Flipped supp||Theory supp|
|21614, 23313, 26459||765||577||556|
You are at: CIS → T-61.5060 Exercises 2007, Solutions 3
Page maintained by firstname.lastname@example.org, last updated Monday, 15-Oct-2007 18:23:24 EEST