T-61.5060 Exercises 2007, Solutions 3

Laboratory of Computer and Information Science / Neural Networks Research Centre

Laboratory of Computer and Information Science > Teaching > T-61.5060 > Exercises 2007 > Solutions 3

Solutions for the 3rd exercises

Frequent sets for support at least 3:
1. {A}, {B}, {C}, {D}, {E}
2. {A, C}, {A, D}, {A, E}, {B, C}, {B, D}, {B, E}, {C, D}, {C, E}, {D, E}
3. Candidates are: {A, C, D}, {A, C, E}, {A, D, E}, {B, C, D}, {B, C, E}, {B, D, E}, {C, D, E}. Of these only {B, C, D} and {B, D, E} fail to reach the frequency threshold.
4. B won't be part in any of these because {B, C, D} is not frequent. Thus the only candidate is {A, C, D, E}, which passes the threshold.
One association rule is {D, E} => C. This has frequency of 4 and confidence of 1. Rule {A, C} => E has the same attributes. But these are not that interesting perhaps, because C itself is abundantly frequent. Others not so obvious are A => D and A => E (4, 0.8), {A, D} => E (3, 0.75), {A, C} => {D, E} (3, 0.75).

One has to have a container for the data, found frequent item sets and some helpers for generating the candidates.

Suppose first the setting where you really are going to do this thing for maybe twice in your life. First the data could be just read to an array (one element per row) of sorted sets. After this initialize candidates with all the elements. Then start looping the following: calculate the support of all candidates and add them to frequent item sets if needed. All the candidates and frequent item sets should also be sorted sets and they should be kept in order. You should also keep booking of the largest elements of all candidates and frequent sets. Check which pairs of k-size frequent item sets have an intersection of size k-1. Then try each k-size subset of these sets' union and see if they are frequent. If no candidates were generated, quit.

Fast implementation:
Use a trie to calculate candidates. When you have create the set of candidates for the next level, just walk the leaves of the trie from left to right and try to expand each of the nodes in order. There is a huge number of different optimizations to do and there actually have been competitions of creating the fastest apriori program (search for fimi & apriori).

Sizes of the collections grow (or decrease in size) very quickly with the threshold. Even a log log of the set counts was not enough to stop the growth.
- Plots of the growth
Interpreting the frequent item sets:

This is very case-specific, but naturally new columns give rise to a large number of new candidate pairs. In tests with the 1st problem's data also the number of frequent pairs shot up with low thresholds.

Think about random data. New singleton's follow the binomial distribution's P(Bin(k, 0.5) >= threshold). As the subsets of frequent sets have to be frequent, we can think that all new frequent sets arise when an earlier frequent set receives a new member. The probability of a frequent set with support S and threshold T getting a new member can be calculated. In the set of S rows of the frequent set the probability of a variable X reaching the threshold T is P = P(Bin(S, 1/2) >= threshold). Then the probability of the set getting a new member is 1-(1-P)^k.

Note! The data file given to apriori program must have a blank last line. (It would have been nice to mention this in the non-existent documentation...).

Take a frequent set of size S and support P. The new support of the set will be something like Bin(P, (1-q)^S), which is a pretty small number, P*(1-q)^S in expectation. Also, the bigger the set, the more its support gets reduced (exponentially more). This means a serious reduce in the number of frequent sets in total.

Comparing runs with original data and flipped data:

Set Data supp Flipped supp Theory supp

30447 2307 2065 2076
23313, 26459 1834 1515 1486

21614, 23313 2213 1808 1793

21614, 23313, 26459 765 577 556

You are at: CIS → T-61.5060 Exercises 2007, Solutions 3

Page maintained by t615060@james.hut.fi, last updated Monday, 15-Oct-2007 18:23:24 EEST