T-61.5060 Exercises 2007, Solutions 5

Laboratory of Computer and Information Science > Teaching > T-61.5060 > Exercises 2007 > Solutions 5

Solutions for the 5th exercises

We may suppose that we are using a window size that is equal to the episode size. Other cases are simple variants of this case. The algorithm keeps in memory the latest met occurrences of all event types and the starting point of the current candidate episode.

The algorithm iterates forward once through the event list. If it encounters an element not in the episode it sets the current starting point to an invalid value. If it encounters an element of the episode, it checks the latest occurrence of the element's type and the starting point of the current candidate. If the event count to the last occurrence is more than to the candidate start, then this event grows the candidate validly. We update the occurrence index and if the event count to the starting point is the size of the episode, we add the episode to some itemset. After that we move the starting point one step forward and move on.

If the distance to the last occurrence was at most the distance to the starting point, we just update the occurrence index and move the starting point one step forward.

Note that this algorithm has one very special property. The only time we need the inclusion relation of the episode is when we check whether the current event is in the episode. Thus if the belongs-to function can be implemented in constant time, the algorithm's complexity is O(N), where N is the number of events. This bound does not depend on the number of events in the episode.

This algorithm works only in cases when the episode is not allowed to contain duplicate events. This limitation can be easily overcome though with some modifications.

The belongs-to function can be implemented in constant time if the episode event types have some mutual easily calculable property, e.g. numbers between 124 and 131 or perfect squares.

Matlab script of the algorithm

First calculate the complement collection of the positive border: E, AD, BC. Now each member of the negative border must have a non-empty intersection with each of these. Therefore each border set will contain A or D, B or C and E. Obviously there are four possible sets: ABE, ACE, DBE, DCE. These sets define the negative border.

Writing math without MathML looks ugly as a swamp monster.

Here's the solution in PDF.

You should try looking at the collection of all possible sets as the Hasse diagram of a lattice, that has the partial order induced by the inclusion relation. Now try to see the patterns the frequent itemsets (or any collection F, where contains(f, F) => forall subsets x of f: contains(x, F)) create to the diagram. They are mountains as they have tops (the maximal elements) and everything under the mountain top belongs to the mountain.

Everything that is not included in the mountains is the sky. The sky consists of the transversals of the mountains' complements. The minimal transversals are in the valleys. As is seen, one can define the same landscape by defining the mountains or the sky, one is enough. Even more, it is enough to define the mountain tops or the valleys, that is, the positive or the negative border.

The first two claims are logically quite obvious. The first claim merely says that if we increase our requirements, the result won't be any worse. Any superset of S stays constant at most when S stays constant, which implies that A stays constant.

The second claim is totally similar. It says that accepting looser situations won't help us a bit. If R is a subset of T, then R stays constant at least when T stays constant. But A will still change in some of those situations so the claim is true.

Once again think of the lattice of all attribute sets. This time the sets of T construct the mountains and the sets of S the sky as is easily interpreted from the problem definition. Now when looking at the general framework we may conclude that the minimal elements of S form the minimal transversals of the complements of T. Therefore the maximal elements of T are the positive and the minimal elements of S are the negative border of T.