Association Rule Mining

A brute-force approach for mining association rules is to compute the support and confidence for every possible rule. This approach is prohibitively expensive because there are exponentially many rules that can be extracted from a data set. More specifically, the total number of possible rules extracted from a data set that contains d items is: $$R = 3^d - 2^{d+1} + 1$$

Even for the small datasets (d = 6) this approach will require us to compute the support and confidence for $3^6 - 2^7 + 1 = 602$ rules. More than 80% of the rules are discarded after applying minsum = 20% and minconf = 50%, thus making most of the computations become wasted. To avoid performing needless computations, it would be useful to prune the rules early without having to compute their support and confidence values.

A common strategy adopted by many association rule mining algorithm is to decompose the problem into major subtasks:

  1. Frequent Itemset Generation:- its objective is to find all the itemsets that satisfy the minsup threshold. These itemsets are called frequent itemsets.
  2. Rule Generation:- its objective is to extract all the high-confidence rules from the frequent itemsets found in the previous step. These rules are called strong rules.

Frequent Itemset Generation

A lattice structure can be used to enumerate the list of all possible itemsets. The following diagram shows an itemset lattic for $I = \{a, b, c, d, e\}$.

In general a dataset that contains $k$ items can potentially generate upto $2^k - 1$ frequent itemsets, excluding the null set. Because $k$ can be very large in many practical applications, the search space of itemsets that need to explored is exponentially large.

A bruteforce approach for finding frequent itemsets is to determine the support count for every candidate itemset in the lattice structure. To do this, we need to compare each candidate against every transaction. If the candidate is contained in a transaction its support count will be incremented. Such an approach can be very expensive because it requires $O(N\ M\ w)$ comparisons, where $N$ is the number of transactions, $M = 2^k - 1$ is the number of candidate itemsets, and w is the maximum transaction width.

There are several ways to reduce the computational complexity of frequent itemset generations.

  1. Reduce the number of candidate itemsets (M):- The apriori principle is an effective way to eliminate some of the candidates without counting their support values.
  2. Reduce the number of comparisons:- Instead of matching each candidate against every transaction, we can reduce the number of comparisons by using more advanced data structures, wither to store the candidate itemsets or to compress the data set.

The Apriori Principle

“If an itemset is frequent, then all of its subsets must also be frequent.”

To understand the above principle, consider the itemset lattic shown below

Suppose $\{c, d, e\}$ is the frequent itemset. Clearly any transaction that contains $\{c, d, e\}$ must also contain its subsets, $\{c, d\}$, $\{c, e\}$, $\{d, e\}$, $\{c\}$, $\{d\}$,$\{e\}$. As a result, if $\{c, d, e\}$ is frequent, then all subsets of $\{c, d, e\}$ must also be frequent. Conversely, if an itemset such as $\{a, b\}$ is infrequent, then all of it’s supersets must be infrequent too. Hence the entire subgraph containing the supersets of $\{a, b\}$ can be pruned.

This strategy of trimming the exponential search space based on the support measure is known as support-based pruning. Such a pruning strategy is made possible by a key property of the support measure, namely, that the support for an itemset never exceeds the support for its subset. This property is known as the anti-monotone property of the support measure.

Monotonicity property

Let $I$ be a set of items, and $J = 2^I$ be the power set of $I$. A measure $f$ is monotone (o upward closed) if $$\forall X, Y \in J: (X \subseteq Y) \rightarrow f(X) \leq f(Y)$$ which means that if $X$ is a subset of $Y$, then $f(X)$ must not exceed $f(Y)$. On the other hand, $f$ is anti-monotone (or downward closed) if $$\forall X, Y \in J: (X \subseteq Y) \rightarrow f(Y) \leq f(X)$$

which means that if $X$ is a subset of $Y$, then $f(Y)$ must not exceed $f(X)$

Frequent itemset generation in the apriori algorithm

Apriori is the first association rule mining algorithm that pioneered the use of support-based pruning to sytematically control the exponential growth of the candidate itemset. Following figure illustrates the high-level illustration of the frequent itemset generation part of the Apriori algorithm.

we assume that the support threshold is 60%, which is equivalent to a minimum support count equal to 3. Initially, every item is considered as a candidate 1-itemset. After counting their supports, the candidate itemsets $\{Cola\}$ and $\{Eggs\}$ are discarded, because they appear in fewer than three transactions. In the next iteration, candidate 2-itemsets are generated using only the frequent 1-itemsets because the Apriori principle ensures that all supersets of the infrequent 1-itemsets, the number of candidate 2-itemsets generated by the algorithm is ${4 \choose 2} = 6$. Two of these six candidates, $\{Beer, Bread\}$ and $\{Beer, Milk\}$, are subsequently found to be infrequent after computing their support values. The remaining four candidates are frequent, and thus will be used to generate candidate 3-itemset. Without support based pruning, there are ${6 \choose 3} = 20$ candidate 3-itemsets that can be formed using six items. With the apriori principle, we only need to keep candidate 3-itemsets whose subsets are frequent. The only candidate that has this property is $\{Bread, Diapers, Milk\}$.

Apriori Algorithm

The pseudocode for the frequent itemset generation part of the Apriori algorithm is shown below

Let $C_{k}$ denotes the set of candidate $k$-itemsets and $F_{k}$ denote the set of frequent $k$-itemsets:

  • The algorithm initially makes a single pass over the data set to determine the support of each item. Upon completion of this step, the set of all frequent 1-itemsets, F_{1}, will be known.
  • Next, the algorithm will iteratively generate new candidates k-itemsets using the frequent $(k-1)$-itemsets found in the previous iteration.
  • To count the support of the candidates, the algorithm needs to make an additional pass over the data set. The subset function is used to determine all the candidate itemsets in $C_{k}$ that are contained in each transaction $t$.
  • After counting their supports, the algorithm eliminates all candidate itemsets whose support counds are less than minsup
  • The algorithm terminates when there are no new frequent itemsets generated.

The frequent itemset generation part of the Apriori algorithm has two important characteristics.

  • First, it is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time, from frequent 1-itemsets to the maximum size of frequent itemsets.
  • Second, it employs a generate-and-test strategy for finding frequent itemsets.

Candidate generation and pruning

The apriori-gen function shown in step 5 of the algorithm generates candidate itemsets by performing the following two operations:

  1. Candidate Generation: This operation generates new candidate $k$-itemsets based on the frequent $(k − 1)$-itemsets found in the previous iteration.

  2. Candidate Pruning: This operation eliminates some of the candidate $k$-itemsets using the support-based pruning strategy.

In principle, there are many ways to generate candidate itemsets, The following is a list of requirements for an effective candidate generation procedure:

  1. It should avoid generating too many unneccesary candidates. A candidate itemset is unnecessary if at least one of its subsets is infrequent. Such a candidate is guaranteed to be infrequent to the anti-monotone property of support.
  2. It must ensure that the candidates set is complete, i.e., no frequent itemsets are left out by the candidate generation procedure. To ensure completeness, the set of candidate itemsets must subsume the set of all frequent itemsets.
  3. It should not generate the same candidate itemset more than once.

Brute-Force Method

The brute-force method considers every k-itemset as a potential candidate and then applies the candidate pruning step to remove any unnecessary candidates. The number of candidate itemsets generated at level $k$ is equal to $d \choose k$, where $d$ is the total number of items. Although candidate generation is rather trivial, candidate pruning becomes extremely expensive because a large number of itemsets must be examined. Given that the ammount of computations needed for each candidate is $O(k)$, the overall complexity of this method is $$O \left( \sum_{k=1}^{d} k * {d \choose k} \right) = O \left( d \times 2^{d-1} \right)$$

The following diagram illustrates the procedure

$F_{k-1} \times F_{1}$ Method

An alternate method for candidate generation is to extend each frequent $(k-1)$-itemset with other frequent items, The following figure illustrates how a frequent 2-itemset such as $\{Beer, Diapers\}$ can be augmented with a frequent item such as $Bread$ to produce a candidate 3-itemset $\{Beer, Diapers, Bread\}$.

This method will produce $O(|F_{k-1}| \times |F_{1}|)$ candidate k-itemset, where $|F_{j}|$ is the number of frequent $j$-itemsets. This approach however, doesn’t prevent the same candidate item from being generated more than once. One way to avoid generating duplicates candidates is by ensuring that the items in each frequent subset are kept sorted in their lexicographical order.

While this procedure is a substantial improvement over the brute-force method, it can still produce a large number of unnecessary candidates.

$F_{k-1} \times F_{k-1}$ Method

The candidate generation procedure in the apriori-gen function merges a pair of frequent $(k-1)$-itemsets only if their first $k-2$ items are identical. Let $A = \{a_{1}, a_{2}, a_{3}, …, a_{k-1}\}$ and $B = \{b_{1}, b_{2}, …, b_{k-1}\}$ be a pair of frequent $(k-1)$-itemsets. A and B is merged if they satisfy the following conditions: $$a_{i} = b_{i} (for\ i = 1,2,3,…,k-1) and a_{k-1} \neq b_{k-1}$$ In the following figure the frequent itemsets $\{Bread, Diapers\}$ and $\{Bread, Milk\}$ are merged to form a candidate 3-itemset $\{Bread, Diapers, Milk\}$

The algorithm doesn’t have to merge $\{Beer, Diapers\}$ with $\{Diapers, Milk\}$ because the first item in both itemsets is different. Indeed, if $\{Beer, Diapers, Milk\}$ is a visible candidate, it would have been obtained by merging $\{Beer, Diapers\}$ with $\{Beer, Milk\}$ instead.

However, because each candidate is obtained by merging a pair of frequent $(k−1)$-itemsets, an additional candidate pruning step is needed to ensure that the remaining $k − 2$ subsets of the candidate are frequent.

Support Counting

Support counting is the process of determining the frequency of occurance of every candidate itemset that survives the candidate pruning step of the apriori-gen function. One approach is to compare each transaction against every candidate itemset and to update the support counts of candidates contained in the transaction. This approach is computationally expensive when the number of transactions and candidate itemsets are large. An alternative approach is to enumerate the itemsets contained in each transaction and use them to update the support counts of their respective candidate itemsets.

Support Counting using hashtree

In the Apriori algorithm, candidate itemsets are partitioned into different buckets and stored in a hash tree. During support counting, itemsets contained in each transaction are also hashed into their appropriate buckets. That way, instead of comparing each itemset in the transaction with every candidate itemset, it is matched only against candidate itemsets that belong to the same bucket.

Rule Generation

An association rule can be extracted by partitioning the itemset $Y$ into two non-empty subsets, $X$ and $Y - X$, such that $X \rightarrow Y - X$ satisfies the confidence threshold. Computing the confidence of an association rule does not require additional scans of the transaction data set. Consider the rule $\{1,2\} \rightarrow \{3\}$, which is generated from the frequent subset $\{1,2,3\}$. The confidence of this rule is $\frac{\sigma(\{1,2,3\})}{\sigma(\{1,2\})}$. Because $\{1,2,3\}$ is frequent, the anti-monotone property of support ensures that $\{1,2\}$ must be frequent, too. Since the support counts for both itemsets were already found during frequent itemset generation, there is no need to read the entire data set again.

Confidence based pruning

Unline the support measure, confidence doesn’t have any monotone property. Nevertheless, if we compare rules generated from the same frequent itemset Y , the following theorem holds for the confidence measure.

If a rule $X \rightarrow Y - X$ doesn’t satisfy the confidence threshold, then any rule $X’ \rightarrow Y - X'$, where $X'$ is a subset of $X$, must not satisfy the confidence threshold as well

Rule generation in Apriori Algorithm

The Apriori algorithm uses level-wise approach for generating association rules, where each level corresponds to the number of items that belong to the rule consequent. Initially, all the high confidence rules that have only one item in the rule consequent are extracted. These rules are then used to generate new candidates. For example, if $\{acd\} \rightarrow \{b\}$ and $\{abd\} \rightarrow \{c\}$ are high confidence rules, then the candidate rule $\{ad\} \rightarrow \{bc\}$ is generated by merging the consequent of both rules. Following figure shows a lattice structure for the assosiation rules generated from the frequent itemset $\{a,b,c,d\}$

If any node in the lattice has low confidence, then according to theorem mentioned above the entire subgraph spanned by the node can be pruned immediatly. Suppose the confidence for $\{bcd\} \rightarrow {a}$ is low. All the rules containing item a in its consequent, including $\{cd\} \rightarrow \{ab\}$, $\{bd\} \rightarrow \{ac\}$, $\{bc\} \rightarrow \{ad\}$, and $\{d\} \rightarrow \{abc\}$ can be discarded.