Association Basic Concepts

Binary representation

Market based data can be represented in a binary format as shown in the table below where each row corrosponds to an transaction and and each columns responds to an item.

TIDBreadMilkDiapersBeerEggsCola
1110000
2101110
3011101
4111100
5111001

An item can be treated as a binary attribute whose value is one if the item is present in the transaction and zero otherwise. Because the presence of an item in a transaction is often considered more important than it’s absence, an item is an asymmetric binary variable.

Itemset and support count

Let $I = \{i_{1}, i_{2}, i_{3}, …, i_{d}\}$ be the set of all items in a market basket data and $T=\{t_{1}, t_{2}, t_{3}, …, t_{n}\}$ be the set of all transactions. Each transaction $t_{1}$ contains a subset of items chosen from $I$. In association analysis, a collection of zero or more items is termed an itemset. If an itemset contains k-items, it is called a k-itemset. The null-itemset that does not contain any items. The transaction width is defined as the number of items present in a transaction. A transaction $t_{j}$ is said to contain an itemset $X$ if $X$ is a subset of $t_{j}$. An important property of an itemset is its support count, which refers to the number of transactions that contain a particular itemset. Mathematically, the support count, $\sigma(X)$, for an itemset $X$ can be stated as follows: $$\sigma(X) = |\{t_{i}|X \subseteq t_{i}, t_{i} \in T\}|$$ where the symbol $|\ .\ |$ denotes the number of elements in the set.

Association Rule

An association rule is an implication expression of the form $X \rightarrow Y$, where $X$ and $Y$ are disjoint itemsets, i.e., $X \cap Y = \phi$. The strength of an association rule can be measured in terms of it’s support and confidence. Support defines how often a rule is applicable to a given data set, while confidence determines how frequently $Y$ appears in a transaction that contains $X$. The formal definition of these metrics are $$Support, s(X \rightarrow Y) = \frac{\sigma(X \cup Y)}{N}$$ $$Confidence, s(X \rightarrow Y) = \frac{\sigma(X \cup Y)}{\sigma(X)}$$

Why use Support and Confidence?\ Support is an important measure because a rule that has very low support may occur simply by chance. A low support rule is also likely to be uninteresting from a business perspective because it may not be profitable to promote items that customers seldom buy together. For these reasons, support is often used to eliminate uninteresting rules. Confidence, on the other hand measures the reliability of the inference made by a rule. For a given rule $X \rightarrow Y$, the higher the confidence, the more it is for $Y$ to be present in the transaction that contains $X$. Confidence also provides an estimate of the conditional probability of $Y$ given $X$.

Association analysis results should be interpreted with caution. The interference made by an association rule doesn’t necessarily imply causality. Instad, it suggests a strong co-occurence relationship between items in the antecedent and consequent of the rule. Causality, on the other hand, requires knowledge about the causal and effect attributes in the data and typically involves relationship occuring over time.

Formulation of association rule mining problem

The association rule mining problem can be formally stated as follows:

“Given a set of transactions T, find all the rules having support $\geq$ minsup and confidence $\geq$ minconf where minsum and minconf are the corrosponding support and confidence thresholds.”

Previous
Next