Measures of similiarity and dissimiliarity

Similiarity and dissimiliarity are important because they are used to by a number of data mining techniques, such as clustering, nearest neighbour classification, and anamoly detection. In many cases, the initial data set is no longer needed once these similiarities or dissimiliarities have been computed. Such approaches can be viewed as transforming the data to a similiarity (dissimilarity) space and then performing the analysis. For convenience the term proximity is used to refer to similiarity or dissimilarity.

Definitions

  • Similiarity:- Informally similiarity between two objects is a numerical measure of the degree to which two objects are alike. Consequently, similiarities are higher for pairs of objects that are alike.
  • Dissimiliarity:- The dissimiliarity between two objects is a numerical measure of the degree to which two objects are different. Dissimiliarities are lower for similiar pairs of objects. Frequently, the term distance is used as a symbol for dissimiliarities.

Dissimiliarities between data objects

we will discuss first distances which are similiarities with certain properties

Distances

Euclidean distance:- The Euclidean distance, $d$, between two points $x$ and $y$ (in one, three or more dimensional space) is given by the following formula $$d(x,y) = \sqrt{\sum_{k=1}^{n} (x_{k} - y_{k})^2}$$ where $n$ is the number of dimensions and $x_{k}$ and $y_{k}$ are, respectively, the $k^{th}$ attributes of $x$ and $y$.

Minkowski Distance metric:- The Euclidean distance measure discussed above is generalized by the Minkowski distance metric which is given as $$d(x,y) = ( \sum_{k=1}^{n}|x_{k} - y_{k}|^r )^{1/r}$$ where r is a parameter. Following are the most common examples of minkowski distances.

  • r = 1 :- City block(Manhattan, taxicab, $L_{1}$) distances. A common example is the hamming distance, which is the number of bits between two objects that have only binary attributes.
  • r = 2 :- Euclidean distance ($L_{2}$ norm)
  • r = $\infty$ :- Supremum ($L_{max}$ or $L_{\infty}$) distance. This is the maximum difference between any two attributes of the objects. More informaly, the $L_{\infty}$ distance is defined as $$d(x,y) = \lim_{r \to \infty} (\sum_{k=1}^{n}|x_{k} - y_{k}|^r)^{1/r}$$

The euclidean, manhattan and supremum are defined for all values of n and specify different ways of combining the differences in each dimension (attribute) into an overall distance.

Consider the following example in which 4 points are given with their x and y co-ordinates.

Pointxy
p102
p220
p331
p451

Now it’s euclidean distance matrix will be

p1p2p3p4
p10.02.83.25.1
p22.80.01.43.2
p33.21.40.02.0
p45.13.22.00.0

$L_{1}$ distance matrix will be

p1p2p3p4
p10.04.04.06.0
p24.00.02.04.0
p34.02.00.02.0
p46.04.02.00.0

$L_{\infty}$ distance matrix will be

p1p2p3p4
p10.02.03.05.0
p22.00.01.03.0
p33.01.00.02.0
p45.03.02.00.0

Properties of distances

Distance discussed above have some common properties:-

  1. Positivity
    • $d(x,y) \geq 0$ for all $x$ and $y$
    • $d(x,y) = 0$ only if $x$ = $y$
  2. Symmetry
    $d(x,y) = d(y,x)$ for all x and y
  3. Triangle Inequality
    $d(x,z) \leq d(x,y) + d(y,z)$ for all points x, y and z.

Measures that satisfy all three properties are known as metrics. There are manu dissimiliarities which do not satisfies these properties, two examples are:-

  1. Set difference
  2. Time

Similiarities between data objects

For similiarities, the triangle inequality (or the analogus property) typically doesn’t hold, but symmetry and positivity typically do. To be explicit, is $s(x,y)$ is the similiarity between two points $x$ and $y$, then the typical properties of similiarities are as follows:-

  1. $s(x, y) = 1$ only if $x = y$. $(0 \leq s \leq 1)$
  2. $s(x, y) = s(y, x)$ $\forall$ $x$ and $y$. (Symmetry)

Similiarity measure for binary data

Similarity measures between objects that contain only binary attributes are called similiarity coefficients, and typically have values between 0 and 1. A value of 1 indicates that the two objects are completely similiar.

Let $x$ and $y$ be two objects that consist of n binary attributes. The comparison for two such objects, i.e., two binary vectors, lead to the following four quantities.

  • $f_{00} =$ the number of attributes where $x$ is 0 and $y$ is 0
  • $f_{01} =$ the number of attributes where $x$ is 0 and $y$ is 1
  • $f_{10} =$ the number of attributes where $x$ is 1 and $y$ is 0
  • $f_{11} =$ the number of attributes where $x$ is 1 and $y$ is 1

Simple matching Coefficient one commonly used similiarity coefficient is the simple matching coefficient (S.M.C.), which is defined as $$SMC = \frac{number\ of\ matching\ attributes}{number\ of\ attributes} = \frac{f_{11}+f_{00}}{f_{00}+f_{01}+f_{10}+f_{11}}$$

Previous
Next