T-61.5020 Statistical Natural Language Processing
Answers 2 -- Similarity measures
Version 1.0


Euclidean distance ($ L_2$ norm)

Euclidean distance between the vectors $ \mathbf x =[x_1 ~x_2 \dots x_n]$ and $ \mathbf y =[y_1 ~y_2 \dots y_n]$ is defined as

$\displaystyle Euc(\mathbf x, \mathbf y)= \sqrt{\sum_{i=1}^n (x_i-y_i)^2}$ (1)

The distance between Tintus and Koskisen korvalääke is calculated as an example:
$\displaystyle Euc(Ti,Ko)$ $\displaystyle =$ $\displaystyle \sqrt{(0-10)^2+(0-6)^2+(5-2)^2+(1-1)^2 +(4-0)^2}$  
  $\displaystyle =$ $\displaystyle 12.7$  
$\displaystyle Euc(Ko,Te)$ $\displaystyle =$ $\displaystyle 9.9$  
$\displaystyle Euc(Ti,Te)$ $\displaystyle =$ $\displaystyle 5.1$  

$ L_1$ norm

The distance according to the $ L_1$ norm is defined as

$\displaystyle L_1(\mathbf x, \mathbf y)= \sum_{i=1}^n \vert x_i-y_i \vert$ (2)

So the distances are:
$\displaystyle L_1(Ti,Ko)$ $\displaystyle =$ $\displaystyle \vert-10\vert+\vert-6\vert+\vert 5-2\vert+\vert 1-1\vert +\vert 4-0\vert$  
  $\displaystyle =$ $\displaystyle 23.0$  
$\displaystyle L_1(Ko,Te)$ $\displaystyle =$ $\displaystyle 17.0$  
$\displaystyle L_1(Ti,Te)$ $\displaystyle =$ $\displaystyle 10.0$  


The cosine measure is a little different case. It can be defined as

$\displaystyle \cos(\mathbf x, \mathbf y) = \frac{\sum_{i=1}^n x_iy_i}{\sqrt{\sum_{i=1}^n x_i^2}\sqrt{\sum_{i=1}^n y_i^2}}$ (3)

Let's calculate the distances:
$\displaystyle cos(Ti,Ko)$ $\displaystyle =$ $\displaystyle \frac{0\cdot10+0\cdot6+5\cdot2+1\cdot1 +4\cdot0}
  $\displaystyle =$ $\displaystyle 0.14$  
$\displaystyle cos(Ko,Te)$ $\displaystyle =$ $\displaystyle 0.55$  
$\displaystyle cos(Ti,Te)$ $\displaystyle =$ $\displaystyle 0.70$  

Here a larger value corresponds to a larger similarity, so the distances are in the same order as before.

Information radius

For the information radius we formulate the maximum likelihood estimates for that the next word is generated by a source $ l_i$ (Tintus, Korvalääke, Termiitti) is $ w_i$. This is done by dividing the each element of the table by the sum of its row (Table 1).

Table 1: ML estimates for the word probabilities
  fresh acidic sweet fruity soft
Tintus 0 0 0.50 0.10 0.40
Korvalääke 0.53 0.32 0.11 0.05 0
Termiitti 0.07 0.29 0.21 0.21 0.21

Last we define that

$\displaystyle 0 \log\frac0x = 0, ~ \forall x \in \Re.$    

The information radius is given by the formula
$\displaystyle Irad(p,q)$ $\displaystyle =$ $\displaystyle D(p\vert\vert\frac{p+q}2) + D(q\vert\vert\frac{p+q}2)$  
  $\displaystyle =$ $\displaystyle \sum_i p_i \log\frac{p_i}{\frac{p_i+q_i}2} + \sum_i q_i

Let's calculate it for the given sources:
$\displaystyle Irad(Ti,Ko)$ $\displaystyle =$ $\displaystyle 0\cdot\log\frac{2\cdot0}{0.53} +
0\cdot\log\frac{2\cdot0}{0.32} +
0.50\cdot\log\frac{2\cdot0.50}{0.61} +
    $\displaystyle +0.40\cdot\log\frac{2\cdot0.40}{0.40} +
0.53\cdot\log\frac{2\cdot0.53}{0.53} +
    $\displaystyle +
0.11\cdot\log\frac{2\cdot0.11}{0.61} +
+ 0\cdot\log\frac{2\cdot0}{0.40}$  
  $\displaystyle =$ $\displaystyle 1.5$  
$\displaystyle Irad(Ko,Te)$ $\displaystyle =$ $\displaystyle 0.6$  
$\displaystyle Irad(Ti,Te)$ $\displaystyle =$ $\displaystyle 0.5$  

We see that all the measures set the medicines to a similar order: Tintus and Temiitti are the most similar ones, Tintus and Korvalääke are the most different.

Kullback-Leibler divergence

From the definition of the KL divergence we can directly see some of its problems:

$\displaystyle D(p\vert\vert q)=\displaystyle \sum_i p_i \log \frac{p_i}{q_i}$    

First, the KL divergence is not symmetric, so we should each time decide which one of the two drugs is the reference drug $ p$. The second problem is that if the compared distribution has a zero probability in some dimension where the reference distribution has a non-zero probability, the divergence goes to infinity.


Kullback-Leibler divergence

The definition of the Kullback-Leibler divergence was

$\displaystyle D(p\vert\vert q)=\displaystyle \sum_i p_i \log \frac{p_i}{q_i}$    

Let's find a distribution that minimizes the KL divergence. We add a Lagrange coefficient $ \lambda_1$ to make sure that $ p$ shall be a correct probability distribution (i.e. $ \sum_i p_i=1$) and $ \lambda_2$ for $ q$.

$\displaystyle E=D(p\vert\vert q)+\lambda(1-\sum_ip_i)=\displaystyle \sum_i p_i \log \frac{p_i}{q_i} + \lambda_1(1-\sum_ip_i) + \lambda_2(1-\sum_iq_i)$    

Let's set the partial derivative with respect to the $ p_i$ to zero:
$\displaystyle \frac{\partial E}{\partial p_i}$ $\displaystyle =$ $\displaystyle p_i\cdot \frac{1}{\frac{p_i}{q_i}}
\cdot \frac{1}{q_i} + \log \frac{p_i}{q_i} - \lambda_1$  
  $\displaystyle =$ $\displaystyle \log p_i - \log q_i + 1 - \lambda_1 = 0$  

Now we solve $ p_i$:
$\displaystyle p_i$ $\displaystyle =$ $\displaystyle q_i \cdot e^{\lambda_1-1}$  

Let's calculate the partial derivative with respect to $ \lambda_1$:
$\displaystyle \frac{\partial E}{\partial \lambda_1}$ $\displaystyle =$ $\displaystyle 1-\sum_i p_i=0$  
$\displaystyle \Rightarrow \sum_i p_i$ $\displaystyle =$ $\displaystyle 1$  

A similar condition is obtained for $ q_i$ when derivating with respect to $ \lambda_2$ (which was exactly the purpose of the multipliers). The last condition is obtained by derivating with respect to $ q_i$:
$\displaystyle \frac{\partial E}{\partial q_i}$ $\displaystyle =$ $\displaystyle p_i \cdot \frac{1}{\frac{p_i}{q_i}}
\cdot p_i \cdot (-\frac{1}{q_i^2})-\lambda_2
= - \frac{p_i}{q_i} - \lambda_2 = 0$  
$\displaystyle \Leftrightarrow p_i$ $\displaystyle =$ $\displaystyle - \lambda_2 q_i$  

Because both $ q$ and $ p$ should sum up to one, we get:
$\displaystyle 1$ $\displaystyle =$ $\displaystyle \sum_i p_i = \sum_i (-\lambda_2 q_i) = -\lambda_2 \sum_i q_i =
$\displaystyle \Rightarrow p_i$ $\displaystyle =$ $\displaystyle -\lambda_2 q_i = q_i$  

Considering the second order derivates we can make sure that this is really the minimum and not maximum:
$\displaystyle \frac{\partial^2 E}{\partial p_i \partial p_i}$ $\displaystyle =$ $\displaystyle \frac1{p_i}>0$  
$\displaystyle \frac{\partial^2 E}{\partial q_i \partial q_i}$ $\displaystyle =$ $\displaystyle \frac{p_i}{q_i^2}>0$  
$\displaystyle \frac{\partial^2 E}{\partial p_i \partial p_j} =
\frac{\partial^2 E}{\partial q_i \partial q_j}$ $\displaystyle =$ 0  

If we set $ q_i = p_i$ to the formula of KL divergence we get the divergence of zero. So KL divergence is zero if and only if the distributions $ q$ and $ p$ are equal, otherwise greater than zero.

Information radius

The definition of the information radius is

$\displaystyle IRad(p,q)=D(p\vert\vert\frac{p+q}2) + D(q\vert\vert\frac{p+q}2)$    

We just calculated that the KL divergence is zero if the distributions are same, and larger than zero if not. In the case of the information radius, the zero divergence is also obtained if and only if $ q_i = p_i$:

$\displaystyle IRad(p,q)=\sum_i p_i \log\frac{p_i}{\frac{p_i+p_i}2} + \sum_i p_i \log\frac{p_i}{\frac{p_i+p_i}2} = 0$    

So the condition is the same as before.

$ L_1$ norm

Definition of the $ L_1$ norm is

$\displaystyle L_1(p,q)=\sum_i \vert p_i - q_i \vert$    

Clearly the smallest value is zero, which comes only if $ q_i = p_i$.

To conclude, we notice that all the measures give zero distance with the same condition: The distributions must be equal.


Kullback-Leibler -divergence

Let's look at the definition once more:

$\displaystyle D(p\vert\vert q)=\displaystyle \sum_i p_i \log \frac{p_i}{q_i}$    

We can see that if $ q_i=0$ when $ p_i \ne 0$ we get the distance $ \infty$.

Information radius

Let's write the definition of information radius open:

$\displaystyle IRad(p,q)=D(p\vert\vert\frac{p+q}2) + D(q\vert\vert\frac{p+q}2) = \sum_i p_i \log \frac{2p_i}{p_i+q_i} +\sum_i q_i \log \frac{2q_i}{p_i+q_i}$    

With intuition we might guess that a suitable distribution would be one where the distributions are in completely separate areas:
$\displaystyle \textrm{if } p_i > 0 \Rightarrow q_i=0$      
$\displaystyle \textrm{if } q_i > 0 \Rightarrow p_i=0$      

Let's insert these to the equation:
$\displaystyle IRad(p,q)$ $\displaystyle =$ $\displaystyle \sum_i p_i \log \frac{2p_i}{p_i} +\sum_i q_i \log
  $\displaystyle =$ $\displaystyle \log 2 \sum_i p_i + \log2 \sum_i q_i = 2 \log 2$  

We knew that this was the largest distance. To prove that it really is, and that the guessed conditions are required to get it, would be somewhat more diffcult.

$ L_1$ norm

The definition for the $ L_1$ norm was

$\displaystyle L_1(p,q)=\sum_i \vert p_i - q_i \vert$    

With intuition we could say that the answer is the same as with information radius, but let's try to prove it more mathematically. We separate the elementary events $ I$ to two sets. In set $ j \in I$ were have the cases where $ p_j > q_j$ and in set $ k \in I$ the cases where $ q_k >p_k$. Using these,
$\displaystyle L_1(p,q)$ $\displaystyle =$ $\displaystyle \sum_j (p_j - q_j ) + \sum_k (q_k-p_k)$  
  $\displaystyle =$ $\displaystyle \sum_j p_j - \sum_k p_k + \sum_k q_k - \sum_j q_j$  

As the probabilities are positive and sum up to one, the largest distance is get when
$\displaystyle \textrm{if } p_i > 0 \Rightarrow q_i=0$      
$\displaystyle \textrm{if } q_i > 0 \Rightarrow p_i=0$      

so the distance is
$\displaystyle L_1(p,q)$ $\displaystyle =$ $\displaystyle \sum_i p_i + \sum_i q_i =2$  


For both information radius and $ L_1$ norm, the same conditions for the distributions are required to get the largest distance. The KL divergence, however, goes to infinity already when the distribution $ q$ is zero somewhere where the reference distribution $ p$ is not.