T-61.263 Advanced course in neural computing

Solutions for exercise 9


    1. We start with the notion that a neuron $ j$ flips from state $ x_j$ to $ -x_j$ at temperature $ T$ with probability

      $\displaystyle P(x_j \rightarrow -x_j) = \frac{1}{1+\exp(-\Delta E_j/T)}$ (1)

      where $ \Delta E_j$ is the energy difference resulting from such a flip. The energy function of the Boltzmann machine is defined by

      $\displaystyle E = -\frac{1}{2} {\sum_i \sum_j}_{i \neq j} w_{ji} x_i x_j \;.
$

      Hence the energy change produced by neuron $ j$ flipping from state $ x_j$ to $ -x_j$ is

      \begin{multline}
\Delta E_j = (\text{energy with neuron $j$\ in state $x_j$}) -...
... \sum_i w_{ji} x_i \\
= -2 x_j \sum_i w_{ji} x_i
= -2 x_j v_j
\end{multline}

      where $ v_j$ is the induced local field of neuron $ j$.

    2. In light of the result in Eq.(2), we may rewrite Eq.(1) as

      $\displaystyle P(x_j \rightarrow -x_j) = \frac{1}{1 + \exp(2x_j v_j/T)} \;.
$

      This means that for an initial state $ x_j = -1$, the probability that neuron $ j$ is flipped into state $ +1$ is

      $\displaystyle \frac{1}{1+\exp(-2v_j/T)} \;.$ (2)

    3. For an initial state of $ x_j = +1$, the probability that neuron $ j$ is flipped into state $ -1$ is

      $\displaystyle \frac{1}{1+\exp(+2v_j/T)} = 1-\frac{1}{1+\exp(-2v_j/T)} \;.$ (3)

      The flipping probability in Eq.(4) and the one in Eq.(3) are in perfect agreement with the following probabilistic rule:

      $\displaystyle x_j = \left\{ \begin{array}{l}
+1 \; \text{with probability}\; P(v_j) \\
-1 \; \text{with probability}\; 1-P(v_j) \\
\end{array} \right.
$

      where $ P(v_j)$ is itself defined by

      $\displaystyle P(v_j) = \frac{1}{1+\exp(-2v_j/T)} \;.
$

  1. The Boltzmann machine and sigmoid belief network share a common feature: they are both stochastic machines with their theory rooted in statistical mechanics.

    They differ from each other in the following respects:

  2. Writing the system of $ N$ simultaneous equations (Haykin, Eq. 12.22) in matrix form:

    $\displaystyle \mathbf{J}^{\mu} = \mathbf{C}(\mu) + \gamma \mathbf{P}(\mu) \mathbf{J}^{\mu}$ (4)

    where

    $\displaystyle \mathbf{J}^{\mu} = [J^{\mu}(1), J^{\mu}(2), \ldots, J^{\mu}(N)]^T
$

    $\displaystyle \mathbf{c}(\mu) = [c(1,\mu), c(2,\mu), \ldots, c(N,\mu)]^T
$

    $\displaystyle \mathbf{P}(\mu) = \left[ \begin{array}{cccc}
p_{11}(\mu) & p_{12}...
...
p_{N1}(\mu) & p_{N2}(\mu) & \ldots & p_{NN}(\mu) \\
\end{array} \right] \;.
$

    Rearranging terms in Eq.(5):

    $\displaystyle (\mathbf{I} - \gamma \mathbf{P}(\mu)) \mathbf{J}^{\mu} = \mathbf{c}(\mu)
$

    where $ \mathbf{I}$ is the $ N$-by-$ N$ identity matrix. For the solution $ \mathbf{J}^{\mu}$ to be unique we require that the $ N$-by-$ N$ matrix $ (\mathbf{I}-\gamma\mathbf{P}(\mu))$ have an inverse matrix for all possible values of the discount factor $ \gamma$.

  3. An important property of dynamic programming is the monotonicity property described by

    $\displaystyle J^{\mu_{n+1}} \le J^{\mu_n} \;.
$

    This property follows from the fact that if the terminal cost $ g_k$ for $ K$ stages is changed to a uniformly larger cost $ \bar g_k$, that is,

    $\displaystyle \bar g_k(x_k) \ge g_k(x_k) \;$for all$\displaystyle \; x_k
$

    then the last stage cost-to-go function $ J_{K-1}(x_{K-1})$ will be uniformly increased. In more general terms, we may state the following. Given two cost-to-go functions $ J_{K+1}$ and $ \bar J_{K+1}$ with

    $\displaystyle \bar J_{K+1}(x_{K+1}) \ge J_{K+1}(x_{K+1})\;$for all$\displaystyle \; x_{K+1}
$

    we find that for all $ x_K$ and $ \mu_K$ the following relation holds:

    $\displaystyle E[ g_K(x_K,\mu_K,x_{K+1}) + J_{K+1}(x_{K+1})] \le E[g_K(x_K,\mu_K,x_{K+1}) + \bar J_{K+1}(x_{K+1})] \;.
$

    This relation merely restates the monotonicity property of the dynamic programming algorithm.



Jaakko Peltonen
2003-11-21