Hopfield networks

Introduction

Network with $N$ neurons:

no self-loops
symmetric $w_{ij} = w_{ji}$

Each neuron has values $y_i = \pm 1$ and, at any timestep, receives a field: \begin{equation} \label{eq:field} \sum_{j \neq i} w_{ji}y_j + b_i \end{equation} Behavior:

If the sign of the field matches its own sign it doesn't respond
If the sign of the field opposes its own sign, it “flips” to match the sign of the field

Convergence

Say a neuron $i$ has value $y_i^-$ before some timestep and $y_i^+$ after. Look at: \begin{equation*} y_i^+ ( \sum_{j \neq i} w_{ji}y_j + b_i ) - y_i^- ( \sum_{j \neq i} w_{ji}y_j + b_i ) \end{equation*} This quantity is either $0$ or positive, so every flip of a neuron is guaranted to increase $y_i ( \sum_{j \neq i} w_{ji}y_j + b_i )$.

Sum this quantity over all nodes to obtain \begin{equation*} D(y_1, y_2, ..., y_N) = \sum_i y_i \sum_{j \neq i} w_{ji}y_j + b_i \\ = \sum_{i, j \neq i} w_{ij}y_i y_j + \sum_i b_i y_i \end{equation*} Every flip of a neuron results in an increase of global quantity $D$.

$D$ is bounded so any sequence of flips must converge in a finite number of steps. Introduce energy $E$: \begin{equation*} E = -D(y_1, y_2, ..., y_N) = - \sum_{i, j \neq i} w_{ij}y_i y_j - \sum_i b_i y_i \end{equation*} Typically will not utilize bias: it is similar to having a single extra neuron that is pegged to 1. Work with $E = -\frac{1}{2}y^TWy$.

Content-addressable memory

The system evolves towards one of its stable configurations (local energy minima):

spin glasses analogy.
any small jitter from such a stable configuration returns it to the stable configuration.

The system remembers its stable state and returns to it $\implies$ associative memory.

How to store specific patterns?

Design weights $w_{ij}$ s.t. energy hits local minimum at desired pattern $y = \{ y_i \}$.

For a single pattern, the Hebbian Learning rule $w_{ij} = y_i y_j$ makes $y_i$ stable.

Hebbian Learning rule for storing $N_p$ patterns $y^p = \{ y^p_i \}$: \begin{equation} \label{eq:hebbian_weights} w_{ij} = \sum_{y_p} y_i^p y_j^p \end{equation} From \ref{eq:field}, a pattern $y_p$ is stable $\iff$: \begin{equation*} y_i^p \sum_j w_{ij}y_j^p > 0, \forall i \end{equation*} Using \ref{eq:hebbian_weights} the above becomes: \begin{equation*} y_i^p \frac{1}{N} \sum_j \sum_{k \neq p} y_i^k y_j^k y_j^p > -1 , \forall i \end{equation*}

How to store $K$ orthogonal patterns?

$y_p$ is stored if $\text{sign}{(Wy_p)} = y_p$ - want this to hold for all targets $y_p$ and no other patterns.

Hebbian Learning

Described above: $W = YY^T - N_pI \implies W = YY^T$.

Geometric Interpretation

Optimize quadratic energy function on lattice.

Notice that $\text{sign}{(Wy_p)} = y_p$ condition clearly holds if we choose $W$ such that $y_p$ is eigenvector with positive eigenvalue. This is what Hebbian learning does in our case (we assumed orthogonal target patterns).
It doesn't work out in Hebbian Learning case because: Hebbian rule makes all corresponding eigenvectors have the same eigenvalue $\lambda = 1$. So then not just the target patterns are eigenvectors of $W$, but any vector in the subspace spanned by targets $\{y_p\}$. This is useless: this network stores literally any pattern so there are many more spurious memories than useful information.

This failure gave us an insight though: we need to worry more about not storing spurious memories.

Optimization Approach

We do so by taking an optimization-based approach and think of a good objective: \begin{equation*} \hat{W} = \arg\min_{W} \left( \sum_{y\in Y_p} E(y) - \sum_{y {\not\in} Y_p} E(y) \right) \end{equation*} We added second term: push non-target patterns up! SGD update for this objective is: \begin{equation*} W \leftarrow W + \eta \left( \sum_{y\in Y_p} yy^T - \sum_{y {\not\in} Y_p} yy^T \right) \end{equation*} This isn't quite it either:

second term is a sum over exponentially many terms
we only need to raise spurious minima, not all patterns that aren't targets..
even better: raise neighborhood of a target to make that pattern a valley!

This leads us to following algorithm:

initialize $W$
until convergence:

sample target pattern $y_p$
initialize network at $y_p$ and evolve for a few steps(2-4) - end up at $y_d$
update $W \leftarrow W + \eta (y_py_p^T - y_dy_d^T)$

Restricted Boltzmann machine

Stochastic Hopfield Networks

Let $z_i = \frac{1}{T}\sum w_{ij}y_j$. Then values of $y_i$ are given by logistic distribution \begin{equation*} p(y_i = 1) = \sigma(z_i) \end{equation*} Probability of state $S$ is given by Boltzmann distribution \begin{equation*} p(S) \propto \text{exp}(-\frac{E(S)}{T}) \end{equation*}

Inference

Running inference is quite different from deterministic Hopfield networks. We need to simulate multiple states (i.e. need to draw from $P(S)$). We randomly initialize the network, let it evolve, and look at final few iterations. Then we compute \begin{equation*} y = \frac{1}{M}\sum_{i=L-M+1}^Ly_t \end{equation*} Finally, treat this as a probability and assign bit $i$ to 1 $\iff y_i > 0.5$.

Training

Training is very similar to deterministic Hopfield networks. We want to do maximum likelihood training with: \begin{equation*} \mathbb{E}[\log{p}] = \frac{1}{N} \sum_s \log{p(s)} = \frac{1}{N} \sum_s \left( \sum_{i, j} w_{ij}s_is_j \right) - \log \left( \sum_{s'} \exp(\sum_{i,j} w_{ij}s_i's_j') \right) \end{equation*} Gradient of this objective is \begin{equation*} \frac{1}{N}\sum_s s_i s_j - \sum_{s'} \frac{\text{exp}(w_{ij}s_i's_j')}{\sum_{s'}\text{exp}(w_{ij}s_i's_j')}s_i's_j' \end{equation*}

First term is expected value of $s_is_j$ over target patterns.
Second term is $\sum_{s'}p(s')s_i's_j'$ and is intractable. We compute it by sampling and estimate as $\frac{1}{M} \sum_{s' \in S_{\text{simulated}}}s_i's_j'$.

Boltzmann machine

Introduce hidden units to increase network capacity and augment train dataset to contain target states $S = (U,H)$ where hidden patterns $H$ are drawn from marginalized probability distribution. Do this by setting visible neurons and simulating a few hidden states $H$.
This is time consuming because we need to do the simulations from above to calculate second term in gradient update rule, as well as this new simulation step to expand train set.

Restricted Boltzmann machine

Build bipartite graph.