The Problem

There exist many problems where finding the local, and in some cases, global minima of a function is desirable - most commonly in optimization problems such as the traveling salesman problem (TSP). Oftentimes, however, computing this minima often requires traversing the entire state-space - this can be extremely difficult

read NP-Hard

to do efficiently. However, for some classes of problem an approximate solution often suffices. How might we find an approximate solution?

The Metropolis Algorithm

Let us take a break from finding global minima and focus on a closely related problem. Say we wish for a method to sample from some arbitrary probability distribution, \(p(x)\). In many cases we know the probability \(p(x)\), but not how to sample some \(x\) from \(p(x)\). For a (trivial) example, let \[p(x)= \frac{3}{8} e^{-x^2} + \frac{1}{4} e^{-(x - 4)^2} + \frac{1}{4} e^{-(x + 2)^2}\]

This is the thinking behind the Metropolis algorithm

Metropolis, Rosenbluth, Rosenbluth, Teller, Teller (1953)

, originally proposed in 1953, though many other variations exist. The idea is define a markov chain over possible values of \(p(x)\) such that we asymptotically reach a unique stationary distribution \(p(x)\).

We start with the detailed balance representing this condition

cite this?

\[ \begin{align*} p(x' | x)\ p(x) = p(x | x')\ p(x') \end{align*} \] \[ \frac{p(x' | x)}{p(x | x')} =\frac{p(x')}{p(x)} \]

We then split this into the proposal and acceptance / rejection, where \(q(x' | x)\) represents the conditional probability of proposing \(x'\) over \(x\) and \(A(x',x)\) is the probability of accepting said proposed state \[ p(x' | x) = q(x' | x)\ A(x',x) \] Thus \[ A(x',x)=\min\left( 1, \frac{p(x')}{p(x)} \times \frac{q(x | x')}{q(x' | x)} \right) \]

As an algorithm, it runs in couple relatively simple steps:

Pick some initial state \(x_0\) (typically random), let \(t=0\);
Generate a random candidate \(x'\) according to \(q(x'|x_t)\);
Calculate the acceptance probability \(A(x',x_t)\).
Accept the state if \(random(0,1) \leq A(x',x_t)\) and set \(x_{t+1}=x'\), otherwise reject it (\(x_{t+1}=x_t\));
increment \(t\).

Current position: 0
Accepted moves: 0 / 0 (0%)
Samples collected: 0

You can play with this yourself here - observe how the distribution builds.

Simulated Annealing

It should now hopefully be more clear as to how we might approach a local minimization problem like the one presented in the beginning. We now have some context to pursue an approach to simulated annealing. Simulated annealing was initially devised to solve the physical problem of (you guessed it) annealing of metals! As such, it revolves around the idea that heating and cooling affect both the temperature and free energy of a given system.

We can derive this by modifying our metropolis algorithm in two key ways - introduce a temperature \(T\), and a cooling schedule which decreases \(T\), typically exponentially

though other examples of cooling schedules exist, they may not be optimal

. - introduce an energy function \(E(x)\) which we wish to optimize, in place of the probability function, where \(p(x) \propto e^{-E(x)/T}\), which aligns with the boltzmann distribution.

Then \[ A(x',x)= \min \left( 1,\frac{e^{-E(x')/T}}{e^{-E(x)/T}} \right) = \min(1,e^{−(E(x′)−E(x))/T}) \]

which is exactly the acceptance probability for simulated annealing. Here temperature serves as an analog for risk - at high temperatures, the algorithm is more likely to accept an uphill move to a higher energy state. As the temperature decreases, the algorithm becomes less and less willing to take an uphill move in “hopes” that we will find a new global minima beyond it, reducing to hill-climbing as \(T\to 0\). It also turned out that this is a relatively good

(barring some objections)

way to minimize some arbitrary function - which is really nice if we happen to have a problem which is difficult to compute exactly.

The algorithm for simulated annealing is quite similar to what we’ve already seen with the metropolis algorithm (shortened for simplicity here)

Pick some initial state \(x\ \leftarrow x_0\) (typically random);
For \(k=0\) through to \(k_{max}\),
- \(T\ \leftarrow\ schedule(k)\)
- \(x'=neighbor(x)\)
- if \(A(x,x') > random(0,1)\)
  - \(x\ \leftarrow\ x'\)
Return \(x\)

We can also modify this procedure to optimize for the most total energy, or maximum, as opposed to minimizing our total energy. You can try this out with the simulation below

Current position: 0
Temperature: 1.0
Accepted moves: 0 / 0 (0%)

We can also apply this approach quite effectively to similar types of problem where we aim to minimize some total energy function.

In conclusion, I hope that you learnt something!

Note that there is a whole ton of rigour which I have completely neglected to mention as to the requirements and types of systems that can be represented this way - and why, in order for this introduction to remain accessible. I would recommend reading the original paper, as well as X if you are further interested.

References

METROPOLIS, Nicholas, ROSENBLUTH, Arianna W, ROSENBLUTH, Marshall N, TELLER, Augusta H and TELLER, Edward, 1953. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953. Vol. 21, no. 6, p. 1087–1092.

Interactive Simulated Annealing

The Problem

The Metropolis Algorithm

Simulated Annealing

References