Probabilities are inherently exponentially-sized objects; need to make assumptions about their structure.
Naive Bayes assumption: conditional independence among the variables
Three major parts in this course:
Representation: how to specify a (tractable) model?
Inference: given a probabilistic model, how to determine the marginal or conditional probabilities of certain events.
marginal inference: what is the probability of a given variable in the model?
maximum a posteriori (MAP) inference: most likely assignment of variables given the data.
Learning: fitting a model to a dataset. Learning and inference are inherently linked, as inference is a key subroutine that we repeatedly call within learning algorithms.
Applications:
Image generation (see Radford et al.). Sampling new images from a learned probability distribution $\hat{x} \sim p(x)$ that assigns high probability to images that resemble the ones in the training set.
In-painting: given a patch, sample from $p(\text{image} \vert \text{patch})$ to complete the image.
Image denoising: given an image corrupted by noise, model the posterior distribution $p(\text{original image} \vert \text{noisy image})$ and sample/use exact inference to predict the original image.
Language models:
generation: sample from a probability distribution over sequences of words or characters.
machine translation: $p(\text{target language sentence}\vert \text{source language sentence})$
Audio models:
upsampling or super-resolution: increase resolution of an audio signal by calculating signal values at intermediate points. Sampling/perform inference on $p(\text{intermediate values}\vert \text{observed values})$
speech synthesis
speech recognition
Sidenotes
Bayesian networks are directed graphical models in which each factor depends only on a small number of ancestor variables:
$p(x_i\vert x_{i-1}, \dots, x_1) = p(x_i \vert x_{A_i})$
where $A_i$ is a subset of $\{x_1, \dots, x_{i-1}\}$.
When the variables are discrete, we may think of the possible values of $p(x_i \vert A_i)$ as probability tables. If each variable takes d values and has at most $k$ ancestors, the entire table will contain at most $\mathcal{O}(d^{k+1})$. With one table per variable, the entire probability distribution can be compactly described with only $\mathcal{O}(nd^{k+1})$ parameters compared to $\mathcal{O}(d^{n})$ with a naive approach.
Edges indicate dependency relationships between a node $x_i$ and its ancestors $A_i$.
A Bayesian network is a directed graph $G = (V,E)$ together with:
A probability $p$ factorizes over a DAG (directed acyclic graph) $G$ if it can be decomposed into a product of factors, as specified by $G$.
We can show by counter-example that when $G$ contains cycles, its associated probability may not sum to one.
Let $I(p)$ be the set of all independencies that hold for a joint distribution $p$. E.g., if $p(x,y) = p(x)p(y)$, then $x\perp y \in I(p)$
Independencies can be recovered from the graph by looking at three types of structures:
Common parent:
if $G$ is of the form $A \leftarrow B \rightarrow C$, and $B$ is observed, then $A \perp C\vert B$ ($C$ given $B$).
However, if $B$ is unobserved then $A \not\perp C$.
Intuitively, $B$ contains all the information that determines the outcome of $A$ and $C$; once it is observed, nothing else affects their outcome (it does not matter what value $A$ or $C$ take respective to the outcome of $C$ and $A$, respectively).
Cascade:
if $G$ equals $A \rightarrow B \rightarrow C$ and $B$ is observed, then $A \perp C \vert B$.
However, if $B$ is unobserved, then $A \not\perp C$.
V-structure:
if $G$ is $A\rightarrow C \leftarrow B$, then knowing $C$ couples $A$ and $B$. I.e. $A \perp B$ if $C$ is unobserved, but $A \not\perp B\vert C$ ($C$ is observed).
E.g. suppose $C$ is true if the lawn is wet and false otherwise. $A$ (it rained) and $B$ (the sprinkler turned on) are two explanations for it being wet. If we know that $C$ is true (grass is wet) and $B$ is false (the sprinkler didn't go on), then $p(A) = 1$ (only other possible explanation). Hence, $A$ and $B$ are not independent given $C$.
We can extend these structures by applying them recursively to any larger Bayesian net. This leads to a notion called d-separation (where d stands for directed).
Let $Q$, $W$ and $O$ be three sets of nodes in a Bayesian Network $G$. We say that $Q$ and $W$ are d-separated given $O$ (i.e. the nodes in $O$ are observed) if $Q$ and $W$ are not connected by an active path. An undirected path in $G$ is called active given observed variables $O$ if for every consecutive triple of variables $X, Y, Z$ on the path, one of the following holds:
$X\leftarrow Y \leftarrow Z$ and $Y$ is unobserved $Y\notin O$
$X\rightarrow Y \rightarrow Z$ and $Y$ is unobserved $Y\notin O$
$X\leftarrow Y \rightarrow Z$ and $Y$ is unobserved $Y\notin O$
$X\rightarrow Y \leftarrow Z$ and $Y$ or any of its descendants are observed.
(i.e. there is pairwise dependency between consecutive variables on the path)
In the following example, $X_1$ and $X_6$ are d-separated given $X_2, X_3$ (you cannot infer $X_6$ from $X_1$ given $X_2, X_3$):
However, in the next example, $X_2, X_3$ are not d-separated given $X_1, X_6$. There is an active path which passed through the V-structure created when $X_6$ is observed:
Let $I(G)=\{(X\perp Y\vert Z): X, Y \text{are d-sep given } Z\}$ be a set of variables that are d-separated in $G$.
I-map: If $p$ factorizes over $G$, then $I(G) \subseteq I(p)$. We say that $G$ is an I-map (independence map) for $p$.
The intuition is that if $X,Y$ and $Y,Z$ are mutually dependent, so are $X,Z$. Thus, we can look at adjacent nodes and propagate dependencies.
In other words, all the independencies encoded in $G$ are sound: variables that are d-separated in $G$ are truly independent in $p$. However, the converse is not true: a distribution may factorize over $G$, yet have independencies that are not captured in $G$.
Note that if $p(x,y)=p(x)p(y)$ then this distribution still factorizes over the graph $y \rightarrow x$ since we can always write it as $p(x,y)=p(x\vert y)p(y)$ where $p(x\vert y)$ does not actually vary with $y$. However, we can construct a graph that matches the structure of $p$ by simply removing that unnecessary edge.
Can directed graphs express all the independencies of any distribution $p$? More formally, given a distribution $p$, can we construct a graph $G$ such that $I(G) = I(p)$?
First, note that it is easy to construct a $G$ such that $I(G) \subseteq I(p)$; A fully connected DAG $G$ is an I-map for any distribution since $I(G) = \emptyset$ (there are no variables which are d-separated in $G$ since one can always find an active path; the trivial one is the one that connects the two variables and thus creates the dependency).
A more interesting question is whether we can find a minimal I-map. We may start with a fully connected $G$ and remove edges until $G$ is no longer an I-map (i.e. $G$ encodes independences that are not in $p$ therefore $p$ no longer factorizes to $G$). One pruning method consist in following the natural topological ordering of the graph and removing node ancestors until this is no longer possible (see structure learning at the end of course).
Does any probability distribution $p$ always admit a perfect map $G$ for which $I(p) = I(G)$. Unfortunately, the answer is no.\
For example, consider the following distribution $p$ over three variables $X,Y,Z$ (noisy-xor example)):
A related question is whether perfect maps are unique when they exist. Again, this is not the case, as $X \rightarrow Y$ and $X \leftarrow Y$ encode the same independencies, yet form different graphs. Two bayes nets $G_1, G_2$ are I-equivalent if they encode the same dependencies $I(G_1) = I(G_2)$.
When are two Bayesian nets I-equivalent?
Each of the graphs below have the same skeleton (if we drop the directionality of the arrows, we obtain the same undirected graph)
a,b and c are symmetric ($X\perp Y\vert Z$ but $X\not\perp Y$). They encode exactly the same dependencies and the directionality does not matter as long as we don't turn them into a V-structure (d). The V-structure is the only one that describes the dependency $X \not\perp Y\vert Z$.
General result on I-equivalence: If $G, G'$ have the same skeleton and the same V-structures, then $I(G) = I(G')$.
Intuitively, two graphs are I-equivalent if the d-separation between variables is the same. We can flip the directionality of any edge, unless it forms a v-structure, and the d-connectivity of the graph will be unchanged. See the textbook of Koller and Friedman for a full proof in Theorem 3.7 (page 77).
In Bayesian networks, unless we want to introduce false independencies among the variables, we must fall back to a less compact representation (with additional, unnecessary edges). This leads to extra parameters in the model and makes it more difficult for the model to learn them.
Markov Random Fields (MRFs) are based on undirected graphs. They can compactly represent independence assumptions that directed models cannot.
Unlike in the directed case, we are not saying anything about how one variable is generated from another set of variables (as a conditional probability distribution would). We simply indicated a level of coupling between dependent variables in the graph. This requires less prior knowledge, as we no longe have to specify a full generative story of how certain variables are constructed from others. We simply identify dependent variables and define the strength of their interactions. This defines an energy landscape over the space of possible assignments and we convert this energy to a probability via the normalization constant.
A Markov Random Field is a probability distribution $p$ over variables $x_1, \dots, x_n$ defined by an undirected graph $G$.
$p(x_1, \dots, x_n) = \frac{1}{Z} \prod_{c\in C}\phi_c(x_c)$
Note: we do not need to specify a factor for each clique.
Example
we are modeling preferences among $A, B, C, D$. $(A,B), (B,C), (C,D), (D,A)$ are friends and friends have similar voting preferences.
$\tilde{p}(A,B,C,D) = \phi(A,B)\phi(B,C)\phi(C,D)\phi(D,A)$
where $\phi(X,Y) = 10\text{ if }X=Y=1, 5\text{ if }X=Y=0, 1 \text{ otherwise}$.
The final probability is:
$p(A,B,C,D)=\frac{1}{Z}\tilde{p}(A,B,C,D)$
where $Z = \sum_{A,B,C,D}\tilde{p}(A,B,D,C)$
In the previous example, we had a distribution over $A,B,C,D$ that satisfied $A\perp C\vert \{B, D\}$ and $B\perp D\vert \{A, C\}$ (since only friends directly influence a person's vote). We can check by counter-example that these independencies cannot be perfectly represented by a Bayesian network. However, the MRF turns out to be a perfect map for this distribution.
Advantages:
Drawbacks:
Bayesian networks are a special case of MRFs with a very specific type of clique factor (one that corresponds to a conditional probability distribution and implies a directed acyclic structure in the graph), and a normalizing constant of one.
A Bayesian network can always be converted into an undirected network with normalization constant one by adding side edges to all parents of a given node and removing their directionality.
The converse is also possible, but may be intractable and may produce a very large directed graph (e.g. fully connected).
A general rule of thumb is to use Bayesian networks whenever possible and only switch to MRFs if there is no natural way to model the problem with a directed graph (like the voting example).
Variables $x$ and $y$ are dependent if they are connected by a path of unobserved variables. However, if $x$'s neighbors are all observed, then $x$ is independent of all the other variables (since they influence $x$ only via its neighbors, referred to as the Markov blanket of $x$).
If a set of observed variables forms a cut-set between two halves of the graph, then variables in one half are independent from ones in the other.
We define the Markov blanket $U$ of a variable $X$ as the minimal set of nodes such that $X$ is independent from the rest of the graph if $U$ is observed. This holds for both directed and undirected models. For undirected models, the Markov blanket is simply equal to a node's neighborhood.
Just as in the directed case, $I(G) \subseteq(I(p))$, but the converse is not necessarily true. E.g.:
Special case of Markov Random Fields when they are applied to model a conditional probability distribution $p(y\vert x)$ where $x \in \mathcal{X}, y\in\mathcal{Y}$ are vector-valued variables. This common setting in supervised learning is also known as structured prediction.
A CRF is a Markov network over variables $\mathcal{X} \cup \mathcal{Y}$ which specifies a conditional distribution:
$\mathbb{P}(y\vert x) = \frac{1}{Z} \prod_{c\in C} \phi_c (x_c, y_c)$
with partition function:
$Z(x) = \sum_{y\in\mathcal{Y}} \prod_{c\in C} \phi_c (x_c, y_c)$
The partition function depends on $x$. $p(y\vert x)$ encodes a different probability function for each $x$. Therefore, a conditional random field results in an instantiation of a new Markov Random Field for each input $x$.
Recognize word from a sequence of black-and-white character images $x_i\in[0, 1]^{d\times d}$ (pixel matrices of size $d$). The output is a sequence of alphabet letters $y_i \in \{a, b, \dots, z\}$
We could train a classifier to separately predict each $y_i$ from its $x_i$. However, since the letters together form a word, the predictions across different positions ought to inform each other.
$p(y\vert x)$ is a CRF with two types of factors:
We can jointly infer the structured label $y$ using MAP inference:
$\arg \max_y \phi(x_1, y_1)\prod_{i=2}^{n} \phi(y_{i-1}, y_i) \phi(x_i, y_i)$
In most practical applications, we assume that the factors $\phi_c(x_c, y_c)$ are of the form:
$\phi_c(x_c, y_c) = \exp(w_c^\top f_c(x_c, y_c))$
where $f_c(x_c, y_c)$ is an arbitrary set of features describing the compatibility between $x_c$ and $y_c$.
In the above example:
CRF features can be arbitrarily complex. We can define a model with factors $\phi(x, y_i) = \exp (w_i^\top f(x, y_i))$ that depend on the entire input $x$. Does not affect computational performance since $x$ is observed (does not change the scope of the factors, just their values).
MRF models the joint distribution $p(x,y)$ and needs to fit two distributions $p(y\vert x)$ and $p(x)$. However, if we want to predict $y$ given $x$, modeling $p(x)$ is unnecessary and disadvantageous to do so:
CRFs forgo this assumption and often perform better on prediction tasks.
Bipartite graph where one group is the variables (round nodes) in the distribution being modeled and the other group is the factors defined on these variables (square nodes). Useful to see factor dependencies between variables explicitly, and facilitates computation.
NP-hard: whether inference is tractable depends on the structure of the graph. If problem is intractable, we use approximate inference methods.
Variable elimination is an exact inference algorithm. Let $x_i$ be discrete variables taking $k$ possible values each (also extends to continuous distributions).
Chain Bayesian network: $p(x_1, \dots, x_n) = p(x_1)\prod_{i=2}^n p(x_i \vert x_{i-1})$
How to compute marginal probability $p(x_n)$ ?
Naive way is summing probability over all $k^{n-1}$ assignments:
$p(x_n)=\sum_{x_1} \dots \sum_{x_{n-1}}p(x_1, \dots, x_n)$
we can do much better by leveraging factorization in probability distribution:
$p(x_n) = \sum_{x_1} \dots \sum_{x_{n-1}} p(x_1) \prod_{i=2}^n p(x_i \vert x_{i-1})$
$=\sum_{x_1}p(x_1)\bigg[\sum_{x_2}p(x_2\vert x_1)\big[\dots\sum_{x_{n-1}}p(x_n\vert x_{n-1})\big]\bigg]$
$=\sum_{x_2}\underbrace{\sum_{x_1}p(x_2\vert x_1)p(x_1)}_{p(x_2)}\times \sum_{x_3}\sum_{x_2}p(x_3\vert x_2)p(x_2)\times\dots\times\underbrace{\sum_{x_{n-1}}p(x_n\vert x_{n-1})p(x_{n-1})}_{p(x_n)}$
$=\underbrace{\tau(x_2)}_{p(x_{21})+\dots+p(x_{2k})}\times\tau(x_3)\times \dots$
Each $\tau$ has complexity $\mathcal{0}(k^2)$ and we calculate $n$ of them, thus overall complexity is $\mathcal{0}(nk^2)$ (much better than $\mathcal{O}(k^n)$).
Graphical model as product of factors: $p(x_1, \dots, x_n) = \prod_{c\in C} \phi_c(x_c)$
Variable elimination algorithm (instance of dynamic programming) performs two factor operations:
Marginalizing $B$ from $\phi(A,B,C)$. For $(a_1, c_1)$: $0.25+0.08=0.33$.
Variable elimination requires an ordering over the variables according to which variables will be eliminated (e.g. ordering implied by the DAG, see example).
Let the ordering $O$ be fixed. According to $O$, for each variable $X_i$:
Multiply all factors $\Phi_i$ containing $X_i$
Marginalize out $X_i$ to obtain new factor $\tau$
replace factors $\Phi_i$ with $\tau$
Running time of Variable Elimination is $\mathcal{O}(nk^{M+1})$ where $M$ is the maximum size (number of variables) of any factor $\tau$ formed during the elimination process and $n$ is the number of variables.
Choosing optimal VE ordering is NP-hard. In practice, we resort to heuristics:
When computing marginals, VE produces many intermediate factors $\tau$ as a side-product. These factors are the same as the ones that we need to answer other marginal queries. By caching them, we can answer new marginal queries at no additional cost.
VE and JT algorithm are two flavors of dynamic programming: top-down DP vs bottom-up table filling.
JT first executes two runs of VE to initialize a particular data structure: bottom-up to get root probabilities $\tau(x)=\sum_y \phi(x,y)$; and top-down to get leaf probabilities $\tau(y)=\sum_x \phi(x,y)$. It can then answer marginal queries in $\mathcal{O}(1)$ time.
Two variants: belief propagation (BP) (applies to tree-structured graphs) and full junction tree method (general networks).
Consider running VE algorithm on a tree to compute marginal $p(x_i)$. Optimal ordering: rooting the tree at $x_i$ and iterating in post-order (leaves to root s.t. nodes are visited after their children). This ordering is optimal because the largest clique formed during VE has size 2.
At each step, eliminate $x_j$ by computing $\tau_k(x_k) = \sum_{x_j} \phi(x_k, x_j)\tau_j(x_j)$ (where $x_k$ is parent of $x_j$). $\tau_j(x_j)$ can be thought of as a message sent from $x_j$ to $x_k$ that summarizes all the information from the subtree rooted at $x_j$.
Say, after computing $p(x_i)$ we want to compute $p(x_k)$; we would run VE again with $x_k$ as root. However, we already computed the messages received by $x_k$ when $x_i$ was root (since there is only one path connecting two nodes in a tree).
(assumes tree structure)
Two variants: sum-product message passing (for marginal inference) and max-product message passing (for MAP inference)
$m_{i\rightarrow j}(x_j) = \sum_{x_i} \phi(x_i, x_j) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i}(x_i)$
(i.e. sum over all possible values taken by $x_i$)
After computing all messages, we may answer any marginal query over $x_i$ in constant time:
$p(x_i) \propto \phi(x_i) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i}(x_i)$
Recall that a factor graph is a bipartite graph with edges going between variables and factors.
Two types of messages:
$\nu_{var(i)\rightarrow fac(s)} (x_i) = \prod_{t\in N(i)\setminus s} \mu_{fac(t)\rightarrow var(i)} (x_i)$
$\mu_{fac(s)\rightarrow var(i)}(x_i) = \sum_{x_{N(s)\setminus i}} f_s(x_{N(s)})\prod_{j\in N(s)\setminus i} \nu_{var(j)\rightarrow fac(s)}(x_j)$
As long as there is a factor (or variable) ready to transmit to a variable (or factor), send message as defined above. Therefore, an edge receives exactly two messages (from variable to factor and factor to variable)
Since MRF factors are non-negative, max operator distributes over products, just like the sum (in general, max only distributes over products of non-negative factors, since within the max, two large negative factors can become a large positive factor, although taken separately, their maximum could be small positive factors).
E.g., for a chain MRF:
$\tilde{p}^* = \max_{x_1}\dots \max_{x_n}\phi(x_1)\prod_{i=2}^n \phi(x_i, x_{i-1})$
$= \max_{x_n} \max_{x_{n-1}} \phi(x_n, x_{n-1}) \max_{x_{n-2}} \phi(x_{n-1}, x_{n-2})\dots \max_{x_1} \phi(x_2, x_1)\phi(x_1)$
Since both problems decompose the same way we can reuse the same approach as for marginal inference (also applies to factor trees).
If we also want the $\arg \max$, we can keep back-pointers during the optimization procedure.
Suppose we have undirected graphical model $G$ (if directed, take the moralized graph). A junction tree $T=(C, E_T)$ over $G=(\chi, E_G)$ is a tree whose nodes $c\in C$ are associated with subsets $x_c \subseteq \chi$ of the graph vertices (i.e. sets of variables). Must satisfy:
Example of MRF with graph $G$ and junction tree $T$:
Trivial junction tree: one node containing all the variables in $G$ (useless because brute force marginalization algorithm)
Optimal trees make the clusters as small and modular as possible: NP hard to find one.
Special case: when $G$ itself is a tree, define a cluster for each edge.
Example of invalid junction tree that does not satisfy running intersection property (green cluster should contain intersection of red and purple):
Let us define potential $\psi_c (x_c)$ of each cluster $c$ as the product of all the factors $\phi$ in $G$ that have been assigned to $c$.
By family preservation property, this is well defined and we may assume distribution in the form:
$p(x_1, \dots, x_n)=\frac{1}{Z} \prod_{c\in C} \psi_c (x_c)$
At each step of the algorithm, we choose a pair of adjacent clusters $c^{(i)}, c^{(j)}$ in $T$ and compute message whose scope is the sepset $S_{ij}$ between the two clusters:
$m_{i\rightarrow j} (S_{ij})=\sum_{x_c \setminus S_{ij}} \psi_c (x_c) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i} (S_{li})$
$x_c$ are variables in the cluster $c^{(i)}$.
We choose $c^{(i)}, c^{(j)}$ only if $c^{(i)}$ has received messages from all of its neighbors except $c^{(j)}$. Terminates in $2\lvert E_T \rvert$ steps just as in belief propagation (bottom-up then top-down to get all messages).
We then define belief of each cluster based on all the messages that it receives:
$\beta_c(x_c) = \psi_c(x_c) \prod_{l\in N(i)} m_{l\rightarrow i} (S_{li})$
(often referred to as Shafer-Shenoy updates)
Beliefs will be proportional to the marginal probabilities over their scopes: $\beta_c(x_c) \propto p(x_c)$.
We answer queries of the form $\tilde{p}(x), x\in x_c$ by marginalizing out the variable in its belief:
$\tilde{p}(x) = \sum_{x_c \setminus x} \beta_c (x_c)$
(requires brute force sum over all variables in $x_c$)
Normalize by partition function $Z=\sum_{x_c}\beta_c (x_c)$ (sum of all the beliefs in a cluster).
Running time is exponential in the size of the largest cluster (because we need to marginalize out variables from the cluster; must be done using brute force).
Running VE with a certain ordering is equivalent to performing message passing on the junction tree.
We may prove correctness of the JT algorithm through an induction argument on the number of factors $\psi$. The key property that makes this argument possible is the running intersection property (assures that it's safe to eliminate a variable from a leaf cluster that is not found in that cluster's sepset since it cannot occur anywhere except that one cluster)
The caching approach used for belief propagation extends to junction trees.
Technique for performing inference on complex (non-tree) graphs.
Running time of JT is exponential in size of the largest cluster. In some cases, we can give quick approximate solution.
Suppose we are given a MRF with pairwise potentials. Main idea is to disregard loops and perform message passing anyway.
$m_{i\rightarrow j}^{t+1}(x_j) = \sum_{x_i} \phi(x_i)\phi(x_i, x_j) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i}^t (x_i)$
Keep performing these updates for fixed number of steps or until convergence of messages. All messages are typically initialized with uniform distribution.
$\max_x \log p(x) = \max_x \sum_c \log \phi_c (x_c) - \log Z$
Many intractable problems as special case.\
E.g.: 3-sat. For each clause $c=(x\lor y \lor \lnot z)$ a factor $\theta_c (x,y,z)=1$ if $c$ and $0$ otherwise. 3-sat instance satisfiable iff the value of the MAP assignment equals the number of clauses.
We may use similar construction to prove that marginal inference is NP-hard: add additional variable $X=1$ when all clauses are satisfied and $0$ otherwise. Its marginal probability will be $\geq 0$ iff the 3-sat instance is satisfiable
Example: image segmentation (input $x\in[0,1]^{d\times d}$; predict label $y\in\{0,1\}^{d\times d}$ indicating wether each pixel encodes the object we want to recover). Intuitively, neighboring pixels should have similar values. This prior knowledge can be modeled via an Ising model (see box 4.C p. 128 in Koller and Friedman)
See Koller and Friedman 13.6
Efficient MAP inference algorithm for certain Potts models over binary-valued variables. Returns optimal solution in polynomial time regardless of structural complexity of the underlying graph.
A graph cut of undirected graph is a partition of nodes into 2 disjoint sets $V_s$, $V_t$. Let each edge be associated with a non-negative cost. Cost of a graph cut is the sum of the costs of the edges that cross between the two partitions:
$\text{cost}(V_s, V_t) =\sum_{v_1\in V, v_2 \in V_t}\text{cost} (v_1, v_2)$
min-cut problem is finding the partition that minimizes the cost of the graph cut. See algorithms textbooks for details.
See Metric MRFs model box 4.D p. 127 in Koller and Friedman
MRF over binary variables with pairwise factors in which edge energies (i.e., negative log-edge factors) take the form:
$E_{uv}(x_u, x_v) = 0\text{ if }x_u = x_v\text{ and }\underbrace{\lambda_{uv}}_{\text{cost of edge mismatch}}\text{ if }x_u \ne x_v$
Each node has unary potential described by energy function $E_u(x_u)$ (normalized by substracting the $\min E_u$so that its $\geq 0$ with either $E_u(1)=0$ or $E_u(0)=0$).
$p(x) = \frac{1}{Z}\exp -\bigg[ \sum_u E_u(x_u) + \sum_{u,v\in \mathcal{E}}E_{uv}(x_u, x_v)\bigg]$
Motivation comes from image segmentation: reduce discordance between adjacent variables.
Formulate as a min-cut problem in augmented graph: add special source and sink nodes $s, t$
The cost of the cut (unary potentials and edge contributions) is precisely the energy of the assignment.
Segmentation task in a 2x2 MRF as a graph cut problem:
Refer to Koller and Friedman textbook for more general models with submodular edge potentials.
Graph-cut only applicable in restricted classes of MRFs.
Linear programming (a.k.a. linear optimization) refers to problems of the form:
$\min_x c\cdot x \text{ subject to }Ax\leq b$
with $x\in\mathbb{R}^{n}$, $c,b\in\mathbb{R}^n$ and $A\in\mathbb{R}^{n\times n}$.
####Integer Linear Programming (ILP)
extension of linear programming which also requires $x\in\{0,1\}^n$
NP-hard in general
Nonetheless, many heuristics exist such as rounding:
relaxed constraint $0\leq x\leq 1$ then round solution
works surprisingly well in practice and has theoretical guarantees for some classes of ILPs
Introduce two types of indicator variables:
MAP objective becomes:
$\max_{\mu} \sum_{i\in V}\sum_{x_i} \theta_i(x_i)\mu_i(x_i) + \sum_{i,j\in E}\sum_{x_i, x_j}\theta_{ij}(x_i, x_j)\mu_{ij}(x_i, x_j)$
(function $theta$ represents $log \phi$)
With constraints:
$\mu_i(x_i)\in \{0, 1\}\forall i, x_i$
$\sum_{x_i}\mu_i(x_i)=1\forall i$
$\mu_{ij}(x_i, x_j)\in \{0, 1\}\forall i,j \in E, x_i, x_j$
$\sum_{x_i, x_j}\mu_{ij}(x_i, x_j) = 1 \forall i,j \in E$
Assignments must also be consistent:
$\sum_{x_i} \mu_{ij}(x_i, x_j) = \mu_j(x_j) \forall i,j \in E, x_j$
$\sum_{x_j} \mu_{ij}(x_i, x_j) = \mu_i(x_i) \forall i,j \in E, x_i$
(we have to optimize over the edges too because an edge between two variables at given states contributes to the cost)
Suppose MRF of the form:
$\max_{x} \sum_{i\in V}\theta_i(x_i) + \sum_{f\in F}\theta_f(x_f)$
Where $F$ denote arbitrary factors (e.g. edge potentials in pairwise MRF)
Consider decoupled objective:
$\sum_{i\in V}\max_{x_i} \theta_i(x_i) + \sum_{f\in F}\max_{x^f} \theta_f (x^f)$
Should encourage consistency between potentials:
$x_i^f - x_i = 0 \forall f,\forall i\in f$ (i.e. variable assignments chosen for an edge should be the same as for the corresponding nodes)
Lagrangian of the constrained problem ($x^f$ and $x$ are assignments of the variable):
$L(\delta, x^f, x)=\sum_{i\in V}\theta_i (x_i) + \sum_{f\in F}\theta_f(x^f) + \sum_{f\in F}\sum_{i\in f}\sum_{x_i'} \underbrace{\delta_{f_i}(x_i')}_{\text{Lagrange multiplier}}\bigg( \mathbb{I}_{x_i' = x_i} - \mathbb{I}_{x_i' = x_i^f}\bigg)$
$\forall \theta: L(\delta) := \max_{x^f, x} L(\delta, x^f, x)\geq p^*$ (lower bound with respect to $\delta$ is optimal value of the objective)
We can reparameterize the Lagrangian as:
$L(\delta) = \sum_{i \in V} \max_{x_i} \left( \theta_i (x_i) + \sum_{f:i \in f} \delta_{fi}(x_i) \right) + \sum_{f \in F} \max_{x^f} \left( \theta_f (x^f) - \sum_{i \in f} \delta_{fi}(x_i) \right)$
$= \sum_{i \in V} \max_{x_i} \bar \theta_{i}^\delta (x_i) + \sum_{f \in F} \max_{x^f} \bar \theta_{f}^\delta (x^f)$
Suppose we can find dual variables $\bar \delta$ such that the local maximizers of $\bar \theta_{i}^{\bar \delta} (x_i)$ and $\bar \theta_{f}^{\bar \delta} (x^f)$ agree; in other words, we can find a $\bar x$ such that $\bar x_i \in \arg\max_{x_i} \bar \theta_{i}^{\bar \delta} (x_i)$ and $\bar x^f \in \arg\max_{x^f} \bar \theta_{f}^{\bar \delta} (x^f)$. Then we have that:
$L(\bar \delta) = \sum_{i \in V} \bar \theta_{i}^{\bar\delta} (\bar x_i) + \sum_{f \in F} \bar \theta_{f}^{\bar\delta} (\bar x^f) = \sum_{i \in V} \theta_i (\bar x_i) + \sum_{f \in F} \theta_f (\bar x^f).$
Second equalities follows because terms involving Lagrange multipliers cancel out when $x$ and $x^f$ agree.
On the other hand, by definition of $p^*$:
$\sum_{i \in V} \theta_i (\bar x_i) + \sum_{f \in F} \theta_f (\bar x^f) \leq p^* \leq L(\bar\delta)$
which implies that $L(\bar\delta) = p^*$.
Thus:
$L(\delta)$ is continuous and convex (point-wise max of a set of affine functions, see EE227C), we may minimizing using subgradient descent or block coordinate descent (faster). Objective is not strongly convex, thus minimum is not global.
As shown above, if a solution $x,x^f$ agrees for some $\delta$, it is optimal.
If each $\bar \theta_i^{\delta^*}$ has a unique maximum, problem is decodable. If some variables do not have a unique maximum, then we assign their optimal values to the ones that can be uniquely decoded to their optimal values and use exact inference to find the remaining variables' values. (NP-hard but usually not a big problem)
How can we decouple variables from each other? Isn't that impossible due to the edges costs?
Start with arbitrary assignment and perform "moves" on the joint assignment that locally increases the probability. No guarantees but prior knowledge makes effective moves.
Exhaustive search over the space of assignments, while pruning branches that can be provably shown not to contain a MAP assignment (like backtracking ?). LP relaxation or its dual can be used to obtain upper bounds and prune trees.
Read more about this
Sampling methods (e.g. Metropolis-Hastings) to sample form:
$p_t(x)\propto \exp (\frac{1}{t}\sum_{c\in C}\theta_c(x_c))$
$t$ is called the temperature:
Idea of simulated annealing is to run sampling algorithm starting with high $t$ and gradually decrease it. If "cooling rate" is sufficiently slow (requires lots of tuning), we are guaranteed to find the mode of the distribution.
Interesting classes of models may not admit exact polynomial-time solutions at all.
Two families of approximate algorithms:
Bayesian network with multinomial variables. Samples variables in topological order:
linear $\mathcal{O}(n)$ time by taking exactly 1 multinomial sample from each CPD
Assume multinomial distribution with $k$ outcomes and associated probabilties $\theta_1, \dots, \theta_k$. Subdivide unit interval into $k$ regions with size $\theta_i, 1\leq i \leq k$ and sample uniformly from $[0, 1]$:
Forward sampling can also be performed on undirected models if the model can be represented by a clique tree with a small number of variables per node:
Name refers to famous casino in Monaco. Term was coined as codeword by physicists working on the atomic bomb as part of the Manhattan project.
Constructs solutions based on large number of samples. E.g. consider the following integral:
$\mathbb{E}_{x\sim p}[f(x)]\approx I_T = \frac{1}{T}\sum_{t=1}^T f(x^t)$
Since samples are i.i.d., MC estimate is unbiased and variance is inversely proportional to $T$.
Special case of Monte Carlo integration. Compute area of a region $R$ by sampling in a larger region with known area and recording the fraction of samples that falls within $R$.
E.g., Bayesian network over set of variables $X=Z\cup E$. We use rejection sampling to compute marginal probabilities $p(E=e)$:
$p(E=e)=\sum_z p(Z=z, E=e)=\sum_x p(x)\mathbb{I}(E=e)=\mathbb{E}_{x\sim p}\big[\mathbb{I}(E=e)\big]$
and then take the Monte Carlo approximation.
See Art Owen's lecture notes about importance sampling
Suppose we want to compute $\mu = \mathbb{E}[f(X)]$ where $f(x)$ is nearly zero outside a region $A$ for which $P(X\in A)$ is small. A plain Monte Carlo sample would be very wasteful and could fail to have even one point inside the region $A$.
Let $q$ be a known probability density function (importance sampling distribution):
$\mu = \int f(x)p(x)dx = \int \frac{f(x)p(x)}{q(x)}q(x)dx=\mathbb{E}_q\bigg(\frac{f(X)p(X)}{q(X)}\bigg)$
The importance sampling estimate of $\mu$ is:
$\hat{\mu}_q = \frac{1}{n} \sum_{i=1}^n \frac{f(X_i)p(X_i)}{q(X_i)}$
Assuming we can compute $fp/q$
We have $\mathbb{E}_q(\hat{\mu}_q) = \mu$ (unbiased estimate) and: $\text{Var}_q(\hat{\mu}_q)=\text{Var}(\frac{1}{n} \sum fp/q) = \frac{\sigma_q^2}{n}$
where $\sigma_q^2 = \int \big(\frac{f(x)p(x)}{q(x)}\big)^2 dx - \mu^2 = \int \big(\frac{f(x)p(x) - \mu q(x)}{q(x)}\big)^2 dx$
Therefore, the numerator is small when $q$ is nearly proportional to $fp$. Small values of $q$ greatly magnify whatever lack of proportionality appears in the numerator. It is good for q to have spikes in the same places that $fp$ does.
Having $q$ follow a uniform distribution collapses to plain Monte Carlo.
When estimating fraction of two probabilities (e.g. a conditional proba), if we sample each proba separately, errors compound and variance can be high. If we use the same samples to evaluate the fraction, estimator is biased but asymptotically unbiased and we avoid the issue of compounding errors.
See notes on website
Developped during the Manhattan project. One of the 10 most important algorithms of the 20th century.
Used to perform marginal and MAP inference as opposed to computing expectations.
We construct a Markov chain whose states are joint assignments of the variables and whose stationary distribution equals the model probability $p$.
Run Markov chain from initial state for $B$ burn-in steps (number of steps needed to converge to stationary distribution, see mixing time)
Run Markov chain for $N$ sampling steps
We produce Monte Carlo estimates of marginal probabilities. We then take the sample with highest probability to perform MAP inference.
Two algorithms:
For distributions that have narrow modes, the algorithm will sample from a given mode for a long time with high probability. Therefore, convergence will be slow.
See Markov Chain Monte Carlo without detailed balance
Inference as an optimization problem: given intractable distribution $p$ and class of tractable distributions $\mathcal{Q}$, find $q\in \mathcal{Q}$ that is most similar to $p$.
Unlike sampling-based methods:
Need to choose approximation family $\mathcal{Q}$ and optimization objective $J(q)$ that captures similarity between $q$ and $p$. Information theory provides us with Kullback-Leibler (KL) divergence.
For two distributions with discrete support:
$KL(q \Vert p)=\sum_{x\sim q} q(x) \log \frac{q(x)}{p(x)}$
with properties:
$KL(q\Vert p) \geq 0 \forall q,p$
$KL(q\Vert p) = 0$ iff $q=p$
It is however asymmetric: $KL(q\Vert p) \ne KL(p \Vert q)$. That is why it is called a divergence and not a distance.
Use unnormalized distribution $\tilde{p}$ instead of $p$ because evaluating $KL(q\rVert p )$ is intractable because of normalization constant.
Important property:
$J(q) = \sum_x q(x)\log \frac{q(x)}{\tilde{p}(x)} = \sum_x q(x) \log \frac{q(x)}{p(x)} - \log Z(\theta) = KL(q\Vert p) - \log Z(\theta)$
Therefore $\log Z(\theta) = \underbrace{KL(q\Vert p)}_{\geq 0} - J(q)\geq -J(q)$
$-J(q)$ is called the variational lower bound or evidence lower bound (ELBO) and often written in the form:
$\log Z(\theta) \geq \mathbb{E}_{q(x)}[\log \tilde{p}(x) - \log q(x)]$
E.g.: If we are trying to compute marginal probability $p(x\vert D) = p(x,D)/p(D)$, minimizing $J(q)$ amounts to maximizing a lower bound on log-likelihood $\log p(D)$ of the observed data.
By maximizing evidence-lower bound, we are minimizing $KL(q\rVert p)$ by squeezing it between $-J(q)$ and $\log Z(\theta)$.
Computationally, computing $KL(p\Vert q)$ involves an expectation with respect to $p$ which is typically intractable to evaluate
$KL(q\Vert p)$ is called the I-projection (information projection) and is infinite when $p(x)=0$ and $q(x) > 0$. Therefore, if $p(x)=0$ we must have $q(x)=0$. We say that $KL(q\Vert p)$ is zero-forcing for $q$ and it under-estimates the support of $p$. It is called the inclusive KL divergence.
$KL(p\Vert q)$ is called the M-projection (moment projection) and is infinite if $q(x)=0$ and $p(x)>0$ thus if $p(x)>0$ we must have $q(x)>0$. $KL(p\Vert q)$ is zero-avoiding for $q$ and it over-estimates the support of $p$. It is called the exclusive KL divergence.
E.g.: fitting unimodal approximating distribution (red) to multimodal (blue). $KL(p\Vert q)$ leads to a) and $KL(q \Vert p)$ leads to b) and c).
How to choose approximating family $\mathcal{Q}$?
One of most widely used classes is set of fully-factored $q(x)=q_1(x_1)q_2(x_2)\dots q_n(x_n)$. Each $q_i$ is a categorical distribution over a one-dimensional discrete variable. This is called mean-field inference and consists in solving:
$\min_{q_1, \dots, q_n} J(q)$
via coordinate descent over the $q_j$.
For one coordinate, the optimization has a closed form solution:
$\log q_j(x_j) \leftarrow \mathbb{E}_{q_{-j}}[\log \tilde{p}(x)] + \text{normalization constant}$
To compute $\tilde{p}(x)$ we only need the factors belonging to the Markov blanket of $x_j$. If variables are discrete with $K$ possible values and there are $F$ factors and $N$ variables in the Markov blanket of $x_j$ then computing expectation takes $\mathcal{O}(KFK^N)$ time (for each value of $x_j$ we sum over all $K^N$ assignments of the $N$ variables and, in each case, sum over the $F$ factors.)
Two different learning tasks:
Possible use cases:
Want to construct $p$ as close as possible to $p^*$ in order to perform density estimation. Use KL divergence:
$KL(p^* \Vert p) = \sum_x p^*(x) \log \frac{p^*(x)}{p(x)}=-H(p^*)-\mathbb{E}_{x\sim p^*}[\log p(x)]$
Hence minimizing KL divergence is equivalent to maximizing expected log-likelihood $\mathbb{E}_{x\sim p^*}[\log p(x)]$ ($p$ must assign high probability to instances sampled from $p^*$)
Since we do not know $p^*$, we use a Monte-Carlo estimate of the expected log-likelihood and maximum likelihood learning is defined as:
$\max_{p\in \underbrace{M}_{\text{family of models}}} \frac{1}{\lvert D \rvert} \sum_{x\in D} \log p(x)$
Minimize expected loss or risk: $\mathbb{E}_{x \sim p^*} L(x,p)$
Loss that corresponds to maximum likelihood is the log loss: $- \log p(x)$
For CRFs we use conditional log likelihood $- \log p(y\vert x)$
For prediction, we can use classification error $\mathbb{E}_{(x,y)\sim p^*}[\mathbb{I}\{\exists y' \ne y: p(y'\vert x)\geq p(y\vert x)\}]$ (probability of predicting the wrong assignment).
Better choice might be hamming loss (fraction of variables whose MAP assigment differs from ground truth).
Work on generalizing hinge loss (from SVM) to CRFs which leads to structured support vector machines.
Given Bayesian network $p(x)=\prod_{i=1}^n \theta_{x_i \vert x_{\text{parents}(i)}}$ and i.i.d. samples $D=\{x^{(1)}, \dots,x^{(m)}\}$ ($\theta$ parameters are the conditional probabilities)
Likelihood is $L(\theta, D) = \prod_{i=1}^n \prod_{j=1}^m \theta_{x_i^{(j)} \vert x_{\text{parents}(i)}^{(j)}}$
Taking log and combinining same values: $\log L(\theta, D) = \sum_{i=1}^n \sum_{x_{\text{parents}(i)}} \sum_{x_i} \lvert (x_i, x_{\text{parents}(i)})\rvert \cdot \log \theta_{x_i \vert x_{\text{parents}(i)}}$
Thus: $\theta_{x_i \vert x_{\text{parents}(i)}}^* = \frac{\lvert (x_i, x_{\text{parents}(i)})\rvert}{\lvert x_{\text{parents}(i)}\rvert}$
Maximum-likelihood estimate has closed-form solution: simplicity of learning is one of the most convenient features of Bayesian networks.
$p(x_1, \dots, x_n) = \frac{1}{Z}\prod_{c\in C}\phi_c (x_c) = \frac{1}{Z}\exp \big( \sum_{c\in C} \sum_{x_c'} \mathbb{I}\{x_c' = x_c\} \log \phi_c (x_c') \big) = \frac{exp(\theta^T f(x))}{Z(\theta)}$