Stanford CS228 Probabilistic Graphical Modeling

Source

Introduction

Probabilities are inherently exponentially-sized objects; need to make assumptions about their structure.

Naive Bayes assumption: conditional independence among the variables

Three major parts in this course:

  • Representation: how to specify a (tractable) model?

  • Inference: given a probabilistic model, how to determine the marginal or conditional probabilities of certain events.

    • marginal inference: what is the probability of a given variable in the model?

    • maximum a posteriori (MAP) inference: most likely assignment of variables given the data.

  • Learning: fitting a model to a dataset. Learning and inference are inherently linked, as inference is a key subroutine that we repeatedly call within learning algorithms.

Applications:

  • Image generation (see Radford et al.). Sampling new images from a learned probability distribution x^p(x)\hat{x} \sim p(x) that assigns high probability to images that resemble the ones in the training set.

  • In-painting: given a patch, sample from p(imagepatch)p(\text{image} \vert \text{patch}) to complete the image.

  • Image denoising: given an image corrupted by noise, model the posterior distribution p(original imagenoisy image)p(\text{original image} \vert \text{noisy image}) and sample/use exact inference to predict the original image.

  • Language models:

    • generation: sample from a probability distribution over sequences of words or characters.

    • machine translation: p(target language sentencesource language sentence)p(\text{target language sentence}\vert \text{source language sentence})

  • Audio models:

    • upsampling or super-resolution: increase resolution of an audio signal by calculating signal values at intermediate points. Sampling/perform inference on p(intermediate valuesobserved values)p(\text{intermediate values}\vert \text{observed values})

    • speech synthesis

    • speech recognition

Sidenotes

  • Judea Pearl was awarded the 2011 Turing award (Nobel prize of computer science) for founding the field of probabilistic graphical modeling.
  • For a philosophical discussion of why one should use probability theory as opposed to something else: see the Dutch book argument for probabilism.

Representation: Bayesian networks

Bayesian networks are directed graphical models in which each factor depends only on a small number of ancestor variables:

p(xixi1,,x1)=p(xixAi)p(x_i\vert x_{i-1}, \dots, x_1) = p(x_i \vert x_{A_i})

where AiA_i is a subset of {x1,,xi1}\{x_1, \dots, x_{i-1}\}.

When the variables are discrete, we may think of the possible values of p(xiAi)p(x_i \vert A_i) as probability tables. If each variable takes d values and has at most kk ancestors, the entire table will contain at most O(dk+1)\mathcal{O}(d^{k+1}). With one table per variable, the entire probability distribution can be compactly described with only O(ndk+1)\mathcal{O}(nd^{k+1}) parameters compared to O(dn)\mathcal{O}(d^{n}) with a naive approach.

Edges indicate dependency relationships between a node xix_i and its ancestors AiA_i.

Formal definition

A Bayesian network is a directed graph G=(V,E)G = (V,E) together with:

  • a random variable xix_i for each node iVi \in V
  • one conditional probability distribution p(xixAi)p(x_i \vert x_{A_i}) per node

A probability pp factorizes over a DAG (directed acyclic graph) GG if it can be decomposed into a product of factors, as specified by GG.

We can show by counter-example that when GG contains cycles, its associated probability may not sum to one.

Dependencies

Let I(p)I(p) be the set of all independencies that hold for a joint distribution pp. E.g., if p(x,y)=p(x)p(y)p(x,y) = p(x)p(y), then xyI(p)x\perp y \in I(p)

Independencies can be recovered from the graph by looking at three types of structures:

  • Common parent:

    • if GG is of the form ABCA \leftarrow B \rightarrow C, and BB is observed, then ACBA \perp C\vert B (CC given BB).

    • However, if BB is unobserved then A⊥̸CA \not\perp C.

    • Intuitively, BB contains all the information that determines the outcome of AA and CC; once it is observed, nothing else affects their outcome (it does not matter what value AA or CC take respective to the outcome of CC and AA, respectively).

  • Cascade:

    • if GG equals ABCA \rightarrow B \rightarrow C and BB is observed, then ACBA \perp C \vert B.

    • However, if BB is unobserved, then A⊥̸CA \not\perp C.

  • V-structure:

    • if GG is ACBA\rightarrow C \leftarrow B, then knowing CC couples AA and BB. I.e. ABA \perp B if CC is unobserved, but A⊥̸BCA \not\perp B\vert C (CC is observed).

    • E.g. suppose CC is true if the lawn is wet and false otherwise. AA (it rained) and BB (the sprinkler turned on) are two explanations for it being wet. If we know that CC is true (grass is wet) and BB is false (the sprinkler didn't go on), then p(A)=1p(A) = 1 (only other possible explanation). Hence, AA and BB are not independent given CC.

    We can extend these structures by applying them recursively to any larger Bayesian net. This leads to a notion called d-separation (where d stands for directed).

    Let QQ, WW and OO be three sets of nodes in a Bayesian Network GG. We say that QQ and WW are d-separated given OO (i.e. the nodes in OO are observed) if QQ and WW are not connected by an active path. An undirected path in GG is called active given observed variables OO if for every consecutive triple of variables X,Y,ZX, Y, Z on the path, one of the following holds:

    • XYZX\leftarrow Y \leftarrow Z and YY is unobserved YOY\notin O

    • XYZX\rightarrow Y \rightarrow Z and YY is unobserved YOY\notin O

    • XYZX\leftarrow Y \rightarrow Z and YY is unobserved YOY\notin O

    • XYZX\rightarrow Y \leftarrow Z and YY or any of its descendants are observed.

    (i.e. there is pairwise dependency between consecutive variables on the path)

In the following example, X1X_1 and X6X_6 are d-separated given X2,X3X_2, X_3 (you cannot infer X6X_6 from X1X_1 given X2,X3X_2, X_3):

d-separated graph

However, in the next example, X2,X3X_2, X_3 are not d-separated given X1,X6X_1, X_6. There is an active path which passed through the V-structure created when X6X_6 is observed:

not d-separated graph

Let I(G)={(XYZ):X,Yare d-sep given Z}I(G)=\{(X\perp Y\vert Z): X, Y \text{are d-sep given } Z\} be a set of variables that are d-separated in GG.

I-map: If pp factorizes over GG, then I(G)I(p)I(G) \subseteq I(p). We say that GG is an I-map (independence map) for pp.

The intuition is that if X,YX,Y and Y,ZY,Z are mutually dependent, so are X,ZX,Z. Thus, we can look at adjacent nodes and propagate dependencies.

In other words, all the independencies encoded in GG are sound: variables that are d-separated in GG are truly independent in pp. However, the converse is not true: a distribution may factorize over GG, yet have independencies that are not captured in GG.

Note that if p(x,y)=p(x)p(y)p(x,y)=p(x)p(y) then this distribution still factorizes over the graph yxy \rightarrow x since we can always write it as p(x,y)=p(xy)p(y)p(x,y)=p(x\vert y)p(y) where p(xy)p(x\vert y) does not actually vary with yy. However, we can construct a graph that matches the structure of pp by simply removing that unnecessary edge.

Representational power of directed graphs

Can directed graphs express all the independencies of any distribution pp? More formally, given a distribution pp, can we construct a graph GG such that I(G)=I(p)I(G) = I(p)?

First, note that it is easy to construct a GG such that I(G)I(p)I(G) \subseteq I(p); A fully connected DAG GG is an I-map for any distribution since I(G)=I(G) = \emptyset (there are no variables which are d-separated in GG since one can always find an active path; the trivial one is the one that connects the two variables and thus creates the dependency).

A more interesting question is whether we can find a minimal I-map. We may start with a fully connected GG and remove edges until GG is no longer an I-map (i.e. GG encodes independences that are not in pp therefore pp no longer factorizes to GG). One pruning method consist in following the natural topological ordering of the graph and removing node ancestors until this is no longer possible (see structure learning at the end of course).

Does any probability distribution pp always admit a perfect map GG for which I(p)=I(G)I(p) = I(G). Unfortunately, the answer is no.\

For example, consider the following distribution pp over three variables X,Y,ZX,Y,Z (noisy-xor example)):

  • we sample X,YBer(0.5)X,Y \sim \text{Ber}(0.5) from a Bernoulli distribution, and we set Z=X xor YZ = X \text{ xor } Y (Z=1Z = 1 only when one of XX or YY equals one).
  • {XY,ZY,XZ}I(p)\{X\perp Y, Z\perp Y, X\perp Z\}\in I(p) but Z{Y,X}I(p)Z\perp\{Y,X\}\notin I(p) (we can deduct ZZ from the other two). Thus, XZYX\rightarrow Z \leftarrow Y is an I-map for pp (every independency we can observe from GG is encoded in pp), but ZYZ\perp Y and XZX\perp Z are not in I(G)I(G) (since the edge automatically creates an active path). The only way to capture all three independencies is to have all three nodes separated but this would not be an I-map (since Z{X,Y}Z\perp \{X,Y\} would also be included). None of the 3-node graph structures that we discussed perfectly describes I(p)I(p), and hence this distribution doesn't have a perfect map.

A related question is whether perfect maps are unique when they exist. Again, this is not the case, as XYX \rightarrow Y and XYX \leftarrow Y encode the same independencies, yet form different graphs. Two bayes nets G1,G2G_1, G_2 are I-equivalent if they encode the same dependencies I(G1)=I(G2)I(G_1) = I(G_2).

When are two Bayesian nets I-equivalent?

Each of the graphs below have the same skeleton (if we drop the directionality of the arrows, we obtain the same undirected graph)

3 nodes Bayes nets

a,b and c are symmetric (XYZX\perp Y\vert Z but X⊥̸YX\not\perp Y). They encode exactly the same dependencies and the directionality does not matter as long as we don't turn them into a V-structure (d). The V-structure is the only one that describes the dependency X⊥̸YZX \not\perp Y\vert Z.

General result on I-equivalence: If G,GG, G' have the same skeleton and the same V-structures, then I(G)=I(G)I(G) = I(G').

Intuitively, two graphs are I-equivalent if the d-separation between variables is the same. We can flip the directionality of any edge, unless it forms a v-structure, and the d-connectivity of the graph will be unchanged. See the textbook of Koller and Friedman for a full proof in Theorem 3.7 (page 77).

Representation: Markov random fields

In Bayesian networks, unless we want to introduce false independencies among the variables, we must fall back to a less compact representation (with additional, unnecessary edges). This leads to extra parameters in the model and makes it more difficult for the model to learn them.

Markov Random Fields (MRFs) are based on undirected graphs. They can compactly represent independence assumptions that directed models cannot.

Unlike in the directed case, we are not saying anything about how one variable is generated from another set of variables (as a conditional probability distribution would). We simply indicated a level of coupling between dependent variables in the graph. This requires less prior knowledge, as we no longe have to specify a full generative story of how certain variables are constructed from others. We simply identify dependent variables and define the strength of their interactions. This defines an energy landscape over the space of possible assignments and we convert this energy to a probability via the normalization constant.

Formal definition

A Markov Random Field is a probability distribution pp over variables x1,,xnx_1, \dots, x_n defined by an undirected graph GG.

p(x1,,xn)=1ZcCϕc(xc)p(x_1, \dots, x_n) = \frac{1}{Z} \prod_{c\in C}\phi_c(x_c)

  • CC denotes the set of cliques. A clique is a fully connected subgraph, i.e. two distinct vertices in the clique are adjacent. It can be a single node, an edge, a triangle, etc.
  • Each factor ϕc\phi_c is a non-negative function over the variables in a clique.
  • The partition function Z=x1,,xncCϕc(xc)Z = \sum_{x_1, \dots, x_n} \prod_{c\in C} \phi_c (x_c) is a normalizing constant

Note: we do not need to specify a factor for each clique.

Example

we are modeling preferences among A,B,C,DA, B, C, D. (A,B),(B,C),(C,D),(D,A)(A,B), (B,C), (C,D), (D,A) are friends and friends have similar voting preferences.

p~(A,B,C,D)=ϕ(A,B)ϕ(B,C)ϕ(C,D)ϕ(D,A)\tilde{p}(A,B,C,D) = \phi(A,B)\phi(B,C)\phi(C,D)\phi(D,A)

where ϕ(X,Y)=10 if X=Y=1,5 if X=Y=0,1 otherwise\phi(X,Y) = 10\text{ if }X=Y=1, 5\text{ if }X=Y=0, 1 \text{ otherwise}.

The final probability is:

p(A,B,C,D)=1Zp~(A,B,C,D)p(A,B,C,D)=\frac{1}{Z}\tilde{p}(A,B,C,D)

where Z=A,B,C,Dp~(A,B,D,C)Z = \sum_{A,B,C,D}\tilde{p}(A,B,D,C)

Comparison to Bayesian networks

In the previous example, we had a distribution over A,B,C,DA,B,C,D that satisfied AC{B,D}A\perp C\vert \{B, D\} and BD{A,C}B\perp D\vert \{A, C\} (since only friends directly influence a person's vote). We can check by counter-example that these independencies cannot be perfectly represented by a Bayesian network. However, the MRF turns out to be a perfect map for this distribution.

Advantages:

  • can be applied to a wider range of problems in which there is no natural directionality in variable dependencies
  • can succinctly express certain dependencies that Bayesian nets cannot (converse is also true)

Drawbacks:

  • computing the normalization constant ZZ requires summing over a potentially exponential number of assignments. NP-hard in the general case; thus many undirected models are intractable and require approximation techniques.
  • difficult to interpret
  • much easier to generate data from a Bayesian network.

Bayesian networks are a special case of MRFs with a very specific type of clique factor (one that corresponds to a conditional probability distribution and implies a directed acyclic structure in the graph), and a normalizing constant of one.

Moralization

A Bayesian network can always be converted into an undirected network with normalization constant one by adding side edges to all parents of a given node and removing their directionality.

The converse is also possible, but may be intractable and may produce a very large directed graph (e.g. fully connected).

A general rule of thumb is to use Bayesian networks whenever possible and only switch to MRFs if there is no natural way to model the problem with a directed graph (like the voting example).

Independencies in Markov Random Fields

Variables xx and yy are dependent if they are connected by a path of unobserved variables. However, if xx's neighbors are all observed, then xx is independent of all the other variables (since they influence xx only via its neighbors, referred to as the Markov blanket of xx).

If a set of observed variables forms a cut-set between two halves of the graph, then variables in one half are independent from ones in the other.

cut-set

We define the Markov blanket UU of a variable XX as the minimal set of nodes such that XX is independent from the rest of the graph if UU is observed. This holds for both directed and undirected models. For undirected models, the Markov blanket is simply equal to a node's neighborhood.

Just as in the directed case, I(G)(I(p))I(G) \subseteq(I(p)), but the converse is not necessarily true. E.g.:

mrf-bn-comparison

Conditional Random Fields

Special case of Markov Random Fields when they are applied to model a conditional probability distribution p(yx)p(y\vert x) where xX,yYx \in \mathcal{X}, y\in\mathcal{Y} are vector-valued variables. This common setting in supervised learning is also known as structured prediction.

Formal definition

A CRF is a Markov network over variables XY\mathcal{X} \cup \mathcal{Y} which specifies a conditional distribution:

P(yx)=1ZcCϕc(xc,yc)\mathbb{P}(y\vert x) = \frac{1}{Z} \prod_{c\in C} \phi_c (x_c, y_c)

with partition function:

Z(x)=yYcCϕc(xc,yc)Z(x) = \sum_{y\in\mathcal{Y}} \prod_{c\in C} \phi_c (x_c, y_c)

The partition function depends on xx. p(yx)p(y\vert x) encodes a different probability function for each xx. Therefore, a conditional random field results in an instantiation of a new Markov Random Field for each input xx.

Example

Recognize word from a sequence of black-and-white character images xi[0,1]d×dx_i\in[0, 1]^{d\times d} (pixel matrices of size dd). The output is a sequence of alphabet letters yi{a,b,,z}y_i \in \{a, b, \dots, z\}

We could train a classifier to separately predict each yiy_i from its xix_i. However, since the letters together form a word, the predictions across different positions ought to inform each other.

p(yx)p(y\vert x) is a CRF with two types of factors:

  • image factors ϕ(xi,yi)\phi(x_i, y_i) for i=1,,ni=1, \dots, n which assign higher values to yiy_i that are consistent with an input xix_i. Can be seen as p(yixi)p(y_i\vert x_i) given by standard softmax regression e.g.
  • pairwise factors ϕ(yi,yi+1)\phi(y_i, y_{i+1}) for i=1,,n1i=1, \dots, n-1. Can be seen as empirical frequencies of letter co-occurences obtained from a large corpus of text.

We can jointly infer the structured label yy using MAP inference:

argmaxyϕ(x1,y1)i=2nϕ(yi1,yi)ϕ(xi,yi)\arg \max_y \phi(x_1, y_1)\prod_{i=2}^{n} \phi(y_{i-1}, y_i) \phi(x_i, y_i)

CRF features

In most practical applications, we assume that the factors ϕc(xc,yc)\phi_c(x_c, y_c) are of the form:

ϕc(xc,yc)=exp(wcfc(xc,yc))\phi_c(x_c, y_c) = \exp(w_c^\top f_c(x_c, y_c))

where fc(xc,yc)f_c(x_c, y_c) is an arbitrary set of features describing the compatibility between xcx_c and ycy_c.

In the above example:

  • f(xi,yi)f(x_i, y_i) may be the probability of letter yiy_i produced by logistic regression or a neural network evaluated on pixels xix_i.
  • f(yi,yi+1)=I(yi=l1,yi+1=l2)f(y_i, y_{i+1}) = \mathbb{I}(y_i = l_1, y_{i+1} = l_2) where l1,l2l_1, l_2 are two letters of the alphabet. The CRF would then learn weights ww that would assign more weight to more common probability of consecutive letters (l1,l2)(l_1, l_2) (useful when yiy_i given xix_i is ambiguous).

CRF features can be arbitrarily complex. We can define a model with factors ϕ(x,yi)=exp(wif(x,yi))\phi(x, y_i) = \exp (w_i^\top f(x, y_i)) that depend on the entire input xx. Does not affect computational performance since xx is observed (does not change the scope of the factors, just their values).

MRF models the joint distribution p(x,y)p(x,y) and needs to fit two distributions p(yx)p(y\vert x) and p(x)p(x). However, if we want to predict yy given xx, modeling p(x)p(x) is unnecessary and disadvantageous to do so:

  • statistically: not enough data to fit both p(yx)p(y\vert x) and p(x)p(x) because, since the models have shared parameters (the same graph), fitting one may not result in the best parameters for the other
  • computationally: we also need to make simplifying assumption in the distribution so that p(x)p(x) can be tractable.

CRFs forgo this assumption and often perform better on prediction tasks.

Factor Graphs

Bipartite graph where one group is the variables (round nodes) in the distribution being modeled and the other group is the factors defined on these variables (square nodes). Useful to see factor dependencies between variables explicitly, and facilitates computation.

factor graph

Representation: Variable elimination (VE)

  • Marginal inference: probability of a given variable p(y=1)=x1xnp(y=1,x1,,xn)p(y=1) = \sum_{x_1} \dots \sum_{x_n} p(y=1, x_1, \dots, x_n)
  • Maximum a posteriori (MAP) inference: most likely assignment to the variables in the model possibly conditioned on evidence maxx1xnp(y=1,x1,,xn)\max_{x_1 \dots x_n} p(y=1, x_1, \dots, x_n)

NP-hard: whether inference is tractable depends on the structure of the graph. If problem is intractable, we use approximate inference methods.

Variable elimination is an exact inference algorithm. Let xix_i be discrete variables taking kk possible values each (also extends to continuous distributions).

Example

Chain Bayesian network: p(x1,,xn)=p(x1)i=2np(xixi1)p(x_1, \dots, x_n) = p(x_1)\prod_{i=2}^n p(x_i \vert x_{i-1})

How to compute marginal probability p(xn)p(x_n) ?

Naive way is summing probability over all kn1k^{n-1} assignments:

p(xn)=x1xn1p(x1,,xn)p(x_n)=\sum_{x_1} \dots \sum_{x_{n-1}}p(x_1, \dots, x_n)

we can do much better by leveraging factorization in probability distribution:

p(xn)=x1xn1p(x1)i=2np(xixi1)p(x_n) = \sum_{x_1} \dots \sum_{x_{n-1}} p(x_1) \prod_{i=2}^n p(x_i \vert x_{i-1})

=x1p(x1)[x2p(x2x1)[xn1p(xnxn1)]]=\sum_{x_1}p(x_1)\bigg[\sum_{x_2}p(x_2\vert x_1)\big[\dots\sum_{x_{n-1}}p(x_n\vert x_{n-1})\big]\bigg]

=x2x1p(x2x1)p(x1)p(x2)×x3x2p(x3x2)p(x2)××xn1p(xnxn1)p(xn1)p(xn)=\sum_{x_2}\underbrace{\sum_{x_1}p(x_2\vert x_1)p(x_1)}_{p(x_2)}\times \sum_{x_3}\sum_{x_2}p(x_3\vert x_2)p(x_2)\times\dots\times\underbrace{\sum_{x_{n-1}}p(x_n\vert x_{n-1})p(x_{n-1})}_{p(x_n)}

=τ(x2)p(x21)++p(x2k)×τ(x3)×=\underbrace{\tau(x_2)}_{p(x_{21})+\dots+p(x_{2k})}\times\tau(x_3)\times \dots

Each τ\tau has complexity 0(k2)\mathcal{0}(k^2) and we calculate nn of them, thus overall complexity is 0(nk2)\mathcal{0}(nk^2) (much better than O(kn)\mathcal{O}(k^n)).

Formal Definition

Graphical model as product of factors: p(x1,,xn)=cCϕc(xc)p(x_1, \dots, x_n) = \prod_{c\in C} \phi_c(x_c)

Variable elimination algorithm (instance of dynamic programming) performs two factor operations:

  • product ϕ3(xc)=ϕ1(xc(1))×ϕ2(xc(2))\phi_3(x_c) = \phi_1(x_c^{(1)})\times \phi_2(x_c^{(2)}). xc(i)x_c^{(i)} denotes an assignment to the variables in the scope of ϕi\phi_i e.g.: ϕ3(a,b,c)=ϕ(a,b)×ϕ2(b,c)\phi_3(a,b,c) = \phi(a,b)\times\phi_2(b,c)
  • marginalization: τ(x)=yϕ(x,y)\tau(x) = \sum_y \phi(x,y). τ\tau is the marginalized factor and does not necessarily correspond to a probability distribution even if ϕ\phi was a CPD.

Marginalizing BB from ϕ(A,B,C)\phi(A,B,C). For (a1,c1)(a_1, c_1): 0.25+0.08=0.330.25+0.08=0.33.

marginalization

Variable elimination requires an ordering over the variables according to which variables will be eliminated (e.g. ordering implied by the DAG, see example).

  • the ordering affects the running time (as the variables become more coupled)
  • finding the best ordering is NP-hard

Let the ordering OO be fixed. According to OO, for each variable XiX_i:

  1. Multiply all factors Φi\Phi_i containing XiX_i

  2. Marginalize out XiX_i to obtain new factor τ\tau

  3. replace factors Φi\Phi_i with τ\tau

Running time of Variable Elimination is O(nkM+1)\mathcal{O}(nk^{M+1}) where MM is the maximum size (number of variables) of any factor τ\tau formed during the elimination process and nn is the number of variables.

Choosing optimal VE ordering is NP-hard. In practice, we resort to heuristics:

  • min-neighbors: choose variable with fewest dependent variables
  • min-weight: choose variables to minimize the product of cardinalities of its dependent variables
  • min-fill: choose vertices to minimize the size of the factor that will be added to the graph

Junction Tree (JT) algorithm

When computing marginals, VE produces many intermediate factors τ\tau as a side-product. These factors are the same as the ones that we need to answer other marginal queries. By caching them, we can answer new marginal queries at no additional cost.

VE and JT algorithm are two flavors of dynamic programming: top-down DP vs bottom-up table filling.

JT first executes two runs of VE to initialize a particular data structure: bottom-up to get root probabilities τ(x)=yϕ(x,y)\tau(x)=\sum_y \phi(x,y); and top-down to get leaf probabilities τ(y)=xϕ(x,y)\tau(y)=\sum_x \phi(x,y). It can then answer marginal queries in O(1)\mathcal{O}(1) time.

Two variants: belief propagation (BP) (applies to tree-structured graphs) and full junction tree method (general networks).

Belief propagation

Variable elimination as message passing

Consider running VE algorithm on a tree to compute marginal p(xi)p(x_i). Optimal ordering: rooting the tree at xix_i and iterating in post-order (leaves to root s.t. nodes are visited after their children). This ordering is optimal because the largest clique formed during VE has size 2.

At each step, eliminate xjx_j by computing τk(xk)=xjϕ(xk,xj)τj(xj)\tau_k(x_k) = \sum_{x_j} \phi(x_k, x_j)\tau_j(x_j) (where xkx_k is parent of xjx_j). τj(xj)\tau_j(x_j) can be thought of as a message sent from xjx_j to xkx_k that summarizes all the information from the subtree rooted at xjx_j.

Say, after computing p(xi)p(x_i) we want to compute p(xk)p(x_k); we would run VE again with xkx_k as root. However, we already computed the messages received by xkx_k when xix_i was root (since there is only one path connecting two nodes in a tree).

Message passing algorithm

(assumes tree structure)

  • a node xix_i sends a message to a neighbor xjx_j whenever it has received messages from all nodes besides xjx_j.
  • There will always be a node with a message to send, unless all the messages have been sent out.
  • Since each edge can receive one message in both directions, all messages will have been sent out after 2E2\lvert E \rvert steps (to get all marginals, we need all incoming messages for every node: bottom-up and then top-down).
  • Messages are defined as intermediate factors in the VE algorithm

Two variants: sum-product message passing (for marginal inference) and max-product message passing (for MAP inference)

Sum-product message passing

mij(xj)=xiϕ(xi,xj)lN(i)jmli(xi)m_{i\rightarrow j}(x_j) = \sum_{x_i} \phi(x_i, x_j) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i}(x_i)

(i.e. sum over all possible values taken by xix_i)

After computing all messages, we may answer any marginal query over xix_i in constant time:

p(xi)ϕ(xi)lN(i)jmli(xi)p(x_i) \propto \phi(x_i) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i}(x_i)

Sum-product message passing for factor trees

Recall that a factor graph is a bipartite graph with edges going between variables and factors.

Two types of messages:

  • variable-to-factor messages ν\nu
  • factor-to-variable messages μ\mu

factor graph messages

νvar(i)fac(s)(xi)=tN(i)sμfac(t)var(i)(xi)\nu_{var(i)\rightarrow fac(s)} (x_i) = \prod_{t\in N(i)\setminus s} \mu_{fac(t)\rightarrow var(i)} (x_i)

μfac(s)var(i)(xi)=xN(s)ifs(xN(s))jN(s)iνvar(j)fac(s)(xj)\mu_{fac(s)\rightarrow var(i)}(x_i) = \sum_{x_{N(s)\setminus i}} f_s(x_{N(s)})\prod_{j\in N(s)\setminus i} \nu_{var(j)\rightarrow fac(s)}(x_j)

As long as there is a factor (or variable) ready to transmit to a variable (or factor), send message as defined above. Therefore, an edge receives exactly two messages (from variable to factor and factor to variable)

Max-product message passing

Since MRF factors are non-negative, max operator distributes over products, just like the sum (in general, max only distributes over products of non-negative factors, since within the max, two large negative factors can become a large positive factor, although taken separately, their maximum could be small positive factors).

E.g., for a chain MRF:

p~=maxx1maxxnϕ(x1)i=2nϕ(xi,xi1)\tilde{p}^* = \max_{x_1}\dots \max_{x_n}\phi(x_1)\prod_{i=2}^n \phi(x_i, x_{i-1})

=maxxnmaxxn1ϕ(xn,xn1)maxxn2ϕ(xn1,xn2)maxx1ϕ(x2,x1)ϕ(x1)= \max_{x_n} \max_{x_{n-1}} \phi(x_n, x_{n-1}) \max_{x_{n-2}} \phi(x_{n-1}, x_{n-2})\dots \max_{x_1} \phi(x_2, x_1)\phi(x_1)

Since both problems decompose the same way we can reuse the same approach as for marginal inference (also applies to factor trees).

If we also want the argmax\arg \max, we can keep back-pointers during the optimization procedure.

Junction tree algorithm

  • If graph not a tree, inference will not be tractable.
  • However, we can partition the graph into a tree of clusters that are amenable to the variable elimination algorithm.
  • Then we simply perform message-passing on this tree.
  • Within a cluster, variables could be highly coupled but interactions among clusters will have a tree structure.
  • yields tractable global solutions if the local problems can be solved exactly.

Suppose we have undirected graphical model GG (if directed, take the moralized graph). A junction tree T=(C,ET)T=(C, E_T) over G=(χ,EG)G=(\chi, E_G) is a tree whose nodes cCc\in C are associated with subsets xcχx_c \subseteq \chi of the graph vertices (i.e. sets of variables). Must satisfy:

  • family preservation: for each factor ϕ\phi there is a cluster cc such that scopevariables taken as input[ϕ]xc\underbrace{\text{scope}}_{\text{variables taken as input}}[\phi] \subseteq x_c
  • running intersection: for every pair of clusters c(i),c(j)c^{(i)}, c^{(j)}, every cluster on the path between c(i),c(j)c^{(i)},c^{(j)} contains xc(i)xc(j)x_c^{(i)} \cap x_c^{(j)}

Example of MRF with graph GG and junction tree TT:

  • MRF potentials (i.e. product of the factors) are denoted using different colors
  • circles are nodes of the junction tree
  • rectangles are sepsets (separation sets) (= sets of variables shared by neighboring clusters)

junctiontree

Trivial junction tree: one node containing all the variables in GG (useless because brute force marginalization algorithm)

Optimal trees make the clusters as small and modular as possible: NP hard to find one.

Special case: when GG itself is a tree, define a cluster for each edge.

jt-over-tree

Example of invalid junction tree that does not satisfy running intersection property (green cluster should contain intersection of red and purple):

badjunctiontree

Algorithm

Let us define potential ψc(xc)\psi_c (x_c) of each cluster cc as the product of all the factors ϕ\phi in GG that have been assigned to cc.

By family preservation property, this is well defined and we may assume distribution in the form:

p(x1,,xn)=1ZcCψc(xc)p(x_1, \dots, x_n)=\frac{1}{Z} \prod_{c\in C} \psi_c (x_c)

At each step of the algorithm, we choose a pair of adjacent clusters c(i),c(j)c^{(i)}, c^{(j)} in TT and compute message whose scope is the sepset SijS_{ij} between the two clusters:

mij(Sij)=xcSijψc(xc)lN(i)jmli(Sli)m_{i\rightarrow j} (S_{ij})=\sum_{x_c \setminus S_{ij}} \psi_c (x_c) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i} (S_{li})

xcx_c are variables in the cluster c(i)c^{(i)}.

We choose c(i),c(j)c^{(i)}, c^{(j)} only if c(i)c^{(i)} has received messages from all of its neighbors except c(j)c^{(j)}. Terminates in 2ET2\lvert E_T \rvert steps just as in belief propagation (bottom-up then top-down to get all messages).

We then define belief of each cluster based on all the messages that it receives:

βc(xc)=ψc(xc)lN(i)mli(Sli)\beta_c(x_c) = \psi_c(x_c) \prod_{l\in N(i)} m_{l\rightarrow i} (S_{li})

(often referred to as Shafer-Shenoy updates)

Beliefs will be proportional to the marginal probabilities over their scopes: βc(xc)p(xc)\beta_c(x_c) \propto p(x_c).

We answer queries of the form p~(x),xxc\tilde{p}(x), x\in x_c by marginalizing out the variable in its belief:

p~(x)=xcxβc(xc)\tilde{p}(x) = \sum_{x_c \setminus x} \beta_c (x_c)

(requires brute force sum over all variables in xcx_c)

Normalize by partition function Z=xcβc(xc)Z=\sum_{x_c}\beta_c (x_c) (sum of all the beliefs in a cluster).

Running time is exponential in the size of the largest cluster (because we need to marginalize out variables from the cluster; must be done using brute force).

Variable elimination over a junction tree

Running VE with a certain ordering is equivalent to performing message passing on the junction tree.

We may prove correctness of the JT algorithm through an induction argument on the number of factors ψ\psi. The key property that makes this argument possible is the running intersection property (assures that it's safe to eliminate a variable from a leaf cluster that is not found in that cluster's sepset since it cannot occur anywhere except that one cluster)

The caching approach used for belief propagation extends to junction trees.

Finding a good junction tree

  • by hand: model has regular structure for which there is an obvious solution (e.g. when model is a grid, clusters are pairs of adjacent rows)
  • using variable elimination: running the VE elimination algorithm implicitly generates a junction tree over the variables. Thus, it is possible to use the heuristics discussed earlier

Loopy belief propagation

Technique for performing inference on complex (non-tree) graphs.

Running time of JT is exponential in size of the largest cluster. In some cases, we can give quick approximate solution.

Suppose we are given a MRF with pairwise potentials. Main idea is to disregard loops and perform message passing anyway.

mijt+1(xj)=xiϕ(xi)ϕ(xi,xj)lN(i)jmlit(xi)m_{i\rightarrow j}^{t+1}(x_j) = \sum_{x_i} \phi(x_i)\phi(x_i, x_j) \prod_{l\in N(i)\setminus j} m_{l\rightarrow i}^t (x_i)

Keep performing these updates for fixed number of steps or until convergence of messages. All messages are typically initialized with uniform distribution.

  • Performs surprisingly well in practice
  • Provably converges on trees and on graphs with at most one cycle
  • However it may not converge and beliefs may not necessarily be equal to true marginals
  • special case of variational inference algorithm

MAP Inference

maxxlogp(x)=maxxclogϕc(xc)logZ\max_x \log p(x) = \max_x \sum_c \log \phi_c (x_c) - \log Z

  • Intractable partition constant ZZ doesn't depend on xx and can be ignored.
  • Marginal inference is summing all assignments, one of which is MAP assignment
  • we could replace summation with maximization, however there exists more efficient methods

Many intractable problems as special case.\

E.g.: 3-sat. For each clause c=(xy¬z)c=(x\lor y \lor \lnot z) a factor θc(x,y,z)=1\theta_c (x,y,z)=1 if cc and 00 otherwise. 3-sat instance satisfiable iff the value of the MAP assignment equals the number of clauses.

We may use similar construction to prove that marginal inference is NP-hard: add additional variable X=1X=1 when all clauses are satisfied and 00 otherwise. Its marginal probability will be 0\geq 0 iff the 3-sat instance is satisfiable

Example: image segmentation (input x[0,1]d×dx\in[0,1]^{d\times d}; predict label y{0,1}d×dy\in\{0,1\}^{d\times d} indicating wether each pixel encodes the object we want to recover). Intuitively, neighboring pixels should have similar values. This prior knowledge can be modeled via an Ising model (see box 4.C p. 128 in Koller and Friedman)

Graph cuts

See Koller and Friedman 13.6

Efficient MAP inference algorithm for certain Potts models over binary-valued variables. Returns optimal solution in polynomial time regardless of structural complexity of the underlying graph.

A graph cut of undirected graph is a partition of nodes into 2 disjoint sets VsV_s, VtV_t. Let each edge be associated with a non-negative cost. Cost of a graph cut is the sum of the costs of the edges that cross between the two partitions:

cost(Vs,Vt)=v1V,v2Vtcost(v1,v2)\text{cost}(V_s, V_t) =\sum_{v_1\in V, v_2 \in V_t}\text{cost} (v_1, v_2)

min-cut problem is finding the partition that minimizes the cost of the graph cut. See algorithms textbooks for details.

Reduction of MAP inference on a particular class of MRFs to the min-cut problem

See Metric MRFs model box 4.D p. 127 in Koller and Friedman

MRF over binary variables with pairwise factors in which edge energies (i.e., negative log-edge factors) take the form:

Euv(xu,xv)=0 if xu=xv and λuvcost of edge mismatch if xuxvE_{uv}(x_u, x_v) = 0\text{ if }x_u = x_v\text{ and }\underbrace{\lambda_{uv}}_{\text{cost of edge mismatch}}\text{ if }x_u \ne x_v

Each node has unary potential described by energy function Eu(xu)E_u(x_u) (normalized by substracting the minEu\min E_uso that its 0\geq 0 with either Eu(1)=0E_u(1)=0 or Eu(0)=0E_u(0)=0).

p(x)=1Zexp[uEu(xu)+u,vEEuv(xu,xv)]p(x) = \frac{1}{Z}\exp -\bigg[ \sum_u E_u(x_u) + \sum_{u,v\in \mathcal{E}}E_{uv}(x_u, x_v)\bigg]

Motivation comes from image segmentation: reduce discordance between adjacent variables.

Formulate as a min-cut problem in augmented graph: add special source and sink nodes s,ts, t

  • ss represents the object (assignment of 00) and tt the background (assignment of 11)
  • remember that either Eu(1)=0E_u(1)=0 or Eu(0)=1E_u(0)=1 since we normalized
  • node ss is connected to nodes uu with Eu(0)=0E_u(0) = 0 by an edge with weight Eu(1)E_u(1)
  • node tt is connected to nodes vv with Ev(1)=0E_v(1)=0 by edge with weight Ev(0)E_v(0)
  • thus every node is either connected to the source or the sink
  • an edge only makes a contribution to the cost of the cut if the nodes are on opposite sides and its contribution is the λuv\lambda_{uv} of the edge.

The cost of the cut (unary potentials and edge contributions) is precisely the energy of the assignment.

Segmentation task in a 2x2 MRF as a graph cut problem:

mincut

Refer to Koller and Friedman textbook for more general models with submodular edge potentials.

Linear programming approach

Graph-cut only applicable in restricted classes of MRFs.

Linear Programming (LP)

Linear programming (a.k.a. linear optimization) refers to problems of the form:

minxcx subject to Axb\min_x c\cdot x \text{ subject to }Ax\leq b

with xRnx\in\mathbb{R}^{n}, c,bRnc,b\in\mathbb{R}^n and ARn×nA\in\mathbb{R}^{n\times n}.

  • Has been extensively studied since the 1930s
  • Major breakthrough of applied mathematics in the 1980s was the development of polynomial-time algorithms for linear programming
  • Practical tools like CPLEX that can solve very large LP instances (100,000 variables or more) in reasonable time.

####Integer Linear Programming (ILP)

  • extension of linear programming which also requires x{0,1}nx\in\{0,1\}^n

  • NP-hard in general

  • Nonetheless, many heuristics exist such as rounding:

    • relaxed constraint 0x10\leq x\leq 1 then round solution

    • works surprisingly well in practice and has theoretical guarantees for some classes of ILPs

Formulating MAP inference as ILP

Introduce two types of indicator variables:

  • μi(xi)\mu_i(x_i) for each iVi\in V and state xix_i
  • μij(xi,xj)\mu_{ij}(x_i, x_j) for each edge (i,j)E(i,j)\in \mathcal{E} and pair of states xi,xjx_i, x_j

MAP objective becomes:

maxμiVxiθi(xi)μi(xi)+i,jExi,xjθij(xi,xj)μij(xi,xj)\max_{\mu} \sum_{i\in V}\sum_{x_i} \theta_i(x_i)\mu_i(x_i) + \sum_{i,j\in E}\sum_{x_i, x_j}\theta_{ij}(x_i, x_j)\mu_{ij}(x_i, x_j)

(function thetatheta represents logϕlog \phi)

With constraints:

μi(xi){0,1}i,xi\mu_i(x_i)\in \{0, 1\}\forall i, x_i

xiμi(xi)=1i\sum_{x_i}\mu_i(x_i)=1\forall i

μij(xi,xj){0,1}i,jE,xi,xj\mu_{ij}(x_i, x_j)\in \{0, 1\}\forall i,j \in E, x_i, x_j

xi,xjμij(xi,xj)=1i,jE\sum_{x_i, x_j}\mu_{ij}(x_i, x_j) = 1 \forall i,j \in E

Assignments must also be consistent:

xiμij(xi,xj)=μj(xj)i,jE,xj\sum_{x_i} \mu_{ij}(x_i, x_j) = \mu_j(x_j) \forall i,j \in E, x_j

xjμij(xi,xj)=μi(xi)i,jE,xi\sum_{x_j} \mu_{ij}(x_i, x_j) = \mu_i(x_i) \forall i,j \in E, x_i

(we have to optimize over the edges too because an edge between two variables at given states contributes to the cost)

  • solution to this linear program equals MAP assignment
  • NP-hard but can transform this into an LP via relaxation
  • in general, this method gives only approximate solutions
  • for tree-structured graphs, relaxation is guaranteed to return integer solutions, which are optimal (see Koller and Friedman textbook for proof)

Dual decomposition

Suppose MRF of the form:

maxxiVθi(xi)+fFθf(xf)\max_{x} \sum_{i\in V}\theta_i(x_i) + \sum_{f\in F}\theta_f(x_f)

Where FF denote arbitrary factors (e.g. edge potentials in pairwise MRF)

Consider decoupled objective:

iVmaxxiθi(xi)+fFmaxxfθf(xf)\sum_{i\in V}\max_{x_i} \theta_i(x_i) + \sum_{f\in F}\max_{x^f} \theta_f (x^f)

Should encourage consistency between potentials:

xifxi=0f,ifx_i^f - x_i = 0 \forall f,\forall i\in f (i.e. variable assignments chosen for an edge should be the same as for the corresponding nodes)

Lagrangian of the constrained problem (xfx^f and xx are assignments of the variable):

L(δ,xf,x)=iVθi(xi)+fFθf(xf)+fFifxiδfi(xi)Lagrange multiplier(Ixi=xiIxi=xif)L(\delta, x^f, x)=\sum_{i\in V}\theta_i (x_i) + \sum_{f\in F}\theta_f(x^f) + \sum_{f\in F}\sum_{i\in f}\sum_{x_i'} \underbrace{\delta_{f_i}(x_i')}_{\text{Lagrange multiplier}}\bigg( \mathbb{I}_{x_i' = x_i} - \mathbb{I}_{x_i' = x_i^f}\bigg)

θ:L(δ):=maxxf,xL(δ,xf,x)p\forall \theta: L(\delta) := \max_{x^f, x} L(\delta, x^f, x)\geq p^* (lower bound with respect to δ\delta is optimal value of the objective)

We can reparameterize the Lagrangian as:

L(δ)=iVmaxxi(θi(xi)+f:ifδfi(xi))+fFmaxxf(θf(xf)ifδfi(xi))L(\delta) = \sum_{i \in V} \max_{x_i} \left( \theta_i (x_i) + \sum_{f:i \in f} \delta_{fi}(x_i) \right) + \sum_{f \in F} \max_{x^f} \left( \theta_f (x^f) - \sum_{i \in f} \delta_{fi}(x_i) \right)

=iVmaxxiθˉiδ(xi)+fFmaxxfθˉfδ(xf)= \sum_{i \in V} \max_{x_i} \bar \theta_{i}^\delta (x_i) + \sum_{f \in F} \max_{x^f} \bar \theta_{f}^\delta (x^f)

Suppose we can find dual variables δˉ\bar \delta such that the local maximizers of θˉiδˉ(xi)\bar \theta_{i}^{\bar \delta} (x_i) and θˉfδˉ(xf)\bar \theta_{f}^{\bar \delta} (x^f) agree; in other words, we can find a xˉ\bar x such that xˉiargmaxxiθˉiδˉ(xi)\bar x_i \in \arg\max_{x_i} \bar \theta_{i}^{\bar \delta} (x_i) and xˉfargmaxxfθˉfδˉ(xf)\bar x^f \in \arg\max_{x^f} \bar \theta_{f}^{\bar \delta} (x^f). Then we have that:

L(δˉ)=iVθˉiδˉ(xˉi)+fFθˉfδˉ(xˉf)=iVθi(xˉi)+fFθf(xˉf).L(\bar \delta) = \sum_{i \in V} \bar \theta_{i}^{\bar\delta} (\bar x_i) + \sum_{f \in F} \bar \theta_{f}^{\bar\delta} (\bar x^f) = \sum_{i \in V} \theta_i (\bar x_i) + \sum_{f \in F} \theta_f (\bar x^f).

Second equalities follows because terms involving Lagrange multipliers cancel out when xx and xfx^f agree.

On the other hand, by definition of pp^*:

iVθi(xˉi)+fFθf(xˉf)pL(δˉ)\sum_{i \in V} \theta_i (\bar x_i) + \sum_{f \in F} \theta_f (\bar x^f) \leq p^* \leq L(\bar\delta)

which implies that L(δˉ)=pL(\bar\delta) = p^*.

Thus:

  • bound given by Lagrangian can be made tight for the right choice of δ\delta
  • we can compute pp^* by finding a δ\delta at which the local sub-problems agree with each other

Minimizing the objective

L(δ)L(\delta) is continuous and convex (point-wise max of a set of affine functions, see EE227C), we may minimizing using subgradient descent or block coordinate descent (faster). Objective is not strongly convex, thus minimum is not global.

Recovering MAP assignment

As shown above, if a solution x,xfx,x^f agrees for some δ\delta, it is optimal.

If each θˉiδ\bar \theta_i^{\delta^*} has a unique maximum, problem is decodable. If some variables do not have a unique maximum, then we assign their optimal values to the ones that can be uniquely decoded to their optimal values and use exact inference to find the remaining variables' values. (NP-hard but usually not a big problem)

How can we decouple variables from each other? Isn't that impossible due to the edges costs?

Local search

Start with arbitrary assignment and perform "moves" on the joint assignment that locally increases the probability. No guarantees but prior knowledge makes effective moves.

Branch and bound

Exhaustive search over the space of assignments, while pruning branches that can be provably shown not to contain a MAP assignment (like backtracking ?). LP relaxation or its dual can be used to obtain upper bounds and prune trees.

Simulated annealing

Read more about this

Sampling methods (e.g. Metropolis-Hastings) to sample form:

pt(x)exp(1tcCθc(xc))p_t(x)\propto \exp (\frac{1}{t}\sum_{c\in C}\theta_c(x_c))

tt is called the temperature:

  • As tt\to \infty, ptp_t is close to the uniform distribution, which is easy to sample from
  • As t0t\to 0, ptp_t places more weight on argmaxxcCθc(xc)\arg \max_x \sum_{c\in C}\theta_c(x_c) (quantity we want to recover). However, since the distribution is highly peaked, it is difficult to sample from.

Idea of simulated annealing is to run sampling algorithm starting with high tt and gradually decrease it. If "cooling rate" is sufficiently slow (requires lots of tuning), we are guaranteed to find the mode of the distribution.

Sampling methods

Interesting classes of models may not admit exact polynomial-time solutions at all.

Two families of approximate algorithms:

  • variational methods (take their name from calculus of variations = optimizing functions that take other functions as arguments): formulate inference as an optimization problem
  • sampling methods: main way of performing approximate inference over the past 15 years before variational methods emerged as viable and superior alternatives

Forward (aka ancestral) sampling

Bayesian network with multinomial variables. Samples variables in topological order:

  • start by sampling variables with no parents
  • then sample from the next generation by conditioning on the values sampled before
  • proceed until all nn variables have been sampled

linear O(n)\mathcal{O}(n) time by taking exactly 1 multinomial sample from each CPD

Assume multinomial distribution with kk outcomes and associated probabilties θ1,,θk\theta_1, \dots, \theta_k. Subdivide unit interval into kk regions with size θi,1ik\theta_i, 1\leq i \leq k and sample uniformly from [0,1][0, 1]:

multinomial-sampling

Forward sampling can also be performed on undirected models if the model can be represented by a clique tree with a small number of variables per node:

  • calibrate clique tree; gives us marginal distribution over each node and choose node to be root
  • marginalize over variables in the root node to get the marginal for a single variable
  • sample from each variables in the node, each time incorporating newly sampled values as evidence x3p(X3=x3X1=x1,X2=x2)x_3 \sim p(X_3=x_3 \vert X_1=x_1, X_2=x_2)
  • move down the tree to sample from other nodes and send updated message containing values of the sampled variables.

Monte Carlo estimation

Name refers to famous casino in Monaco. Term was coined as codeword by physicists working on the atomic bomb as part of the Manhattan project.

Constructs solutions based on large number of samples. E.g. consider the following integral:

Exp[f(x)]IT=1Tt=1Tf(xt)\mathbb{E}_{x\sim p}[f(x)]\approx I_T = \frac{1}{T}\sum_{t=1}^T f(x^t)

Since samples are i.i.d., MC estimate is unbiased and variance is inversely proportional to TT.

Rejection sampling

Special case of Monte Carlo integration. Compute area of a region RR by sampling in a larger region with known area and recording the fraction of samples that falls within RR.

E.g., Bayesian network over set of variables X=ZEX=Z\cup E. We use rejection sampling to compute marginal probabilities p(E=e)p(E=e):

p(E=e)=zp(Z=z,E=e)=xp(x)I(E=e)=Exp[I(E=e)]p(E=e)=\sum_z p(Z=z, E=e)=\sum_x p(x)\mathbb{I}(E=e)=\mathbb{E}_{x\sim p}\big[\mathbb{I}(E=e)\big]

and then take the Monte Carlo approximation.

Importance sampling

See Art Owen's lecture notes about importance sampling

Suppose we want to compute μ=E[f(X)]\mu = \mathbb{E}[f(X)] where f(x)f(x) is nearly zero outside a region AA for which P(XA)P(X\in A) is small. A plain Monte Carlo sample would be very wasteful and could fail to have even one point inside the region AA.

Let qq be a known probability density function (importance sampling distribution):

μ=f(x)p(x)dx=f(x)p(x)q(x)q(x)dx=Eq(f(X)p(X)q(X))\mu = \int f(x)p(x)dx = \int \frac{f(x)p(x)}{q(x)}q(x)dx=\mathbb{E}_q\bigg(\frac{f(X)p(X)}{q(X)}\bigg)

The importance sampling estimate of μ\mu is:

μ^q=1ni=1nf(Xi)p(Xi)q(Xi)\hat{\mu}_q = \frac{1}{n} \sum_{i=1}^n \frac{f(X_i)p(X_i)}{q(X_i)}

Assuming we can compute fp/qfp/q

We have Eq(μ^q)=μ\mathbb{E}_q(\hat{\mu}_q) = \mu (unbiased estimate) and: Varq(μ^q)=Var(1nfp/q)=σq2n\text{Var}_q(\hat{\mu}_q)=\text{Var}(\frac{1}{n} \sum fp/q) = \frac{\sigma_q^2}{n}

where σq2=(f(x)p(x)q(x))2dxμ2=(f(x)p(x)μq(x)q(x))2dx\sigma_q^2 = \int \big(\frac{f(x)p(x)}{q(x)}\big)^2 dx - \mu^2 = \int \big(\frac{f(x)p(x) - \mu q(x)}{q(x)}\big)^2 dx

Therefore, the numerator is small when qq is nearly proportional to fpfp. Small values of qq greatly magnify whatever lack of proportionality appears in the numerator. It is good for q to have spikes in the same places that fpfp does.

Having qq follow a uniform distribution collapses to plain Monte Carlo.

Normalized importance sampling

When estimating fraction of two probabilities (e.g. a conditional proba), if we sample each proba separately, errors compound and variance can be high. If we use the same samples to evaluate the fraction, estimator is biased but asymptotically unbiased and we avoid the issue of compounding errors.

See notes on website

Markov Chain Monte Carlo

Developped during the Manhattan project. One of the 10 most important algorithms of the 20th century.

Used to perform marginal and MAP inference as opposed to computing expectations.

We construct a Markov chain whose states are joint assignments of the variables and whose stationary distribution equals the model probability pp.

  1. Run Markov chain from initial state for BB burn-in steps (number of steps needed to converge to stationary distribution, see mixing time)

  2. Run Markov chain for NN sampling steps

We produce Monte Carlo estimates of marginal probabilities. We then take the sample with highest probability to perform MAP inference.

Two algorithms:

  • Metropolis-Hastings
  • Gibbs sampling (special case of Metropolis-Hastings)

For distributions that have narrow modes, the algorithm will sample from a given mode for a long time with high probability. Therefore, convergence will be slow.

See IEOR263A lecture notes

See Markov Chain Monte Carlo without detailed balance

Variational Inference

Inference as an optimization problem: given intractable distribution pp and class of tractable distributions Q\mathcal{Q}, find qQq\in \mathcal{Q} that is most similar to pp.

Unlike sampling-based methods:

  • variational approaches will not find globally optimal solution
  • but we always know if they have converged and even have bounds on their accuracy
  • scale better and more amenable to techniques like stochastic gradient optimization, parallelization over multiple processors, acceleration using GPUs

Kullback-Leibler divergence

Need to choose approximation family Q\mathcal{Q} and optimization objective J(q)J(q) that captures similarity between qq and pp. Information theory provides us with Kullback-Leibler (KL) divergence.

For two distributions with discrete support:

KL(qp)=xqq(x)logq(x)p(x)KL(q \Vert p)=\sum_{x\sim q} q(x) \log \frac{q(x)}{p(x)}

with properties:

KL(qp)0q,pKL(q\Vert p) \geq 0 \forall q,p

KL(qp)=0KL(q\Vert p) = 0 iff q=pq=p

It is however asymmetric: KL(qp)KL(pq)KL(q\Vert p) \ne KL(p \Vert q). That is why it is called a divergence and not a distance.

Variational lower bound

Use unnormalized distribution p~\tilde{p} instead of pp because evaluating KL(qp)KL(q\rVert p ) is intractable because of normalization constant.

Important property:

J(q)=xq(x)logq(x)p~(x)=xq(x)logq(x)p(x)logZ(θ)=KL(qp)logZ(θ)J(q) = \sum_x q(x)\log \frac{q(x)}{\tilde{p}(x)} = \sum_x q(x) \log \frac{q(x)}{p(x)} - \log Z(\theta) = KL(q\Vert p) - \log Z(\theta)

Therefore logZ(θ)=KL(qp)0J(q)J(q)\log Z(\theta) = \underbrace{KL(q\Vert p)}_{\geq 0} - J(q)\geq -J(q)

J(q)-J(q) is called the variational lower bound or evidence lower bound (ELBO) and often written in the form:

logZ(θ)Eq(x)[logp~(x)logq(x)]\log Z(\theta) \geq \mathbb{E}_{q(x)}[\log \tilde{p}(x) - \log q(x)]

E.g.: If we are trying to compute marginal probability p(xD)=p(x,D)/p(D)p(x\vert D) = p(x,D)/p(D), minimizing J(q)J(q) amounts to maximizing a lower bound on log-likelihood logp(D)\log p(D) of the observed data.

By maximizing evidence-lower bound, we are minimizing KL(qp)KL(q\rVert p) by squeezing it between J(q)-J(q) and logZ(θ)\log Z(\theta).

KL(qp)KL(q\Vert p) or KL(pq)KL(p\Vert q)?

Computationally, computing KL(pq)KL(p\Vert q) involves an expectation with respect to pp which is typically intractable to evaluate

KL(qp)KL(q\Vert p) is called the I-projection (information projection) and is infinite when p(x)=0p(x)=0 and q(x)>0q(x) > 0. Therefore, if p(x)=0p(x)=0 we must have q(x)=0q(x)=0. We say that KL(qp)KL(q\Vert p) is zero-forcing for qq and it under-estimates the support of pp. It is called the inclusive KL divergence.

KL(pq)KL(p\Vert q) is called the M-projection (moment projection) and is infinite if q(x)=0q(x)=0 and p(x)>0p(x)>0 thus if p(x)>0p(x)>0 we must have q(x)>0q(x)>0. KL(pq)KL(p\Vert q) is zero-avoiding for qq and it over-estimates the support of pp. It is called the exclusive KL divergence.

E.g.: fitting unimodal approximating distribution (red) to multimodal (blue). KL(pq)KL(p\Vert q) leads to a) and KL(qp)KL(q \Vert p) leads to b) and c).

kldiv

Mean-field inference

How to choose approximating family Q\mathcal{Q}?

One of most widely used classes is set of fully-factored q(x)=q1(x1)q2(x2)qn(xn)q(x)=q_1(x_1)q_2(x_2)\dots q_n(x_n). Each qiq_i is a categorical distribution over a one-dimensional discrete variable. This is called mean-field inference and consists in solving:

minq1,,qnJ(q)\min_{q_1, \dots, q_n} J(q)

via coordinate descent over the qjq_j.

For one coordinate, the optimization has a closed form solution:

logqj(xj)Eqj[logp~(x)]+normalization constant\log q_j(x_j) \leftarrow \mathbb{E}_{q_{-j}}[\log \tilde{p}(x)] + \text{normalization constant}

To compute p~(x)\tilde{p}(x) we only need the factors belonging to the Markov blanket of xjx_j. If variables are discrete with KK possible values and there are FF factors and NN variables in the Markov blanket of xjx_j then computing expectation takes O(KFKN)\mathcal{O}(KFK^N) time (for each value of xjx_j we sum over all KNK^N assignments of the NN variables and, in each case, sum over the FF factors.)

Learning in directed models

Two different learning tasks:

  • parameter learning, graph structure is known and we want to estimate the factors
  • structure learning, estimate the graph

Possible use cases:

  • density estimation: want full distribution so that we can compute conditional probabilities
  • specific prediction task
  • structure or knowledge discovery (interested in the model itself)

Maximum likelihood

Want to construct pp as close as possible to pp^* in order to perform density estimation. Use KL divergence:

KL(pp)=xp(x)logp(x)p(x)=H(p)Exp[logp(x)]KL(p^* \Vert p) = \sum_x p^*(x) \log \frac{p^*(x)}{p(x)}=-H(p^*)-\mathbb{E}_{x\sim p^*}[\log p(x)]

Hence minimizing KL divergence is equivalent to maximizing expected log-likelihood Exp[logp(x)]\mathbb{E}_{x\sim p^*}[\log p(x)] (pp must assign high probability to instances sampled from pp^*)

Since we do not know pp^*, we use a Monte-Carlo estimate of the expected log-likelihood and maximum likelihood learning is defined as:

maxpMfamily of models1DxDlogp(x)\max_{p\in \underbrace{M}_{\text{family of models}}} \frac{1}{\lvert D \rvert} \sum_{x\in D} \log p(x)

Likelihood, Loss and Risk

Minimize expected loss or risk: ExpL(x,p)\mathbb{E}_{x \sim p^*} L(x,p)

Loss that corresponds to maximum likelihood is the log loss: logp(x)- \log p(x)

For CRFs we use conditional log likelihood logp(yx)- \log p(y\vert x)

For prediction, we can use classification error E(x,y)p[I{yy:p(yx)p(yx)}]\mathbb{E}_{(x,y)\sim p^*}[\mathbb{I}\{\exists y' \ne y: p(y'\vert x)\geq p(y\vert x)\}] (probability of predicting the wrong assignment).

Better choice might be hamming loss (fraction of variables whose MAP assigment differs from ground truth).

Work on generalizing hinge loss (from SVM) to CRFs which leads to structured support vector machines.

Maximum likelihood learning in Bayesian networks

Given Bayesian network p(x)=i=1nθxixparents(i)p(x)=\prod_{i=1}^n \theta_{x_i \vert x_{\text{parents}(i)}} and i.i.d. samples D={x(1),,x(m)}D=\{x^{(1)}, \dots,x^{(m)}\} (θ\theta parameters are the conditional probabilities)

Likelihood is L(θ,D)=i=1nj=1mθxi(j)xparents(i)(j)L(\theta, D) = \prod_{i=1}^n \prod_{j=1}^m \theta_{x_i^{(j)} \vert x_{\text{parents}(i)}^{(j)}}

Taking log and combinining same values: logL(θ,D)=i=1nxparents(i)xi(xi,xparents(i))logθxixparents(i)\log L(\theta, D) = \sum_{i=1}^n \sum_{x_{\text{parents}(i)}} \sum_{x_i} \lvert (x_i, x_{\text{parents}(i)})\rvert \cdot \log \theta_{x_i \vert x_{\text{parents}(i)}}

Thus: θxixparents(i)=(xi,xparents(i))xparents(i)\theta_{x_i \vert x_{\text{parents}(i)}}^* = \frac{\lvert (x_i, x_{\text{parents}(i)})\rvert}{\lvert x_{\text{parents}(i)}\rvert}

Maximum-likelihood estimate has closed-form solution: simplicity of learning is one of the most convenient features of Bayesian networks.

Learning in undirected models

  • higher expressivity, thus more difficult to deal with
  • maximum likelihood learning can be reduced to repeatedly performing inference

Learning Markov Random Fields

p(x1,,xn)=1ZcCϕc(xc)=1Zexp(cCxcI{xc=xc}logϕc(xc))=exp(θTf(x))Z(θ)p(x_1, \dots, x_n) = \frac{1}{Z}\prod_{c\in C}\phi_c (x_c) = \frac{1}{Z}\exp \big( \sum_{c\in C} \sum_{x_c'} \mathbb{I}\{x_c' = x_c\} \log \phi_c (x_c') \big) = \frac{exp(\theta^T f(x))}{Z(\theta)}

where f(x)f(x) is a vector of indicator functions and θ\theta is the set of all model parameters defined by logϕc(xc)\log \phi_c (x_c')

Partition function Z(θ)Z(\theta) equals to one for Bayesian networks but MRFs do not make this assumption.

Exponential families

See Stanford CS229 course

  • exponential families are log-concave in their natural parameters θ\theta
  • f(x)f(x) is called vector of sufficient statistics. E.g. if pp is Gaussian, f(x)f(x) contains mean and variance
  • exponential families make the fewest unnecessary assumptions about the data disribution. Formally, distribution maximizing entropy H(p)H(p) under constraint Ep[ϕ(x)]=α\mathbb{E}_p [\phi(x)] = \alpha is in the exponential family.
  • admit conjugate priors which makes them applicable in variational inference

Maximum likelihood learning of MRFs

Since we are working with an exponential family, maximum likelihood will be concave.

The log-likelihood is:

1Dlogp(D;θ)=1DxDθTf(x)logZ(θ)=1DxDθTf(x)logxexp(θTf(x))\frac{1}{\lvert D \rvert} \log p(D; \theta) = \frac{1}{\lvert D\rvert}\sum_{x\in D} \theta^T f(x) - \log Z(\theta) = \frac{1}{\lvert D\rvert}\sum_{x\in D} \theta^T f(x) - \log \sum_x \exp(\theta^T f(x))

First term is decomposable in xx, second is not and it's hard to evaluate.

Computing the gradient of logZ(θ)\log Z(\theta) requires inference with respect to pp which is intractable in general:

θlogZ(θ)=θZ(θ)/Z(θ)=xf(x)exp(θTf(x))/Z(θ)=xf(x)p(x)=Exp[f(x)]\nabla_\theta \log Z(\theta) = \nabla_\theta Z(\theta) / Z(\theta) = \sum_x f(x) \exp (\theta^T f(x)) / Z(\theta) = \sum_x f(x)p(x) = \mathbb{E}_{x\sim p}[f(x)]

Moreover, the Hessian is:

θ2logZ(θ)=Exp[f(x)f(x)T]Exp[f(x)]Exp[f(x)T]=cov[f(x)]\nabla_\theta^2 \log Z(\theta) = \mathbb{E}_{x\sim p} [f(x)f(x)^T] - \mathbb{E}_{x\sim p}[f(x)]\mathbb{E}_{x\sim p}[f(x)^T] = cov[f(x)]

Covariance matrices are always positive semi-definite, which is why logZ(θ)\log Z(\theta) is convex.

Usually non-convexity is what makes optimization intractable but in this case it is the computation of the gradient.

Approximate learning techniques

Maximum-likelihood learning reduces to repeatedly using inference to compute the gradient and then change model weights using gradient descent.

  • Gibbs sampling from distribution at each step of gradient descent; then approximate gradient using Monte-Carlo
  • persistent contrastive divergence (used to train Restricted Boltzmann Machines; see link) which re-uses same Markov Chain between iterations (since model has changed very little)

Pseudo-likelihood

For each example xx, pseudo likelihood makes the approximation: logp(x;θ)=ilogp(xixN(i);θ)\log p(x; \theta) = \sum_i \log p(x_i\vert x_{N(i)}; \theta) where xix_i is the ii-th variable in xx and N(i)N(i) is the Markov blanket of ii (i.e. neighbors). Since each term involves one variable, we only need to sum over the values of one variable to get its partition function (tractable).

Pseudo-likelihood objective assumes that xix_i depends mainly on its neighbors in the graph.

Pseudo-likelihood converges towards true likelihood as number of data points increases.

Moment matching

Recall, log-likelihood of MRF: 1Dlogp(D;θ)=1DxDθTf(x)logZ(θ)\frac{1}{\lvert D\rvert} \log p(D; \theta) = \frac{1}{\lvert D\rvert}\sum_{x\in D} \theta^T f(x) - \log Z(\theta)

Taking the gradient (see intro of chapter for details): 1Dxdf(x)Exp[f(x)]\frac{1}{\lvert D\rvert}\sum_{x\in d}f(x) - \mathbb{E}_{x\sim p}[f(x)]

This is precisely the difference between expectations of natural parameters under empirical (data) and model distribution.

ff is a vector of indicator functions for the variables of a clique: one entry equals I(xc=xcˉ)\mathbb{I}(x_c = \bar{x_c}) for some xc,xcˉx_c, \bar{x_c}.

The log-likelihood objective forces model marginals to match empirical marginals. This property is called moment matching.

When minimizing inclusive KL-divergence KL(pq)KL(p\Vert q), minimizer qq^* will match the moments of the sufficient statistics to the corresponding moments of pp.

MLE estimate is minimizer of KL(p^q)KL(\hat{p} \Vert q) where p^\hat{p} is the empirical distribution of the data. In variational inference, minimization over qq in smaller set of distributions Q\mathcal{Q} is known as M-projection ("moment projection").

Learning in conditional random fields

p(yx)=1Z(x,φ)cCϕc(yc,x;φ)p(y\vert x) = \frac{1}{Z(x, \varphi)} \prod_{c\in C}\phi_c (y_c, x; \varphi)

where: Z(x,φ)=y1,,yncCϕc(yc,x;φ)Z(x, \varphi) = \sum_{y_1, \dots, y_n} \prod_{c\in C}\phi_c(y_c, x; \varphi)

We can reparameterize it as we did for MRFs: p(yx)=exp(θTf(x,y))Z(x,θ)p(y\vert x) = \frac{\exp (\theta^T f(x,y))}{Z(x, \theta)}

Log-likelihood given dataset DD is: 1Dlogp(D;θ)=1Dx,yDθTf(x,y)1DxDlogZ(x,θ)\frac{1}{\lvert D\rvert}\log p(D; \theta)=\frac{1}{\lvert D\rvert}\sum_{x,y\in D} \theta^T f(x,y) - \frac{1}{\lvert D\rvert}\sum_{x\in D} \log Z(x, \theta)

There is a different partition function for each data point since it is dependent on xx.

The gradient is now 1Dx,yDf(x,y)1DxDEyp(yx)[f(x,y)]\frac{1}{\lvert D\rvert}\sum_{x,y\in D} f(x,y) - \frac{1}{\lvert D\rvert}\sum_{x\in D}\mathbb{E}_{y\sim p(y\vert x)}[f(x,y)]

And the Hessian is the covariance matrix covyp(yx)[f(x,y)]\text{cov}_{y\sim p(y\vert x)}[f(x,y)]

Conditional log-likelihood is still concave, but computing the gradient now requires one inference per training data point, therefore gradient ascent is more expensive than for MRFs.

One should try to limit the number of variables or make sure that the model's graph is not too densely connected.

Popular objective for training CRFs: max-margin loss, a generalization of the objective for training SVMs. Models trained using this loss are called structured support vector machines or max-margin networks. This loss is more widely used in practice. Only requires MAP inference rather than general (e.g. marginal inference).

Learning in latent variables models (LVMs)

  • LVMs enable us to leverage prior knowledge. E.g. language model of news articles. We know our set of news articles is a mixture of KK distinct distributions (one for each topic). Let xx be an article and tt a topic (unobserved variable), we may build a more accurate model p(xt)p(t)p(x\vert t)p(t). We now learn a separate p(xt)p(x\vert t) for each topic rather than trying to model everything with one p(x)p(x). However, since tt is unobserved we cannot use the previous learning methods.
  • LVMs also increase the expressive power of the model.

Formally, latent variable model is probability distribution pp over two sets of variables x,zx,z:

p(x,z;θ)p(x,z;\theta)

where xx is observed and zz is unobserved.

Example: Gaussian Mixture Models\

see CS289 lecture notes on GMMs

Marginal likelihood training

Maximize marginal log-likelihood of the data:

logp(D)=xDlogp(x)=xDlog(zp(xz)p(z))\log p(D) = \sum_{x\in D} \log p(x) = \sum_{x\in D} \log \bigg( \sum_z p(x\vert z)p(z) \bigg)

Whereas a single exponential family distribution has concave log-likelihood, the log of a weighted mixture of such distributions is no longer concave or convex.

Learning latent variable models

Expectation-Mazimization algorithm

  • See CS289 lecture notes EM-algorithm
  • The class website notes use the fact that an exponential family is entirely described by its sufficient statistics to derive the optimal assignment of μ\mu and Σ\Sigma in the GMM example.

EM as variational inference

Why does EM converge?

Consider posterior inference problem for p(zx)p(z\mid x). We apply our variational inference framework by taking p(x,z)p(x,z) to be the unnormalized distribution; in that case, p(x)p(x) will be the normalization constant (maximizing it maximizes the likelihood).

Recall that variational inference maximizes the evidence lower bound (ELBO):

L(p,q)=Eq(z)[logp(x,z;θ)logq(z)]\mathcal{L}(p,q) = \mathbb{E}_{q(z)} \left[ \log p(x,z; \theta) - \log q(z) \right]

over distributions qq. The ELBO satisfies the equation

logp(x;θ)=KL(q(z)p(zx;θ))+L(p,q).\log p(x; \theta) = KL(q(z) | p(z|x; \theta)) + \mathcal{L}(p,q).

Hence, L(p,q)\mathcal{L}(p,q) is maximized when q=p(zx)q=p(z \mid x); in that case the KL term becomes zero and the lower bound is tight: logp(x;θ)=L(p,q).\log p(x; \theta) = \mathcal{L}(p,q).

The EM algorithm can be seen as iteratively optimizing the ELBO over qq (at the E step) and over θ\theta (at the M) step.

Starting at some θt\theta_t, we compute the posterior p(zx;θ)p(z\mid x; \theta) at the EE step. We evaluate the ELBO for q=p(zx;θ)q = p(z\mid x; \theta); this makes the ELBO tight:

logp(x;θt)=Ep(zx;θt)logp(x,z;θt)Ep(zx;θt)logp(zx;θt)\log p(x; \theta_t) = \mathbb{E}_{p(z|x; \theta_t)} \log p(x,z; \theta_t) - \mathbb{E}_{p(z|x; \theta_t)} \log p(z|x; \theta_t)

Next, we optimize the ELBO over pp, holding qq fixed. We solve the problem

θt+1=argmaxθEp(zx;θt)logp(x,z;θ)Ep(zx;θt)logp(zx;θt)\theta_{t+1} = \arg\max_\theta \mathbb{E}_{p(z|x; \theta_t)} \log p(x,z; \theta) - \mathbb{E}_{p(z|x; \theta_t)} \log p(z|x; \theta_t)

This is precisely the optimization problem solved at the MM step of EM (in the above equation, there is an additive constant independent of θ\theta).

Solving this problem increases the ELBO. However, since we fixed qq to p(zx;θt)p(z\mid x; \theta_t), the ELBO evaluated at the new θt+1\theta_{t+1} is no longer tight. But since the ELBO was equal to logp(x;θt)\log p(x ;\theta_t) before optimization, we know that the true log-likelihood logp(x;θt+1)\log p(x;\theta_{t+1}) must have increased.

  • Every step increases the marginal likelihood logp(x;θt)\log p(x;\theta_t), which is what we wanted to show.
  • Since the marginal likelihood is upper-bounded by its true global maximum, EM must eventually converge (however objective is non-convex so we have no guarantee to find the global optimum; heavily dependent on initial θ0\theta_0)

Bayesian learning

Example of limits of maximum likelihood estimation:

  • we're not taking into account the confidence in our estimate considering the sample size
  • no prior distribution of the parameters: e.g. for out of vocabulary words, their probability will be zero.

In Bayesian learning both observed variables XX and parameters θ\theta are random variables. We're taking uncertainty over the parameters into account.

Prior distribution p(θ)p(\theta) encodes our initial beliefs. Choice of prior is subjective.

Useful when we want to provide uncertainty estimates about model parameters or when we encounter out of sample data.

Conjugate priors

For some choices of priors p(θ)p(\theta), the posterior p(θD)p(\theta \vert \mathcal{D}) can be computed in close form.

Suppose:

P(θ)=Beta(θαH,αTcontrols shape of Beta distribution)P(\theta) = Beta(\theta\vert \underbrace{\alpha_H, \alpha_T}_{\text{controls shape of Beta distribution}})

Then:

P(θD)=Beta(θαH+NH,αT+HT)P(\theta\vert\mathcal{D}) = Beta(\theta\vert \alpha_H + N_H, \alpha_T + H_T)

No need to compute: p(θD)=p(Dθ)p(θ)/p(D)p(\theta\vert \mathcal{D}) = p(\mathcal{D} \vert \theta)p(\theta)/ p(\mathcal{D}) where the integral p(D)=θp(Dθ)p(θ)dθp(\mathcal{D}) = \int_\theta p(\mathcal{D} \vert \theta) p(\theta)d\theta is intractable.

Beta distribution

p(θ)=Beta(θαH,αT)=θαH1(1θ)αT1B(αH,αT)p(\theta) = Beta(\theta \vert \alpha_H, \alpha_T) = \frac{\theta^{\alpha_H- 1}(1-\theta)^{\alpha_T - 1}}{B(\alpha_H, \alpha_T)}

where BB is a normalization constant.

If αH>αT\alpha_H > \alpha_T, we believe heads are more likely:

E[θ]=αHαH+αT\mathbb{E}[\theta] = \frac{\alpha_H}{\alpha_H + \alpha_T}

As αH\alpha_H or αT\alpha_T increase, variance decreases, i.e. we are more certain about the value of θ\theta:

Var(θ)=αHαT(αH+αT)2(αH+αT+1)Var(\theta) = \frac{\alpha_H\alpha_T}{(\alpha_H + \alpha_T)^2(\alpha_H+\alpha_T+1)}

beta

The Beta distribution family is a conjugate prior to the Bernoulli distribution family, i.e. if p(θ)p(\theta) is a Beta distribution and p(Xθ)p(X\vert \theta) is a Bernoulli distribution then p(θD=(X1,,XN))p(\theta \vert \mathcal{D}=(X_1, \dots, X_N)) is still a Beta distribution.

Categorical Distribution

Parameter of the categorical distribution:

θ=(θ1,,θK):=(P(X=1),,P(X=K))\theta = (\theta_1, \dots, \theta_K) := (P(X = 1), \dots, P(X=K))

where θ\theta sums to 1.

The Dirichlet distribution is the conjugate prior for the categorical distribution. Dirichlet distribution is defined by parameter α=(α1,,αK)\alpha = (\alpha_1, \dots, \alpha_K) and its pdf is:

P(θ)=Dirichlet(θα)=1B(α)norma. cst.k=1Kθkαk1P(\theta) = \text{Dirichlet}(\theta\vert \alpha) = \frac{1}{\underbrace{B(\alpha)}_{\text{norma. cst.}}} \prod_{k=1}^K \theta_k^{\alpha_k-1}

Used in topic modeling (e.g. latent dirichlet allocation)

Limits of conjugate priors

  • restricts the kind of priors we can use
  • for more complex distributions, posterior can still be NP hard to compute

Practitioners should compare it with other tools such as MCMC or variational inference.

Structure learning for bayesian networks

Learn structure of Directed Acyclic Graph from data. Two approaches: score-based and constraint-based.

Score-based approach

Search over space of DAGs to maximize score metric.

Score metric: penalized log-likelihood

score(G:D)=LL(G:D)ϕ(D)G\text{score}(G:D) = LL(G:D) - \phi(\lvert D\rvert)\lVert G\rVert

  • LL(G:D)LL(G:D): log-likelihood of the graph under graph structure GG
  • parameters in GG are estimated based on MLE and log-likelihood score is calculated based on parameters
  • if we only considered log-likelihood, we would end up with a complete graph (overfitting). Second term penalizes over-complicated structures.
  • D\lvert D\rvert is the number of samples and G\lVert G\rVert is the number of parameters in GG
  • For AIC: ϕ(t)=1\phi(t) = 1, for BIC: ϕ(t)=log(t)/2\phi(t) = \log(t)/2 (for BIC, influence of model complexity decreases as the number of samples grow, allowing log-likelihood to dominate the score

Score metric: Bayesian Dirichlet score

P(DG)=P(DG,ΘG)P(ΘGG)priordΘGP(D\vert G) = \int P(D\vert G, \Theta_G)\underbrace{P(\Theta_G\vert G)}_{\text{prior}} d\Theta_G

When prior is Dirichlet distribution:

P(DΘG)=iπi[Γ(jNi,πi,j)Γ(jNi,πi,j+Ni,πi,j)jΓ(Ni,πi,j+Ni,πi,j)Γ(Ni,πi,j)]P(D|\Theta_{\mathcal{G}}) = \prod_{i} \prod_{\pi_i} \left[ \frac{\Gamma(\sum_j N'{i,\pi_i,j})}{\Gamma(\sum_j N'{i,\pi_i,j} + N_{i,\pi_i,j} )} \prod_{j}\frac{\Gamma(N'{i,\pi_i,j} + N{i,\pi_i,j})}{\Gamma(N'_{i,\pi_i,j})}\right]

  • πi\pi_i: parent configuration of variable ii
  • Ni,π,jN_{i, \pi, j} count of variable ii taking value jj with parent configuration π\pi
  • NN' counts in the prior

With prior for graph structure P(ΘG)P(\Theta_G), BD score:

logP(D,ΘG)=logP(DΘG)+logP(ΘG)\log P(D, \Theta_G) = \log P(D\vert \Theta_G) + \log P(\Theta_G)

Overfitting is implicitly penalized via integral over parameter space.

Chow-Liu Algorithm

Finds maximum-likelihood tree where each node has at most one parent. We can simply use maximum likelihood as a score (already penalizing complexity by restricting ourselves to tree structures).

Likelihood decomposes into mutual information and entropy (plogp\sum p \log p).

logp(Dθ,G)=DiMIp^(Xi,Xpa(i))DiHp^(Xi)\log p(D\vert \theta, G) = \lvert D\rvert \sum_i MI_{\hat{p}} (X_i, X_{pa(i)}) - \lvert D\rvert \sum_i H_{\hat{p}}(X_i)

Entropy is independent of tree structure (so we eliminate it from the maximization). Mutual information is symmetric, therefore edge orientation does not matter.

3 steps:

  1. compute mutual information for all pairs of variables X,UX, U (edges):

    MI(X,U)=x,up^(x,u)log[p^(x,u)p^(x)p^(u)]\text{MI}(X,U) = \sum_{x,u} \hat{p}(x,u)\log \left[ \frac{\hat{p}(x,u)}{\hat{p}(x)\hat{p}(u)} \right]

    where p^(x,u)=counts(x,u)number of data points\hat{p}(x,u) = \frac{\text{counts}(x,u)}{\text{number of data points}}

    complexity: O(n2)O(n^2)

  2. find maximum weight spanning tree (tree that connects all vertices) using Kruskal or Prim Algorithms (complexity: O(n2)O(n^2))

  3. pick a node to be root variable, assign directions going outward of this node.

Search algorithms

  • local search: start with empty/complete graph. at each step, perform single operation on graph strucure (add/remove/reverse edge, while preserving acyclic property). If score increases, then adopt attempt otherwise, make other attempt.

  • greedy search (K3 algorithm): assume topological order of graph in advance (for every directed edge uv from vertex u to vertex v, u comes before v in the ordering). Restrict parent set to variables with a higher order. While searching for parent set for each variable, add parent that increases the score most until no improvement can be made. (Doesn't that create a complete graph? No: restricted by the topological order). When specified topological order is a poor one, results in bad graph structure (low graph score).

Space is highly non-convex and both algos might get stuck at sub-optimal regions.

Constraint-based approach

Employs independence test to identify a set of edge constraints for the graph and then finds best DAG that satisfies the constraint.

E.g.: distinguish V-structure and fork-structure by doing independence test for the two variables on the side conditional on the variable in the middle. Requires lots of data samples to guarantee testing power.

Recent advances

  • order search (OS): search over topological orders and graph space at the same time. It swaps the order of two adjacent variables at each step and employs K3 algorithm as a sub-routine.
  • ILP: encodes graph structure, scoring and acyclic constraint into ILP problem and uses solver. Approach requires a bound on the maximum number of parents of any node. Otherwise, number of constraints explodes and problem becomes intractable

Variational auto-encoder

Deep learning technique for learning latent representations.

Directed latent-variable model: p(x,z)=p(xz)p(z)p(x,z)=p(x\vert z)p(z)

Deep generative model with mm layers: p(x,z1,,zm)=p(xz1)p(z1z2)p(zm1zm)p(zm)p(x, z_1, \dots, z_m) = p(x \vert z_1)p(z_1\vert z_2)\dots p(z_{m-1}\vert z_m)p(z_m)

Objectives:

  • learning parameters θ\theta of pp
  • approximate posterior inference over zz (given image xx, what are its latent factors)
  • approximate marginal inference over xx (given image xx with missing parts, how do we fill them in)

Assumptions:

  • intractable posterior probability p(zx)p(z\vert x)
  • dataset is too large to fit in memory

Standard approaches?

  • EM: need to compute posterior p(zx)p(z\vert x) (intractable) + would need to use online version of EM to perform M-step.
  • mean field: requires us to compute expectation. Time complexity scales exponentially with size of Markov blanket of target variable. For zz, if at least one component of xx depends on each component of zz, this introduces V-structure (xx explains away differences among zz) and Markov blanket of some ziz_i contains all the other zz-variables (intractable)
  • sampling-based methods: authors (Kingma and Welling) found that they don't scale well on large datasets

Auto-encoding variational Bayes

  • seminal paper by Kingma and Welling
  • variational auto-encoder is one instantation of this algo

ELBO:

L(pθ,qϕ)=Eqϕ(zx)[logpθ(x,z)logqϕ(zx)]\mathcal{L}(p_\theta, q_\phi) = \mathbb{E}_{q_\phi (z\vert x)}[\log p_\theta (x,z) - \log q_\phi (z\vert x)]

satisfies equation:

logpθ(x)=KL(qϕ(zx)p(zx))+L(pθ,qϕ)\log p_\theta (x) = KL(q_\phi(z\vert x) \Vert p(z\vert x)) + \mathcal{L}(p_\theta, q_\phi)

We are conditioning q(z)q(z) on xx. Could we optimize over q(zx)q(z \vert x) using mean field? No, assumption that qq is fully factored is too strong.

Approach: black-box variational inference

  • gradient descent over ϕ\phi (only assumption: qϕq_\phi differentiable as opposed to coordinate descent)
  • simultaneously perform learning via gradient descent on both ϕ\phi (keep ELBO tight around logp(x)\log p(x)) and θ\theta (push up lower bound (hence logp(x)\log p(x))). Similar to EM algo.

We need to compute gradient:

θ,ϕEqϕ(z)[logpθ(x,z)logqϕ(z)]\nabla_{\theta, \phi} \mathbb{E}_{q_\phi(z)}[\log p_\theta (x,z) - \log q_\phi(z)]

For gradient over θ\theta we can swap with expectation and estimate Eqϕ(z)[θlogpθ(x,z)]\mathbb{E}_{q_\phi(z)}[\nabla_\theta \log p_\theta (x,z)] via Monte Carlo.

Gradient w.r.t. ϕ\phi via the score function estimator (see Appendix B of Neural Variational Inference and Learning in Belief Networks):

ϕEqϕ(z)[logpθ(x,z)logqϕ(z)]=Eqϕ(z)[(logpθ(x,z)logqϕ(z))ϕlogqϕ(z)]\nabla_\phi \mathbb{E}_{q_\phi(z)}[\log p_\theta (x,z) - \log q_\phi(z)] = \mathbb{E}_{q_\phi(z)}[(\log p_\theta (x,z) - \log q_\phi(z))\nabla_\phi \log q_\phi (z)]

(can evaluate using Monte-Carlo)

However, score function estimator has high variance. VAE paper proposes alternative estimator that is much better behaved.

SGVB estimator

L(pθ,qϕ)=Eq[logpθ(x,z)logqϕ(zx)]\mathcal{L}(p_\theta, q_\phi) = \mathbb{E}_q[\log p_\theta (x,z) - \log q_\phi(z\vert x)]

=Eq[logpθ(xz)pθ(z)]Eq[logqϕ(zx)]= \mathbb{E}_q[\log p_\theta (x\vert z)p_\theta(z)] - \mathbb{E}_q [\log q_\phi(z\vert x)]

$

= \mathbb{E}q \log p\theta (z) - KL(q_\phi (z\vert x) \Vert p(z))$

Thus the ELBO becomes:

logpθ(x)Eqϕ(zx)logpθ(z)KL(qϕ(zx)p(z))\log p_\theta (x) \geq \mathbb{E}_{q_\phi(z\vert x)} \log p_\theta (z) - KL(q_\phi (z\vert x) \Vert p(z))

Interpretation:

  • calling qq the encoder, both terms take sample zq(zx)z\sim q(z\vert x) which we interpret as a code describing xx
  • first term is log-likelihood of observed xx given code zz. pp is called decoder network and the term is called the reconstruction error
  • second term is divergence between q(zx)q(z\vert x) and prior p(z)p(z), which is fixed to be a unit Normal. Encourages the codes zz to look Gaussian. It's a regularization term that prevents q(zx)q(z\vert x) to encode a simple identity mapping.
  • reminiscent of auto-encoder neural networks (learn xˉ=f(g(x))\bar{x} = f(g(x)) by minimizing reconstruction loss xˉx\lVert \bar{x} - x \rVert), hence the name.

Main contribution of the paper is a low-variance gradient estimator based on the reparameterization trick:

ϕEzqϕ(zx)[f(x,z)]=ϕEϵp(ϵ)[f(x,gϕ(ϵ,x))]=Eϵp(ϵ)[ϕf(x,gϕ(ϵ,x))]\nabla_\phi \mathbb{E}_{z \sim q_\phi(z\mid x)}\left[ f(x,z) \right]= \nabla_\phi \mathbb{E}_{\epsilon \sim p(\epsilon)}\left[ f(x, g_\phi(\epsilon, x)) \right] = \mathbb{E}_{\epsilon \sim p(\epsilon)}\left[ \nabla_\phi f(x, g_\phi(\epsilon, x)) \right]

where noise variable ϵ\epsilon is sampled from a simple distribution p(ϵ)p(\epsilon) (e.g. standard normal) and deterministic transformation gϕ(ϵ,x)g_\phi(\epsilon, x) maps random noise into more complex distribution qϕ(zx)q_\phi(z\vert x).

This approach has much lower variance (see appendix of this paper by Rezende et al.) than the score function estimator.

qq and pp are parameterized by neural networks (bridge between classical machine learning method (approximate Bayesian inference here) and modern deep learning):

q(zx)=N(z;μ(x),diag(σ(x))2)q(z\vert x) = \mathcal{N}(z; \vec{\mu}(x), \text{diag}(\vec{\sigma}(x))^2)

p(xz)=N(x;μ(z),diag(σ(z))2)p(x\vert z) = \mathcal{N}(x; \vec{\mu}(z), \text{diag}(\vec{\sigma}(z))^2)

p(z)=N(z;0,I)p(z) = \mathcal{N}(z; 0, I)

where the μ\vec{\mu} and σ\vec{\sigma} functions are neural nets (two dense hidden layers of 500 units each).

Experimental results:

  • Monte-Carlo EM and hybrid Monte-Carlo are quite accurate, but don’t scale well to large datasets.
  • Wake-sleep is a variational inference algorithm that scales much better; however it does not use the exact gradient of the ELBO (it uses an approximation), and hence it is not as accurate as AEVB.

See original paper

Papers that use VAE:

Further readings