Mixture Models, Latent Variables and the EM Algorithm

Compressing and restricting density estimates. Mixtures of limited numbers of distributions. Mixture models as probabilistic clustering; finally an answer to "how many clusters?" The EM algorithm as an iterative way of maximizing likelihood with latent variables. Analogy to k-means. More theory of the EM algorithm. Applications: density mixtures, signal processing/state estimation, mixtures of regressions, mixtures of experts; topic models and probabilistic latent semantic analysis. A glance at non-parametric mixture models.

1. Mixture Models, Latent Variables and the EM Algorithm 36-350, Data Mining, Fall 2009 30 November 2009 Contents 1 From Kernel Density Estimates to Mixture Models 1 1.1 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Probabilistic Clustering 4 3 Estimating Parametric Mixture Models 4 3.1 More about the EM Algorithm . . . . . . . . . . . . . . . . . . . 6 3.2 Further Reading on and Applications of EM . . . . . . . . . . . . 9 3.3 Topic Models and Probabilistic LSA . . . . . . . . . . . . . . . . 9 4 Non-parametric Mixture Modeling 10 5 R 10 6 Exercises 10 Reading: Principles of Data Mining, sections 9.2, 8.4 and 9.6, in that order. 1 From Kernel Density Estimates to Mixture Models Last time, we looked at kernel density estimation, where we approximate the true distribution by sticking a small copy of a kernel pdf at each observed data point and adding them up. With enough data, this comes arbitrarily close to any (reasonable) probability density, but it does have some drawbacks. Statistically, it labors under the curse of dimensionality. Computationally, we have to remember all of the data points, which is a lot. We saw similar problems when we looked at fully non-parametric regression, and then saw that both could be ameliorated by using things like additive models, which impose more 1

2.constraints than, say, unrestricted kernel smoothing. Can we do something like that with density estimation? Additive modeling for densities is not as common as it is for regression — it’s harder to think of times when it would be natural and well-defined1 — but we can do something with a similar flavor, which is to use a mixture model. One way to motivate this is to realize that, in many problems, lots of our data points are very similar to each other, and the differences between them are really just matters of chance or noise. Rather than remembering all of those noisy details, why not collapse those data points, and just remember their common distribution? More formally, we say that a distribution f is a mixture of K component distributions f1 , f2 , . . . fK if K f (x) = λk fk (x) (1) k=1 with the λk being the mixing weights, λk > 0, k λk = 1. Eq. 1 is a complete stochastic model, so it gives us a recipe for generating new data points: first pick a distribution, with probabilities given by the mixing weights, and then generate one observation according to that distribution. Symbolically, Z ∼ Mult(λ1 , λ2 , . . . λK ) (2) X|Z ∼ fZ (3) where I’ve introduced the discrete random variable Z which says which compo- nent X is drawn from. I haven’t said what kind of distribution the fk s are. In principle, we could make these completely arbitrary, and we’d still have a perfectly good mixture model. In practice, a lot of effort is given over to parametric mixture mod- els, where the fk are all from the same parametric family, but with different parameters — for instance they might all be Gaussians with different centers and variances, or all Poisson distributions with different means, or all power laws with different exponents. (It’s not strictly necessary that they all be of the same kind.) We’ll write the parameter, or parameter vector, of the k th component as θk , so the model becomes K f (x) = λk f (x; θk ) (4) k=1 The over-all parameter vector of the mixture model is thus θ = (λ1 , λ2 , . . . λK , θ1 , θ2 , . . . θK ). Let’s consider two extremes. When K = 1, we have a simple parametric distribution, of the usual sort, and density estimation reduces to estimating the 1 Remember that the integral of a probability density over all space must be 1, while the integral of a regression function P doesn’t have to be anything in particular. If we had an additive density, f (x) = j fj (xj ), ensuring normalization is going to be very tricky; we’d P R need j fj (xj )dx1 dx2 dxp = 1. It would be easier to ensure normalization while making the log-density additive, but that assumes the features are independent of each other. 2

3.parameters, by maximum likelihood or whatever else we feel like. On the other hand when K = n, the number of observations, we have gone back towards kernel density estimation. If K is fixed as n grows, we still have a parametric model, and avoid the curse of dimensionality, but a mixture of (say) ten Gaus- sians is more flexible than a single Gaussian — thought it may still be the case that the true distribution just can’t be written as a ten-Gaussian mixture. 1.1 Identifiability Before we set about trying to estimate our probability models, we need to make sure that they are identifiable — that if we have distinct representations of the model, they make distinct observational claims. It is easy to let there be too many parameters, or the wrong choice of parameters, and lose identifiability. If there are distinct representations which are observationally equivalent, we either need to change our model, change our representation, or fix on a unique representation by some convention. • With additive regression, E [Y |X = x] = α + j fj (xj ), we can add arbi- trary constants so long as they cancel out. That is, we get the same pre- dictions from α + c0 + jfj (xj ) + cj when c0 = − j cj . This is another model of the same form, α + j fj (xj ), so it’s not identifiable. We dealt with this by imposing the convention that α = E [Y ] and E [fj (Xj )] = 0 — we picked out a favorite, convenient representation from the infinite collection of equivalent representations. • Linear regression becomes unidentifiable with collinear features. Collinear- ity is a good reason to not use linear regression (i.e., we change the model.) • Factor analysis is unidentifiable because of the rotation problem. Some people respond by trying to fix on a particular representation, others just ignore it. Two kinds of identification problems are common for mixture models; one is trivial and the other is fundamental. The trivial one is that we can always swap the labels of any two components with no effect on anything observable at all — if we decide that component number 1 is now component number 7 and vice versa, that doesn’t change the distribution of X at all. This label degeneracy can be annoying, especially for some estimation algorithms, but that’s the worst of it. A more fundamental lack of identifiability happens when mixing two distri- butions from a parametric family just gives us a third distribution from the same family. For example, suppose we have a single binary feature, say an indicator for whether someone will pay back a credit card. We might think there are two kinds of customers, with high- and low- risk of not paying, and try to represent this as a mixture of binomial distribution. If we try this, we’ll see that we’ve gotten a single binomial distribution with an intermediate risk of repayment. A mixture of binomials is always just another binomial. In fact, a mixture of multinomials is always just another multinomial. 3

4.2 Probabilistic Clustering Back when we talked about clustering, I suggested that, ideally, knowing which cluster an observation came from would tell us as much as possible about the distribution of features. This means that differences in features between points from the same clusters should just be matters of chance. This view exactly corresponds to mixture models like Eq. 1; the hidden variable Z I introduced above in just the cluster label. One of the very nice things about probabilistic clustering is that Eq. 1 ac- tually claims something about what the data looks like; it says that it follows a certain distribution. We can check whether it does, and we can check whether new data follows this distribution. If it does, great; if not, if the predictions sys- tematically fail, then the model is wrong. We can compare different probabilistic clusterings by how well they predict (say under cross-validation). Contrast this with k-means or hierarchical clustering: they make no predictions, and so we have no way of telling if they are right or wrong. Consequently, comparing different non-probabilistic clusterings is a lot harder! In particular, probabilistic clustering gives us a sensible way of answering the question “how many clusters?” The best number of clusters to use is the number which will best generalize to future data. If we don’t want to wait around to get new data, we can approximate generalization performance by cross-validation, or by any other adaptive model selection procedure. 3 Estimating Parametric Mixture Models From baby stats., we remember that it’s generally a good idea to estimate distributions using maximum likelihood, when we can. How could we do that here? Remember that the likelihood is the probability (or probability density) of observing our data, as a function of the parameters. Assuming independent samples, that would be n f (xi ; θ) (5) i=1 for observations x1 , x2 , . . . xn . As always, we’ll use the logarithm to turn multi- plication into addition: n (θ) = log f (xi ; θ) (6) i=1 n K = log λk f (xi ; θk ) (7) i=1 k=1 4

5.Let’s try taking the derivative of this with respect to one parameter, say θj . n ∂ 1 ∂f (xi ; θj ) = K λj (8) ∂θj i=1 k=1 λk f (xi ; θk ) ∂θj n λj f (xi ; θj ) 1 ∂f (xi ; θj ) = K (9) i=1 k=1 λk f (xi ; θk ) f (xi ; θj ) ∂θj n λj f (xi ; θj ) ∂ log f (xi ; θj ) = K (10) i=1 k=1 λk f (xi ; θk ) ∂θj If we just had an ordinary parametric model, on the other hand, the derivative of the log-likelihood would be n ∂ log f (xi ; θj ) (11) i=1 ∂θj So maximizing the likelihood for a mixture model is like doing a weighted like- lihood maximization, where the weight of xi depends on cluster, being λj f (xi ; θj ) wij = K (12) k=1 λk f (xi ; θk ) The problem is that these weights depend on the parameters we are trying to estimate! Let’s look at these weights wij a bit more. Remember that λj is the proba- bility that the hidden class variable Z is j, so the numerator in the weights is the joint probability of getting Z = j and X = xi . The denominator is the marginal probability of getting X = xi , so the ratio is the conditional probability of Z = j given X = xi , λj f (xi ; θj ) wij = K = p(Z = j|X = xi ; θ) (13) k=1 λk f (xi ; θk ) If we try to estimate the mixture model, then, we’re doing weighted maximum likelihood, with weights given by the posterior cluster probabilities. These, to repeat, depend on the parameters we are trying to estimate, so there seems to be a vicious circle. But, as the saying goes, one man’s vicious circle is another man’s successive approximation procedure. We have in fact seen something very much like this particular vicious circle already in cluster, in the form of the k-means algorithm. There, we want to assign each point to the cluster with the closest center (like calculating p(Z = j|X = xi ; θ) given θ), but a cluster’s center depends on which points belong to it (like finding θ by maximizing the weighted likelihood). In k-means, we solved the difficulty by alternating between assigning points to clusters and finding cluster centers. You could imagine doing something similar for mixture models: start with an initial guess about the component 5

6.distributions; find out which component each point is most likely to have come from; re-estimate the components using only the points assigned to it, etc., until things converge. This corresponds to taking all the weights wij to be either 0 or 1; it also corresponds to a “hard” clustering procedure like k-means. However, it does not maximize the likelihood, since we’ve seen that to do so we need fractional weights. What’s called the EM algorithm is simply the obvious refinement of the hard assignment strategy. 1. Start with guesses about the mixture components θ1 , θ2 , . . . θK and the mixing weights λ1 , . . . λK . 2. Until nothing changes very much: (a) Using the current parameter guesses, calculate the weights wij (E- step) (b) Using the current weights, maximize the weighted likelihood to get new parameter estimates (M-step) 3. Return the final parameter estimates (including mixing proportions) and cluster probabilities The M in “M-step” and “EM” stands for “maximization”, which is pretty transparent. The E stands for “expectation”, because it gives us the condi- tional probabilities of different values of Z, and probabilities are expectations of indicator functions. (In fact in some early applications, Z was binary, so one really was computing the expectation of Z.) The whole thing is also called the “expectation-maximization” algorithm. 3.1 More about the EM Algorithm The EM algorithm turns out to be a general way of maximizing the likelihood when some variables are unobserved, and hence useful for other things besides mixture models. So in this section, where I try to explain why it works, I am going to be a bit more general abstract. (Also, it will actually cut down on notation.) I’ll pack the whole sequence of observations x1 , x2 , . . . xn into a single variable d (for “data”), and likewise the whole sequence of z1 , z2 , . . . zn into h (for “hidden”). What we want to do is maximize (θ) = log p(d; θ) = log p(d, h; θ) (14) h This is generally hard, because even if p(d, h; θ) has a nice parametric form, that is lost when we sum up over all possible values of h (as we saw above). The essential trick of the EM algorithm is to maximize not the log likelihood, but a lower bound on the log-likelihood, which is more tractable; we’ll see that this lower bound is sometimes tight, i.e., coincides with the actual log-likelihood, and in particular does so at the global optimum. 6

7. 0.5 0.0 log(x) -0.5 0.5 1.0 1.5 2.0 x curve(log(x),from=0.4,to=2.1) segments(0.5,log(0.5),2,log(2),lty=2) Figure 1: The logarithm is a concave function, i.e., the curve connecting any two points lies above the straight line doing so. Thus the average of logarithms is less than the logarithm of the average. We can introduce an arbitrary2 distribution on h, call it q(h), and we’ll (θ) = log p(d, h; θ) (15) h q(h) = log p(d, h; θ) (16) q(h) h p(d, h; θ) = log q(h) (17) q(h) h So far so trivial. Now we need a geometric fact about the logarithm function, which is that its curve is concave: if we take any two points on the curve and connect them by a straight line, the curve lies above the line (Figure 1). Algebraically, this means that w log t1 + (1 − w) log t2 ≤ log wt1 + (1 − w)t2 (18) for any 0 ≤ w ≤ 1, and any points t1 , t2 > 0. Nor does this just hold for two points: for any r points t1 , t2 , . . . tr > 0, and any set of non-negative weights 2 Well, almost arbitrary; it shouldn’t give probability zero to value of h which has positive probability for all θ. 7

8. r i=1 wr = 1, r r wi log ti ≤ log wi ti (19) i=1 i=1 In words: the log of the average is at least the average of the logs. This is called Jensen’s inequality. So p(d, h; θ) p(d, h; θ) log q(h) ≥ q(h) log (20) q(h) q(h) h h ≡ J(q, θ) (21) We are bothering with this because we hope that it will be easier to maximize this lower bound on the likelihood than the actual likelihood, and the lower bound is reasonably tight. As to tightness, suppose that q(h) = p(h|d; θ). Then p(d, h; θ) p(d, h; θ) p(d, h; θ) = = = p(d; θ) (22) q(h) p(h|d; θ) p(h, d; θ)/p(d; θ) no matter what h is. So with that choice of q, J(q, θ) = (θ) and the lower bound is tight. Also, since J(q, θ) ≤ (θ), this choice of q maximizes J for fixed θ. Here’s how the EM algorithm goes in this formulation. 1. Start with an initial guess θ(0) about the components and mixing weights. 2. Until nothing changes very much (a) E-step: q (t) = argmaxq J(q, θ(t) ) (b) M-step: θ(t+1) = argmaxθ J(q (t) , θ) 3. Return final estimates of θ and q The E and M steps are now nice and symmetric; both are about maximizing J. It’s easy to see that, after the E step, J(q (t) , θ(t) ) ≥ J(q (t−1) , θ(t) ) (23) and that, after the M step, J(q (t) , θ(t+1) ) ≥ J(q (t) , θ(t) ) (24) Putting these two inequalities together, J(q (t+1) , θ(t+1) ) ≥ J(q (t) , θ(t) ) (25) (t+1) (t) (θ ) ≥ (θ ) (26) So each EM iteration can only improve the likelihood, guaranteeing convergence to a local maximum. Since it only guarantees a local maximum, it’s a good idea 8

9.to try a few different initial values of θ(0) and take the best. All of which, again, is very like k-means. We saw above that the maximization in the E step is just computing the posterior probability p(h|d; θ). What about the maximization in the M step? p(d, h; θ) q(h) log = q(h) log p(d, h; θ) − q(h) log q(h) (27) q(h) h h h The second sum doesn’t depend on θ at all, so it’s irrelevant for maximizing, giving us back the optimization problem from the last section. This confirms that using the lower bound from Jensen’s inequality hasn’t yielded a different algorithm! 3.2 Further Reading on and Applications of EM Like the textbook’s, my presentation of the EM algorithm draws heavily on Neal and Hinton (1998). Because it’s so general, the EM algorithm is applied to lots of problems with missing data or latent variables. Traditional estimation methods for factor analysis, for example, can be replaced with EM. (Arguably, some of the older methods were versions of EM.) A common problem in time-series analysis and signal processing is that of “filtering” or “state estimation”: there’s an unknown signal St , which we want to know, but all we get to observe is some noisy, corrupted measurement, Xt = h(St ) + ηt . (A historically important example of a “state” to be estimated from noisy measurements is “Where is our rocket and which way is it headed?” — see McGee and Schmidt, 1985.) This is solved by the EM algorithm, with the signal as the hidden variable; Fraser (2008) gives a really good introduction to such models and how they use EM. Instead of just doing mixtures of densities, one can also do mixtures of predictive models, say mixtures of regressions, or mixtures of classifiers. The hidden variable Z here controls which regression function to use. A general form of this is what’s known as a mixture-of-experts model (Jordan and Jacobs, 1994; Jacobs, 1997) — each predictive model is an “expert”, and there can be a quite complicated set of hidden variables determining which expert to use when. The EM algorithm is so useful and general that it has in fact been re-invented multiple times. The name “EM algorithm” comes from the statistics of mixture models in the late 1970s; in the time series literature it’s been known since the 1960s as the “Baum-Welch” algorithm. 3.3 Topic Models and Probabilistic LSA Mixture models over words provide an alternative to latent semantic indexing for document analysis. Instead of finding the principal components of the bag- of-words vectors, the idea is as follows. There are a certain number of topics which documents in the corpus can be about; each topic corresponds to a dis- tribution over words. The distribution of words in a document is a mixture of 9

10.the topic distributions. That is, one can generate a bag of words by first picking a topic according to a multinomial distribution (topic i occurs with probability λi ), and then picking a word from that topic’s distribution. The distribution of topics varies from document to document, and this is what’s used, rather than projections on to the principal components, to summarize the document. This idea was, so far as I can tell, introduced by Hofmann (1999), who estimated ev- erything by EM. Latent Dirichlet allocation, due to Blei and collaborators (Blei et al., 2003) is an important variation which smoothes the topic distribu- tions; there is a CRAN package called lda. Blei and Lafferty (2009) is a good recent review paper of the area. 4 Non-parametric Mixture Modeling We could replace the M step of EM by some other way of estimating the dis- tribution of each mixture component. This could be a fast-but-crude estimate of parameters (say a method-of-moments estimator if that’s simpler than the MLE), or it could even be a non-parametric density estimator of the type we talked about last time. (Similarly for mixtures of regressions, etc.) Issues of dimensionality re-surface now, as well as convergence: because we’re not, in general, increasing J at each step, it’s harder to be sure that the algorithm will in fact converge. This is an active area of research. 5 R There are several R packages which implement mixture models. The mclust package (http://www.stat.washington.edu/mclust/) is pretty much stan- dard for Gaussian mixtures. One of the most recent and powerful is mixtools (Benaglia et al., 2009), which, in addition to classic mixtures of parametric densities, handles mixtures of regressions and some kinds of non-parametric mixtures. The FlexMix package (Leisch, 2004) is (as the name implies) very good at flexibly handling complicated situations, though you have to do some programming to take advantage of this. 6 Exercises Not to hand in. 1. Work through the E- step and M- step for a mixture of two Poisson distri- butions, without looking at the textbook. Then check your answer against section 8.4 2. Code up the EM algorithm for a mixture of K Gaussians, following section 8.4 in the textbook. Simulate data from K = 3 Gaussians. How well does your code assign data-points to components if you give it the actual 10

11. Gaussian parameters as your initial guess? If you give it other initial parameters? 3. Could you use mixture models to design a recommendation engine? References Benaglia, Tatiana, Didier Chauveau, David R. Hunter and Derek S. Young (2009). “mixtools: An R Package for Analyzing Mixture Models.” Journal of Statistical Software, 32. URL http://www.jstatsoft.org/v32/i06. Blei, David M. and John D. Lafferty (2009). “Topic Models.” In Text Min- ing: Theory and Applications (A. Srivastava and M. Sahami, eds.). London: Taylor and Francis. URL http://www.cs.princeton.edu/~blei/papers/ BleiLafferty2009.pdf. Blei, David M., Andrew Y. Ng and Michael I. Jordan (2003). “Latent Dirichlet Allocation.” Journal of Machine Learning Research, 3: 993–1022. URL http://jmlr.csail.mit.edu/papers/v3/blei03a.html. Fraser, Andrew M. (2008). Hidden Markov Models and Dynamical Systems. Philadelphia: SIAM Press. URL http://www.siam.org/books/ot107/. Hofmann, Thomas (1999). “Probabilistic Latent Semantic Analysis.” In Un- certainty in Artificial Intelligence: Proceedings of the Fiftheenth Conference [UAI 1999] (Kathryn Laskey and Henri Prade, eds.), pp. 289–296. San Fran- cisco: Morgan Kaufmann. URL http://www.cs.brown.edu/~th/papers/ Hofmann-UAI99.pdf. Jacobs, Robert A. (1997). “Bias/Variance Analyses of Mixtures-of-Experts Ar- chitectures.” Neural Computation, 9: 369–383. Jordan, Michael I. and Robert A. Jacobs (1994). “Hierarchical Mixtures of Experts and the EM Algorithm.” Neural Computation, 6: 181–214. Leisch, Friedrich (2004). “FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R.” Journal of Statistical Software, 11. URL http://www.jstatsoft.org/v11/i08. McGee, Leonard A. and Stanley F. Schmidt (1985). Discovery of the Kalman Filter as a Practical Tool for Aerospace and Industry. Tech. Rep. 86847, NASA Technical Memorandum. URL http://ntrs.nasa.gov/archive/ nasa/casi.ntrs.nasa.gov/19860003843_1986003843.pdf. Neal, Radford M. and Geoffrey E. Hinton (1998). “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants.” In Learning in Graphical Models (Michael I. Jordan, ed.), pp. 355–368. Dor- drecht: Kluwer Academic. URL http://www.cs.toronto.edu/~radford/ em.abstract.html. 11