Créer une présentation
Télécharger la présentation

Download

Download Presentation

Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and

110 Vues
Download Presentation

Télécharger la présentation
## Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Introduction to Expectation MaximizationAssembled and**extended by Longin Jan LateckiTemple University, latecki@temple.edubased on slides by Andrew Blake, Microsoft Research and Bill Freeman, MIT, ICCV 2003 and Andrew W. Moore, Carnegie Mellon University**Learning and vision: Generative Methods**• Machine learning is an important direction in computer vision. • Our goal for this class: • Give overviews of useful techniques. • Show how these methods can be used in vision. • Provide references and pointers.**What is the goal of vision?**If you are asking, “Are there any faces in this image?”, then you would probably want to use discriminative methods.**What is the goal of vision?**If you are asking, “Are there any faces in this image?”, then you would probably want to use discriminative methods. If you are asking, “Find a 3-d model that describes the runner”, then you would use generative methods.**Modeling**So we want to look at high-dimensional visual data, and fit models to it; forming summaries of it that let us understand what we see.**Posterior probability**Likelihood function Prior probability mean data points std. dev. Evidence By Bayes rule How find the parameters of the best-fitting Gaussian?**Posterior probability**Likelihood function Prior probability mean data points std. dev. Evidence Maximum likelihood parameter estimation: How find the parameters of the best-fitting Gaussian?**Derivation of MLE for Gaussians**Observation density Log likelihood Maximisation**Basic Maximum Likelihood Estimate (MLE) of a Gaussian**distribution Mean Variance Covariance Matrix**Basic Maximum Likelihood Estimate (MLE) of a Gaussian**distribution Mean Variance For vector-valued data, we have the Covariance Matrix**Maximum likelihood estimation for the slope of a single line**Data likelihood for point n: Maximum likelihood estimate: where gives regression formula**Model fitting example 3:Fitting two lines to observed data**y x**MLE for fitting a line pair**(a form of mixture dist. for )**Line 1**Line 2 Fitting two lines: on the one hand… If we knew which points went with which lines, we’d be back at the single line-fitting problem, twice. y x**Fitting two lines, on the other hand…**We could figure out the probability that any point came from either line if we just knew the two equations for the two lines. y x**Expectation Maximization (EM): a solution to**chicken-and-egg problems**EM example:**Converged!**MLE with hidden/latent variables:Expectation Maximisation**General problem: data parameters hidden variables For MLE, want to maximise the log likelihood The sum over z inside the log gives a complicated expression for the ML solution.**The EM algorithm**We don’t know the values of the labels, zi , but let’s use its expected value under its posterior with the current parameter values, old. That gives us the “expectation step”: “E-step” Now let’s maximize this Q function, an expected log-likelihood, over the parameter values, giving the “maximization step”: “M-step” Each iteration increases the total log-likelihood log p(y|)**Expectation Maximisation applied to fitting the two lines**Need: /2 and then: and maximising that gives associate data point with line Hidden variables and probabilities of association are : ’ ’**EM fitting to two lines**with /2 “E-step” and repeat Regression becomes: “M-step”**Experiments: EM fitting to two lines**(from a tutorial by Yair Weiss, http://www.cs.huji.ac.il/~yweiss/tutorials.html) Line weights line 1 line 2 Iteration 1 2 3**Applications of EM in computer vision**• Image segmentation • Motion estimation combined with perceptual grouping • Polygonal approximation of edges**Next… back to Density EstimationWhat if we want to do**density estimation with multimodal or clumpy data?**m1**The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi m2 m3**m1**The GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: m2 m3**The GMM assumption**• There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). m2**The GMM assumption**• There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix s2I • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). • Datapoint ~ N(mi, s2I ) m2 x**m1**The General GMM assumption • There are k components. The i’th component is called wi • Component wi has an associated mean vector mi • Each component generates data from a Gaussian with mean mi and covariance matrix Si • Assume that each datapoint is generated according to the following recipe: • Pick a component at random. Choose component i with probability P(wi). • Datapoint ~ N(mi, Si ) m2 m3**Unsupervised Learning:not as hard as it looks**IN CASE YOU’RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA (X VECTORS) DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS**Computing likelihoods in unsupervised case**We have x1, x2 , … xN We know P(w1) P(w2) .. P(wk) We knowσ P(x|wi, μi, … μk) = Prob that an observation from class wi would have value x given class means μ1… μk Can we write an expression for that?**likelihoods in unsupervised case**We have x1 x2 … xn We have P(w1) .. P(wk). We have σ. We can define, for any x , P(x|wi , μ1, μ2..μk) Can we define P(x | μ1, μ2..μk) ? Can we define P(x1, x2, .. xn| μ1, μ2..μk) ? [YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY]**Unsupervised Learning:Mediumly Good News**We now have a procedure s.t. if you give me a guess at μ1, μ2..μk, I can tell you the prob of the unlabeled data given those μ‘s. Suppose x‘s are 1-dimensional. There are two classes; w1 and w2 P(w1) = 1/3 P(w2) = 2/3 σ = 1 . There are 25 unlabeled datapoints (From Duda and Hart) x1 = 0.608 x2 = -1.590 x3 = 0.235 x4 = 3.949 : x25 = -0.712**Duda & Hart’s Example**Graph of log P(x1, x2 .. x25 | μ1, μ2 ) against μ1 () and μ2 () Max likelihood = (μ1=-2.13, μ2 =1.668) Local minimum, but very close to global at (μ1=2.085, μ2 =-1.257)* * corresponds to switching w1 with w2.**Duda & Hart’s Example**We can graph the prob. dist. function of data given our μ1 and μ2 estimates. We can also graph the true function from which the data was randomly generated. • They are close. Good. • The 2nd solution tries to put the “2/3” hump where the “1/3” hump should go, and vice versa. • In this example unsupervised is almost as good as supervised. If the x1 .. x25 are given the class which was used to learn them, then the results are (μ1=-2.176, μ2=1.684). Unsupervised got (μ1=-2.13, μ2=1.668).**Finding the max likelihood μ1,μ2..μk**We can compute P( data | μ1,μ2..μk) How do we find the μi‘s which give max. likelihood? • The normal max likelihood trick: Set log Prob (….) = 0 μi and solve for μi‘s. # Here you get non-linear non-analytically- solvable equations • Use gradient descent Slow but doable • Use a much faster, cuter, and recently very popular method…**The E.M. Algorithm**DETOUR • We’ll get back to unsupervised learning soon. • But now we’ll look at an even simpler case with hidden information. • The EM algorithm • Can do trivial things, such as the contents of the next few slides. • An excellent way of doing our unsupervised learning problem, as we’ll see. • Many, many other uses, including inference of Hidden Markov Models.**Silly Example**Let events be “grades in a class” w1 = Gets an A P(A) = ½ w2 = Gets a B P(B) = μ w3 = Gets a C P(C) = 2μ w4 = Gets a D P(D) = ½-3μ (Note 0 ≤μ≤1/6) Assume we want to estimate μ from data. In a given class there were a A’s b B’s c C’s d D’s What’s the maximum likelihood estimate of μ given a,b,c,d ?**Silly Example**Let events be “grades in a class” w1 = Gets an A P(A) = ½ w2 = Gets a B P(B) = μ w3 = Gets a C P(C) = 2μ w4 = Gets a D P(D) = ½-3μ (Note 0 ≤μ≤1/6) Assume we want to estimate μ from data. In a given class there were a A’s b B’s c C’s d D’s What’s the maximum likelihood estimate of μ given a,b,c,d ?**Trivial Statistics**P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ P( a,b,c,d | μ) = K(½)a(μ)b(2μ)c(½-3μ)d log P( a,b,c,d | μ) = log K + alog ½ + blog μ + clog 2μ + dlog (½-3μ) Boring, but true!**Same Problem with Hidden Information**REMEMBER P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of μ now?**Same Problem with Hidden Information**REMEMBER P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ Someone tells us that Number of High grades (A’s + B’s) = h Number of C’s = c Number of D’s = d What is the max. like estimate of μ now? We can answer this question circularly: EXPECTATION If we know the value of μ we could compute the expected value of a and b Since the ratio a:b should be the same as the ratio ½ : m MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of μ