Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Dirichlet processes and Applications

158 vues

Publié le

An intuitive guide to the working of Dirichlet distributions and processes, and their applications.

Publié dans : Données & analyses
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

Dirichlet processes and Applications

  1. 1. Dirichlet Processes and Applications Saurav Jha Machine Learning Engineer Copyright © 2018 FactSet Research Systems Inc. All rights reserved. Confidential: Do not forward.
  2. 2. 1. Probability 101: Mass & Density Functions 2. Probability 102: Simplex and its geometrical meaning 3. Dirichlet Distribution 4. Dirichlet Process 5. A demo 6. An application Table of Contents 2
  3. 3. Probability 101 • PDF = probability that a continuous random variable has a particular range of values • PMF = probability that a discrete random variable is exactly equal to some value 3 • In continuous setting: ∫b a f(x)dx = prob. that outcome is between a and b i.e., units of f(x) = prob. Per unit length (dx) = how dense is probability per unit length near x • In discrete setting: f(x) = Pr(X = x) i.e., units of f(x) = simple probability = what is the mass of object X at point x • Set of PMFs on entire sample space. S = { x E Rn : xi >= 0, ∑i=1..n xi = 1} Probability Mass Function (PMF) vs Density Function (PDF) Probability Simplex
  4. 4. 4 • A k-dimensional polytope ( a geometric object with flat sides) formed from convex hull of its k+1 vertices. Probability 102: K-Simplex – geometrical meaning • Let u0, u1, …, uk E Rk be (k+1) points, then the simplex determined by them = set of points: C = {Ɵ0u0 + … + Ɵkuk | ∑i = 0...k Ɵi = 1 and Ɵi >= 0 ∀ i }  Looking at u0, u1, u2 as a disjoint set of possible events, such that their probs. sum to 1. i.e. p0 + p1 + p2 = 1, where 0 <= pi <= 1  Consider the three probabilities as points in Euclidean space (p1,p2,p3).  Resulting shape outlines the perimeter of a triangle.  While the set C lies in a k-dim. Space (k=3), the object it forms is (k-1) dimensional.  Each point pi in the simplex = a pmf in its own (i.e. each component of pi = [0,1] and all its components sum up to 1).
  5. 5. Dirichlet distribution 5 • Let Q = [Q1, Q2, …, Qk] = a random pmf (i.e. Qi >= 0) for i = 1,2,…, k and ∑i=1..k Qi = 1. • Let α = [α1, α2, . . . , αk], with αi > 0 for each i, and let α0 = ∑i=1..k αi • Then, Q = a Dirichlet distribution with param. α and is denoted by Q ∼ Dir(α): P(Q1, Q2, …, Qk) = • A probability distribution whose samples lie in the (k-1) dimensional probability simplex ∆k, i.e., a distribution over pmfs of length k. • Ranges over possible parameters vectors for a multinomial distribution and is the conjugate prior of multinomial distribution. “A distribution of distributions”
  6. 6. Dirichlet distribution – an example use-case • X = vector representing n draws of a random var. with 3 possible outcomes = [4,4,2] • PMF of X = multinomial distribution = (p1n1* p2n2 * p3n3) * n!/ n1!*n2!*n3! 6 Q) What if p1, p2, p3 are unknown? i.e., no certainty over what the distribution of categorical vars. is!  Solution: use a Dirichlet distribution with params α1, α2, α3 to first draw a P ~ Dir(α), and then, draw X ~ Multi(p). • Introduces one level of indirection in the model for X – instead of saying what P generated X, use params α1, α2, α3 to find likely prob. Distributions and then draw samples X acc. To random P. • Since, sampling is directly from a prob. K-Simplex => the values of a k-dim. Dirichlet distribution = mean value of the Dirichlet. • Addition of the Dirichlet distribution = introducing prior beliefs about what X is likely to occur. i.e., a random pmf has a Dirichlet distribution with param α. [1] • Analogy 1: if a random pmf = a bag full of dice, then a sample from the Dirichlet = a specific dice.
  7. 7. Dirichlet Process  Dirichlet Processes to the Rescue ! 7 • In the dice analogy, the dice must have a finite no. of faces. • Limitation of Dirichlet distribution = assumes a finite set of events. • Enables working with an infinite set of events, and hence to model prob. Distributions over infinite sample spaces. Analogy 2: • Asking a pedestrians on the street to choose their fav. Color out of {V,I,B,G,Y,O,R}. • Based on answer, model each person as a pmf over 7 colors. • Each person’s pmf = a realization of a draw from a Dirichlet distribution over 7 colors.  What if the choices are no longer restricted to 7 colors? • Modelling an individual’s pmfs (over infinite dim.) = a distribution over distributions over an infinite samle space. • One solution = a Dirichlet process.
  8. 8. Dirichlet Process – definition  Input = H (a prob. Distribution a.k.a base distribution), α (a +ve real no. a.k.a concentration param.)  Draw A (i.e., nth element) from H.  For n > 1:  Assign A to a new category with the prob. α / (α + n – 1).  Assign A to a pre-existing category x with prob. nx / (α + n – 1), where nx = no. of random variables already assigned to x. 8 • Assign elements A,B,C to unknown no. of categories following the algorithm: • Used when modelling data that tends to repeat previous values in a “rich get richer” fashion. • Can also be defined as a Chinese Restaurant Process. • Applications: Morphological segmentation in NLP, Modelling mutation rates of genes in evolutionary biology.
  9. 9. A demo [2] 9
  10. 10. An application: Learning of hierarchical Morphology paradigms [3] • A paradigm = a pair (StemList, SuffixList) where, each Stem+Suffix string = a valid word. • Can be modelled as a hierarchical structure. 10 • Morphologically similar words = close to each other in the structure. • Similarity metric = # common morphemes • Notations: w = word, s = stem, m = suffix • Assumption: Stems and suffixes generated independently from each other. • Prob. of a word = p(w = s+m) = p(s) * p(m)
  11. 11. An application: Learning of hierarchical Morphology paradigms [3] 11 1. Two Dirichlet processes generate stems and suffixes independently: • βs = concentration parameter, i.e., no. of stem types generated by the DP • If β = small, new stem/suffix types are less likely to be generated. • β = large, more likely to generate new stem/suffix types, thus yielding more uniform distribution. • Authors choose β < 1, i.e. to yield a more skewed distribution with sparse stems & suffixes. • P = base distribution specifying prior prob. Distribution for morpheme lengths. • Joint prob. Of stems can then be calculated as:
  12. 12. References 1. Frigyik, Bela A. et al. “Introduction to the Dirichlet Distribution and Related Processes.” (2010). 2. http://phyletica.org/dirichlet-process/ 3. Can, Burcu and Suresh Manandhar. “Probabilistic Hierarchical Clustering of Morphological Paradigms.” EACL (2012). 12
  13. 13. THANK YOU ! 13