Latent Dirichlet Allocation

What is Latent Dirichlet Allocation?

LDA automatically discovers topics contained in a collection of documents. A document is a mixture of topics and each topic is a mixture of words.

For example, given the sentences below (documents) LDA might discover the topic mixture:

The dog ate cheese. (50% about cheese, 50% about dogs.)
The cheese was made in France. (100% about cheese)
Dogs often bark at strangers. (100% about dogs)

And the word mixtures:

cheese, France - topic 1 (cheese)
dogs, bark - topic 2 (dogs)

LDA discovers both the topic mixture and the word mixture.

The LDA Model

LDA is a generative model in which we assume the observed data (documents consisting of words) is generated by a latent probabilistic model. In other words, each document is about each topic with some probability and each topic contains each word with some probability.

The LDA model follows this algorithm:

Decide the number of words N in the document
Choose a topic mixture (i.e. 50% about sports, 50% about business) for the document (according to a Dirichlet distribution)
Generate the next word in the document by,
- First, choose a topic according to the sampled topic mixture above
- Next, use the topic’s distribution over words to choose the next word
Repeat until you’ve generated all the words in the document

LDA discovers a model which could have used the above steps to generate the documents.

Aside: Some Statistics

Above I talked about sampling from various distributions, but what is the actual nature of these distributions?

Determining the bias of a coin

We can start with a simpler mechanism than words in documents: coins. To determine the probability a coin comes up heads you’d probably flip the coin a bunch of times and count the number of times it came up heads vs the total number of flips.

Doing this experiment however isn’t enough because you can never be certain when to stop flipping. For example, if you flip the coin 10 times and get 7 heads you might estimate the true bias of the coin to be 0.7, but it could also be 0.6 or 0.8.

So what you really want is a probability distribution over the likelihood the coin comes up heads and you’d like this distribution to get more and more accurate as you observe flips (data).

Written mathematically, we want to find $ f(\phi \mid h, N = h + t) $, or the probability density function which describes the probability the coin comes up heads, $\phi$, given we’ve observed $h$ heads and $t$ tails.

To determine this function let’s first define some notation which will be used below:

$\phi$ - the probability the coin comes up heads
$N$ - the number of coin flips
$h$ - the number of times the coin came up heads
$t$ - the number of times the coin came up tails
$f(\phi \mid h, N = h + t)$ - the posterior probability density function of $\phi$ (i.e. what you believe about the coin after observing $N$ flips)
$P(h \mid \phi, N = h + t)$ - the probability the coin came up heads $h$ times given a particular choice of $\phi$ and a particular number of observed flips
$P(\phi)$ - the prior probability density function of $\phi$ (i.e. what you believed about the coin before you started flipping..maybe you knew nothing and thought all values of $phi$ equally likely)
$P(h \mid N = h + t)$ - the probability the coin came up heads $h$ times over all choices of $\phi$

The posterior probability density function can be written using Bayes rule: \begin{align} f(\phi \mid h, N = h + t) &= \frac{P(h \mid \phi, N = h + t) P(\phi)}{P(h \mid N = h + t)} \\ & = \frac{P(h \mid \phi, N = h + t) P(\phi)}{\int_0^1 P(h \mid \phi^\prime, N = h + t) P(\phi^\prime) \mathrm{d}\phi^\prime} \end{align}

Now, as said above, we might assume that all values of $\phi$ are equally likely so $P(\phi)$ is uniform and $P(\phi) = 1$. Rewriting the above equation,

\begin{align} f(\phi \mid h, N = h + t) &= \frac{P(h \mid \phi, N = h + t)}{\int_0^1 P(H = h \mid \phi^\prime, N = h + t) \mathrm{d}\phi^\prime} \end{align}

Ok, now what can we say about $P(h \mid \phi, N = h + t)$, the probability of getting $h$ heads in $N$ flips given a particular $\phi$? Well it turns out this outcome is a binomial random variable with parameters $\phi$ and $N$.

The binomial distribution describes the probability of achieving a particular number of successes in $N$ independent yes / no experiments each of which yields success with probability $\phi$.

Using the probability mass function of the binomial distribution,

$P(h \mid \phi, N = h + t) = {N \choose h} \phi^{h} (1-\phi)^{t}$

We can substitute this value back into the posterior PDF equation above:

\begin{align} f(\phi \mid h, N = h + t) &= \frac{ {N \choose h} \phi^{h} (1-\phi)^{t}}{\int_0^1 {N \choose h} (\phi^\prime)^{h} (1-\phi^\prime)^{t} \mathrm{d}\phi^\prime} \\ & = \frac{ \phi^{h} (1-\phi)^{t}}{\int_0^1 (\phi^\prime)^{h} (1-\phi^\prime)^{t} \mathrm{d}\phi^\prime} \end{align}

In this result, the denominator acts as a constant normalization factor which makes $\int_0^1 f(\phi \mid h, N = h + t) \mathrm{d}\phi = 1$ so it is a true probability density function. This form also turns out to be a beta distribution with parameters $(h+1)$ and $(t+1)$.

The beta distribution is the conjugate prior for the likelihood function of the binomial distribution. A conjugate prior is a distribution whose posterior distribution is in the same family as the prior. Essentially this is a fancy way of saying that after observing more coin flips we just need to add the respective number of heads and tails to the parameters of the above beta distribution to get an updated estimate of our probability density function.

On to Dirichlet

The Dirichlet distribution is the multi-dimensional generalization of the Beta distribution. It describes the probability distribution of the likelihoods of each value a variable can take on for a variable that can take on multiple values. For example, you might sample from a Dirichlet distribution to get an estimate for the weights of the sides of a die (or the probabilities with which each of a set of topics occurs in a document…). A sample drawn from a Dirichlet distribution gives you a multinomial distribution analagous to a sample from a Beta distribution which gives a binomial distribution.