LDA automatically discovers topics contained in a collection of documents. A document is a mixture of topics and each topic is a mixture of words.
For example, given the sentences below (documents) LDA might discover the topic mixture:
And the word mixtures:
LDA discovers both the topic mixture and the word mixture.
LDA is a generative model in which we assume the observed data (documents consisting of words) is generated by a latent probabilistic model. In other words, each document is about each topic with some probability and each topic contains each word with some probability.
The LDA model follows this algorithm:
LDA discovers a model which could have used the above steps to generate the documents.
Above I talked about sampling from various distributions, but what is the actual nature of these distributions?
We can start with a simpler mechanism than words in documents: coins. To determine the probability a coin comes up heads you’d probably flip the coin a bunch of times and count the number of times it came up heads vs the total number of flips.
Doing this experiment however isn’t enough because you can never be certain when to stop flipping. For example, if you flip the coin 10 times and get 7 heads you might estimate the true bias of the coin to be 0.7, but it could also be 0.6 or 0.8.
So what you really want is a probability distribution over the likelihood the coin comes up heads and you’d like this distribution to get more and more accurate as you observe flips (data).
Written mathematically, we want to find \( f(\phi \mid h, N = h + t) \), or the probability density function which describes the probability the coin comes up heads, \(\phi\), given we’ve observed \(h\) heads and \(t\) tails.
To determine this function let’s first define some notation which will be used below:
The posterior probability density function can be written using Bayes rule: \begin{align} f(\phi \mid h, N = h + t) &= \frac{P(h \mid \phi, N = h + t) P(\phi)}{P(h \mid N = h + t)} \\ & = \frac{P(h \mid \phi, N = h + t) P(\phi)}{\int_0^1 P(h \mid \phi^\prime, N = h + t) P(\phi^\prime) \mathrm{d}\phi^\prime} \end{align}
Now, as said above, we might assume that all values of \(\phi\) are equally likely so \(P(\phi)\) is uniform and \(P(\phi) = 1\). Rewriting the above equation,
\begin{align} f(\phi \mid h, N = h + t) &= \frac{P(h \mid \phi, N = h + t)}{\int_0^1 P(H = h \mid \phi^\prime, N = h + t) \mathrm{d}\phi^\prime} \end{align}
Ok, now what can we say about \(P(h \mid \phi, N = h + t)\), the probability of getting \(h\) heads in \(N\) flips given a particular \(\phi\)? Well it turns out this outcome is a binomial random variable with parameters \(\phi\) and \(N\).
The binomial distribution describes the probability of achieving a particular number of successes in \(N\) independent yes / no experiments each of which yields success with probability \(\phi\).
Using the probability mass function of the binomial distribution,
We can substitute this value back into the posterior PDF equation above:
\begin{align} f(\phi \mid h, N = h + t) &= \frac{ {N \choose h} \phi^{h} (1-\phi)^{t}}{\int_0^1 {N \choose h} (\phi^\prime)^{h} (1-\phi^\prime)^{t} \mathrm{d}\phi^\prime} \\ & = \frac{ \phi^{h} (1-\phi)^{t}}{\int_0^1 (\phi^\prime)^{h} (1-\phi^\prime)^{t} \mathrm{d}\phi^\prime} \end{align}
In this result, the denominator acts as a constant normalization factor which makes \(\int_0^1 f(\phi \mid h, N = h + t) \mathrm{d}\phi = 1\) so it is a true probability density function. This form also turns out to be a beta distribution with parameters \((h+1)\) and \((t+1)\).
The beta distribution is the conjugate prior for the likelihood function of the binomial distribution. A conjugate prior is a distribution whose posterior distribution is in the same family as the prior. Essentially this is a fancy way of saying that after observing more coin flips we just need to add the respective number of heads and tails to the parameters of the above beta distribution to get an updated estimate of our probability density function.
The Dirichlet distribution is the multi-dimensional generalization of the Beta distribution. It describes the probability distribution of the likelihoods of each value a variable can take on for a variable that can take on multiple values. For example, you might sample from a Dirichlet distribution to get an estimate for the weights of the sides of a die (or the probabilities with which each of a set of topics occurs in a document…). A sample drawn from a Dirichlet distribution gives you a multinomial distribution analagous to a sample from a Beta distribution which gives a binomial distribution.