Skip to content

GoogleTechTalks: Modeling Science: Dynamic Topic Models of Scholarly

NerdPollution edited this page Sep 28, 2014 · 1 revision

Topic Models:

  • automatically discover topic from collection documents
  • automatically label images that are unlabeled
  • models connections between topics (group of words)

LDA (Latent Dirichlet allocation):

  • Treat data s observation that arise from a generative probabilistic process that includes hidden variables ** For documents, the hidden variables reflect the thematic structure of the collection

  • General Idea:
    ⋅⋅⋅ Cast the intuition about the data into generative probabilistic process:
    ⋅⋅⋅ Each document is a random mixture of carpus-wide topics
    ⋅⋅⋅ Each word is drawn from one of these topics

  • Algorithms variables:
    ⋅⋅⋅ K: number of topics
    ⋅⋅⋅ Beta K: Distributions of all the words in our vocabulary
    ⋅⋅⋅ D: Document
    ⋅⋅⋅ Theta: Each topic and their distributions
    ⋅⋅⋅ N: Words in the document
    ⋅⋅⋅ Z: Each word choose a topic that it belongs too
    ⋅⋅⋅ W: for each topic look up the word and draw the word from Beta distribution

  • Algorithm (informal) steps:
    ⋅⋅⋅ 1. Draw each topic beta ~ Dir(N), for i elemnt {i, ......, K}
    ⋅⋅⋅ 2. For each document:
    ⋅⋅⋅⋅⋅⋅ 1. Draw word a random from topic
    ⋅⋅⋅⋅⋅⋅ 2. Draw the word distribution from the topic beta distribution

  • Images:
    GoogleTechTalks

Vocabulary:

  • Generative model: is a model for randomly generating observable data, typically given some hidden parameters. It specifies a joint probability distribution over observation and label sequences

Clone this wiki locally