Topic Modeling with Newspaper Archives

Marijn Koolen

Koninklijke Bibliotheek, The Hague, 19/01/2015

Programme

  1. What are topic models?
  2. Why use topic models?
  3. How do topic models work?
  4. Working with topic models

Part 1: What Are Topic Models?

a general overview

Topic Models

Illustration 1

image not found

Illustration 2

image not found

Illustration 3

image not found

Assumptions

Statistical Modeling

Probability and Intuition

Wikipedia: Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words.

Topic Modelling Software

Part 2: Why Use Topic Models?

relevance for research

Suggestive Patterns

Topic Modelling in Digital Humanities

Sub-Topic Modelling

Experimenting with STM

Topic Modelling In Research

KB Newspaper Archive

Examples with KB Newspaper Archive

Researcher’s Responsibility

Familiarity with Corpus

Lies, Damned Lies and Statistics

Mixing Languages

Beyond Text

Part 3: How Do Topic Models Work?

the technical details

Two Parts of Technicalities

Computational Thinking

Transformations

1. Preprocessing Text

2. Indexing Text

2a. Text as Vectors

2b. Inverted Index

Parsing & Units of Data

Stopwords

3. Modeling

Semantic Relatedness

Frequency vs. Importance

Mallet & LDA

Considerations 1/2

Considerations 2/2

Part 4: Working with Topic Models

hands on experience

Exercises

  1. Using Mallet
    • making topic models
  2. Comparing models
    • understanding impact of parameters
  3. Analysing probability distributions
    • exploring topics in collections

Exercise 1: Using Mallet

Exercise 2: Comparing Models

Exercise 3: Analyse Probabilities

Further reading:

Wrap Up