Historische Bronnen en Topic Modeling
Marijn Koolen
24 April 2018
Overview
- Topic Modelling
- Data-driven discovery of topics in text
- What are they? Why use them? How do they work?
Topic Models
- Representing topics in collection of documents
- Use statistics to find topics represented by groups of words
- Document is a mix of topics
- Topic is a mix of words
- Documents and words can be directly observed
- topics are ‘hidden’ or ‘latent’
Illustration 1

Illustration 2

Assumptions
- Two documents with the same topics will have overlap in words
- not literally true, but probabilistically true
- Single document can consist of many topics
- Three elements: words, topics, documents
- topics are formed by a selection of words
- documents are formed by a selection of topics
Statistical Modeling
- Given a collection of documents (text or otherwise), the modeling process does two things:
- create word probability distribution for topics
- create topic probability distribution for documents
- Both are purely based on frequency and co-occurrence of words
Suggestive Patterns
- Overcomes problems of keyword search
- search with whole dictionary
- but words weighted by topical importance
- Gives insight in topical nature of collection
- Advantages:
- Great for finding suggestive patterns
- Disadvantages:
- Topics can be hard to interpret
Topic Modelling in Digital Humanities
- Extremely popular, especially in Historical Sciences
KB Newspaper Archive
- Over 80 million articles
- organised via formal metadata
- date, newspaper title, article type
- How organised in terms of topics?
- Sampled collection:
- 100,000 articles matching: groote oorlog OR wereldoorlog OR europeesche oorlog OR 1914-1918
- Constrained to period 1918-1940
Topics in Newspapers
- Newsarticles on World War One, published during the Interbellum
Interpreting Topics

Two Parts of Technicalities
- Lots of statistics
- we’ll only scratch the surface!
- Lots of transformations
- most of these steps are easy to understand
- but also important to understand
- need to be aware of them to gain control
- Do experiments to get feel for what’s meaningful
- Four major steps:
- Text > preprocessing > Words
- Words > indexing > Numbers
- Numbers > modeling > Topics
- Topics > analysing > Compositions
- We’ll encounter specifc transformations on the way
1. Preprocessing Text
- What do the newspaper articles look like?
2. Indexing Text
- From words to vectors to indexes
2a. Text as Vectors
- text is linear sequence of words
- can be represented as
- or as a vector of words
- ‘Easy’ to see which texts have overlap in words
2b. Inverted Index
- Term-document index:
- word lists which texts it appears in
- index becomes rows of text vectors
- inverted index
- interesting aside: search engine use this for quick lookup!
Parsing & Units of Data
- Usually words as units
- can be anything, but features need high enough frequency of units
- trigrams and longer phrases often too sparse
- Bag of words:
- ignores word order, syntax, sentence or paragraph boundaries
- same with other kinds of data (colours, objects, melodies)
Stopwords
- Function words and other frequent words carry little meaning in modelled topics
- But which words are stopwords?
3. Modeling
- Topics are latent
- Reduce high-dimensional term vector space to low-dimensional 'latent' topic space
- Topics represented by prob. dist. over words
- Texts represented by prob. dist. over topics
- Established models:
- LSA: Latent Semantic Analysis
- LDA: Latent Dirichlet Allocation
- Two words co-occurring in a text
- signal that they are related
- document frequency determines strength of signal
- co-occurrence index
- Easy to see which words are related
Frequency vs. Importance
- How can statistics help identify important words?
- TF * IDF indicates importance of term relative to the document
- TF: Term Frequency
- terms more frequently in document are more important
- IDF: Inverted Document Frequency
- terms in fewer documents are more specific
4. Analysing Compositions
- LDAvis is a wonderful tool to visually explore generated topics
- Topic Modeling Tool gives handy output for analysing compositions
Topic Modelling Summary
- Very interesting technique for humanities
- suggestive patterns good for interpretation
- but easy to see connections that aren’t there
- many examples of do’s and don’ts
- inexact science, but there are best practices
Wrap Up
- Topic modelling:
- data-driven, less biased by own knowledge/experience
- meaning from statistical structure, ‘unstructured data’
- Assignment: topic modelling in Tijdschrift voor Geschiedenis