Historische Bronnen en Topic Modeling

Marijn Koolen

24 April 2018

Overview

Topic Modelling
- Data-driven discovery of topics in text
- What are they? Why use them? How do they work?

Topic Models

Representing topics in collection of documents
- Use statistics to find topics represented by groups of words
- Document is a mix of topics
- Topic is a mix of words
Documents and words can be directly observed
- topics are ‘hidden’ or ‘latent’

Illustration 1

image not found

source: http://www.4four.us/wordpress/wp-content/uploads/2010/10/result.png

Illustration 2

image not found

source: http://dig-eh.org/files/2014/07/circ_disciplines-700x386.png

Assumptions

Two documents with the same topics will have overlap in words
- not literally true, but probabilistically true
Single document can consist of many topics
- but to different degrees
Three elements: words, topics, documents
- topics are formed by a selection of words
- documents are formed by a selection of topics

Statistical Modeling

Given a collection of documents (text or otherwise), the modeling process does two things:
1. create word probability distribution for topics
2. create topic probability distribution for documents
Both are purely based on frequency and co-occurrence of words

Topic Modelling Software

R LDA package and LDAvis for visual exploration
Mallet:
- popular in Digital Humanities community
- also does classification, information extraction
topic-modeling-tool: GUI for Mallet, TMT with RegEx
Other options:
- GenSim (Python library)
- Stanford Topic Modeling Toolbox

Suggestive Patterns

Overcomes problems of keyword search
- search with whole dictionary
- but words weighted by topical importance
Gives insight in topical nature of collection
Advantages:
- Great for finding suggestive patterns
Disadvantages:
- Topics can be hard to interpret

Topic Modelling in Digital Humanities

Extremely popular, especially in Historical Sciences

KB Newspaper Archive

Over 80 million articles
- organised via formal metadata
- date, newspaper title, article type
How organised in terms of topics?
Sampled collection:
- 100,000 articles matching: groote oorlog OR wereldoorlog OR europeesche oorlog OR 1914-1918
- Constrained to period 1918-1940

Topics in Newspapers

Newsarticles on World War One, published during the Interbellum
- timeline with three topics:
- socialism, neutrality, secret documents

Interpreting Topics

Image not found

Two Parts of Technicalities

Lots of statistics
- we’ll only scratch the surface!
Lots of transformations
- most of these steps are easy to understand
- but also important to understand
- need to be aware of them to gain control
Do experiments to get feel for what’s meaningful

Transformations

Four major steps:
1. Text > preprocessing > Words
2. Words > indexing > Numbers
3. Numbers > modeling > Topics
4. Topics > analysing > Compositions
We’ll encounter specifc transformations on the way

1. Preprocessing Text

What do the newspaper articles look like?
- scanned page
- after OCR

2. Indexing Text

From words to vectors to indexes

2a. Text as Vectors

text is linear sequence of words
can be represented as
- text’
- list of words
or as a vector of words
- vectors
‘Easy’ to see which texts have overlap in words

2b. Inverted Index

Term-document index:
- word lists which texts it appears in
- index becomes rows of text vectors
- inverted index
- interesting aside: search engine use this for quick lookup!

Parsing & Units of Data

Usually words as units
- can be anything, but features need high enough frequency of units
- trigrams and longer phrases often too sparse
Bag of words:
- ignores word order, syntax, sentence or paragraph boundaries
- same with other kinds of data (colours, objects, melodies)

Stopwords

Function words and other frequent words carry little meaning in modelled topics
- dominate the inverted index
- remove them to focus on meaningful terms
But which words are stopwords?
- standard list Snowball Dutch stopword list
- domain dependent: make your own

3. Modeling

Topics are latent
- Reduce high-dimensional term vector space to low-dimensional 'latent' topic space
- Topics represented by prob. dist. over words
- Texts represented by prob. dist. over topics
Established models:
- LSA: Latent Semantic Analysis
- LDA: Latent Dirichlet Allocation

Semantic Relatedness

Two words co-occurring in a text
- signal that they are related
- document frequency determines strength of signal
- co-occurrence index
Easy to see which words are related

Frequency vs. Importance

How can statistics help identify important words?
TF * IDF indicates importance of term relative to the document
TF: Term Frequency
- terms more frequently in document are more important
IDF: Inverted Document Frequency
- terms in fewer documents are more specific

4. Analysing Compositions

LDAvis is a wonderful tool to visually explore generated topics
- tutorial for visualising topics in film reviews
Topic Modeling Tool gives handy output for analysing compositions

Topic Modelling Summary

Very interesting technique for humanities
- suggestive patterns good for interpretation
- but easy to see connections that aren’t there
- many examples of do’s and don’ts
- inexact science, but there are best practices

Wrap Up

Topic modelling:
- data-driven, less biased by own knowledge/experience
- meaning from statistical structure, ‘unstructured data’
Assignment: topic modelling in Tijdschrift voor Geschiedenis