Topic Modeling and Semantic Search

Marijn Koolen

Zoeken

Week 2, 8 February 2016

Overview

Semantic Search
- Semantics in Data and Search Queries
- Challenges of Big Data
Topic Modelling
- What are they?
- Why use them?
- How do they work?

Administrative Stuff

Assignment is on blackboard
- requires downloading Topic Modeling Tool
- Experiment with 2 types of search
Questions about paper? Ideas?

Part 1: Semantic Search

Transformations in Search

Search has changed enormously over last 10-20 years
- Geographic, Mobile
- Graph search
More use of explicit semantics

Meaning and Interpretation

Image not found

Meaning and Interpretation

Image not found

Semantic Search

Search with explicit meaning
- Data and queries have explicit meaning
Entity search: e.g. bicycle repair shop Amsterdam
- 40-50% of web queries are entity queries (Lin et al., WWW 2012)
List search: e.g. US states voting on Super Tuesday
Question-answering: What is the biggest landlocked country?

Knowledge Bases

Google’s Knowledge Graph (based on Freebase), Microsoft’s Satori
Linked Open Data:
- DBpedia: derived from Wikipedia
E.g. Darth Vader facts in Wikipedia
- become RDF triples on Darth Vader in DBpedia

Schema.org

Schema.org is a collaborative community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.
Sponsored by Bing, Google, Yahoo! Yandex
As of July 2015, Over 31% of sites use Schema.org to markup web pages (in sample of 10 billion web pages)

Semantic Data

Semantic Web
- close to Tim Berners-Lee’s original idea of the web
Linked Data: explicit relationships among data
- URI: identifies a resource
  - e.g. URL, URN (urn:isbn:0-486-27557-4)
- RDF: identifies a relationship

Semantic Web

Image not found

Resource Description Framework

RDF links data points
- provides meaning via structure
URIs:
- http://dbpedia.org/resource/Darth_Vader
- http://dbpedia.org/resource/Sith
RDF triple:
- dbpedia:Darth Vader dbpedia-owl:affiliation dbpedia:Sith

Semantic Qeuries?

Adding semantics to data is great for understanding and filtering search results
But what about semantics in user’s questions/queries?

Difficult Information Needs

Linguistic issues
- ambiguity: jaguar, Michael Jackson
- subjectivity: good pizza restaurant
Complex needs
- categories: countries in africa
- computational: 54 EURO in GBP
Question-answering systems partly solve this

Queries with Structure

Data structure is semantic
SQL: Structured Query Langauge (c.f. advanced search)
Typically used in databases

SELECT contract FROM uva_staff WHERE (contract=temp) SET contract=permanent

SPARQL: SQL voor Linked Data

SPARQL Example 1/4

DBpedia SPARQL interface
SPARQL query:

SELECT ?affiliation WHERE { dbr:Darth_Vader dbp:affiliation ?affiliation }

SPARQL Example 2/4

SELECT ?person WHERE { ?person dbp:affiliation dbr:Sith}

SPARQL Example 3/4

SELECT ?person ?affiliation WHERE { ?person dbp:affiliation ?affiliation }

SPARQL Example 4/4

SELECT count(?person) WHERE { ?person dbp:affiliation dbr:Sith }

Comparison

What are advantages of searching with explicit meaning?
- when would you use it?

Comparison

What are advantages of searching with explicit meaning?
- when would you use it?
Are there disadvantages?

Search and Big Data

Big data sets are difficult to search through
- especially if they’re unstructured
Difficult to get an overview/summary
- how many blogs are positive/negative about X?
- Recall is hard to measure in web search (but often also irrelevant)
Difficult to do analysis
- classifying the web is close to impossible (remember Yahoo! Directories?)

Semantics and Big Data

Big Data often chaotic
- explicit semantics often created manually
- sometimes (semi-)automatic: e.g. DBpedia
- problems: incomplete, incorrect, inconsistent
Source criticism!

Inconsistency

Terminology varies
- especially across disciplines/domains
- linking different thesauri is huge amount of work
- inconsistency also in Wikipedia (therefore also in DBpedia)
Semantic search suffers from inconsistency
- Nationality of persons

Alternatives

Wolfram Alpha
- questions in natural language
- is this equally explicit as SPARQL?

Part 2a: Topic Models

What are topic models?

Transformations in Search

Search has changed completely over last 10-20 years
- Geographic, Mobile
- Graph search
More use of explicit semantics
Many search tasks rely heavily on structure
- What about search across ‘unstructured’ data?

Topic Modelling

Search engines look for keywords
- Much more next week
- E.g. ‘Michael Jackson’ or ‘World War I’
How can we search for the topic ‘WWI’?
- Search engine doesn’t know about topics
- One solution: topic modelling

Statistical Structure in Language

Full-text search often treats text as a bag of words
What meaning is left if we turn a text into a bag of words?

Statistical Structure in Language

Full-text search often treats text as a bag of words
What meaning is left if we turn a text into a bag of words?
- Frequency of words reveals topic
Meaning from statistics
- semantic search from statistical perspective
- less precise, less explicit

Topic Models

Representing topics in collection of documents
- Use statistics to find topics represented by groups of words
- Document is a mix of topics
- Topic is a mix of words
Documents and words can be directly observed
- topics are latent

Illustration 1

image not found

source: http://www.4four.us/wordpress/wp-content/uploads/2010/10/result.png

Illustration 2

image not found

source: http://dig-eh.org/files/2014/07/circ_disciplines-700x386.png

Assumptions

Two documents with the same topics will have overlap in words
- not literally true, but probabilistically true
Single document can consist of many topics
- but to different degrees
Three elements: words, topics, documents
- topics are formed by a selection of words
- documents are formed by a selection of topics

Statistical Modeling

Given a collection of documents (text or otherwise), the modeling process does two things:
1. create word probability distribution for topics
2. create topic probability distribution for documents
Both are purely based on frequency and co-occurrence of words

Topic Modelling Software

R LDA package and LDAvis for visual exploration
Mallet:
- popular in Digital Humanities community
- also does classification, information extraction
topic-modeling-tool: GUI for Mallet, TMT with RegEx
Other options:
- GenSim (Python library)
- Stanford Topic Modeling Toolbox

Part 2b: Why Use Topic Models?

relevance for research

Suggestive Patterns

Overcomes problems of keyword search
- search with whole dictionary
- but words weighted by topical importance
Gives insight in topical nature of collection
Advantages:
- Great for finding suggestive patterns
Disadvantages:
- Topics can be hard to interpret

Topic Modelling in Digital Humanities

Extremely popular, especially in Historical Sciences

KB Newspaper Archive

Over 80 million articles
- organised via formal metadata
- date, newspaper title, article type
How organised in terms of topics?
Sampled collection:
- 100,000 articles matching: groote oorlog OR wereldoorlog OR europeesche oorlog OR 1914-1918
- Constrained to period 1918-1940

Examples with KB Newspaper Archive

Topics modelled by Mallet:
- topics modelled from World War I

Topics in Newspapers

Newsarticles on World War One, published during the Interbellum
- timeline with three topics:
- socialism, neutrality, secret documents

Interpreting Topics

Image not found

Part 2c: How Do Topic Models Work?

the technical details

Two Parts of Technicalities

Lots of statistics
- we’ll only scratch the surface!
Lots of transformations
- most of these steps are easy to understand
- but also important to understand
- need to be aware of them to gain control
Do experiments to get feel for what’s meaningful

Transformations

Four major steps:
1. Text > preprocessing > Words
2. Words > indexing > Numbers
3. Numbers > modeling > Topics
4. Topics > analysing > Compositions
We’ll encounter specifc transformations on the way

1. Preprocessing Text

What do the newspaper articles look like?
- scanned page
- after OCR

2. Indexing Text

From words to vectors to indexes

2a. Text as Vectors

text is linear sequence of words
can be represented as
- text’
- list of words
or as a vector of words
- vectors
‘Easy’ to see which texts have overlap in words

2b. Inverted Index

Term-document index:
- word lists which texts it appears in
- index becomes rows of text vectors
- inverted index
- interesting aside: search engine use this for quick lookup!

Parsing & Units of Data

Usually words as units
- can be anything, but features need high enough frequency of units
- trigrams and longer phrases often too sparse
Bag of words:
- ignores word order, syntax, sentence or paragraph boundaries
- same with other kinds of data (colours, objects, melodies)

Stopwords

Function words and other frequent words carry little meaning in modelled topics
- dominate the inverted index
- remove them to focus on meaningful terms
But which words are stopwords?
- standard list Snowball Dutch stopword list
- domain dependent: make your own

3. Modeling

Topics are latent
- Reduce high-dimensional term vector space to low-dimensional 'latent' topic space
- Topics represented by prob. dist. over words
- Texts represented by prob. dist. over topics
Established models:
- LSA: Latent Semantic Analysis
- LDA: Latent Dirichlet Allocation

Semantic Relatedness

Two words co-occurring in a text
- signal that they are related
- document frequency determines strength of signal
- co-occurrence index
Easy to see which words are related

Frequency vs. Importance

How can statistics help identify important words?
TF * IDF indicates importance of term relative to the document
TF: Term Frequency
- terms more frequently in document are more important
IDF: Inverted Document Frequency
- terms in fewer documents are more specific

LDA

LDA = Latent Dirichlet Allocation
- Diri-what?
- generative model
- iterative sampling to establish topics, word-topic dist. and topic-document dist.
After so many iterations, distributions are stable
- done: topics are modelled

4. Analysing Compositions

LDAvis is a wonderful tool to visually explore generated topics
- tutorial for visualising topics in film reviews
topic-modeling-tool gives handy output for analysing compositions

Topic Models in Literary Studies

Goldstone & Underwood, 2013
- Quiet Transformations
- Tool criticism: parameter choices have a huge impact on outcomes

Considerations 1/2

Number of documents:
- at least 1000, preferably more
Selection of documents:
- type dimension: which article types? which newspaper(s)?
- topical dimension: filtered by keywords?
- temporal dimension: beware language change!
- geographical dimension: beware language differences!

Considerations 2/2

Number of topics
- depends on number of documents
- below 10,000 documents: 20-100 topics
- 10,000 and more: 100-500 topics
Generic and domain-specific stopwords
Units of analysis: words, n-grams, phrases

Narrative Topic Models

Tunes & Tales project
- Modelling Oral Transmission
- Folktales and Songs
- Folksong families (stemmata, similarity)
- Topic models to category tales according to narrative elements (Propp)

Familiarity with Corpus

understanding topic requires understanding the corpus
- modelling topics in British history texts is hard if you know little about British history
- inexact science: checking usefulness of topics requires corpus knowledge and creativity

Lies, Damned Lies and Statistics

Many pretty pictures based on topic modelling
- what do they mean?
Word distributions often seem incoherent
- yet most informative topics often perform badly
- limited use as evidence, great for discovery (Ramsay)
- example of labelling topics
  - Pennsylvania Gazette

Guidelines

Corpus size: >1000 documents
Number of topics:
- 20-50 for small corpora (<10kdocs), 50-200 for medium (<100k docs), 200-500 for larger
- no clear criteria to determine number of topics
models: LSI, pLSI, LDA, pLSI-LDA, ...
- Most used is LDA
  - can generalise to unseen documents

Other Considerations

Preprocessing
- removal of stopwords, hapaxes (efficiency), punctuation
Document lengths
- very large texts have many topics?
- large texts can be chunked
- docs of equal length help comparison

Mixing Languages

E.g. non-English texts in mostly English corpus
- models language instead of topics
Three topics modelled on 64,000 song lyrics:
1. baby like come oh yeah let know gonna m go never get one na re hey love ll wanna man
2. get like baby know let go ll got gonna love back girl feel away want oh gotta time take hey
3. na que de y like la m get el te re tu en mi ang yo un ya sa es

Beyond Text

In fact, you can use other data points than words
- Ben Schmidt uses lat/long coordinates of Whaling ships to model topics
- Topic model of coordinates can be plotted on a map for easy inspection

"This is a case where I'm really being saved by the restrictive feature space of data. If I were interpreting these MALLET results as text, I might notice it, for example, but start to tell a just-so story about how transatlantic shipping and Pacific whaling really are connected. (Which they are; but so is everything else.) The absurdity of doing that with geographic data like this is pretty clear; but interpretive leaps are extraordinarily easy to make with texts."

Sub-Topic Modelling

Tangherlini & Leonard use sub-topic modelling (STM)
Use sub-corpus topics to 'trawl' in larger corpus
- generate topics on sub-corpus
- score docs in larger corpus
- more focus and control on topics

Experimenting with STM

Tangherlini & Leonard (2013) ran 3 experiments
1. model topics on Darwin's books, look for topics in Danish literature
2. model topics on genre subset, look for unknown works of that genres
3. model topics on folklore, look for influences in other literature

Control over Topics

How can you control topic models?
- sub-topic modelling
- manual vocabulary
- filtering topics
Should you control topic models?

Topic Modelling Summary

Very interesting technique for humanities
- suggestive patterns good for interpretation
- but easy to see connections that aren’t there
- many examples of do’s and don’ts
- inexact science, but there are best practices

12/02: Deadline assignment 2 (see Blackboard)
15/02: Search Engines

Topic Modeling and Semantic Search

Marijn Koolen

Zoeken

Week 2, 8 February 2016

Overview

Administrative Stuff

Part 1: Semantic Search

Transformations in Search

Meaning and Interpretation

Meaning and Interpretation

Semantic Search

Knowledge Bases

Schema.org

Semantic Data

Semantic Web

Resource Description Framework

Semantic Qeuries?

Difficult Information Needs

Queries with Structure

SPARQL Example 1/4

SPARQL Example 2/4

SPARQL Example 3/4

SPARQL Example 4/4

Comparison

Comparison

Search and Big Data

Semantics and Big Data

Inconsistency

Alternatives

Part 2a: Topic Models

What are topic models?

Transformations in Search

Topic Modelling

Statistical Structure in Language

Statistical Structure in Language

Topic Models

Illustration 1

Illustration 2

Assumptions

Statistical Modeling

Topic Modelling Software

Part 2b: Why Use Topic Models?

relevance for research

Suggestive Patterns

Topic Modelling in Digital Humanities

KB Newspaper Archive

Examples with KB Newspaper Archive

Topics in Newspapers

Interpreting Topics

Part 2c: How Do Topic Models Work?

the technical details

Two Parts of Technicalities

Transformations

1. Preprocessing Text

2. Indexing Text

2a. Text as Vectors

2b. Inverted Index

Parsing & Units of Data

Stopwords

3. Modeling

Semantic Relatedness

Frequency vs. Importance

LDA

4. Analysing Compositions

Topic Models in Literary Studies

Considerations 1/2

Considerations 2/2

Narrative Topic Models

Familiarity with Corpus

Lies, Damned Lies and Statistics

Guidelines

Other Considerations

Mixing Languages

Beyond Text

Sub-Topic Modelling

Experimenting with STM

Control over Topics

Topic Modelling Summary

Next