Topic Modeling and Semantic Search

Marijn Koolen

Zoeken

Week 2, 8 February 2016

Overview

  1. Semantic Search
    • Semantics in Data and Search Queries
    • Challenges of Big Data
  2. Topic Modelling
    • What are they?
    • Why use them?
    • How do they work?

Administrative Stuff

Part 1: Semantic Search

Meaning and Interpretation

Image not found

Meaning and Interpretation

Image not found

Knowledge Bases

Schema.org

Semantic Data

Semantic Web

Image not found

Image not found

Image not found

Resource Description Framework

Semantic Qeuries?

Difficult Information Needs

Queries with Structure

SELECT contract FROM uva_staff WHERE (contract=temp) SET contract=permanent

SPARQL Example 1/4

SELECT ?affiliation WHERE { dbr:Darth_Vader dbp:affiliation ?affiliation }

SPARQL Example 2/4

SELECT ?person WHERE { ?person dbp:affiliation dbr:Sith}

SPARQL Example 3/4

SELECT ?person ?affiliation WHERE { ?person dbp:affiliation ?affiliation }

SPARQL Example 4/4

SELECT count(?person) WHERE { ?person dbp:affiliation dbr:Sith }

Comparison

Comparison

Search and Big Data

Semantics and Big Data

Inconsistency

Alternatives

Part 2a: Topic Models

What are topic models?

Topic Modelling

Statistical Structure in Language

Statistical Structure in Language

Topic Models

Illustration 1

image not found

Illustration 2

image not found

Assumptions

Statistical Modeling

Topic Modelling Software

Part 2b: Why Use Topic Models?

relevance for research

Suggestive Patterns

Topic Modelling in Digital Humanities

KB Newspaper Archive

Examples with KB Newspaper Archive

Topics in Newspapers

Interpreting Topics

Image not found

Part 2c: How Do Topic Models Work?

the technical details

Two Parts of Technicalities

Transformations

1. Preprocessing Text

2. Indexing Text

2a. Text as Vectors

2b. Inverted Index

Parsing & Units of Data

Stopwords

3. Modeling

Semantic Relatedness

Frequency vs. Importance

LDA

4. Analysing Compositions

Topic Models in Literary Studies

Considerations 1/2

Considerations 2/2

Narrative Topic Models

Familiarity with Corpus

Lies, Damned Lies and Statistics

Guidelines

Other Considerations

Mixing Languages

Beyond Text

"This is a case where I'm really being saved by the restrictive feature space of data. If I were interpreting these MALLET results as text, I might notice it, for example, but start to tell a just-so story about how transatlantic shipping and Pacific whaling really are connected. (Which they are; but so is everything else.) The absurdity of doing that with geographic data like this is pretty clear; but interpretive leaps are extraordinarily easy to make with texts."

Sub-Topic Modelling

Experimenting with STM

Control over Topics

Topic Modelling Summary

Next