Topic Modeling and Semantic Search
Marijn Koolen
Zoeken
Week 2, 8 February 2016
Overview
- Semantic Search
- Semantics in Data and Search Queries
- Challenges of Big Data
- Topic Modelling
- What are they?
- Why use them?
- How do they work?
Administrative Stuff
- Assignment is on blackboard
- requires downloading Topic Modeling Tool
- Experiment with 2 types of search
- Questions about paper? Ideas?
- Search has changed enormously over last 10-20 years
- Geographic, Mobile
- Graph search
- More use of explicit semantics
Meaning and Interpretation
Meaning and Interpretation
Semantic Search
- Search with explicit meaning
- Data and queries have explicit meaning
- Entity search: e.g. bicycle repair shop Amsterdam
- 40-50% of web queries are entity queries (Lin et al., WWW 2012)
- List search: e.g. US states voting on Super Tuesday
- Question-answering: What is the biggest landlocked country?
Schema.org
- Schema.org is a collaborative community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond.
- Sponsored by Bing, Google, Yahoo! Yandex
- As of July 2015, Over 31% of sites use Schema.org to markup web pages (in sample of 10 billion web pages)
Semantic Data
- Semantic Web
- close to Tim Berners-Lee’s original idea of the web
- Linked Data: explicit relationships among data
- URI: identifies a resource
- e.g. URL, URN (urn:isbn:0-486-27557-4)
- RDF: identifies a relationship
Semantic Web
Resource Description Framework
- RDF links data points
- provides meaning via structure
- URIs:
- RDF triple:
- dbpedia:Darth Vader dbpedia-owl:affiliation dbpedia:Sith
Semantic Qeuries?
- Adding semantics to data is great for understanding and filtering search results
- But what about semantics in user’s questions/queries?
- Linguistic issues
- ambiguity: jaguar, Michael Jackson
- subjectivity: good pizza restaurant
- Complex needs
- categories: countries in africa
- computational: 54 EURO in GBP
- Question-answering systems partly solve this
Queries with Structure
- Data structure is semantic
- SQL: Structured Query Langauge (c.f. advanced search)
- Typically used in databases
SELECT contract FROM uva_staff WHERE (contract=temp) SET contract=permanent
- SPARQL: SQL voor Linked Data
SPARQL Example 1/4
SELECT ?affiliation WHERE { dbr:Darth_Vader dbp:affiliation ?affiliation }
SPARQL Example 2/4
SELECT ?person WHERE { ?person dbp:affiliation dbr:Sith}
SPARQL Example 3/4
SELECT ?person ?affiliation WHERE { ?person dbp:affiliation ?affiliation }
SPARQL Example 4/4
SELECT count(?person) WHERE { ?person dbp:affiliation dbr:Sith }
Comparison
- What are advantages of searching with explicit meaning?
Comparison
- What are advantages of searching with explicit meaning?
- Are there disadvantages?
Search and Big Data
- Big data sets are difficult to search through
- especially if they’re unstructured
- Difficult to get an overview/summary
- how many blogs are positive/negative about X?
- Recall is hard to measure in web search (but often also irrelevant)
- Difficult to do analysis
- classifying the web is close to impossible (remember Yahoo! Directories?)
Semantics and Big Data
- Big Data often chaotic
- explicit semantics often created manually
- sometimes (semi-)automatic: e.g. DBpedia
- problems: incomplete, incorrect, inconsistent
- Source criticism!
Inconsistency
- Terminology varies
- especially across disciplines/domains
- linking different thesauri is huge amount of work
- inconsistency also in Wikipedia (therefore also in DBpedia)
- Semantic search suffers from inconsistency
Alternatives
- Wolfram Alpha
- questions in natural language
- is this equally explicit as SPARQL?
Part 2a: Topic Models
What are topic models?
- Search has changed completely over last 10-20 years
- Geographic, Mobile
- Graph search
- More use of explicit semantics
- Many search tasks rely heavily on structure
- What about search across ‘unstructured’ data?
Topic Modelling
- Search engines look for keywords
- Much more next week
- E.g. ‘Michael Jackson’ or ‘World War I’
- How can we search for the topic ‘WWI’?
- Search engine doesn’t know about topics
- One solution: topic modelling
Statistical Structure in Language
- Full-text search often treats text as a bag of words
- What meaning is left if we turn a text into a bag of words?
Statistical Structure in Language
- Full-text search often treats text as a bag of words
- What meaning is left if we turn a text into a bag of words?
- Frequency of words reveals topic
- Meaning from statistics
- semantic search from statistical perspective
- less precise, less explicit
Topic Models
- Representing topics in collection of documents
- Use statistics to find topics represented by groups of words
- Document is a mix of topics
- Topic is a mix of words
- Documents and words can be directly observed
Illustration 1
Illustration 2
Assumptions
- Two documents with the same topics will have overlap in words
- not literally true, but probabilistically true
- Single document can consist of many topics
- Three elements: words, topics, documents
- topics are formed by a selection of words
- documents are formed by a selection of topics
Statistical Modeling
- Given a collection of documents (text or otherwise), the modeling process does two things:
- create word probability distribution for topics
- create topic probability distribution for documents
- Both are purely based on frequency and co-occurrence of words
Part 2b: Why Use Topic Models?
relevance for research
Suggestive Patterns
- Overcomes problems of keyword search
- search with whole dictionary
- but words weighted by topical importance
- Gives insight in topical nature of collection
- Advantages:
- Great for finding suggestive patterns
- Disadvantages:
- Topics can be hard to interpret
Topic Modelling in Digital Humanities
- Extremely popular, especially in Historical Sciences
KB Newspaper Archive
- Over 80 million articles
- organised via formal metadata
- date, newspaper title, article type
- How organised in terms of topics?
- Sampled collection:
- 100,000 articles matching: groote oorlog OR wereldoorlog OR europeesche oorlog OR 1914-1918
- Constrained to period 1918-1940
Examples with KB Newspaper Archive
- Topics modelled by Mallet:
Topics in Newspapers
- Newsarticles on World War One, published during the Interbellum
Interpreting Topics
Part 2c: How Do Topic Models Work?
the technical details
Two Parts of Technicalities
- Lots of statistics
- we’ll only scratch the surface!
- Lots of transformations
- most of these steps are easy to understand
- but also important to understand
- need to be aware of them to gain control
- Do experiments to get feel for what’s meaningful
- Four major steps:
- Text > preprocessing > Words
- Words > indexing > Numbers
- Numbers > modeling > Topics
- Topics > analysing > Compositions
- We’ll encounter specifc transformations on the way
1. Preprocessing Text
- What do the newspaper articles look like?
2. Indexing Text
- From words to vectors to indexes
2a. Text as Vectors
- text is linear sequence of words
- can be represented as
- or as a vector of words
- ‘Easy’ to see which texts have overlap in words
2b. Inverted Index
- Term-document index:
- word lists which texts it appears in
- index becomes rows of text vectors
- inverted index
- interesting aside: search engine use this for quick lookup!
Parsing & Units of Data
- Usually words as units
- can be anything, but features need high enough frequency of units
- trigrams and longer phrases often too sparse
- Bag of words:
- ignores word order, syntax, sentence or paragraph boundaries
- same with other kinds of data (colours, objects, melodies)
Stopwords
- Function words and other frequent words carry little meaning in modelled topics
- But which words are stopwords?
3. Modeling
- Topics are latent
- Reduce high-dimensional term vector space to low-dimensional 'latent' topic space
- Topics represented by prob. dist. over words
- Texts represented by prob. dist. over topics
- Established models:
- LSA: Latent Semantic Analysis
- LDA: Latent Dirichlet Allocation
- Two words co-occurring in a text
- signal that they are related
- document frequency determines strength of signal
- co-occurrence index
- Easy to see which words are related
Frequency vs. Importance
- How can statistics help identify important words?
- TF * IDF indicates importance of term relative to the document
- TF: Term Frequency
- terms more frequently in document are more important
- IDF: Inverted Document Frequency
- terms in fewer documents are more specific
LDA
- LDA = Latent Dirichlet Allocation
- Diri-what?
- generative model
- iterative sampling to establish topics, word-topic dist. and topic-document dist.
- After so many iterations, distributions are stable
- done: topics are modelled
4. Analysing Compositions
Topic Models in Literary Studies
Considerations 1/2
- Number of documents:
- at least 1000, preferably more
- Selection of documents:
- type dimension: which article types? which newspaper(s)?
- topical dimension: filtered by keywords?
- temporal dimension: beware language change!
- geographical dimension: beware language differences!
Considerations 2/2
- Number of topics
- depends on number of documents
- below 10,000 documents: 20-100 topics
- 10,000 and more: 100-500 topics
- Generic and domain-specific stopwords
- Units of analysis: words, n-grams, phrases
Narrative Topic Models
- Tunes & Tales project
- Modelling Oral Transmission
- Folktales and Songs
- Folksong families (stemmata, similarity)
- Topic models to category tales according to narrative elements (Propp)
Familiarity with Corpus
- understanding topic requires understanding the corpus
- modelling topics in British history texts is hard if you know little about British history
- inexact science: checking usefulness of topics requires corpus knowledge and creativity
Lies, Damned Lies and Statistics
- Many pretty pictures based on topic modelling
- Word distributions often seem incoherent
- yet most informative topics often perform badly
- limited use as evidence, great for discovery (Ramsay)
- example of labelling topics
Guidelines
- Corpus size: >1000 documents
- Number of topics:
- 20-50 for small corpora (<10kdocs), 50-200 for medium (<100k docs), 200-500 for larger
- no clear criteria to determine number of topics
- models: LSI, pLSI, LDA, pLSI-LDA, ...
- Most used is LDA
- can generalise to unseen documents
Other Considerations
- Preprocessing
- removal of stopwords, hapaxes (efficiency), punctuation
- Document lengths
- very large texts have many topics?
- large texts can be chunked
- docs of equal length help comparison
Mixing Languages
- E.g. non-English texts in mostly English corpus
- models language instead of topics
- Three topics modelled on 64,000 song lyrics:
- baby like come oh yeah let know gonna m go never get one na re hey love ll wanna man
- get like baby know let go ll got gonna love back girl feel away want oh gotta time take hey
- na que de y like la m get el te re tu en mi ang yo un ya sa es
Beyond Text
- In fact, you can use other data points than words
"This is a case where I'm really being saved by the restrictive feature space of data. If I were interpreting these MALLET results as text, I might notice it, for example, but start to tell a just-so story about how transatlantic shipping and Pacific whaling really are connected. (Which they are; but so is everything else.) The absurdity of doing that with geographic data like this is pretty clear; but interpretive leaps are extraordinarily easy to make with texts."
Sub-Topic Modelling
- Tangherlini & Leonard use sub-topic modelling (STM)
- Use sub-corpus topics to 'trawl' in larger corpus
- generate topics on sub-corpus
- score docs in larger corpus
- more focus and control on topics
Experimenting with STM
- Tangherlini & Leonard (2013) ran 3 experiments
- model topics on Darwin's books, look for topics in Danish literature
- model topics on genre subset, look for unknown works of that genres
- model topics on folklore, look for influences in other literature
Control over Topics
- How can you control topic models?
- sub-topic modelling
- manual vocabulary
- filtering topics
- Should you control topic models?
Topic Modelling Summary
- Very interesting technique for humanities
- suggestive patterns good for interpretation
- but easy to see connections that aren’t there
- many examples of do’s and don’ts
- inexact science, but there are best practices
Next
- 12/02: Deadline assignment 2 (see Blackboard)
- 15/02: Search Engines