Marijn Koolen

Topic Modeling with Newspaper Archives workshop

National Library of the Netherlands, The Hague, 19/01/2015

Slide presentation

Exercises

Exercise 1: Using Mallet

In this exercise, you'll use Mallet from the command line to import some newspaper articles into the appropriate format and create a simple topic model.

Steps:

Create a new folder on your hard drive within the directory where you installed Mallet. Call this folder kb_mallet (if you give it another name, replace kb_mallet in the rest of the instructions with your chosen directory name). This is where all the data for this workshop will be stored.
Download the following two files to your computer and save it in the kb_mallet folder: WOI.article.100.import WOI.article.1000.import
Open a terminal/console/command prompt and change directory (with cd) to the directory where you stored Mallet.

check that you are in the right folder by typing bin/mallet (or bin\mallet for Windows users)

Type the following command to transform the file you just download into a new file with Mallet-prepared data:

bin/mallet import-file --input kb_mallet/WOI.article.100.import --output kb_mallet/WOI.article.100.mallet --keep-sequence

Now you can train your first topic model. To do that type the following:

bin/mallet train-topics --input kb_mallet/WOI.article.100.mallet --output-model kb_mallet/WOI.article.100.model

You'll see the topic models printed on screen. Many of them consists of mostly stopwords. This is not very useful. Let's clean up the data before training.
Download the following two stopword files and save them in kb_mallet: stopwords.dutch.snowball.txt stopwords.dutch.extended.txt

To re-train the models without stopwords we need to re-import first. Type the following two commands in order:

bin/mallet import-file --input kb_mallet/WOI.article.100.import --output kb_mallet/WOI.article.100.mallet --keep-sequence --stoplist-file kb_mallet/stopwords.dutch.snowball.txt

bin/mallet train-topics --input kb_mallet/WOI.article.100.mallet --output-model kb_mallet/WOI.article.100.model

This is looking better, but there are still lots of single character words in our models. The Snowball stopword list only has a limited set of words it removes. Now try the extended list *stopwords.dutch.extended.txt*:

To re-train the models without stopwords we need to re-import first. Type the following two commands in order:

bin/mallet import-file --input kb_mallet/WOI.article.100.import --output kb_mallet/WOI.article.100.mallet --keep-sequence --stoplist-file kb_mallet/stopwords.dutch.extended.txt

bin/mallet train-topics --input kb_mallet/WOI.article.100.mallet --output-model kb_mallet/WOI.article.100.model

This is starting to look useful. Let's save the key topic words so we can compare them across different parameters settings:

bin/mallet train-topics --input kb_mallet/WOI.article.100.mallet --output-model kb_mallet/WOI.article.100.model --output-topic-keys kb_mallet/WOI.article.100.topics.txt

Exercise 2: Comparing Models

In this exercise you'll compare topic models generate by different numbers of articles and with different numbers of topics. The point here is to get a feel for how these factors affect the models.

Steps:

In the previous exercise you created a topic model using 100 newspaper articles. Now make another one with 1000 articles:

bin/mallet import-file --input kb_mallet/WOI.article.1000.import --output kb_mallet/WOI.article.1000.mallet --keep-sequence --stoplist-file kb_mallet/stopwords.dutch.extended.txt

bin/mallet train-topics --input kb_mallet/WOI.article.1000.mallet --output-model kb_mallet/WOI.article.1000.model --output-topic-keys kb_mallet/WOI.article.1000.topics.txt

Open both topic files (kb_mallet/WOI.article.100.topics.txt and kb_mallet/WOI.article.1000.topics.txt) in a text editor and compare them. They have different topic models. Look for coherent, interpretable or meaningful topics. Which is better? The one generated by 100 articles or the one by 1000 articles?

NOTE:  Comparing two sets of topic models to determine which set is better or more meaningful or more promising requires some creativity and flexibility on your part. Maybe you can't see any reason to prefer one over the other. Maybe they both have certain topics that you can't make heads nor tails of. If you do think one is better than the other, try to spell out for yourself what makes it better. Are there more easy to interpret topics? Are there more interesting/suggestive topics?

Now train topics again, but this time create 100 topics:

bin/mallet train-topics --input kb_mallet/WOI.article.1000.mallet --output-model kb_mallet/WOI.article.1000.model --output-topic-keys kb_mallet/WOI.article.1000.topics.100.txt --num-topics 100g

Open the new topic file (kb_mallet/WOI.article.1000.topics.100.txt) in a text editor. Are these 100 differnt topics or is there a lot of overlap? Are these as coherent as the ones modelled with only 10 topics?
Finally, download the file WOI.article.100000 and look at the topics. Try and come up with meaningful labels for some of the topics that look interesting. In the next exercise you can use these to visualise the occurrence of these topics over time, and across newspaper. Make a note of the topic number when you label them so you can look up those topics in the files in the next exercise.

Exercise 3: Analysing Topic Compositions

In the final exercise, you will analyse the topical composition of documents.

Steps:

Re-train the topics with an additional parameter, that saves the topic compositions in a separate file:

bin/mallet train-topics --input kb_mallet/WOI.article.1000.mallet --output-model kb_mallet/WOI.article.1000.model --output-topic-keys kb_mallet/WOI.article.1000.topics.txt --output-doc-topics kb_mallet/WOI.article.1000.composition.csv

Import the file *kb_mallet/WOI.article.1000.composition.csv* into a spreadsheet programme. The composition file contains data on the probability of each topic contributing to each document. One line contains data on a document, with first the document number and name, then the most contributing topic and it's probability, then all other topics in order of their contribution. If you sort the spreadsheet on the column containing the number of the biggest contributing topic (third column), you can see which articles focus on specific topics. You can make a bar graph showing for how many of the 1000 newspaper articles each topic is the main topic.
Go to this online folder topic_tables and download some of the document-topic compositions of the topics of your interest. These are csv files that you can *import* into a spreadsheet programme like Excel or Google Spreadsheet. The files consist of three columns, the name of the news paper, the year and month in which the articles appeared and the fraction of the text in those articles that's made up of the selected topic.
You can sort the data on newspapers and select individual news paper data to visualise the occurrence of the selected topic over time, in that news paper. You can also sort the data on date and compare the data across newspapers.
select an appropriate graph style to visualise the selected data.