Modeling Topics in Historic Newspapers

Ghent Center for Digital Humanities, Ghent, Belgium, 25/02/2015

Other DH materials

Instructor: Marijn Koolen

Slide presentation

Exercises

Exercise 1: Using Mallet

In this exercise, you'll use Mallet from the command line to import some newspaper articles into the appropriate format and create a simple topic model.

Steps:

  1. Create a new folder on your hard drive within the directory where you installed Mallet. Call this folder mallet_ghent (if you give it another name, replace mallet_ghent in the rest of the instructions with your chosen directory name). This is where all the data for this workshop will be stored.
  2. Download the following two files to your computer and save it in the mallet_ghent folder: Hervorming.100.articles.import Hervorming.1000.articles.import. These are the top 100 and top 1000 articles returend by searching Delpher using the query ((social* OR sociaal*) AND hervorming*) restricted to the period 1800-1900.
  3. Open a terminal/console/command prompt and change directory (with cd) to the directory where you stored Mallet.
    • check that you are in the right folder by typing bin\mallet (or bin\mallet for Windows users)
  4. Type the following command to transform the file you just download into a new file with Mallet-prepared data:
    bin\mallet import-file --input mallet_ghent\Hervorming.100.articles.import --output mallet_ghent\Hervorming.100.articles.mallet --keep-sequence
  5. Now you can train your first topic model. To do that type the following:
    bin\mallet train-topics --input mallet_ghent\Hervorming.100.articles.mallet --output-model mallet_ghent\Hervorming.100.articles.model
  6. You'll see the topic models printed on screen. Many of them consists of mostly stopwords. This is not very useful. Let's clean up the data before training.
  7. Download the following two stopword files and save them in the mallet_ghent folder: stopwords.dutch.snowball.txt stopwords.dutch.extended.txt
  8. To re-train the models without stopwords we need to re-import first. Type the following two commands in order:
    bin\mallet import-file --input mallet_ghent\Hervorming.100.articles.import --output mallet_ghent\Hervorming.100.articles.mallet --keep-sequence --stoplist-file mallet_ghent\stopwords.dutch.snowball.txt
    bin\mallet train-topics --input mallet_ghent\Hervorming.100.articles.mallet --output-model mallet_ghent\Hervorming.100.articles.model
  9. This is looking better, but there are still lots of single character words in our models. The Snowball stopword list only has a limited set of words it removes. Now try the extended list *stopwords.dutch.extended.txt*:
  10. To re-train the models without stopwords we need to re-import first. Type the following two commands in order:
    bin\mallet import-file --input mallet_ghent\Hervorming.100.articles.import --output mallet_ghent\Hervorming.100.articles.mallet --keep-sequence --stoplist-file mallet_ghent\stopwords.dutch.extended.txt
    bin\mallet train-topics --input mallet_ghent\Hervorming.100.articles.mallet --output-model mallet_ghent\Hervorming.100.articles.model
  11. This is starting to look useful. Let's save the key topic words so we can compare them across different parameters settings:
    bin\mallet train-topics --input mallet_ghent\Hervorming.100.articles.mallet --output-model mallet_ghent\Hervorming.100.articles.model --output-topic-keys mallet_ghent\Hervorming.100.articles.topics.txt

Exercise 2: Comparing Models

In this exercise you'll compare topic models generate by different numbers of articles and with different numbers of topics. The point here is to get a feel for how these factors affect the models.

Steps:

  1. In the previous exercise you created a topic model using 100 newspaper articles. Now make another one with 1000 articles:
      bin\mallet import-file --input mallet_ghent\Hervorming.1000.articles.import --output mallet_ghent\Hervorming.1000.articles.mallet --keep-sequence --stoplist-file mallet_ghent\stopwords.dutch.extended.txt
      bin\mallet train-topics --input mallet_ghent\Hervorming.1000.articles.mallet --output-model mallet_ghent\Hervorming.1000.articles.model --output-topic-keys mallet_ghent\Hervorming.1000.articles.topics.txt
  2. Open both topic files (mallet_ghent\Hervorming.100.articles.topics.txt and mallet_ghent\Hervorming.1000.articles.topics.txt) in a text editor and compare them. They have different topic models. Look for coherent, interpretable or meaningful topics. Which is better? The one generated by 100 articles or the one by 1000 articles?
    • NOTE: Comparing two sets of topic models to determine which set is better or more meaningful or more promising requires some creativity and flexibility on your part. Maybe you can't see any reason to prefer one over the other. Maybe they both have certain topics that you can't make heads nor tails of. If you do think one is better than the other, try to spell out for yourself what makes it better. Are there more easy to interpret topics? Are there more interesting/suggestive topics?
  3. Now train topics again, but this time create 100 topics:
      bin\mallet train-topics --input mallet_ghent\Hervorming.1000.articles.mallet --output-model mallet_ghent\Hervorming.1000.articles.model --output-topic-keys mallet_ghent\Hervorming.1000.articles.topics.100.txt --num-topics 100
  4. Open the new topic file (mallet_ghent\Hervorming.1000.articles.topics.100.txt) in a text editor. Are these 100 different topics or is there a lot of overlap? Are these as coherent as the ones modelled with only 10 topics?
  5. Finally, download the file Hervorming.18500.articles and look at the topics. Try and come up with meaningful labels for some of the topics that look interesting. In the next exercise you can use these to visualise the occurrence of these topics over time, and across newspaper. Make a note of the topic number when you label them so you can look up those topics in the files in the next exercise.
      There is no one right way of interpreting topic models. Adding a meaningful topic label to a set of words can be difficult. Not all sets of words are equally easy to interpret. Some might make sense to you, others not at all. There is no need to label all the topics. Skip the ones you don't like or can't make sense of.

Exercise 3: Analysing Topic Compositions

In the final exercise, you will analyse the topical composition of documents.

Steps:

  1. Re-train the topics with an additional parameter, that saves the topic compositions in a separate file:
      bin\mallet train-topics --input mallet_ghent\Hervorming.1000.articles.mallet --output-model mallet_ghent\Hervorming.1000.articles.model --output-topic-keys mallet_ghent\Hervorming.1000.articles.topics.txt --output-doc-topics mallet_ghent\Hervorming.1000.articles.composition.csv
  2. Import the file *mallet_ghent\Hervorming.1000.articles.composition.csv* into a spreadsheet programme. The composition file contains data on the probability of each topic contributing to each document. One line contains data on a document, with first the document number and name, then the most contributing topic and it's probability, then all other topics in order of their contribution. If you sort the spreadsheet on the column containing the number of the biggest contributing topic (third column), you can see which articles focus on specific topics. You can make a bar graph showing for how many of the 1000 newspaper articles each topic is the main topic.
  3. Go to this online folder topic_output and look at some of the document-topic compositions of the topics of your interest. This is the output based on 18,500 articles using the query on social reform mentioned above, modeling 200 topics. However, here, each document is a combination of all retrieved articles of a single newspaper, for a single month.