Unsupervised ML: Topic Models for Classification

We want to know what specific issues areas, or TOPICS, our judgments cover. Topics, here, can mean different things. Think about any text: depending on the level of abstraction a text can have different “topics”.

A judicial decision, depending on the level of abstraction, can be about aboriginal rights, the consideration of aboriginal concerns in specific content such as sentencing, or the factual circumstances of the case.

We can use these levels of abstraction purposefully. When we are interested in abstract content, we can choose fewer topics that allow us to classify documents in content groups: this is a decision about aboriginal rights, this is a decision about immigration, etc.

When we are interested in more specific content, we can choose a higher number topics allow us to know what issues a document talks about: 60% of case A concerns the facts relating to burglaries, 20% relates to the criminal code and 20% relate to sentencing. Hence we can use topic models purposefully for different content analysis tasks.

Here, we focus on the first task: classification. All our cases have already been classified manually. We now want to know whether a topic modelling algorithm can guess these topics correctly. 

We begin by installing and loading the package topicmodels.


# Load package.

library(topicmodels) 
library(plyr)

Topic models are unsupervised machine learning algorithms. The only information we have to give to the computer is the number of topics we suspect to be in a set of documents. Based on the distribution of words in our documents, the computer then makes a statistically informed “guess” what these topics are.

The algorithm will not return a label for each topic, but it will return a list of words most associated with each topic. By reading this list of words, we can assign labels to each topic (and check whether the grouping is sensible).

In light of the above considerations, a LOW number of topics helps create sensible categories, while a HIGH number of topics will provide more specific categories.

Let’s first try to classify our decisions in 3 baskets.

# So we set the number of topics to 3.

k <- 3

# We can run our topic model on our dtm with 3 topics.

topic_model <- LDA(dtm,k)

We now want to see the top-10 words associated with each topic to assign labels to each of the topics.

# We create a new object terms with the top words for each topic as input.
terms <- terms(topic_model, 10)
 
terms
 ##
     Topic 1       Topic 2      Topic 3
[1,] “health” “immigration” “aboriginal”
[2,] “canada” “officer” “court”
[3,] “court” “applicant” “canada”
[4,] “officer” “canada” “federal”
[5,] “safety” “applicants” “rights”
[6,] “nthe” “decision” “plaintiffs”
[7,] “medical” “application” “applicants”
[8,] “services” “minister” “general”
[9,] “federal” “citizenship” “attorney”
[10,] “apotex” “evidence” “minister”

Note: Your output may look different, since topic modelling is based on a probabilistic algorithm. But if you run it multiple times, it should produce results that allow you to classify texts in three themes.

In our case, the three topics emerge clearly as most frequent words. Our judgments deal with 1) health, 2) immigration and 3) aboriginal concerns.

We may thus want to assign these names as column headers to our terms.

# Consult the most distinct 10 words for each topic. Based on these words, assign [1] aboriginal, [2] health, [3] immigration to the corresponding topic. # !!! IMPORTANT: ADAPT THAT ORDERING TO YOUR RESULTS!!!

topic_label <- c("health","immigration","aboriginal") 
colnames(terms) <- topic_label
 
terms
 ##
health immigration aboriginal
[1,] “health” “immigration” “aboriginal”
[2,] “canada” “officer” “court”
[3,] “court” “applicant” “canada”
[4,] “officer” “canada” “federal”
[5,] “safety” “applicants” “rights”
[6,] “nthe” “decision” “plaintiffs”
[7,] “medical” “application” “applicants”
[8,] “services” “minister” “general”
[9,] “federal” “citizenship” “attorney”
[10,] “apotex” “evidence” “minister”
 

We can now use that classification to determine which is the most prominent topic per case.

Since we already have the correct, human-assigned label, we will combine the assignment with our classification prediction to compare results.

# We create a new object topics with the topic for each document as input and create a list that compares actual with predicted classifications.
topics <- topics(topic_model)
 
topics <- mapvalues(topics, from=1:3, to=topic_label)
topics <- as.data.frame(cbind(as.character(cases$issue),topics))
 
topics <- cbind(cases$case,topics)
 
colnames(topics) <- c("case","issue", "prediction")
 
head(topics, 5)
 ##
case issue prediction
1 Sheldon v. Canada (Health) health health
2 Swarath v. Canada health health
3 Adewusi v. Canada (Citizenship and Immigration) immigration immigration
4 Cohen v. Canada (Minister of Citizenship and Immigration) immigration health
5 Lee v. Canada (Minister of Citizenship and Immigration)

We can determine the probability of whether a topic is in a given document. On the one hand, it can give us a sense of what percentage of a document covers what. On the other hand, it also helps with error correction. The algorithm assigns a topic as main topic when it is the most prevalent topic in a document. But if, according to the algorithm, 51% of a document covers health and 49% covers immigration, then a human could arguably classify it as either or. If the document is then humanly classified as immigration, the algorithm may not be completely wrong. Again, it is important to carefully study results to assess how much confidence one can have in computer generated categories. The same can be said for human-assigned categories. 

# We determine the probability of whether a topic is in a given document.

topics$prob_topic <- posterior(topic_model)$topics
 
head(topics,5)
 ##
case issue prediction prob_topic.1 prob_topic.2 prob_topic.3
1 Sheldon v. Canada (Health) health health 8.458806e-01 3.005826e-05 1.540893e-01
2 Swarath v. Canada health health 8.470014e-01 2.726386e-05 1.529714e-01
3 Adewusi v. Canada (Citizenship and Immigration) immigration immigration 4.717039e-01 5.282504e-01 4.572935e-05
4 Cohen v. Canada (Minister of Citizenship and Immigration) immigration health 8.283561e-01 1.716054e-01 3.845195e-05
5 Lee v. Canada (Minister of Citizenship and Immigration) immigration health 8.185887e-01 1.813808e-01 3.051184e-05
 

Finally, we can formally quantify how good our automated classification was by comparing it to the manual classification. To do that we simply check the percentage of classes that were guessed correctly.


# Here we check how many times the preassigned "issue" label is identical to our prediction. If the prediction is correct, we count it as a hit. We start the hit count with 0 and add one every time the assignment was correct. We can then divide the hits by the total number of guesses.

hits <- 0 
for (row in 1:nrow(topics)) { if (topics$issue[row] ==topics$prediction[row] ) { hits <- hits+1 } }

correctness <- hits/length(topics$issue)

correctness
 ##
[1] 0.8983051

Pretty good guessing! The unsupervised algorithm got 9 out of 10 classifications right. (Note: Your number may be different since the algorithm is probabilistic).

 

Extension: We used the topic model to classify our data. But by increasing our number of topics, we can also investigate content that is more granular. Rerun the analysis with a higher k. Rather than assigning each document to a single topic, we now want to check what percentages of a topic is in a given document. In that sense, the posterior (the share of topic per document) will not be a measure of how confident we are in our unique classification, but will describe the allocation of topics per document.

access_time Last update May 11, 2020.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying