Lesson 7

Automated Content Analysis Through Machine Learning


Machine learning is responsible for the success of recent artificial intelligence (AI) application and are fuelling a wide range of advances from translation to self-driving cars. Machine learning can also be used in legal data science. Machine learning algorithms look for relationships in data and tend to get better the more data they have. For legal researchers supervised and unsupervised learning are the two most relevant types of machine learning.

Supervised machine learning

In supervised machine learning, a human “teaches” a computer a task, which it can then perform autonomously. For example, you could read 100 Supreme Court decisions and classify them based on outcomes. Subsequently, you feed the text and the associated outcome to a machine learning algorithm. The algorithm “learns” how the relationship between input (text of decision) and output (outcome of decision) and stores this relationship in form of a model. It can then use this model to autonomously assign a given outcome, if provided with the text of the decision. To have a sense of how good the model is, you should evaluate it against the “gold standard” — manually classified data (but not the ones used for the training of the model). If the results are mostly correct, the model works well. If it produces many falsely assigned cases, you may have to hand-code more cases so that the algorithm can learn the relationship between text and outcome better. 

Unsupervised machine learning

Unsupervised machine learning algorithms autonomously mine a dataset for patterns without prior human guidance. We will be looking at topic modelling as one example for such an algorithm. Say you do not know the content of a corpus in advance, but you guess that there are five topics in the corpus. You could then run a topic modelling algorithm that assumes that there are five topics in the corpus, that different words have different probabilities of appearing within a topic and that documents vary in the proportion of topics they talk about. The algorithm then sets out to “find” these five different topics and returns word lists associated with each topic. By interpreting these word lists you then deduce the content of these five topics. Sometimes unsupervised machine learning algorithms work like magic and reveal patterns that make intuitive sense. Sometimes they find relationships that make no sense at all. Sometimes, it literally comes down to luck – unsupervised machine learning algorithms tend to be probabilistic and, depending on where they start and how they “guess”, their findings can be more or less meaningful.

Validate, validate, validate

Machine algorithms are highly appealing, because they can be used to quickly explore the content of a corpus or to classify texts by subject. But legal researchers should take care not to blindly trust, but to carefully validate results. Aside from quantitive checks, this includes actually reading and reviewing some of the documents, which have been processed, to assess whether results make sense. Moreover, researchers should not expect computers to do what humans cannot. Some corpora may be so diverse that they cannot be meaningfully grouped; others allow for multiple equally valid groupings. Just like two lawyers may validly divide the same collection of judicial decisions differently, computer-generated output should not be taken as “truth”, but as one proposal of how to group the available data.

When researchers should use machine learning

Unsupervised machine learning: It is often used at early stages of the research to explore new datasets, because an unsupervised algorithm, in contrast to human trained supervised machine learning algorithms, can even find patterns that the researcher did not actively look for. They thus work well even when categories are unknown. In fact, many researchers will be disappointed when they use unsupervised algorithms to look for known patterns. Instead of automatically categorizing treaties by their clauses, for example, a topic model is likely to pick up language typically used by specific states qne classify treaties by signatories and not by content. Where categories of interest are known, a rules-based dictionary mapping or a supervised machine learning approach is more suitable. 

Supervised machine learning: Whenever the researcher is confronted with repetitive tasks at a high volume, supervised machine learning is a useful tool. This is particularly true when the relationship between input and output data is complex and cannot be reduced to a small set of logical rules and where a dictionary-based content mapping is thus not available. But it only makes sense to embark on machine learning when the volume of data is high. If the classification only involves a few dozen or a hundred cases, it may be quicker to classify data by hand.

What we do in this lesson

In this lesson, we talk about two types of machine learning for the purposes of classifying the content of text. In the next lesson, we use similar algorithms for prediction.

1. Unsupervised Machine Learning
2. Supervised Machine Learning

R Script

Loading and Pre-Processing Texts

Before we can embark on any type of machine learning, we have to upload and pre-process our text. 

Today we work with Federal Court decisions that have been classified into three different issue categories: (1) health, (2) aboriginal and (3) immigration.

# Load text data.

setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")
cases <- read.csv("Sample Canadian Cases.csv", header = TRUE)

# Load package for text processing.


Now we can create a corpus. 

# Create a corpus from the text.

corpus <- VCorpus(VectorSource(cases$text))

Next we pro-process our text.

# We get rid of variation that we don't consider conceptually meaningful.

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Finally, we again create a document-term-matrix.

dtm <- DocumentTermMatrix(corpus, control = list(bounds=list(global = c(2, Inf))))
dtm <- as.matrix(dtm)

Unsupervised ML: Topic Models for Classification

We want to know what specific issues areas, or TOPICS, our judgments cover. Topics, here, can mean different things. Think about any text: depending on the level of abstraction a text can have different "topics".

A judicial decision, depending on the level of abstraction, can be about aboriginal rights, the consideration of aboriginal concerns in specific content such as sentencing, or the factual circumstances of the case.

We can use these levels of abstraction purposefully. When we are interested in abstract content, we can choose fewer topics that allow us to classify documents in content groups: this is a decision about aboriginal rights, this is a decision about immigration etc.

When we are interested in more specific content, we can choose a higher number topics allow us to know what issues a document talks about: 60% of case A concerns the facts relating to burglaries, 30% relates to the criminal code and 30% relate to sentencing. Hence we can use topic models purposefully for different content analysis tasks.

Here, we are focus on the first task: classification. All out cases have already been manually classified. We now want to know whether a topic modelling algorithm can guess these topics correctly. 

We begin by installing and loading the package topicmodels.

# Load package.


Topic models are an unsupervised machine learning algorithm. The only information we have to give to the computer is the number of topics we suspect to be in a set of documents. Based on the distribution of words in our documents, the computer then makes a statistically informed "guess" what these topics are.

The algorithm will not return a label for each topic, but it will return a list of words most associated with each topic. By reading this list of words, we can assign labels to each topic (and check whether the topics make sense).

In light of the above considerations, LOW number of topics help to classify; HIGH number of topics will provide more specific content.

Let's first try to classify our decisions in 3 baskets.

# So we set the number of topics to 3.

k <- 3

# We can run our topic model on our dtm with 3 topics.

topic_model <- LDA(dtm,k)

We now want to see the top-10 words associated with each topic to assign labels to each of the topics.

# We create a new object terms with the top words for each topic as input.
terms <- terms(topic_model, 10)
     Topic 1       Topic 2      Topic 3
[1,] "health" "immigration" "aboriginal"
[2,] "canada" "officer" "court"
[3,] "court" "applicant" "canada"
[4,] "officer" "canada" "federal"
[5,] "safety" "applicants" "rights"
[6,] "nthe" "decision" "plaintiffs"
[7,] "medical" "application" "applicants"
[8,] "services" "minister" "general"
[9,] "federal" "citizenship" "attorney"
[10,] "apotex" "evidence" "minister"


Note: Your output may look different, since topic modelling is based on a probabilistic algorithm. But if you run it multiple times, it should produce results that allow you to classify texts in three themes.

In our case, the three topics emerge clearly as most frequent words. Our judgments deal with 1) health, 2) immigration and 3) aboriginal concerns.

We may thus want to assign these names as column headers to our terms.

# Consult the most distinct 10 words for each topic. Based on these words, assign [1] aboriginal, [2] health, [3] immigration to the corresponding topic. # !!! IMPORTANT: ADAPT THAT ORDERING TO YOUR RESULTS!!!

topic_label <- c("health","immigration","aboriginal") 
colnames(terms) <- topic_label
health immigration aboriginal
[1,] "health" "immigration" "aboriginal"
[2,] "canada" "officer" "court"
[3,] "court" "applicant" "canada"
[4,] "officer" "canada" "federal"
[5,] "safety" "applicants" "rights"
[6,] "nthe" "decision" "plaintiffs"
[7,] "medical" "application" "applicants"
[8,] "services" "minister" "general"
[9,] "federal" "citizenship" "attorney"
[10,] "apotex" "evidence" "minister"

We can now use that classification to determine which is the most prominent topic per case.

Since we already have the correct, human-assigned label, we will combine the assignment with our classification prediction to compare results.

# We create a new object topics with the topic for each document as input and create a list that compares actual with predicted classifications.
topics <- topics(topic_model)
topics <- mapvalues(topics, from=1:3, to=topic_label)
topics <- as.data.frame(cbind(as.character(cases$issue),topics))
topics <- cbind(cases$case,topics)
colnames(topics) <- c("case","issue", "prediction")
head(topics, 5)
case issue prediction
1 Sheldon v. Canada (Health) health health
2 Swarath v. Canada health health
3 Adewusi v. Canada (Citizenship and Immigration) immigration immigration
4 Cohen v. Canada (Minister of Citizenship and Immigration) immigration health
5 Lee v. Canada (Minister of Citizenship and Immigration)


We can determine the probability of whether a topic is in a given document. On the one hand, it can give us a sense of what percentage of a document covers what. On the other hand, it also helps with error correction. The algorithm assigns a topic as main topic when it is the most prevalent topic in a document. But if, according to the algorithm, 51% of a document covers health and 49% covers immigration, then a human could arguably classify it as either or. If the document is then humanly classified as immigration, the algorithm may not be completely wrong. Again, it is important to carefully study results to assess how much confidence one can have in computer generated (and sometimes also human-assigned) categories.

# We determine the probability of whether a topic is in a given document.

topics$prob_topic <- posterior(topic_model)$topics
case issue prediction prob_topic.1 prob_topic.2 prob_topic.3
1 Sheldon v. Canada (Health) health health 8.458806e-01 3.005826e-05 1.540893e-01
2 Swarath v. Canada health health 8.470014e-01 2.726386e-05 1.529714e-01
3 Adewusi v. Canada (Citizenship and Immigration) immigration immigration 4.717039e-01 5.282504e-01 4.572935e-05
4 Cohen v. Canada (Minister of Citizenship and Immigration) immigration health 8.283561e-01 1.716054e-01 3.845195e-05
5 Lee v. Canada (Minister of Citizenship and Immigration) immigration health 8.185887e-01 1.813808e-01 3.051184e-05

Finally, we can formally quantify how good our automated classification was by comparing it to the manual classification. To do that we simply check the percentage of classes that were guessed correctly.

# Here we check how many times the preassigned "issue" label is identical to our prediction. If the prediction is correct, we count it as a hit. We start the hit count with 0 and add one every time the assignment was correct. We can then divide the hits by the total number of guesses.

hits <- 0 
for (row in 1:nrow(topics)) { if (topics$issue[row] ==topics$prediction[row] ) { hits <- hits+1 } }

correctness <- hits/length(topics$issue)

[1] 0.8983051

Pretty good guessing! The unsupervised algorithm got 9 out of 10 classifications right. (Note: Your number may be different since the algorithm is probabilistic).


Extension: We used the topic model to classify our data. But by increasing our number of topics, we can also investigate its more specific content. Rerun the analysis with a higher k. Rather than assigning each document to a single topic, we now want to check what percentages of a topic is in a given document. In that sense, the posterior (the share of topic per treaty) will not be a measure of how confident we are in our unique classification, but will describe the allocation of topics per document.

Supervised ML: Naive Bayes Classifier

Supervised machine learning, in contrast to unsupervised machine learning (like topic models), takes human classification as baseline. The computer is trained on already labelled data to label unlabelled data automatically.

We work again with our court decisions and, again, we want automatically classify these decisions by subject matter. We will train the computer on a sub-sample of the decisions to then label decisions out of sample.

# Load packages.


It is very computationally intensive and not necessary to work with the entire DTM as input. A simpler way to prepare our estimation is to reduce our DTM to lower dimensionality.

Remember, our DTM has the length of the number of documents and the width of the number of terms in the corpus. So DTMs can have thousands of columns. We want to compress this large matrix into a simpler matrix with just 2 dimensions. We can do that by again, creating a distance matrix representation of the DTM and the use statistical dimensionality reduction to reduce this matrix to two dimensions.

# For that we again create a distance matrix.

distance_matrix <- as.matrix(dist(dtm, method="binary"))

# We then scale the distance matrix to say 2 dimensions.

compressed_dtm <- cmdscale(distance_matrix, k = 2)
# We add the issue areas to our dataframe.

compressed_dtm <- cbind(cases$issue,compressed_dtm)

Next we want to create two subsets of our dataframe: [1] a training and [2] a test set.

 # For that we generate 10 random row numbers that we will use to build our sets.

sample_rows <- sample(1:length(cases$issue), 15)
# On that basis, we create a test and a training set.

dtm_training <- compressed_dtm[-sample_rows,]
dtm_test <- compressed_dtm[sample_rows,]

We then train our model on the training data.

# We use a simple machine learning algorithm - Naive Bayes - to train our model.

model <- naiveBayes(dtm_training[,-1], dtm_training[,1])

Next, we use our model to predict out of sample. We use our test data to predict their classification.

# We apply the model using the predict() function.

prediction <- predict(model, newdata = dtm_test)
prediction <- as.data.frame(prediction)

Finally, we compare our actual row assignment to our prediction.

prediction <- cbind(cases$issue[sample_rows],prediction)
colnames(prediction) <- c("issue","prediction")
issue prediction
1 aboriginal health
2 immigration immigration
3 immigration immigration
4 immigration immigration
5 immigration immigration
6 immigration immigration
7 immigration immigration
8 immigration immigration
9 immigration immigration
10 immigration immigration
11 immigration immigration
12 immigration immigration
13 aboriginal health
14 immigration immigration
15 immigration immigration

Last but not least, we again want to assess how well our algorithm has performed.

# We follow the same approach as before to calculate the number of correct assignments.

hits <- 0 
for (row in 1:nrow(prediction)) {
if (prediction$issue[row] ==prediction$prediction[row] ) { hits <- hits+1 } }
correctness <- hits/length(prediction$issue)

 ## [1] 0.8666667

Again, not bad. 86% of labels have been correctly assigned. For some applications this may be good enough. For other tasks, we may need a correctness of closer to 100%. In that case, we could two strategies. First, we could train more data so that the model learn the relationship between text and outcome more accurately. Second, we could try another machine learning algorithm that achieves a higher accuracy. But it is often difficult if not impossible to reach perfection.

Another point to keep in mind is that the machine learning assignment does not need to be the final call. It can instead be used an assignment proposals that a human subsequently validates by hand to ensure accuracy.


Sample of Canadian Court Decisions. [Download]

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying