Supervised ML: Naive Bayes Classifier - Data Science for Lawyers

Supervised machine learning, in contrast to unsupervised machine learning (like topic models), uses human-determined categories as a baseline. The computer is trained on already-labelled data to then categorize not-yet labelled data automatically.

Again, we will be working with court decisions that we want the algorithm to classify automatically based on its subject matter. We will train the computer on a sub-sample of the decisions to then classify decisions that are not-yet labelled.

# Load packages.

library(e1071)

It is very computationally intensive and oftern unnecessary to input the entire DTM file. An easier way to prepare our estimate is to reduce the dimensions of our DTM.

Remember, the length of out DTM is the number of documents and the width is all the terms in the corpus. As such, DTMs can have thousands of columns. We want to compress this large matrix into a simpler matrix with just 2 dimensions. We can do that by creating a distance matrix representation of the DTM which reduces this matrix to two dimensions using a statistical formula.

# For that we again create a distance matrix.

distance_matrix <- as.matrix(dist(dtm, method="binary"))

# We then scale the distance matrix to say 2 dimensions.

compressed_dtm <- cmdscale(distance_matrix, k = 2)

# We add the issue areas to our dataframe.

compressed_dtm<-as.data.frame(compressed_dtm)

compressed_dtm <- cbind(cases$issue,compressed_dtm)

Next we want to create two subsets of our dataframe: [1] a training and [2] a test set.

 # For that we generate 10 random row numbers that we will use to build our sets.

sample_rows <- sample(1:length(cases$issue), 15)

# On that basis, we create a test and a training set.

dtm_training <- compressed_dtm[-sample_rows,]

dtm_test <- compressed_dtm[sample_rows,]

We then train our model on the training data.

# We use a simple machine learning algorithm - Naive Bayes - to train our model.

model <- naiveBayes(dtm_training[,-1], as.factor(dtm_training[,1]))

Next, we use our model to predict outside of our sample data. We use our test data to predict their classification.

# We apply the model using the predict() function.

prediction <- predict(model, newdata = dtm_test)

prediction <- as.data.frame(prediction)

Finally, we compare our actual row assignment to our prediction.


prediction <- cbind(cases$issue[sample_rows],prediction)

colnames(prediction) <- c("issue","prediction")

prediction

##

issue prediction

1 aboriginal health

2 immigration immigration

3 immigration immigration

4 immigration immigration

5 immigration immigration

6 immigration immigration

7 immigration immigration

8 immigration immigration

9 immigration immigration

10 immigration immigration

11 immigration immigration

12 immigration immigration

13 aboriginal health

14 immigration immigration

15 immigration immigration

Last but not least, we again want to assess how well our algorithm has performed.

# We follow the same approach as before to calculate the number of correct assignments.

hits <- 0 

for (row in 1:nrow(prediction)) {

if (prediction$issue[row] ==prediction$prediction[row] ) {
hits <- hits+1
}
}

correctness <- hits/length(prediction$issue)

correctness

 ## [1] 0.8666667

Again, not bad. 86% of labels were assigned correctly. For some applications this may be good enough. For other tasks, we may need a correctness of closer to 100%. In that case, therer are two strategies. First, we could train more data so that the model learns the relationship between text and outcome more accurately. Second, we could try another machine learning algorithm that to see if it is more accurate. It is often difficult, if not impossible, to reach perfection.

It is important to remember that assignments made by machine learning algorithms do not need to be the final decision. It can instead make proposals that are subsequently validated by humans.

Last update February 16, 2021.