Lesson 8

Prediction

Introduction

On a technical level, prediction is not very different from the classification through supervised machine learning we did in Lesson 7. On a conceptual level, however, it is a world apart. 

Prediction is speculating about the future

When we classify a text, we essentially summarize its content under a label. There is nothing speculative or inter-temporal about a classification. When we predict, however, we use the past to speculate about the future. Since we have an imperfect understanding of what determines future events, prediction is fraught with uncertainties. Hence, we must be extra careful when interpreting and relying on predictive results.

The danger of dumb predictions

Machine learning algorithms make it easy to generate predictions. Indeed, predicting is easy. Anyone with access to some sample code can predict an outcome. That gives rise to what I call “dumb predictions”: predictions that lack a causal theory and a deeper understanding of the input data. 

While predictions are extremely useful in practice, researchers and lawyers relying on them, should make sure these predictions are “smart predictions”. Smart predictions are rooted in a sound causal theory that connects causes to outcomes as best as we can. Smart predictions also require studying the input data to detect missing variables, biases and other limitations that make predictions less reliable.

In short, prediction is easy.  Predicting smartly is hard. 

What we do in this lesson

In this lesson, my main point is to show you how easy it is to predict. We will do what is common in practice: use an existing dataset and try different  machine learning algorithms to see which one performs best. But keep in mind that the deeper challenge lies not in predicting, but in predicting smartly.

  • 1. Loading WJBrennan Voting
  • 2. Prediction Using Naive Bayes
  • 3. Prediction Using Support Vector Machines
  • 4. Prediction Using K-Nearest Neighbour

R Script

Loading WJBrennan Voting Data
We begin by loading our texts. We will predict the decisions of a 20th century Judge of the US Supreme Court - Judge Brennan - with a long 33-year tenure. The data is taken from the United States Supreme Court Database.
# Load voting data.
setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")
voting <- read.csv("WJBrennan_voting.csv", header = TRUE)

Take a look at the data using the head() function. You will notice that the values are codes, which require the codebook of the United States Supreme Court Database to understand. I have chosen this example deliberately to underscore that prediction does not require a deeper understanding of the underlying data to produce results. While, in our case, we can be confident that the United States Supreme Court Database was carefully designed to track meaningful variables and was diligently produced, the fact that it is easy to produce predictions, gives rise to a risk of predictions that are not meaningful. The first step in a prediction task is thus to carefully reflect on the causal determinants of the relationship you are trying to predict and to assess the quality of the data to check for biases, mistakes, and omissions.
Prediction Using Naive Bayes

Lets go back in time to 1980. Judge Brennan had been on the Supreme Court since 1956 and remained a judge until 1989. Let's try to predict his votes in the 1980s based on his voting history.

In the dataset, note that the voting code works as follows:

[1] means that he voted with the majority.

[2] means he dissented.

We will use three different machine learning algorithms on the same data to predict his voting choices in order to see which algorithm performs best. We starting again with Naive Bayes.


# Load packages.


library(e1071)

We begin by dividing our dataset in two parts: one pre-1980 and one post-1980. We use the pre-1980 to train our model and post-1980 to predict.

 # Creating training and test set.

voting_pre1980 <- voting[c(1:3368),c(2:10)]
 

We first train our model on the training data.

voting_post1980 <- voting[c(3369:4746),c(2:10)]

model <- naiveBayes(voting_pre1980[,-9], voting_pre1980[,9])

Next, we predict out of sample based on our test data.


prediction <- predict(model, newdata = voting_post1980[,-9])
 
prediction_Bayes <- as.data.frame(prediction)

Finally, we compare our actual row assignment to our prediction.

 


prediction_Bayes <- cbind(voting_post1980$vote,prediction_Bayes)
 
colnames(prediction_Bayes) <- c("vote","prediction")
head(prediction_Bayes)
 ##
   vote       prediction
1 majority majority
2 majority majority
3 majority dissent
4 majority majority
5 majority majority
6 majority majority
 

So how well did our prediction perform? To compare the quality of our prediction we determine the number of correct predictions

# We again calculate the number of correct assignments.

hits <- 0
for (row in 1:nrow(prediction_Bayes)) {
if (prediction_Bayes$vote[row] ==prediction_Bayes$prediction[row] ) {
hits <- hits+1
}
}

 
correctness_Bayes <- hits/length(prediction_Bayes$vote)

 
correctness_Bayes
 ## [1] 0.6748911


In 67% our prediction proved correct. This is far off from perfection, but better than a 50:50 guess.

To further evaluate the performance of the algorithm, we can take a look at the confusion matrix. If the algorithm had predicted all values correctly, all actual decisions (rows) would match the predicted decisions (columns) and the lower left and upper right cell would be 0.

# Compare the results in a confusion matrix

table(prediction_Bayes$vote,prediction_Bayes$prediction)
 ##
    dissent majority
dissent   184      275
majority  173      746
 

We see that the algorithm got it wrong both ways. Some dissents were mistakenly predicted as majority votes and some majority votes were mistakenly predicted as dissents.

Prediction Using Support Vector Machines

We now repeat the same exercise but with another algorithm: Support Vector Machines.

Again, we start by training our model on the training data to then predict out of sample.

# Training the model.


model <- svm(voting_pre1980[,-9], voting_pre1980[,9])
model <- svm(voting_pre1980[,-9], voting_pre1980[,9], kernel ="polynomial", degree = 18, cost = 3)
# Predicting out of sample.


prediction <- predict(model, voting_post1980[,-9])
 
prediction_SVM <- as.data.frame(prediction)

Finally, to evaluate the performance of our algorithm, we again compare our actual row assignment to our prediction and calculate the percentage of accurately predicted results.

prediction_SVM <- cbind(voting_post1980$vote,prediction_SVM)
 
colnames(prediction_SVM) <- c("vote","prediction")
head(prediction_SVM)
 ##
          vote       prediction
3369 majority majority
3370 majority majority
3371 majority majority
3372 majority majority
3373 majority majority
3374 majority majority
 
# We again calculate the number of correct assignments.

hits <- 0
for (row in 1:nrow(prediction_SVM)) {
if (prediction_SVM$vote[row] ==prediction_SVM$prediction[row] ) {
hits <- hits+1 } }
 
correctness_SVM <- hits/length(prediction_SVM$vote)
correctness_SVM
 ## [1] 0.6669086


The performance of the SVM algorithm with close to 67% correctness is comparable to the performance of the Naive Bayes. But take a look at the confusion matrix!

table(prediction_SVM$vote,prediction_SVM$prediction)
 ##
               dissent  majority
dissent   0            459
majority  0            919
 

You notice that the SVM predicted ALL voting outcomes as majority vote and NONE as dissent. The reason for that is that Judge Brennan voted more with the majority than in dissent. This creates an imbalance in the data and some machine learning algorithm are affected by that imbalance and then predict exclusively the more common category.

Another lesson to learn from this is to always look at the confusion matrix to assess what the algorithm got wrong.

Prediction Using K-Nearest Neighbour

Finally, we repeat the same exercise with a last algorithm: K-Nearest Neighbour.

Again, we start by training our model on the training data to then predict out of sample.

library(class)

# We first train our model on the training data and apply it to the test data.


model.knn <- knn(voting_pre1980[,-9], voting_post1980[,-9], voting_pre1980[,9], k = 3, prob=TRUE)

We then evaluate the performance of our algorithm by comparing our actual row assignment to our prediction and calculate the percentage of accurately predicted results.

# We create a dataframe with our prediction.


prediction_KNN <- as.data.frame(model.knn)

# Finally, we compare our actual row assignment to our prediction.


prediction_KNN <- cbind(voting_post1980$vote,prediction_KNN)
 
colnames(prediction_KNN) <- c("vote","prediction")
 
head(prediction_KNN)
 ##
   vote        prediction
1 majority dissent
2 majority majority
3 majority majority
4 majority majority
5 majority majority
6 majority majority
 
# We again calculate the number of correct assignments.


hits <- 0
for (row in 1:nrow(prediction_KNN)) {
  if (prediction_KNN$vote[row] ==prediction_KNN$prediction[row] ) {
    hits <- hits+1
  }
}

 
correctness_KNN <- hits/length(prediction_SVM$vote)

 
correctness_KNN
 ## [1] 0.6879536


The performance of the K-Nearest Neighbour algorithm is the best we have seen so far with 69% correctness although the performance increase is modest. Let's also take a look at the confusion matrix.

# Compare the results in a confusion matrix


table(prediction_KNN$vote,prediction_KNN$prediction)
 ##
                 dissent majority
dissent    139        320
majority   110        809
 

Like the Naive Bayes, the K-Nearest Neighbour algorithm produces balanced predictions. For this particular task, we would thus likely choose this algorithm given its superior performance.

Much of the work in prediction (and supervised machine learning generally) is about achieving the highest possible accuracy of prediction by trying different algorithms and specifications. These efforts are useful and important. But correctness scores (and its related success measures such as recall, precision, F-scores, area-under-the-curve, etc.) should not become the sole target. In the end, predictions have to make sense. They have to be grounded in reliable theories of how the world works and should be informed by the quality of the data.

So while researchers and lawyers should eagerly apply the tools introduced in this lecture, they should carefully think about what they are doing and carefully reflect on the data they use to ensure that predictions are smart and not dumb.

Dataset

Sample of US Supreme Court Data. [Download]

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying