Prediction Using K-Nearest Neighbour - Data Science for Lawyers

Finally, we repeat the same exercise with a last algorithm: K-Nearest Neighbour.

Again, we start by training our model on the training data to then predict out of sample.

library(class)

# We first train our model on the training data and apply it to the test data.


model.knn <- knn(voting_pre1980[,-9], voting_post1980[,-9], voting_pre1980[,9], k = 3, prob=TRUE)

We then evaluate the performance of our algorithm by comparing our actual row assignment to our prediction. Then, we calculate the percentage of accurately predicted results.

# We create a dataframe with our prediction.


prediction_KNN <- as.data.frame(model.knn)

# Finally, we compare our actual row assignment to our prediction.


prediction_KNN <- cbind(voting_post1980$vote,prediction_KNN)

colnames(prediction_KNN) <- c("vote","prediction")

head(prediction_KNN)

##

vote prediction

1 majority dissent

2 majority majority

3 majority majority

4 majority majority

5 majority majority

6 majority majority

# We again calculate the number of correct assignments.


hits <- 0
for (row in 1:nrow(prediction_KNN)) {
  if (prediction_KNN$vote[row] ==prediction_KNN$prediction[row] ) {
    hits <- hits+1
  }
}

correctness_KNN <- hits/length(prediction_SVM$vote)

correctness_KNN

 ## [1] 0.6879536

The performance of the K-Nearest Neighbour algorithm is the best we have seen so far with 69% correctness, although this performance increase is modest. Let’s also take a look at the confusion matrix.

# Compare the results in a confusion matrix


table(prediction_KNN$vote,prediction_KNN$prediction)

##

dissent majority

dissent 139 320

majority 110 809

Like the Naive Bayes, the K-Nearest Neighbour algorithm produces balanced predictions. For this particular task, we would thus likely choose this algorithm because it had the best performance.

Much of the work in prediction (and supervised machine learning generally) is about achieving the highest possible accuracy of prediction by trying different algorithms and specifications. These efforts are useful and important. But correctness scores (and its related success measures such as recall, precision, F-scores, area-under-the-curve, etc.) should not become the sole target. In the end, predictions have to make sense. They have to be grounded in reliable theories of how the world works and should be informed by the quality of the data.

So while researchers and lawyers should eagerly apply the tools introduced in this lecture, they should think about what they are doing and carefully reflect on the data they use. This will ensure that you make smart predictions rather than dumb ones.

Last update May 11, 2020.