Distance Measures I - Data Science for Lawyers

When the textual similarity of documents is discussed, it is often done by refering to their “distance”. The method intuits that similar documents will be “closer”, and have a lesser distance, while for less similar documents will be “farther” and have a greater distance. Think of the word frequency counts of the DTM as the coordinates of each document in a vector space. Where these counts are identical, the two documents will occupy the same coordinates; if some counts differ, and others are the same, the two documents will be further apart, but not too distant; yet if the counts are very different and many terms exist in document A but not B, the coordinates of these two documents will be very far apart.

Statistical distance measures quantify this difference. They calculate how far apart two documents are based on their term frequency counts.

There are many different distance measures. We are now using a unigram representation of the texts. In that context, cosine similarity is often used to calculate the distance between documents. Here, however, we opt for a simpler binary distance measure that allows us to calculate the distance between documents by determining how many words, regardless of their frequency, are present in both documents. This distance measure is known as Jaccard distance. Jaccard distances have the useful advantage of being a measure between 0 (no distance, perfect similarity) and 1 (no similarity).

We thus create a distance matrix using our DTM as input.

# Create binary distance matrix checking whether or not a word appears in a document.

distance_matrix <- as.matrix(dist(dtm, method="binary"))

We can study the distance matrix in its own right. For instance, take a look at min() and max() values for the matrix to determine which treaties are closest and furthest apart. In other words, it helps us determine the similarity of the documents.

Often, however, it is easier to visualize a distance matrix. But before we do so, let’s look at a more sophisticated way of calculating distance.

Last update May 11, 2020.