In law, context often matters, and as such a word frequency representation may not have enough detail to assess the similarity between documents.
Instead, we can represent text through its constituent characters. The term “legal data”, represented through its 5-character components, for example, would be “legal”, “egal “, “gal d”, “al da”, “l dat” and ” data”. By calculating the Jaccard distance between these 5-character components, we can arrive at a more fine-grained comparison of text that takes word order into account.
Note, however, that this is more computationally intensive. It will thus take considerably longer than calculating Jaccard distances between unigram representations of text.
We use the package “stringdist” to compare the Jaccard similarity distances between treaties represented as 5-character components.
# Load packges
# Create a distance matrix
distance_matrix_5gram <- stringdistmatrix(treaty_texts$text, treaty_texts$text, method = "jaccard", q = 5)