Distance Measures II

In law, context often matters, and as such a word frequency representation may not have enough detail to assess the similarity between documents. 

Instead, we can represent text through its constituent characters. The term “legal data”, represented through its 5-character components, for example, would be “legal”, “egal “, “gal d”, “al da”, “l dat” and ” data”. By calculating the Jaccard distance between these 5-character components, we can arrive at a more fine-grained comparison of text that takes word order into account. 

Note, however, that this is more computationally intensive. It will thus take considerably longer than calculating Jaccard distances between unigram representations of text. 

We use the package “stringdist” to compare the Jaccard similarity distances between treaties represented as 5-character components.

# Load packges

library(stringdist)

# Create a distance matrix

distance_matrix_5gram <- stringdistmatrix(treaty_texts$text,
treaty_texts$text,
method = "jaccard",
q = 5)

access_time Last update May 11, 2020.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying