Introduction - Data Science for Lawyers

Similarity measures are an extremely powerful tool to investigate legal content. Law, whether in contracts, statutes, treaties or cases, is rarely built from scratch. Instead, legal language draws on precedent, follows established meanings, abides by formalistic drafting conventions, and is often modeled on boilerplate texts. Similarity measures exploit these patterns. They quantify to what degree a contract or treaty follows a model and help identify unique passages; they help identify court decisions that address similar legal issues, track legal innovation or reveal sub-groupings of documents within a large corpus. In short, similarity measures are a useful and very versatile tool for automatically investigating legal texts.

An Example: Mapping Investment Treaties

Similarity analysis brought me to legal data science in the first place. Together with Dmitriy Skougarevskiy, an Economist, I began investigating the universe of bilateral investment treaties through simple similarity measures. We showcase our results on the interactive website www.mappinginvestmenttreaties.com.

What struck me was how a simple measure — how different are the texts of agreements — could reveal so much. We found that the investment treaty universe is marked by power asymmetries with developed states acting as rule-makers and developing states acting as rule-takers, traced changes in underlying investment policy programs of states across the world, and quantified the degree of language innovation in the recently concluded Transpacific Partnership Agreement.

What we do in this lesson

In this lesson, we provide you with all the necessary tools to conduct similarity analyses in the same spirit.

1. Preparing Metadata
2. Creating a Document-Term Matrix
3. Creating a Similarity Matrix
4. Visualizing Similarity Through Heatmaps

Useful Resources

Wolfgang Alschner and Dmitriy Skougarevskiy, “Mapping the Universe of International Investment Agreements”, Journal of International Economic Law, Vol. 19, No. 3, 2016, pp. 561-588. [SSRN Version]

Wolfgang Alschner, “Sense and Similarity: Automating Legal Text Comparison”, in: Whalen (ed.) Computational Legal Studies: The Promise and Challenge of Data-Driven Legal Research. [SSRN Version]

Last update May 11, 2020.