Introduction - Data Science for Lawyers

A relatively-low tech, but highly effective information retrieval technique is regular expressions. Most of you already use key words to find and retrieve information from full texts. Regular expressions – or short: regex – are like key word searches, but better. Rather than looking only for words, regexes look for patterns.

Fortunately for lawyers and legal researchers, much in law is based on patterns: consistent document identification, standardized citations, formalized text structures and so forth.

Uses of regular expressions

In legal data science, regexes serve two basic purposes.

1) Document Segmentation: For some applications we want to work with parts of a document rather than the entire text. It is thus useful to segment contracts or treaties into constituent articles. Regexes help with that.
2) Informational Retrieval: In other contexts we use regexes to identify and extract information we are interested in. For example, we could extract all the dates, email addresses or citations in a document. Again, regexes help us accomplish this.

What we will do in this lesson

1. What is a Regex?
2. Integrating Regexes into R Code
3. Using Regexes for Text Segmentation
4. Using Regexes for Information Retrieval

Useful Resources

V. David Zvenyach, Coding For Lawyers, Chapter 1: Regular Expressions.
David Colarusso, Pattern Recognition: Regular Expressions and You, Lawyerist, 12 January 2017.

Last update May 8, 2020.