A relatively-low tech, but highly effective information retrieval technique is regular expressions. Most of you already use key words to find and retrieve information from full texts. Regular expressions – or short: regex – are like key word searches, but better. Rather than looking only for words, regexes look for patterns.
Fortunately for lawyers and legal researchers, much in law is based on patterns: consistent document identification, standardized citations, formalized text structures and so forth.
Uses of regular expressions
In legal data science, regexes serve two basic purposes.
- 1) Document Segmentation: For some applications we want to work with parts of a document rather than the entire text. It is thus useful to segment contracts or treaties into constituent articles. Regexes help with that.
- 2) Informational Retrieval: In other contexts we use regexes to identify and extract information we are interested in. For example, we could extract all the dates, email addresses or citations in a document. Again, regexes help us accomplish this.
What we will do in this lesson
1. What is a Regex?
2. Integrating Regexes into R Code
3. Using Regexes for Text Segmentation
4. Using Regexes for Information Retrieval
- V. David Zvenyach, Coding For Lawyers, Chapter 1: Regular Expressions.
- David Colarusso, Pattern Recognition: Regular Expressions and You, Lawyerist, 12 January 2017.