Lesson 3

Regular Expressions

Introduction

A relatively-low tech, but highly effective information retrieval technique is regular expressions. Most of you use key words to find and retrieve information from full texts. Regular expressions – or short: regex – are like key word searches, but better. Rather than looking only for words, regexes look for patterns.

Fortunately for lawyers and legal researchers, much in law is based on patterns: consistent document identification, standardized citations, formalized text structures and so forth.

Uses of regular expressions

In legal data science, regexes serve two basic purposes.

  • 1) Document Segmentation: For some applications we want to work with parts of a document rather than the entire text. It is thus useful to segment contracts or treaties into constituent articles or cases arguments by the parties and those of the court. Regexes help with that.
  • 2) Informational Retrieval: In other contexts we use regexes to identify and extract information we are interested in. We may want to get all dates, email addresses or citations from a document. Again, regexes can help with that.
What we will do in this lesson

1. What is a Regex?
2. Integrating Regexes into R Code
3. Using Regexes for Text Segmentation
4. Using Regexes for Information Retrieval

Useful Resources

R Script

What is a Regex?
Imagine you have a text and want to extract all years that are mentioned in that document. Key word search is not an option, so what to do?

Regexes offer a solution, because they look for patterns rather than exact matches. These patterns can be combined to look for a diverse range of elements in a text.

The above example - all years mentioned in a document - can be reframed as looking a pattern of four consecutive digits. The regex "\\d" for instance looks for all digits in a document. The regex "\\d{4}" would look for all series of 4 digits in a document - an easy way to find all years.

pattern <- "\\d{4}"

There are many other patterns, you can look for:
  • - "\\w" matches all letters
  • - "\\w+" matches all words with at least one letter
  • - "[A-Z]\w+" matches all capitalized words
  • - "\\s" matches all whitespaces
  • - ...

There are lots of regexes. Mastery of regexes requires practice. A great website to test regexes, find support and try out different code is https://regexr.com.
Integrating Regexes into R Code
R comes with different regex functions. Some find matched patterns, others replace them. Check out the different commands.
?grep
## Description - grep, grepl, regexpr, gregexpr and regexec search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
Let's work with an example. Say you want to extract the years from the following sentence:
# Example text:
sample_text  "World War II lasted from 1939 to 1945."
In R, you announce a regex with "\\" As mentioned previously, d stands for digit and {4} indicates the length of a sequence.

# So your regex would be "\\d{4}"


pattern <- "\\d{4}"

To extract a pattern, you first identify it through gregexpr().
pattern_matching <- gregexpr(pattern, sample_text)

Second, you match it through regmatches().
regmatches(sample_text, pattern_matching)[[1]]
 ## [1] "1939" "1945"

Using Regex for Text Segmentation

Aside from finding information, regular expressions are really useful to segment text either through substitution or splitting.

The command gsub() uses regexes to substitute elements.

gsub(pattern, "[enter year here]", sample_text)
 ## [1] "World War II lasted from [enter year here] to [enter year here]."

You can also split strings on that pattern using the strsplit() function.

strsplit(sample_text, pattern)
 ## [[1]]
[1] "World War II lasted from " " to " "."

Such splitting based on patterns is useful when we want to split a contract, statute or treaty into its subcomponents. As an illustration, let's work again with the Universal Declaration of Human Rights.

# Let's repeat the part of lesson 2 and load the Universal Declaration of Human Rights.
library(pdftools)
 


human_rights  <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")
# Recall that the pdf_text() function returns an object separated by page. We want to turn this into a single text object.

human_rights <- paste(human_rights, collapse = " ") 

Fortunately, legal documents are often quite uniform. We can exploit that feature to segment them.

Take a look at the Universal Declaration text. You will notice that it is segmented in articles with article headers. These articles headers follow the pattern of "Article" + space + one or more digit. The regex "Article\\s\\d+" should thus allow us to split the Declaration into articles effectively. If you try this, though, you will see that it misses "Article 1", because unlike the others, it is formatted with a Roman numeral as "Article I". So a comprehensive regex looks for the word "Article" + space + one or more digit OR "Article" + space + a capitalized letter. We thus use the regex "Article\\s\\d+|Article\\s[A-Z]". 

We can then follow a three step structure, identifying, extracting and splitting based the regex pattern.

# (1) identify our pattern in the text

article_matcher <- gregexpr("Article\\s\\d+|Article\\s[A-Z]", human_rights)

# (2) extract the pattern

article_headers <- regmatches(human_rights, article_matcher)[[1]]
# (3) and then split the text at the pattern

article_text <- strsplit(human_rights,"Article\\s\\d+|Article\\s[A-Z]")[[1]]

Finally, we want to create a dataframe that has the number of the article in column 1 and its text in column 2. Note, however, since treaties have preambles, there will be more article_texts than article_headers.

# Hence, we have to add another header (preamble) before we can match the headers and text in a dataframe.

article_headers <- c("Preamble",article_headers)
 
# At last, we can combine the headers and text in a dataframe.

article_table <- data.frame(article_headers,article_text)

 head(article_table)
 ##
  article_headers article_text
1 Preamble Universal Declaration of Human Rights\nPreamble\nWhereas recognition of the inherent dignity and of the ...
2 Article I \nAll human beings are born free and equal in dignity and rights. They are\nendowed with reason and cons...
3 Article 2 \nEveryone is entitled to all the rights and freedoms set forth in this Declaration,\nwithout distinction...
4 Article 3 \nEveryone has the right to life, liberty and security of person.\n
5 Article 4 \nNo one shall be held in slavery or servitude; slavery and the slave trade shall be\nprohibited in all their forms.\n
 

Using Regexes for Information Retrieval

The greatest research promise of regexes, however, probably lies in information retrieval. If there is a consistent pattern in your text, it is very likely that you can write a regex to extract it.

Examples include:

1) Numbers (years, telephone numbers, page numbers, ...)
2) Names (Names of persons, laws, entities,...)
3) Citations (court decisions, academic works, ...)

In the Universal Declaration, we might be interested to extract all named entities. Names are typically capitalized and consist of two or more words.


pattern_matching <- gregexpr("[A-Z][a-z]+\\s[A-Z][a-z]+", human_rights)
 

regmatches(human_rights, pattern_matching)[[1]]

 ##
[1] " Human Rights" " United Nations" " Member States" " United Nations" " General Assembly"
[6] " Universal Declaration" " Human Rights" " Member States" " United Nations" " United Nations"
[11] " United Nations"

Importantly, the results from regexes are only as good as the clarity and consistency of the underlying pattern. Here we get some false positives: "Human Rights" relate to the "Universal Declaration of Human Rights".  

Regexes can also be used for fuzzy searches. Say you are interested in all words connected with "human". So you write a regex that captures all compound terms that start with human:


pattern_matching <- gregexpr(" human\\s[a-z]+", human_rights)
 

unique(regmatches(human_rights, pattern_matching)[[1]])

 ##[1] " human family"  " human rights"  " human beings"  " human person"  " human dignity"

Dataset

Sample of cases from the Canadian Supreme Court in txt format. [Download]

Exercises

Go back to the Canadian Supreme Court data.

Imagine you work for a tech startup and have been asked to extract some meta data – data about the case – from the full text of Supreme Court judgments. Attempt to create a set of regular expressions to extract meta data from these cases relating to:

  • a) the official case citation
  • b) the date of the decisions
  • c) the names of the disputants

*

Hint: Start with a sample case before applying your code to the entire dataset.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying