Lesson 3

Introduction

A relatively-low tech, but highly effective information retrieval technique is regular expressions. Most of you already use key words to find and retrieve information from full texts. Regular expressions – or short: regex – are like key word searches, but better. Rather than looking only for words, regexes look for patterns.

Fortunately for lawyers and legal researchers, much in law is based on patterns: consistent document identification, standardized citations, formalized text structures and so forth.

Uses of regular expressions

In legal data science, regexes serve two basic purposes.

1) Document Segmentation: For some applications we want to work with parts of a document rather than the entire text. It is thus useful to segment contracts or treaties into constituent articles. Regexes help with that.
2) Informational Retrieval: In other contexts we use regexes to identify and extract information we are interested in. For example, we could extract all the dates, email addresses or citations in a document. Again, regexes help us accomplish this.

What we will do in this lesson

1. What is a Regex?
2. Integrating Regexes into R Code
3. Using Regexes for Text Segmentation
4. Using Regexes for Information Retrieval

Useful Resources

V. David Zvenyach, Coding For Lawyers, Chapter 1: Regular Expressions.
David Colarusso, Pattern Recognition: Regular Expressions and You, Lawyerist, 12 January 2017.

R Script

										Agenda
										What is a Regex?
Integrating Regexes into R Code
Using Regex for Text Segmentation
Using Regexes for Information Retrieval

What is a Regex?
									Imagine you have a text and want to extract all years that are mentioned in that document. Key word search is not an option, so what is there to do?

Regexes offer a solution, because they look for patterns rather than exact matches. It is possible to mix and match these patterns, which helps us extract a wide variety of elements from a text.

The above example - all years mentioned in a document - can be reframed as looking a pattern of four consecutive digits. The regex "\\d" for instance looks for all digits in a document. The regex "\\d{4}" would look for all series of 4 digits in a document. This is an easy strategy to help you find all years that are listed.

pattern <- "\\d{4}"

There are many other patterns, too. You can look for:
- "\\w" matches all letters
- "\\w+" matches all words with at least one letter
- "[A-Z]\w+" matches all capitalized words
- "\\s" matches all whitespaces
- ...

There are lots of regexes. Mastery of regexes requires practice. A great website to test regexes, find support, and try out different code is https://regexr.com.								
Integrating Regexes into R Code
									R comes with different regex functions. Some find matched patterns, others replace them. Check out the different commands.

?grep

## Description - grep, grepl, regexpr, gregexpr and regexec search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.

Let's work with an example. Say you want to extract the years from the following sentence:

# Example text:

sample_text  "World War II lasted from 1939 to 1945."

In R, you announce a regex with "\\" As mentioned previously, d stands for digit and {4} indicates the length of a sequence.

# So your regex would be "\\d{4}"

pattern <- "\\d{4}"

To extract a pattern, you first identify it through gregexpr().

pattern_matching <- gregexpr(pattern, sample_text)

Second, you match it through regmatches().

regmatches(sample_text, pattern_matching)[[1]]

 ## [1] "1939" "1945"

Using Regex for Text SegmentationAside from finding information, regular expressions help segment text either through substitution and splitting.
The command gsub() uses regexes to substitute elements.

gsub(pattern, "[enter year here]", sample_text)

 ## [1] "World War II lasted from [enter year here] to [enter year here]."

You can also split strings on that pattern using the strsplit() function.

strsplit(sample_text, pattern)

 ## [[1]]
[1] "World War II lasted from " " to "                      "."
 
Using patterns to split a text is useful to break down a contract, statute or treaty into its various subcomponents. As an illustration, let's once again work with the Universal Declaration of Human Rights.

# Let's repeat the part of lesson 2 and load the Universal Declaration of Human Rights.

library(pdftools)

human_rights  <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")

# Recall that the pdf_text() function returns an object separated by page. We want to turn this into a single text object.

human_rights <- paste(human_rights, collapse = " ") 
 
Fortunately, legal documents are often quite uniform. We can exploit that feature to segment them into subcomponents.
Take a look at the Universal Declaration. You will notice that it is segmented in articles which each have their own article headers. These headers follow the pattern of "Article" + space + one or more digits. As a result, the regex "Article\\s\\d+" should allow us to properly split the Declaration into articles. If you try this, though, you will see that it misses "Article 1", because unlike the others, it is formatted with a Roman numeral as "Article I". So a comprehensive regex looks for the word "Article" + space + one or more digit OR "Article" + space + a capitalized letter. We thus use the regex "Article\\s\\d+|Article\\s[A-Z]". 
We can then follow a three-step processes of identifying, extracting and splitting based on the pattern we input into the regex.

# (1) identify our pattern in the text

article_matcher <- gregexpr("Article\\s\\d+|Article\\s[A-Z]", human_rights)

# (2) extract the pattern

article_headers <- regmatches(human_rights, article_matcher)[[1]]

# (3) and then split the text at the pattern

article_text <- strsplit(human_rights,"Article\\s\\d+|Article\\s[A-Z]")[[1]]
 
Finally, we want to create a dataframe that has the number of the article in column 1 and its text in column 2. Note, however, that since treaties have preambles, there will be more article_texts than article_headers.

# Hence, we have to add another header (preamble) before we can match the headers and text in a dataframe.

article_headers <- c("Preamble",article_headers)

# At last, we can combine the headers and text in a dataframe.

article_table <- data.frame(article_headers,article_text)

 head(article_table)

 ##

article_headers
article_text

1
Preamble
Universal Declaration of Human Rights\nPreamble\nWhereas recognition of the inherent dignity and of the ...

2
Article I
\nAll human beings are born free and equal in dignity and rights. They are\nendowed with reason and cons...

3
Article 2
\nEveryone is entitled to all the rights and freedoms set forth in this Declaration,\nwithout distinction...

4
Article 3
\nEveryone has the right to life, liberty and security of person.\n

5
Article 4
\nNo one shall be held in slavery or servitude; slavery and the slave trade shall be\nprohibited in all their forms.\n

Using Regexes for Information RetrievalThe research area in which regexes have the most promise is probably information retrieval. If there is a consistent pattern in your text, it is very likely that you can write a regex to extract it.
Examples include:
1) Numbers (years, telephone numbers, page numbers, ...)
2) Names (Names of persons, laws, entities,...)
3) Citations (court decisions, academic works, ...)
In the Universal Declaration, we might want to extract all proper names. Names are typically capitalized and consist of two or more words.

pattern_matching <- gregexpr("[A-Z][a-z]+\\s[A-Z][a-z]+", human_rights)

regmatches(human_rights, pattern_matching)[[1]]

 ##[1] " Human Rights" " United Nations" " Member States" " United Nations" " General Assembly"
[6] " Universal Declaration" " Human Rights" " Member States" " United Nations" " United Nations"
[11] " United Nations"
 
Importantly, the results from regexes are only as good as the clarity and consistency of the underlying pattern. Here we get some false positives: "Human Rights" relate to the "Universal Declaration of Human Rights".  
Regexes can also be used for fuzzy searches. If you are interested in all words connected with "human" you can then write a regex that captures all compound terms that start with the word "human":

pattern_matching <- gregexpr(" human\\s[a-z]+", human_rights)

unique(regmatches(human_rights, pattern_matching)[[1]])

 ##[1] " human family"  " human rights"  " human beings"  " human person"  " human dignity"

	article_headers	article_text
1	Preamble	Universal Declaration of Human Rights\nPreamble\nWhereas recognition of the inherent dignity and of the ...
2	Article I	\nAll human beings are born free and equal in dignity and rights. They are\nendowed with reason and cons...
3	Article 2	\nEveryone is entitled to all the rights and freedoms set forth in this Declaration,\nwithout distinction...
4	Article 3	\nEveryone has the right to life, liberty and security of person.\n
5	Article 4	\nNo one shall be held in slavery or servitude; slavery and the slave trade shall be\nprohibited in all their forms.\n

Dataset

Sample of cases from the Supreme Court of Canada in txt format. [Download]

Exercises

Go back to the Supreme Court of Canada data.

Imagine you work for a tech startup and have been asked to extract some meta data – data about the case – from the full text of Supreme Court judgments. Attempt to create a set of regular expressions to extract meta data from these cases relating to:

a) the official case citation
b) the date of the decisions
c) the names of the disputants

Hint: Start with a sample case before applying your code to the entire dataset.

Regular Expressions

Introduction

Uses of regular expressions

What we will do in this lesson

Useful Resources

R Script

Agenda

What is a Regex?

Integrating Regexes into R Code

Using Regex for Text Segmentation

Using Regexes for Information Retrieval

Dataset

Exercises