Using Regexes for Information Retrieval - Data Science for Lawyers

The research area in which regexes have the most promise is probably information retrieval. If there is a consistent pattern in your text, it is very likely that you can write a regex to extract it.

Examples include:

1) Numbers (years, telephone numbers, page numbers, …)
2) Names (Names of persons, laws, entities,…)
3) Citations (court decisions, academic works, …)

In the Universal Declaration, we might want to extract all proper names. Names are typically capitalized and consist of two or more words.


pattern_matching <- gregexpr("[A-Z][a-z]+\\s[A-Z][a-z]+", human_rights)


regmatches(human_rights, pattern_matching)[[1]]

##

[1] ” Human Rights” ” United Nations” ” Member States” ” United Nations” ” General Assembly”

[6] ” Universal Declaration” ” Human Rights” ” Member States” ” United Nations” ” United Nations”

[11] ” United Nations”

Importantly, the results from regexes are only as good as the clarity and consistency of the underlying pattern. Here we get some false positives: “Human Rights” relate to the “Universal Declaration of Human Rights”.

Regexes can also be used for fuzzy searches. If you are interested in all words connected with “human” you can then write a regex that captures all compound terms that start with the word “human”:


pattern_matching <- gregexpr(" human\\s[a-z]+", human_rights)


unique(regmatches(human_rights, pattern_matching)[[1]])

 ##[1] " human family"  " human rights"  " human beings"  " human person"  " human dignity"

Last update May 8, 2020.