Using Regex for Text Segmentation

Aside from finding information, regular expressions help segment text either through substitution and splitting.

The command gsub() uses regexes to substitute elements.

gsub(pattern, "[enter year here]", sample_text)
 ## [1] "World War II lasted from [enter year here] to [enter year here]."

You can also split strings on that pattern using the strsplit() function.

strsplit(sample_text, pattern)
 ## [[1]]
[1] "World War II lasted from " " to " "."

 

Using patterns to split a text is useful to break down a contract, statute or treaty into its various subcomponents. As an illustration, let’s once again work with the Universal Declaration of Human Rights.

# Let's repeat the part of lesson 2 and load the Universal Declaration of Human Rights.
library(pdftools)
 


human_rights  <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")
# Recall that the pdf_text() function returns an object separated by page. We want to turn this into a single text object.

human_rights <- paste(human_rights, collapse = " ") 

 

Fortunately, legal documents are often quite uniform. We can exploit that feature to segment them into subcomponents.

Take a look at the Universal Declaration. You will notice that it is segmented in articles which each have their own article headers. These headers follow the pattern of “Article” + space + one or more digits. As a result, the regex “Article\\s\\d+” should allow us to properly split the Declaration into articles. If you try this, though, you will see that it misses “Article 1”, because unlike the others, it is formatted with a Roman numeral as “Article I”. So a comprehensive regex looks for the word “Article” + space + one or more digit OR “Article” + space + a capitalized letter. We thus use the regex “Article\\s\\d+|Article\\s[A-Z]”

We can then follow a three-step processes of identifying, extracting and splitting based on the pattern we input into the regex.

# (1) identify our pattern in the text

article_matcher <- gregexpr("Article\\s\\d+|Article\\s[A-Z]", human_rights)

# (2) extract the pattern

article_headers <- regmatches(human_rights, article_matcher)[[1]]
# (3) and then split the text at the pattern

article_text <- strsplit(human_rights,"Article\\s\\d+|Article\\s[A-Z]")[[1]]

 

Finally, we want to create a dataframe that has the number of the article in column 1 and its text in column 2. Note, however, that since treaties have preambles, there will be more article_texts than article_headers.

# Hence, we have to add another header (preamble) before we can match the headers and text in a dataframe.

article_headers <- c("Preamble",article_headers)
 
# At last, we can combine the headers and text in a dataframe.

article_table <- data.frame(article_headers,article_text)

 head(article_table)
 ##
  article_headers article_text
1 Preamble Universal Declaration of Human Rights\nPreamble\nWhereas recognition of the inherent dignity and of the …
2 Article I \nAll human beings are born free and equal in dignity and rights. They are\nendowed with reason and cons…
3 Article 2 \nEveryone is entitled to all the rights and freedoms set forth in this Declaration,\nwithout distinction…
4 Article 3 \nEveryone has the right to life, liberty and security of person.\n
5 Article 4 \nNo one shall be held in slavery or servitude; slavery and the slave trade shall be\nprohibited in all their forms.\n
 

access_time Last update May 8, 2020.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying