Creation of a Text Corpus and Text Pre-processing

In this lesson, we return to our Supreme Court of Canada data and convert the judgments’ text into data. To do that we, download and activate the tm package (one of several R packages that allow turning text into data) and create a corpus object.

# Load package.

library(tm)

# Create a corpus from the text.

corpus <- VCorpus(VectorSource(scc$text))

We should then do some processing on our corpus. There is some variation in our text that, for most tasks, we are not interested in. For instance, if we are interested in understanding the content of our corpus, it will not matter for our analysis whether a word is capitalised or not.

At the same time, pre-processing text is not a value-neutral exercise and it may affect your results. You thus need to be mindful what you do.

There are two ways to think about pre-processing choices. First, you could run the same analysis with different pre-processing combinations to make sure that your results are robust and do not vary strongly due to your pre-processing decisions. Denny and Spirling provide some useful guidance on this. Second, you could approach pre-processing from a more theoretical perspective and justify your pre-processing choices on conceptual grounds: “is the variation in the text that I am interested in for this particular task meaningfully distorted by this pre-processing step?”

# Here, we focus on getting rid of variation that we don't consider conceptually meaningful for investigating content variation in our SCC texts.

# Removing punctuation.
corpus <- tm_map(corpus, removePunctuation)

# Lowercasing letters.
corpus <- tm_map(corpus, content_transformer(tolower))

# Eliminating trailing white space.
corpus <- tm_map(corpus, stripWhitespace)

# Removing numbers.
corpus <- tm_map(corpus, removeNumbers)

A more controversial aspect is the removal of stopwords. Stopwords include “and”, “or”, “over”, “our” etc. Take a look.

stopwords("english")

 ##
  [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"       "ourselves" 
  [9] "you"        "your"       "yours"      "yourself"   "yourselves" "he"         "him"        "his"       
 [17] "himself"    "she"        "her"        "hers"       "herself"    "it"         "its"        "itself"    
 [25] "they"       "them"       "their"      "theirs"     "themselves" "what"       "which"      "who"       
 [33] "whom"       "this"       "that"       "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"      "have"       "has"        "had"       
 [49] "having"     "do"         "does"       "did"        "doing"      "would"      "should"     "could"     
 [57] "ought"      "i'm"        "you're"     "he's"       "she's"      "it's"       "we're"      "they're"   
 [65] "i've"       "you've"     "we've"      "they've"    "i'd"        "you'd"      "he'd"       "she'd"     
 [73] "we'd"       "they'd"     "i'll"       "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"     "haven't"    "hadn't"     "doesn't"   
 [89] "don't"      "didn't"     "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"      "cannot"    
 [97] "couldn't"   "mustn't"    "let's"      "that's"     "who's"      "what's"     "here's"     "there's"   
[105] "when's"     "where's"    "why's"      "how's"      "a"          "an"         "the"        "and"       
[113] "but"        "if"         "or"         "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"      "against"    "between"    "into"      
[129] "through"    "during"     "before"     "after"      "above"      "below"      "to"         "from"      
[137] "up"         "down"       "in"         "out"        "on"         "off"        "over"       "under"     
[145] "again"      "further"    "then"       "once"       "here"       "there"      "when"       "where"     
[153] "why"        "how"        "all"        "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"         "nor"        "not"        "only"      
[169] "own"        "same"       "so"         "than"       "too"        "very"

In some legal contexts these words can matter a lot. Further, some words that we as lawyers may consider to be “noise” or unwanted stop words such as purely stylistic legal latin like “inter alia” , are not included in that stopword count. Hence as part of the exercise for this lesson, you will need to create a legal stopword list. For now, though, we just eliminate stopwords.

corpus <- tm_map(corpus, removeWords, stopwords("english"))

One other common pre-preprocessing step, stemming, whereby words are reduced to their root or stem, is something that must be done cautiously when applied to legal texts. The reason for that is because, in a legal context, words with the same stems can have very different meanings. For example, some stemmers group the word “arbitrary” and “arbitrator” to the same “arbitr” stem. As a result, stemming risks eliminating information that is useful to lawyers.

Last update May 11, 2020.