We now turn to the first of several lessons that treat text as data.
Treating text as data has one important advantage: scalability. While it takes time to read through cases, contracts and treaties, text-as-data techniques allow lawyers to process a lot of legal material efficiently and effectively in seconds.
Unfortunately, any conversion from text to data comes at the cost of losing semantic information. The meaning of words is partially derived from their context and part of that semantic context is inevitable lost when text is treated as data. Two consequences flow from this.
First, reading text and treating text-as-data are not substitutes; they are complements. Text-as-data analysis provides a bird’s-eye-perspective of the law by processing lots of it, but not in detail. Reading text, on the other hand, provides detail and interpretive context that text-as-data analysis cannot match, but it is also time-consuming and thus necessarily restricted to a comparatively small number of selected documents.
Second, all text-as-data methods work with imperfect representations of text. The benchmark of evaluating text-as-data analysis then is not how accurately it represents the meaning of a text, but whether it yields useful insights about a text. As we will see, usefulness often does not require semantic perfection.
Because text-as-data methods are evaluated based on their usefulness and not how accurately they represent semantic nuances, the selection of text-as-methods depends on the task at hand. In this lesson, we will primarily work with term-frequency representations of text: We count the number of times a word appears in a document. As we will see, this simple representation of text is surprisingly useful for many tasks, but not for all. Sometimes we are only interested in particular words; sometimes, the grammatical form or the semantic context of words are crucial. There are many ways to represent and work with text. Which method is best, depends on what exactly you want to do. We will cover a few different application in this and the next lesson to guide you.
We will represent text as word frequencies to visualize and analyze the content and sentiment of legal texts.
Grimmer, Justin, and Brandon M. Stewart. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis, 2013.
scc$text <- iconv(scc$text, "ASCII", "UTF-8",sub='')
# When working with French texts, for instance, use the following encoding:
scc$text <- iconv(scc$text, from="UTF-8", to="LATIN1")
In this lesson, we return to our Canadian Supreme Court data and turn the judgments' text into data. To do that we, download and activate the tm package (one of several R packages that allow turning text into data) and create a corpus object.
# Load package.
# Create a corpus from the text.
corpus <- VCorpus(VectorSource(scc$text))
We should then do some processing on our corpus. There is some variation in our text that, for most tasks, we are not interested in. For instance, if we are interested in understanding the content of our corpus, it will not matter for our analysis whether a word is capitalised or not.
At the same time, pre-processing text is not a value-neutral exercise and it may affect your results. You thus need to be mindful what you do.
There are two ways to think about pre-processing choices. First, you could run the same analysis with different pre-processing combinations to make sure that your results are robust and do not vary strongly due to your pre-processing decisions. Denny and Spirling provide some useful guidance on this. Second, you could approach pre-processing from a more theoretical perspective and justify your pre-processing choices on conceptual grounds: "is the variation in the text that I am interested in for this particular task meaningfully distorted by this pre-processing step?"
# Here, we focus on getting rid of variation that we don't consider conceptually meaningful for investigating content variation in our SCC texts.
# Removing punctuation.
corpus <- tm_map(corpus, removePunctuation)
# Lowercasing letters.
corpus <- tm_map(corpus, content_transformer(tolower))
# Eliminating trailing white space.
corpus <- tm_map(corpus, stripWhitespace)
# Removing numbers.
corpus <- tm_map(corpus, removeNumbers)
A more controversial aspect is the removal of stopwords. Stopwords include "and", "or", "over", "our" etc. Take a look.
##  "i" "me" "my" "myself" "we" "our" "ours" "ourselves"
 "you" "your" "yours" "yourself" "yourselves" "he" "him" "his"
 "himself" "she" "her" "hers" "herself" "it" "its" "itself"
 "they" "them" "their" "theirs" "themselves" "what" "which" "who"
 "whom" "this" "that" "these" "those" "am" "is" "are"
 "was" "were" "be" "been" "being" "have" "has" "had"
 "having" "do" "does" "did" "doing" "would" "should" "could"
 "ought" "i'm" "you're" "he's" "she's" "it's" "we're" "they're"
 "i've" "you've" "we've" "they've" "i'd" "you'd" "he'd" "she'd"
 "we'd" "they'd" "i'll" "you'll" "he'll" "she'll" "we'll" "they'll"
 "isn't" "aren't" "wasn't" "weren't" "hasn't" "haven't" "hadn't" "doesn't"
 "don't" "didn't" "won't" "wouldn't" "shan't" "shouldn't" "can't" "cannot"
 "couldn't" "mustn't" "let's" "that's" "who's" "what's" "here's" "there's"
 "when's" "where's" "why's" "how's" "a" "an" "the" "and"
 "but" "if" "or" "because" "as" "until" "while" "of"
 "at" "by" "for" "with" "about" "against" "between" "into"
 "through" "during" "before" "after" "above" "below" "to" "from"
 "up" "down" "in" "out" "on" "off" "over" "under"
 "again" "further" "then" "once" "here" "there" "when" "where"
 "why" "how" "all" "any" "both" "each" "few" "more"
 "most" "other" "some" "such" "no" "nor" "not" "only"
 "own" "same" "so" "than" "too" "very"
In some legal contexts these words can matter a lot. Also some words that we as lawyers, may consider "noise" or unwanted stop words such as purely stylistic legal latin like "inter alia" , are not included in that stopword count. Hence as part of the exercise for this lesson, you will need to create a legal stopword list. For now, though, we just eliminate stopwords.
corpus <- tm_map(corpus, removeWords, stopwords("english"))
One other common pre-preprocessing step, stemming, whereby words are reduced to their root or stem, is something I generally do not recommend for legal text. The reason for that is that, in a legal context, words with the same stems, can have very different meaning. For example, some stemmers, stem the word "arbitrary" and "arbitrator" to the same "arbitr" stem. Hence, stemming risks eliminating information that lawyers typically consider to be useful.
# Create a term-document matrix.
tdm <- TermDocumentMatrix(corpus)
tdm <- as.matrix(tdm)
# We can for example sort our matrix by the word frequency in our corpus.
frequent_words <- sort(rowSums(tdm), decreasing=TRUE)
# Here are our most frequent words.
One of the advantages of R is that we can easily visualize our results. For example, we can visualize word frequency through wordclouds. For that we need to download and active the package wordcloud.
# Load package.
To make the wordcloud easily readable we reduce the information we want to display.
# It is easier to work with wordcloud if we tranform our file into a dataframe.
frequent_words <- as.data.frame(as.table(frequent_words))
colnames(frequent_words) <- c("word", "freq")
# In order not to overload the word cloud, we limit our analysis to the 50 most frequent words in the corpus.
most_frequent_words <- head(frequent_words,50)
Finally, we can proceed to the visualization.
# You can adjust the color to your liking.
wordcloud(most_frequent_words$word,most_frequent_words$freq, colors = brewer.pal(8, "Dark2"))
# This is our bigram function. R allows you to write your own functions. This is what we do here. We can then apply our function within the script similar to R's in-built functions.
bigrams <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
# Creating a term-document matrix with bigrams.
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = bigrams))
tdm <- as.matrix(tdm)
# Sort TDM find the 12 most frequent bigrams.
head(sort(rowSums(tdm), decreasing=TRUE), 12)
Dictionary methods relate an outside list of terms to our corpus. This can be useful for a range of tasks. We may want to automatically check for the presence of absence of certain terms as part of a content analysis. Or we may want to select specific documents where specific signalling terms from our dictionary appear.
In this lesson, we use Dictionary methods for two tasks. First, we want to know whether a Supreme Court Case constituted a successful appeal or not and use a mapping of signalling terms to automate this assignment. Second, we use a sentiment dictionary to check whether the judges couch their decisions in positive or negative tone.
Here, we tackle the first of these tasks - classifying the outcome of decisions. For that, we can come up with two signalling terms that denote each outcome. Of course, rather than using one signalling term, we could use several.
# Load library
success_formular <- "should be allowed"
reject_formular <- "should be dismissed"
We then check whether that word sequence is present and append it to our cases as success or reject.
success <- str_count(scc$text, success_formular)
reject <- str_count(scc$text, reject_formular)
Since each of these terms is potentially repeated, we substitute occurrences with "yes" and the absence of the term with "no".
success <- gsub("1|2","yes",success)
success <- gsub("0","no",success)
reject <- gsub("1|2","yes",reject)
reject <- gsub("0","no",reject)
Finally, we add the success and failure column to our dataset.
scc$success <- success
scc$failure <- failure
Take a look at the dataset using the View(scc) command. You will notice that some SCC cases are classified as both success and reject. How can that be? If you open the SCC judgment  1 S.C.R. 61.txt you will see that the Supreme Court allowed part of the appeal and dismissed another part. So our assignment of success and reject turned out to be correct. In general, always double check your results for accuracy.
Because legal texts often follow specific drafting rules and convention, such rule-based term mapping can work surprisingly well to map the contents of legal documents. But simple word mapping rules do not work for all tasks. Where no uniform signalling terms exist, researchers are better off to resort to more flexible machine-learning approaches that we will discuss in the section on Classification and Prediction.
Another dictionary approach uses outside word lists to investigate the sentiment of text. Essentially, that means that we count words with positive and negative connotation in a document to assess the sentiment or tonality of a document.
Political scientists have used sentiment analysis, for example, to investigate whether judges change their tone when dealing with particularly sensitive cases. This is what we will do here as well. But the same approach, but with a different dictionary, could be used to look at other characteristics of legal texts, such as their prescriptivity, flexibility or use of legalese or outdated terms.
Let's start by downloading and activating the SentimentAnalysis package and by taking a look at its inbuilt sentiment dictionaries.
# The package comes with existing word lists of positive and negative words.
##  "abide" "ability" "able" "abound" "absolve" "absorbent"
##  "abandon" "abandonment" "abate" "abdicate" "abhor" "abject"
Next, we want check how many of these positive and negative words exist in each text of our corpus. Do judges tend to use more positive or more negative language when framing their decisions.
# We count how many of positive and negative words exist in each text .
tdm <- TermDocumentMatrix(corpus)
text_sentiment <- analyzeSentiment(tdm)
When positive words outnumber negative ones, we classify a text as positive and vice versa. This is of course only an approximation but may nevertheless be helpful to group texts.
## negative positive
As it turns out, all judgments use more positive then negative words.
You probably noticed that amongst the most frequent words are terms that tell us little about a decision’s content. For instance, it is unsurprising that “court” or “canlii” appears often. Take the list of most frequent words as a basis to create a legal stopword list.
1. Write your stopwords into the first column of a csv file. (Hint: Use the word frequency table to help you identify unwanted words)
2. Then upload the file and run your analysis again. What do you find?
# This sample code will help you.
stopwords <- read.csv("stopwords.csv", header = FALSE)
stopwords <- as.character(stopwords$V1)
stopwords <- c(stopwords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, stopwords)
1. Go back to the sentiment counts for each judgment. We saw that all decisions contain more positive words than negative ones. But do some contain more positive or negative words than others. Specifically, you may want to add your sentiment counts to the table with the success/failure classifications. Do decisions use more negative words when they dismiss an appeal?
2. What else, apart from sentiment, could you look for in these judgments? For instance, you may want to check how many latin legalese Canadian judges use. By adapting the sample code for stop words from above, you can create custom dictionaries to count the frequency of terms that you are interested in.