Loading and Pre-Processing Texts - Data Science for Lawyers

Before we can embark on any type of machine learning, we have to upload and pre-process our text.

Today we work with Federal Court decisions that have been classified into three different issue categories: (1) health, (2) aboriginal and (3) immigration.


# Load text data.


setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")

cases <- read.csv("Sample Canadian Cases.csv", header = TRUE)

# Load package for text processing.

library(tm)

Now we can create a corpus.

# Create a corpus from the text.


corpus <- VCorpus(VectorSource(cases$text))

Next we pro-process our text.

# We get rid of variation that we don't consider conceptually meaningful.


corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, content_transformer(tolower))

corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, removeNumbers)

corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Finally, we again create a document-term-matrix.

dtm <- DocumentTermMatrix(corpus, control = list(bounds=list(global = c(2, Inf))))

dtm <- as.matrix(dtm)

Last update May 11, 2020.