Before we can embark on any type of machine learning, we have to upload and pre-process our text.
Today we work with Federal Court decisions that have been classified into three different issue categories: (1) health, (2) aboriginal and (3) immigration.
# Load text data.
setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")
cases <- read.csv("Sample Canadian Cases.csv", header = TRUE)
# Load package for text processing.
Now we can create a corpus.
# Create a corpus from the text.
corpus <- VCorpus(VectorSource(cases$text))
Next we pro-process our text.
# We get rid of variation that we don't consider conceptually meaningful.
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Finally, we again create a document-term-matrix.
dtm <- DocumentTermMatrix(corpus, control = list(bounds=list(global = c(2, Inf))))
dtm <- as.matrix(dtm)