Loading and Pre-Processing Texts

Before we can embark on any type of machine learning, we have to upload and pre-process our text. 

Today we work with Federal Court decisions that have been classified into three different issue categories: (1) health, (2) aboriginal and (3) immigration.


# Load text data.


setwd("~/Google Drive/Teaching/Canada/Legal Data Science/English Course/Sample judgments")
 
cases <- read.csv("Sample Canadian Cases.csv", header = TRUE)

# Load package for text processing.

library(tm)

Now we can create a corpus. 

# Create a corpus from the text.


corpus <- VCorpus(VectorSource(cases$text))

Next we pro-process our text.

# We get rid of variation that we don't consider conceptually meaningful.


corpus <- tm_map(corpus, removePunctuation)
 
corpus <- tm_map(corpus, content_transformer(tolower))
 
corpus <- tm_map(corpus, stripWhitespace)
 
corpus <- tm_map(corpus, removeNumbers)
 
corpus <- tm_map(corpus, removeWords, stopwords("english"))

# Finally, we again create a document-term-matrix.

dtm <- DocumentTermMatrix(corpus, control = list(bounds=list(global = c(2, Inf))))
 
dtm <- as.matrix(dtm)

access_time Last update May 11, 2020.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying