Creating a Document-Term Matrix - Data Science for Lawyers

We can now turn our treaty texts into data in order to prepare the similarity analysis. Similar to Lesson 5, we want to represent the texts as word frequency counts. This time, however, our focus is not on the terms but on the documents. So we create a Document-Term Matrix — the simple inverse representation of a Term-Document Matrix.

We again start by creating a corpus object.

# Load package for text processing.

library(tm)

# We start by creating a corpus from the text.

corpus <- VCorpus(VectorSource(treaty_texts$text))

We then pre-process our treaty texts.

# Again, we get rid of variation that we don't consider conceptually meaningful.

corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, content_transformer(tolower))

corpus <- tm_map(corpus, stripWhitespace)

corpus <- tm_map(corpus, removeNumbers)

corpus <- tm_map(corpus, removeWords, stopwords("english"))

We now want to use our corpus to create a Document-Term Matrix (DTM). In that matrix, every row is a document (here, a treaty) and every column is a term. The values are the frequencies of each term in a document.

# Create a document term matrix.

dtm <- DocumentTermMatrix(corpus)

While the above DTM can be used for further analysis, we often want to refine that matrix. Imagine, for instance, that you want to investigate the content of 100 documents. You are not interested in outliers but want to get a general sense of what is in these documents. In that case you may want to exclude all terms that only appear in one or two documents. You can do that by changing the lower bound of the control = list() function, which by default is set to 1 (i.e. all terms that appear in at least 1 document).

# For example, we restrict our analysis to all terms appearing in 2 or more documents.

dtm <- DocumentTermMatrix(corpus, control = list(bounds=list(global = c(2, Inf))))

# Finally, we transform the dtm from dataframe to numerical matrix.


dtm <- as.matrix(dtm)

Last update May 11, 2020.