Above we worked with single words — unigrams. But we can do the same analysis with sequences of words — ngrams. Here we use a sequence of 2 words — bigrams.
# This is our bigram function. R allows you to write your own functions. This is what we do here. We can then apply our function within the script similar to R's in-built functions.
bigrams <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
We now create another term-document matrix with one small change: Each row is now a pair of words.
# Creating a term-document matrix with bigrams.
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = bigrams))
tdm <- as.matrix(tdm)
Again, we can use our Term-Document matrix to identify the most frequent pairs of words.
# Sort TDM find the 12 most frequent bigrams.
head(sort(rowSums(tdm), decreasing=TRUE), 12)
##
r v scc scr scc canlii canlii scr court appeal canlii scc
729 631 606 603 597 587
attorney general de facto constitution act jury roll trial judge british columbia
491 398 296 291 280 261
access_time Last update May 11, 2020.