Above we worked with single words — unigrams. But we can do the same analysis with sequences of words — ngrams. Here we use a sequence of 2 words — bigrams.
# This is our bigram function. R allows you to write your own functions. This is what we do here. We can then apply our function within the script similar to R's in-built functions.
bigrams <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
We now create another term-document matrix with one small change: Each row is now a pair of words.
# Creating a term-document matrix with bigrams.
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = bigrams))
tdm <- as.matrix(tdm)
Again, we can use our Term-Document matrix to identify the most frequent pairs of words.
# Sort TDM find the 12 most frequent bigrams.
head(sort(rowSums(tdm), decreasing=TRUE), 12)