Working with Bigrams

Above we worked with single words — unigrams. But we can do the same analysis with sequences of words — ngrams. Here we use a sequence of 2 words — bigrams.

# This is our bigram function. R allows you to write your own functions. This is what we do here. We can then apply our function within the script similar to R's in-built functions.


bigrams <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

We now create another term-document matrix with one small change: Each row is now a pair of words.

# Creating a term-document matrix with bigrams.


tdm <- TermDocumentMatrix(corpus, control = list(tokenize = bigrams))

tdm <- as.matrix(tdm)

Again, we can use our Term-Document matrix to identify the most frequent pairs of words.

# Sort TDM find the 12 most frequent bigrams.
head(sort(rowSums(tdm), decreasing=TRUE), 12)

 ##
r v    scc scr     scc canlii      canlii scr    court appeal     canlii scc
729  631           606                603              597                  587
attorney general     de facto     constitution act     jury roll     trial judge     british columbia
491                           398            296                         291             280               261

access_time Last update May 11, 2020.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying