Webscraping - Data Science for Lawyers

Webscraping is technically challenging, sometimes impossible, and often prohibited. So this section is only a primer. Some websites contain special interfaces to access their data through so called application programming interfaces (APIs). Rather than doing webscraping yourself you should ask a computer scientist for help. With that in mind, I still want to give you a taste of what is possible in R. To do webscraping in R, we need to install and load the package “rvest” – note the pun.

# Activate package

library("rvest")

In this instance, we are interested in scraping the texts of labor agreements Canada has signed with third parties. Look for Canadian labor agreements online.

# Once we have found the website, copy its url.

url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"

We can then read the website into R using the read_html() command.

website<- read_html(url)

Websites are written in the html language. That is a mark up language that we can use to locate what we are looking for. Not all websites can be easily scraped. The website architecture determines how (and whether at all) it is possible to scrape it.

Open our target website in its source code (most web browsers have an option that allows you to view a website in source code). Go through that source code. Where do we find the list of labor treaties?

Our target website contains several lists (<li>) in html. The labor treaties are located in one of these lists. Within these lists items are classified with tags that indicate links to other pages. We now scrape the names behind the tags within the list objects and manually identify the location of the labor agreements.

# Websites change over time. We are looking for the agreements starting with NAFTA up to the Honduras-Canada agreement. At the time of this writing, they are list numbers 23:30.

treaties_names <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_text()

treaties_names <- treaties_names[23:30]

treaties_names

##

[1] “North American Agreement on Labour Cooperation” “Canada-Chile Agreement on Labour Cooperation”

[3] “Canada-Costa Rica Agreement on Labour Cooperation” “Canada-Peru Agreement on Labour Cooperation”

[5] “Canada-Colombia Agreement on Labour Cooperation” “Canada-Jordan Agreement on Labour Cooperation”

[7] “Canada-Panama Agreement on Labour Cooperation” “Canada-Honduras Agreement on Labour Cooperation”

We also want to capture the hyperlink associated with each treaty so that we can go to the corresponding page with the full text and scrape the text.

# The the hyperlink associated with each treaty is stored under the "href" tag.

treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")

treaties_links <-treaties_links[23:30]

Next, we want to go to the website behind each of these hyperlinks and scrape the text of the agreements.

# Links 1 to 7 are not complete so we add https://www.canada.ca". Since we are dealing with another list, we need a for-loop. To loop over the elements in the treaty_links object, we use the lapply function. The lapply() function is similar to the loop function we got to know before, but more efficient. In words, we ask R to loop over the elements in treaties_links and perform the function (x) on each element of that list. In our case, our function(x) is the paste() function. We want to add "https://www.canada.ca" to every link (except the 8th one, which is already complete).



treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))

treaties_links_full[8] <-treaties_links[8]

Now that we have the proper hyperlinks, we can loop over these treaty link urls and extract the full text for each treaty.

# We use the read_html() function to extract the full text for each treaty behind the link url.

treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))

The texts of the treaties themselves are contained in the paragraph tag. Here, the results we obtain will not be perfect. The code will captures some additional text not belonging to the treaties. If we want to be exact, we would need to trim that data down to the just the treaty texts before using it. For our present purposes, this information is good enough.



treaty_texts <- lapply(treaty_texts, function(x) (x %>% html_nodes('body') %>% html_nodes('p') %>% html_text()))

treaty_texts <- lapply(treaty_texts, function(x) (unlist(x)))

treaty_texts <- lapply(treaty_texts, function(x) paste((x), collapse=' '))

treaty_texts <- unlist(treaty_texts)

Finally, we can combine the text of each agreement with the name we extracted earlier to create a dataframe.

# Storing our webscraped texts in a dataframe allows us to conduct analysis with it later on.

treaty_dataset <- data.frame(treaties_names,treaty_texts)

Last update May 8, 2020.