The life-cycle of any legal data science project starts with getting your data into R.
The great thing about R is that it allows you to upload data from your local machine, and scrape the web for data (that is, processing websites and downloading their content). Webscraping is an art and skill of its own, so today’s lesson will only scratch the surface.
Some websites prohibit webscraping in their terms of service. As such, you should always double check to see if you are allowed to webscrape a page and, if in doubt, it is best to contact the website’s owner. Websites may also make their data available through other means, such as APIs.
As we get into more complicated coding activities, you may encounter errors, i.e. your code is just not working. There are many reasons for this which range from typos in your function to forgetting to activate a package. Don’t be discouraged though. Often errors are wasy to fix. R will give you an error message indicating the source of the error. The most important thing is that you learn from your error messages. They will help you identify the source of the problem and will help you fix it. As we discussed in the first lesson, there also is plenty of help online that will enable you to resolve the issue.
There are many ways to upload data into R. Today we will just consider 4 different methods. We will also teach you how to interact with your working directory and how to install packages.
1. Setting a Working Directory
2. Installing Packages
3. Loading and saving csv files
4. Upload text files
5. Read and upload pdfs
7. Working with XML data
# Here I created a very simple table in Excel with three countries and some sample data and saved it as myfile.csv. I then load it into R.
myfile <- read.csv("myfile.csv",header = TRUE, sep = ",",row.names=1)
|Data 1||Data 2||Data 3|
# Change the below folder path to the path of your target folder. Important: don't forget the little asterix *. It indicates that you want to import all files in that folder. Also make sure you only store the files you want to import in that folder.
folder <- "~/Google Drive/Teaching/Canada/Legal Data Science/2019/Data/Supreme Court Cases/*"
scc_texts <- readtext(folder)
Many legal documents are in fact in .pdf and not in .txt format. To upload those we will use the package "pdftools".
# Activate package
It is important to note that only pdfs with embedded digital texts can be uploaded. Scanned images of a text first need to undergo OCR - Optical Character Recognition. Today, we will work with the Universal Declaration of Human Rights as an example.
# Download a .pdf version of the Universal Declaration of Human Rights directly into R from the internet.
human_rights <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")
The pdf_text() function converts each page into an element in your object. The Universal Declaration is 8 pages long. It has thus been converted into a list with 8 elements.
# If we want to look at page 5, simply specify the number of that page.
##  " 1. Men and women of full age, without any limitation due to race, nationality\r\n or religion, have the right to marry and to found a family. They are entitled\r\n to equal rights as to marriage, during marriage and at its dissolution.\r\n 2. Marriage shall be entered into only with the free and full consent of the\r\n intending spouses.\r\n 3. The family is the natural and fundamental group unit of society and is\r\n entitled to protection by society and the State.\r\nArticle 17\r\n 1. Everyone has the right to own property alone as well as in association with\r\n others.\r\n 2. No one shall be arbitrarily deprived of his property.\r\nArticle 18\r\nEveryone has the right to freedom of thought, conscience and religion; this right\r\nincludes freedom to change his religion or belief, and freedom, either alone or in\r\ncommunity with others and in public or private, to manifest his religion or belief in\r\nteaching, practice, worship and observance.\r\nArticle 19\r\nEveryone has the right to freedom of opinion and expression; this right includes\r\nfreedom to hold opinions without interference and to seek, receive and impart\r\ninformation and ideas through any media and regardless of frontiers.\r\nArticle 20\r\n 1. Everyone has the right to freedom of peaceful assembly and association.\r\n 2. No one may be compelled to belong to an association.\r\nArticle 21\r\n"
Webscraping is technically challenging, sometimes impossible, and often prohibited. So this section is only a primer. Some websites contain special interfaces to access their data through so called application programming interfaces (APIs). Rather than doing webscraping yourself you should ask a computer scientist for help. With that in mind, I still want to give you a taste of what is possible in R. To do webscraping in R, we need to install and load the package "rvest" - note the pun.
# Activate package
In this instance, we are interested in scraping the texts of labor agreements Canada has signed with third parties. Look for Canadian labor agreements online.
# Once we have found the website, copy its url.
url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"
We can then read the website into R using the read_html() command.
Websites are written in the html language. That is a mark up language that we can use to locate what we are looking for. Not all websites can be easily scraped. The website architecture determines how (and whether at all) it is possible to scrape it.
Open our target website in its source code (most web browsers have an option that allows you to view a website in source code). Go through that source code. Where do we find the list of labor treaties?
Our target website contains several lists (<li>) in html. The labor treaties are located in one of these lists. Within these lists items are classified with tags that indicate links to other pages. We now scrape the names behind the tags within the list objects and manually identify the location of the labor agreements.
# Websites change over time. We are looking for the agreements starting with NAFTA up to the Honduras-Canada agreement. At the time of this writing, they are list numbers 23:30.
treaties_names <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
treaties_names <- treaties_names[23:30]
We also want to capture the hyperlink associated with each treaty so that we can go to the corresponding page with the full text and scrape the text.
# The the hyperlink associated with each treaty is stored under the "href" tag.
treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")
Next, we want to go to the website behind each of these hyperlinks and scrape the text of the agreements.
# Links 1 to 7 are not complete so we add https://www.canada.ca". Since we are dealing with another list, we need a for-loop. To loop over the elements in the treaty_links object, we use the lapply function. The lapply() function is similar to the loop function we got to know before, but more efficient. In words, we ask R to loop over the elements in treaties_links and perform the function (x) on each element of that list. In our case, our function(x) is the paste() function. We want to add "https://www.canada.ca" to every link (except the 8th one, which is already complete).
treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))
Now that we have the proper hyperlinks, we can loop over these treaty link urls and extract the full text for each treaty.
# We use the read_html() function to extract the full text for each treaty behind the link url.
treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))
The texts of the treaties themselves are contained in the paragraph tag. Here, the results we obtain will not be perfect. The code will captures some additional text not belonging to the treaties. If we want to be exact, we would need to trim that data down to the just the treaty texts before using it. For our present purposes, this information is good enough.
treaty_texts <- lapply(treaty_texts, function(x) (x %>% html_nodes('body') %>% html_nodes('p') %>% html_text()))
treaty_texts <- lapply(treaty_texts, function(x) (unlist(x)))
treaty_texts <- lapply(treaty_texts, function(x) paste((x), collapse=' '))
treaty_texts <- unlist(treaty_texts)
Finally, we can combine the text of each agreement with the name we extracted earlier to create a dataframe.
# Storing our webscraped texts in a dataframe allows us to conduct analysis with it later on.
treaty_dataset <- data.frame(treaties_names,treaty_texts)
XML, similar to HTML, is another way of storing text information, whilst providing a possibility to annotate text. The combination of text and annotation tags makes XML one of the best data formats for legal data analysis.
XML data can store meta-data, that is data about the text, directly in the document. For instance, the XML versions of Canadian laws and regulations, contains information about the entry into force of the law or regulation, the date of its last amendment, its enabling statutory authority if applicable and the like. Court data in XML typically provides information on the identity of litigants, the date of the judgment, the identity of the judge, and other useful information.
XML data also provides for annotated full text data. XML data of Canadian regulations, for instance, distinguishes article headers from article text, "definition" clauses from substantive clauses, and identifies cross-references to other laws. This facilitates the segmentation of the text and makes it easier to extract information.
The only downside of the XML format is that it needs to be parsed (similar to what we did for HTML) in order to extract such information. The code below illustrates how one can parse the XML of NAFTA - the North American Free Trade Agreement - an international agreement contained in the Text of Trade Agreements (ToTA) dataset.
# Load libraries
We start by reading the NAFTA XML into R. If you were to work with the entire ToTA set, instead of just one agreement, you would write a for-loop that repeats the code below for all the urls that lead to ToTA texts.
# Download XML of NAFTA, which is PTA number 112 in ToTA.
tota_xml <- read_xml("https://raw.githubusercontent.com/mappingtreaties/tota/master/xml/pta_112.xml", options = c("RECOVER", "NOCDATA", "NOENT", "NOBLANKS", "BIG_LINES"), encoding = "UTF-8")
Next we parse the meta-data of the NAFTA XML.
# Extract agreement name
tota_name <- as.character(xml_find_all(tota_xml, "/treaty/meta/name/text()"))
##  "North American Free Trade Agreement (NAFTA)"
# Extract agreement type
tota_type <- as.character(xml_find_all(tota_xml, "/treaty/meta/type/text()")) print(tota_type)
##  "Free Trade Agreement & Economic Integration Agreement"
# Extract agreement parties
tota_parties <- as.character(xml_find_all(tota_xml, "/treaty/meta/parties_original/partyisocode/text()"))
tota_parties <- paste0(tota_parties, collapse = ",")
##  "CAN,MEX,USA"
# Extract agreement year of signature
tota_date_signature <- as.character(xml_find_all(tota_xml, "/treaty/meta/date_signed/text()"))
##  "1992-12-17"
Finally, we can extract the full text of the NAFTA agreement from the XML.
# Extract full text
full_text <- xml_text(xml_find_all(tota_xml, "/treaty/body"))
If you want to continue working with this information, it would make sense to combine the meta-data and full text into the row of a dataframe like we did for the webscraping.