The life-cycle of any legal data science project starts with getting your data into R.
The great thing about R is that it allows you not only to upload data from your local machine, but also to scrape the web for data (that is, processing websites and downloading their content). Webscraping is an art and skill of its own, so we will only be able to scratch its surface in this lesson.
Some websites specifically prohibit webscraping in their terms of service. Always double check whether you are actually allowed to webscrape a page and, if in doubt, best contact the website owner. Websites may also make their data available through other means, such as APIs.
As we get into more complicated coding activities, you may encounter errors, i.e. your code is just not working. This may be due to many reasons from typos in your function to having forgotten to activate a package. Don’t be discouraged. Often errors can be fixed easily. R will give you an error message indicating the source of the error. The most important thing is that you learn from your error messages. They will help you identify the source of the problem and will help you fix it. As we discussed in the first lesson, there also is plenty of help online that will enable you to resolve the issue.
There are many ways to upload data into R. Today we will just consider 4 different methods. In addition, we will teach you to interact with your working directory and to install packages.
1. Setting a Working Directory
2. Installing Packages
3. Loading and saving csv files
4. Upload text files
5. Read and upload pdfs
7. Working with XML data
# Here I created a very simple table in Excel with three countries and some sample data and saved it as myfile.csv. I then load it into R.
myfile <- read.csv("myfile.csv",header = TRUE, sep = ",",row.names=1)
|Data 1||Data 2||Data 3|
# Change the below folder path to the path of your target folder. Important: don't forget the little asterix *. It indicates that you want to import all files in that folder. Also make sure you only store the files you want to import in that folder.
folder <- "~/Google Drive/Teaching/Canada/Legal Data Science/2019/Data/Supreme Court Cases/*"
scc_texts <- readtext(folder)
# Activate package
# Download a .pdf version of the Universal Declaration of Human Rights directly into R from the internet.
human_rights <- pdf_text("https://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf")
# If we want to look at page 5, simply specify the number of that page.
##  " 1. Men and women of full age, without any limitation due to race, nationality\r\n or religion, have the right to marry and to found a family. They are entitled\r\n to equal rights as to marriage, during marriage and at its dissolution.\r\n 2. Marriage shall be entered into only with the free and full consent of the\r\n intending spouses.\r\n 3. The family is the natural and fundamental group unit of society and is\r\n entitled to protection by society and the State.\r\nArticle 17\r\n 1. Everyone has the right to own property alone as well as in association with\r\n others.\r\n 2. No one shall be arbitrarily deprived of his property.\r\nArticle 18\r\nEveryone has the right to freedom of thought, conscience and religion; this right\r\nincludes freedom to change his religion or belief, and freedom, either alone or in\r\ncommunity with others and in public or private, to manifest his religion or belief in\r\nteaching, practice, worship and observance.\r\nArticle 19\r\nEveryone has the right to freedom of opinion and expression; this right includes\r\nfreedom to hold opinions without interference and to seek, receive and impart\r\ninformation and ideas through any media and regardless of frontiers.\r\nArticle 20\r\n 1. Everyone has the right to freedom of peaceful assembly and association.\r\n 2. No one may be compelled to belong to an association.\r\nArticle 21\r\n"
# Activate package
# Once we have found the website, copy its url.
url <- "https://www.canada.ca/en/employment-social-development/services/labour-relations/international/agreements.html"
# Websites change over time. We are looking for the agreements starting with NAFTA up to the Honduras-Canada agreement. At the time of this writing, they are list numbers 23:30.
treaties_names <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
treaties_names <- treaties_names[23:30]
# The the hyperlink associated with each treaty is stored under the "href" tag.
treaties_links <- website %>% html_nodes("li") %>% html_nodes("a") %>% html_attr("href")
# Links 1 to 7 are not complete so we add https://www.canada.ca". Since we are dealing with another list, we need a for-loop. To loop over the elements in the treaty_links object, we use the lapply function. The lapply() function is similar to the loop function we got to know before, but more efficient. In words, we ask R to loop over the elements in treaties_links and perform the function (x) on each element of that list. In our case, our function(x) is the paste() function. We want to add "https://www.canada.ca" to every link (except the 8th one, which is already complete).
treaties_links_full <- lapply(treaties_links, function(x) (paste("https://www.canada.ca",x,sep="")))
# We use the read_html() function to extract the full text for each treaty behind the link url.
treaty_texts <- lapply(treaties_links_full, function(x) (read_html(x)))
treaty_texts <- lapply(treaty_texts, function(x) (x %>% html_nodes('body') %>% html_nodes('p') %>% html_text()))
treaty_texts <- lapply(treaty_texts, function(x) (unlist(x)))
treaty_texts <- lapply(treaty_texts, function(x) paste((x), collapse=' '))
treaty_texts <- unlist(treaty_texts)
# Storing our webscraped texts in a dataframe allows us to conduct analysis with it later on.
treaty_dataset <- data.frame(treaties_names,treaty_texts)
XML, similar to HTML, is another way of storing text information, whilst providing a possibility to annotate text. The combination of text and annotation tags makes XML one of the best data formats for legal data analysis.
On the one hand, it allows storing meta-data, that is data about the text, directly in the document. For instance, the XML versions of Canadian laws and regulations, contains information about the entry into force of the law or regulation, the date of its last amendment, its enabling statutory authority if applicable and the like. Court data in XML typically provides information on the identity of litigants, the date of the judgment, the identity of the judge etc.
On the other hand, it provides for annotated full text data. XML data of Canadian regulations, for instance, distinguishes article headers from article text, "definition" clauses from substantive clauses, and identifies cross-references to other laws. This allows easy text segmentation and information extraction.
The only downside of XMLs are is that they need to be parsed (similar to what we did for HTML) in order to extract such information. The code below illustrates how one can parse the XML of NAFTA - the North American Free Trade Agreement - an international agreement contained in the Text of Trade Agreements (ToTA) dataset.
# Load libraries
We start by reading the NAFTA XML into R. If you were to work with the entire ToTA set, instead of just one agreement, you would write a for-loop that repeats the below code for all urls to ToTA texts.
# Download XML of NAFTA, which is PTA number 112 in ToTA.
tota_xml <- read_xml("https://raw.githubusercontent.com/mappingtreaties/tota/master/xml/pta_112.xml", options = c("RECOVER", "NOCDATA", "NOENT", "NOBLANKS", "BIG_LINES"), encoding = "UTF-8")
Next we parse the meta-data of the NAFTA XML.
# Extract agreement name
tota_name <- as.character(xml_find_all(tota_xml, "/treaty/meta/name/text()"))
##  "North American Free Trade Agreement (NAFTA)"
# Extract agreement type
tota_type <- as.character(xml_find_all(tota_xml, "/treaty/meta/type/text()")) print(tota_type)
##  "Free Trade Agreement & Economic Integration Agreement"
# Extract agreement parties
tota_parties <- as.character(xml_find_all(tota_xml, "/treaty/meta/parties_original/partyisocode/text()"))
tota_parties <- paste0(tota_parties, collapse = ",")
##  "CAN,MEX,USA"
# Extract agreement year of signature
tota_date_signature <- as.character(xml_find_all(tota_xml, "/treaty/meta/date_signed/text()"))
##  "1992-12-17"
Finally, we can extract the full text of the NAFTA agreement from the XML.
# Extract full text
full_text <- xml_text(xml_find_all(tota_xml, "/treaty/body"))
If you want to continue working with this information, it would make sense to combine the meta-data and full text into the row of a dataframe like we did for the webscraping.