Preparing Metadata

To make sense of similarity patterns in legal documents, it is essential to start by extracting metadata. Meta data is the data that describes a document which often includes dates, sources and content.

For today’s lesson, we will be working with Canadian labor agreements. Please download them and upload them into R.

# Load readtext package.

library(readtext)

# Load treaties into R.

folder <- "~/Google Drive/Teaching/Canada/Legal Data Science/Labor Agreements/EN/*"

# Upload the texts from that target folder.

treaty_texts <- readtext(folder)

To enrich our subsequent analysis we want extract meta data from the treaties. Take a look at the treaty file names. They already contain vital mata data, such as parties and the year of signature. We can thus use regular expressions (Lesson 3) to extract that information from the file names.

# We start by creating two empty objects to which we will add the meta information.

treaty_partner <- character()
treaty_year <- numeric()

Next we loop through all treaty names and extract partner state and year.

# We extract partner and year from each file name and then add it to the empty objects before proceeding to the next file name.

for (treaty_name in treaty_texts$doc_id) {
treaty_name <- gsub(".txt|CDA-","",treaty_name)
partner <- strsplit(treaty_name, "_")[[1]][1]
treaty_partner <- c(treaty_partner,partner)
year <- as.numeric(strsplit(treaty_name, "_")[[1]][2])
treaty_year <- c(treaty_year,year)
}

Then we attach the two vector objects we fill with metadata to original dataset.

treaty_texts <- cbind(treaty_texts,treaty_partner)
treaty_texts <- cbind(treaty_texts,treaty_year)

Now we have a dataframe not only with the full text, but also with helpful metadata. For example, we can now organize the treaties by the year they were signed.

# Order by treaty year.
treaty_texts <- treaty_texts[order(treaty_texts$treaty_year),] 

access_time Last update May 11, 2020.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying