Working with XML Data - Data Science for Lawyers

XML, similar to HTML, is another way of storing text information, whilst providing a possibility to annotate text. The combination of text and annotation tags makes XML one of the best data formats for legal data analysis.

XML data can store meta-data, that is data about the text, directly in the document. For instance, the XML versions of Canadian laws and regulations, contains information about the entry into force of the law or regulation, the date of its last amendment, its enabling statutory authority if applicable and the like. Court data in XML typically provides information on the identity of litigants, the date of the judgment, the identity of the judge, and other useful information.

XML data also provides for annotated full text data. XML data of Canadian regulations, for instance, distinguishes article headers from article text, “definition” clauses from substantive clauses, and identifies cross-references to other laws. This facilitates the segmentation of the text and makes it easier to extract information.

The only downside of the XML format is that it needs to be parsed (similar to what we did for HTML) in order to extract such information. The code below illustrates how one can parse the XML of NAFTA – the North American Free Trade Agreement – an international agreement contained in the Text of Trade Agreements (ToTA) dataset.


# Load libraries



library("xml2")


library("rvest")

We start by reading the NAFTA XML into R. If you were to work with the entire ToTA set, instead of just one agreement, you would write a for-loop that repeats the code below for all the urls that lead to ToTA texts.


# Download XML of NAFTA, which is PTA number 112 in ToTA.



tota_xml <- read_xml("https://raw.githubusercontent.com/mappingtreaties/tota/master/xml/pta_112.xml", options = c("RECOVER", "NOCDATA", "NOENT", "NOBLANKS", "BIG_LINES"), encoding = "UTF-8")

Next we parse the meta-data of the NAFTA XML.


# Extract agreement name


tota_name <- as.character(xml_find_all(tota_xml, "/treaty/meta/name/text()"))


print(tota_name)

 ## [1] "North American Free Trade Agreement (NAFTA)"


# Extract agreement type


tota_type <- as.character(xml_find_all(tota_xml, "/treaty/meta/type/text()"))
print(tota_type)

 ## [1] "Free Trade Agreement & Economic Integration Agreement"


# Extract agreement parties


tota_parties <- as.character(xml_find_all(tota_xml, "/treaty/meta/parties_original/partyisocode/text()"))


tota_parties <- paste0(tota_parties, collapse = ",")


print(tota_parties)

 ## [1] "CAN,MEX,USA"


# Extract agreement year of signature


tota_date_signature <- as.character(xml_find_all(tota_xml, "/treaty/meta/date_signed/text()"))


print(tota_date_signature)

 ## [1] "1992-12-17"

Finally, we can extract the full text of the NAFTA agreement from the XML.


# Extract full text


full_text <- xml_text(xml_find_all(tota_xml, "/treaty/body"))

If you want to continue working with this information, it would make sense to combine the meta-data and full text into the row of a dataframe like we did for the webscraping.

Last update May 8, 2020.