Lesson 5

Dictionary Analysis

Introduction

We now turn to the first of several lessons that treat text as data.

Benefits and shortcomings of text-as-data analysis

Treating text as data has one important advantage: scalability. While it takes time to read through cases, contracts and treaties, text-as-data techniques allow lawyers to process a lot of legal material efficiently and effectively in seconds.

Unfortunately, any conversion from text to data comes at the cost of losing semantic information. The meaning of words is partially derived from their context and part of that semantic context is inevitably lost when text is treated as data. Two consequences flow from this.

First, reading text and treating text-as-data are not substitutes; they are complements. A text-as-data analysis provides a bird’s-eye-view of the law by processing lots of information, but does not provide a detailed understanding. Reading text, on the other hand, provides detail and interpretive context that text-as-data analysis cannot match, but it is also time-consuming and thus necessarily restricted to a comparatively small number of selected documents.

Second, all text-as-data methods work with imperfect representations of text. The utility of a text-as-data analysis should not be evaluated by how accurately it reflects the text, but rather by how useful its insights about the text are. As we will see, usefulness often does not require semantic perfection.

Matching text-as-data methods with specific tasks

Since text-as-data methods are evaluated based on their usefulness and not how accurately they represent semantic nuances, the selection of text-as-methods depends on the task at hand. In this lesson, we will primarily work with term-frequency representations of text: We count the number of times a word appears in a document. As we will see, this simple representation of text is surprisingly useful for many tasks, but not for all. Sometimes we are only interested in particular words; sometimes, the grammatical form or the semantic context of words are crucial. There are many ways to represent and work with text. Which method is best, depends on what exactly you want to do. We will cover a few different application in this and the next lesson to guide you.

What we will do in this lesson

We will represent text as word frequencies to visualize and analyze the content and sentiment of legal texts.

  • 1. Creation of a Text Corpus and Text Pre-processing
  • 2. Creating a Term-document Matrix
  • 3. Visualizing Word Frequency
  • 4. Working with Bigrams
  • 5. Dictionary approach I: Term mapping
  • 6. Dictionary approach II: Sentiment Analysis
Useful resources

Grimmer, Justin, and Brandon M. Stewart. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis, 2013.

chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying chat networking coding local-network layer menu folders diagram panel route line-chart compass search flow data-sharing search-1 message target translator candidates studying