Introduction - Data Science for Lawyers

Machine learning is responsible for the success of recent applications of artificial intelligence (AI) technology and is fuelling a wide range of advances from translation to self-driving cars. Machine learning can also be used in legal data science. Machine learning algorithms look for relationships in data and tend to improve as they process more data. For legal researchers, supervised and unsupervised learning are the two most relevant types of machine learning.

Supervised machine learning

In supervised machine learning, a human “teaches” a computer a task, which it can then perform autonomously. For example, you could read 100 Supreme Court decisions and classify them based on outcomes. Subsequently, you feed the text and the associated outcome to a machine learning algorithm. The algorithm “learns” how the relationship between input (text of decision) and output (outcome of decision) and stores this relationship in form of a model. It can then use this model to autonomously assign a given outcome, if provided with the text of the decision. To have a sense of how good the model is, you should evaluate it against the “gold standard” — manually classified data (but not the ones used for the training of the model). If the results are mostly correct, the model works well. If it produces many falsely assigned cases, you may have to hand-code more cases so that the algorithm can learn the relationship between text and outcome better.

Unsupervised machine learning

Unsupervised machine learning algorithms autonomously mine a dataset for patterns without prior human guidance. We will be looking at topic modeling as one example of such an algorithm. Say you do not know the content of a corpus in advance, but you guess that there are five topics in the corpus. You could then run a topic modeling algorithm that assumes that there are five topics in the corpus, that different words have different probabilities of appearing within a topic and that documents vary in the proportion of topics they talk about. The algorithm then sets out to “find” these five different topics and returns word lists associated with each topic. By interpreting these word lists you then deduce the content of these five topics. Sometimes unsupervised machine learning algorithms work like magic and reveal patterns that make intuitive sense. Sometimes they find relationships that make no sense at all. Sometimes, it literally comes down to luck – unsupervised machine learning algorithms tend to be probabilistic and, depending on where they start and how they “guess”, their findings can be more or less meaningful.

Validate, validate, validate

Machine algorithms are highly appealing because they can be used to quickly explore the content of a corpus or to classify texts by subject. However, legal researchers should not blindly trust the results, but instead carefully validate them. Aside from quantitive checks, this includes actually reading and reviewing some of the documents, which have been processed, to assess whether results make sense. Moreover, researchers should not expect that computers can do tasks that humans cannot. Some corpora may be so diverse that they cannot be meaningfully grouped; others allow for multiple equally valid groupings. Just like two lawyers may validly divide the same collection of judicial decisions differently, a computer-generated output should not be taken as “truth”, but as one possible way to group the data.

When researchers should use machine learning

Unsupervised machine learning: It is often used at early stages of the research to explore new datasets, because an unsupervised algorithm, in contrast to human trained supervised machine learning algorithms, can even find patterns that the researcher did not actively look for. They thus work well even when categories are unknown. In fact, many researchers will be disappointed when they use unsupervised algorithms to look for known patterns. Instead of automatically categorizing treaties by their clauses, for example, a topic model is likely to pick up language typically used by specific states and classify treaties by signatories instead of by their content. Where categories of interest are known, a rules-based dictionary mapping or a supervised machine learning approach is more suitable.

Supervised machine learning: Whenever the researcher is confronted with repetitive tasks at a high volume, supervised machine learning is a useful tool. This is particularly true when the relationship between input and output data is complex. If the relationship can be broken down into a small set of logical rules a dictionary-based content mapping method might be more appropriate. It only makes sense to use machine learning when there is a high volume of data. If you are only classifying a few dozen or a hundred cases, it may be quicker to classify the data by hand.

What we do in this lesson

In this lesson, we talk about two types of machine learning for the purposes of classifying the content of texts. In the next lesson, we use similar algorithms for prediction.

1. Unsupervised Machine Learning
2. Supervised Machine Learning

Last update May 11, 2020.