Data, Prediction and Law (Legal Studies 123) is a new “data-enabled” course that was developed as a part of UC Berkeley’s Data Science major and minor. The course allows you to explore different data sources used by scholars and government officials to make generalizations and predictions in the realm of law; the class also reinforces the skills to help you do that.
You’ll get introduced to critiques of predictive techniques in law. Apply the statistical and Python programming skills from Foundations of Data Science to examine a traditional social-science dataset (the American National Election Study), “big data” related to law (the San Francisco Police Incident Report dataset) and legal text data (Old Bailey Online). Note: You should complete Foundations of Data Science or have equivalent preparation in Python and statistics before enrolling in this course.
Below is an example of one of the in-class lab exercises in which students explore changes in English criminal law through the wealth of data available through the Old Bailey Online. Students in this class work through more than 20 labs during the course. Here, we were asking whether or not data-analytic techniques (simple dictionary methods of text analysis) could help us find evidence of the changes to English law that occurred in the years around 1830 (huge reduction in capital offenses, State assuming an important role in prosecution and so on).
UC Berkeley undergraduates Gibson Chu and Keeley Takimoto developed this lab exercise.
Section 3: Moral Foundations Theory
Another approach is to create specialized dictionaries containing specific words of interest to try to analyze sentiment from a particular angle (i.e., use a dictionary method). One set of researchers did just that from the perspective of Moral Foundations Theory. We will now use it to see if we can understand more about the moral tone of Old Bailey transcripts than by using general polarity. You should be doing something like this for your homework. We will be using a provided moral foundations dictionary.
with open('data/haidt_dict.json') as json_data:
mft_dict = json.load(json_data)
Moral Foundations Theory posits that there are five (with an occasional sixth) innate, universal psychological foundations of morality, and that those foundations shape human cultures and institutions (including legal). The keys of the dictionary correspond to the five foundations.
#look at the keys of the dictionary provided
keys = mft_dict.keys()
list(keys)
['authority/subversion',
'care/harm',
'fairness/cheating',
'loyalty/betrayal',
'sanctity/degradation']
And the values of the dictionary are lists of words associated with each foundation.
mft_dict[list(keys)[0]] #one example of the values provided for the first key
Calculating Percentages
In this approach, we'll use the frequency of Moral Foundations–related words as a measure of how the transcripts talk about morality and see if there's a difference between pre- and post-1827 trends.
As a first step, we need to know the total number of words in each transcript.
EXERCISE: Add a column to old_bailey with the number of words corresponding to each transcript.
# create a new column called 'total_words'
old_bailey['total_words'] = old_bailey['tokens'].apply(len) #note you have to apply the function len to iterate
old_bailey.head()
|
year | transcript | tokens | stemmed_tokens | polarity | total_words |
---|---|---|---|---|---|---|
trial_id |
||||||
t18170115-1 | 1822 | PETER JOHNSON was indicted for being at large,... | PETER JOHNSON was indicted for being at large,... [peter, johnson, was, indicted, for, being, at... |
[peter, johnson, was, indict, for, be, at, lar... | -0.128571 | 45 |
t18170115-2 | 1822 | BENJAMIN HEARNE was indicted for burglariously... | [benjamin, hearne, was, indicted, for, burglar... | [benjamin, hearn, was, indict, for, burglari, ... | 0.075000 | 382 |
t18170115-3 | 1822 | JOHN DAVIS and JAMES LEMON , were indicted for... | [john, davis, and, james, lemon, were, indicte... | [john, davi, and, jame, lemon, were, indict, f... | -0.027721 | 981 |
t18170115-4 | 1822 | RICHARD WILTSHIRE and SUSAN PAR-SONS , were in... | [richard, wiltshire, and, susan, par, sons, we... | [richard, wiltshir, and, susan, par, son, were... | 0.074495 | 372 |
t18170115-5 | 1822 | MARY JOHNSTON was indicted for burglariously b... | [mary, johnston, was, indicted, for, burglario... | [mari, johnston, was, indict, for, burglari, b... | -0.033333 | 963 |
Next, we need to calculate the number of matches to entries in our dictionary for each foundation for each speech.
Run the next cell to add six new columns to old_bailey, one per foundation, that show the number of word matches. This cell will also likely take some time to run (no more than a minute). Note that by now, you have the skills to write all the code in the next cell—we're just giving it to you because it's long, fiddly and writing nested for-loops is not the focus of this lab. Make sure you know what it does before you move on, though.
# Will take a bit of time to run due to the large size.
# do the following code for each foundation
for foundation in mft_dict.keys():
# create a new, empty column
num_match_words = np.zeros(len(old_bailey))
stems = mft_dict[foundation]
# do the following code for each foundation word
for stem in stems:
# find related word matches
wd_count = np.array([sum([wd == stem for wd in transcript])for transcript in old_bailey['stemmed_tokens']])
# add the number of matches to the total
num_match_words += wd_count
# create a new column for each foundation with the number of related words per transcript
old_bailey[foundation] = num_match_words
old_bailey.head()
year | transcript | tokens | stemmed_tokens | polarity | total_words |
authority/ subversion |
care/ harm |
fairness/ cheating |
loyalty/ betrayal |
sanctity/ degradation |
|
---|---|---|---|---|---|---|---|---|---|---|---|
trial_id | |||||||||||
t18170115-1 | 1822 | PETER JOHNSON was indicted for being at large,... | [peter, johnson, was, indicted, for, being, at... | [peter, johnson, was, indict, for, be, at, lar... | -0.128571 | 45 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
t18170115-2 | 1822 | BENJAMIN HEARNE was indicted for burglariously... | [benjamin, hearne, was, indicted, for, burglar... | [benjamin, hearn, was, indict, for, burglari, ... | 0.075000 | 382 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
t18170115-3 | 1822 | JOHN DAVIS and JAMES LEMON , were indicted for... | [john, davis, and, james, lemon, were, indicte... | [john, davi, and, jame, lemon, were, indict, f... | -0.027721 | 981 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
t18170115-4 | 1822 | RICHARD WILTSHIRE and SUSAN PAR-SONS , were in... | [richard, wiltshire, and, susan, par, sons, we... | [richard, wiltshir, and, susan, par, son, were... | 0.074495 | 372 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
t18170115-5 | 1822 | MARY JOHNSTON was indicted for burglariously b... | [mary, johnston, was, indicted, for, burglario... | [mari, johnston, was, indict, for, burglari, b... | -0.033333 | 963 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
EXERCISE: The columns for each foundation currently contain the number of words related to that foundation for each of the trials. Calculate the percentage of foundation words per trial by dividing the number of matched words by the number of total words and multiplying by 100.
# do this for each foundation column
for foundation in mft_dict.keys():
old_bailey[foundation] = 100*(old_bailey[foundation]/old_bailey['total_words'])
#this is it, right? just overwriting the numbers with a percentage?
old_bailey.head()
year | transcript | tokens | stemmed_tokens | polarity | total_words |
authority/ subversion |
care/ harm |
fairness/ cheating |
loyalty/ betrayal |
sanctity/ degradation |
|
---|---|---|---|---|---|---|---|---|---|---|---|
trial_id | |||||||||||
t18170115-1 | 1822 | PETER JOHNSON was indicted for being at large,... | [peter, johnson, was, indicted, for, being, at... | [peter, johnson, was, indict, for, be, at, lar... | -0.1285 | 45 | 2.222222 | 0.00000 | 0.0 | 0.0 | 0.0 |
t18170115-2 | 1822 | BENJAMIN HEARNE was indicted for burglariously... | [benjamin, hearne, was, indicted, for, burglar... | [benjamin, hearn, was, indict, for, burglari, ... | 0.0750 | 382 | 0.000000 | 0.26178 | 0.0 | 0.0 | 0.0 |
t18170115-3 | 1822 | JOHN DAVIS and JAMES LEMON , were indicted for... | [john, davis, and, james, lemon, were, indicte... | [john, davi, and, jame, lemon, were, indict, f... | -0.0277 | 981 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 |
t18170115-4 | 1822 | RICHARD WILTSHIRE and SUSAN PAR-SONS , were in... | [richard, wiltshire, and, susan, par, sons, we... | [richard, wiltshir, and, susan, par, son, were... | 0.074495 | 372 | 0.000000 | 0.00000 | 0.0 | 0.0 | 0.0 |
t18170115-5 | 1822 | MARY JOHNSTON was indicted for burglariously b... | [mary, johnston, was, indicted, for, burglario... | [mari, johnston, was, indict, for, burglari, b... | -0.033333 | 963 | 0.103842 | 0.00000 | 0.0 | 0.0 | 0.0 |
Let's compare the average percentage of foundation words per transcript for the two dates 1822 and 1832.
EXERCISE: Create a dataframe that only has columns for the five foundations plus the year. Then, use the pandas dataframe function groupby to group rows by the year, and call the mean function on the groupby output to get the averages for each foundation.
# the names of the columns we want to keep
mft_columns = ['authority/subversion', 'care/harm', 'fairness/cheating', 'loyalty/betrayal',
'sanctity/degradation', 'year']
# create a data frame with only the above columns included
mft_df = old_bailey.loc[:, mft_columns]
# groups the rows of mft_df by year, then take the mean
foundation_avgs = mft_df.groupby('year').mean()
foundation_avgs
authority/subversion | care/harm | fairness/cheating | loyalty/betrayal | sanctity/degradation | |
---|---|---|---|---|---|
year | |||||
1822 | 0.078470 | 0.205310 | 0.014769 | 0.013650 | 0.027381 |
1832 | 0.146665 | 0.099239 | 0.013073 | 0.012185 | 0.042233 |
Next, create a bar graph. The simplest way is to call .plot.barh() on your dataframe of the averages.
Also try calling .transpose() on your averages dataframe, then making a bar graph of that. The transpose function flips the rows and columns and can make it easier to compare the percentages.
# create a bar graph
foundation_avgs.plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e236320>

#transpose and then make another bar graph
foundation_avgs.transpose().plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x1a1dff59b0>

QUESTION: What do you see from the bar graphs you created?
Why would this be a good approach to answering the question on how talk about morality changed between these two periods? What are some limitations of this approach? (Hint: look at the values on the graphs you calculated. Remember: These are percentages, not proportions.)
It may not be the best way to talk about how shared norms of right and wrong changed, since the interests of the English legal system are an intervening variable, and as the prompt notes, the proportion of cases that have words in the MFT categories is vanishingly small (0.2%). What is interesting is the change over time in a couple of the categories—the care/harm category and the authority/subversion category. It looks like public order became much more of a focus over the ten years from 1822 to 1832.