Inside Look Into Data, Prediction and Law Course

Professor Jon Marshall provides example lab exercise

Data, Prediction and Law (Legal Studies 123) is a new “data-enabled” course that was developed as a part of UC Berkeley’s Data Science major and minor. The course allows you to explore different data sources used by scholars and government officials to make generalizations and predictions in the realm of law; the class also reinforces the skills to help you do that.

You’ll get introduced to critiques of predictive techniques in law. Apply the statistical and Python programming skills from Foundations of Data Science to examine a traditional social-science dataset (the American National Election Study), “big data” related to law (the San Francisco Police Incident Report dataset) and legal text data (Old Bailey Online). Note: You should complete Foundations of Data Science or have equivalent preparation in Python and statistics before enrolling in this course.

Below is an example of one of the in-class lab exercises in which students explore changes in English criminal law through the wealth of data available through the Old Bailey Online. Students in this class work through more than 20 labs during the course. Here, we were asking whether or not data-analytic techniques (simple dictionary methods of text analysis) could help us find evidence of the changes to English law that occurred in the years around 1830 (huge reduction in capital offenses, State assuming an important role in prosecution and so on).

UC Berkeley undergraduates Gibson Chu and Keeley Takimoto developed this lab exercise.

Section 3: Moral Foundations Theory

Another approach is to create specialized dictionaries containing specific words of interest to try to analyze sentiment from a particular angle (i.e., use a dictionary method). One set of researchers did just that from the perspective of Moral Foundations Theory. We will now use it to see if we can understand more about the moral tone of Old Bailey transcripts than by using general polarity. You should be doing something like this for your homework. We will be using a provided moral foundations dictionary.

with open('data/haidt_dict.json') as json_data:
    mft_dict = json.load(json_data)
Moral Foundations Theory posits that there are five (with an occasional sixth) innate, universal psychological foundations of morality, and that those foundations shape human cultures and institutions (including legal). The keys of the dictionary correspond to the five foundations.
#look at the keys of the dictionary provided
keys = mft_dict.keys()
list(keys)
['authority/subversion',
'care/harm',
'fairness/cheating',
'loyalty/betrayal',
'sanctity/degradation']

And the values of the dictionary are lists of words associated with each foundation.
mft_dict[list(keys)[0]] #one example of the values provided for the first key

Calculating Percentages

In this approach, we'll use the frequency of Moral Foundations–related words as a measure of how the transcripts talk about morality and see if there's a difference between pre- and post-1827 trends.

As a first step, we need to know the total number of words in each transcript.

EXERCISE: Add a column to old_bailey with the number of words corresponding to each transcript.

# create a new column called 'total_words'
old_bailey['total_words'] = old_bailey['tokens'].apply(len) #note you have to apply the function len to iterate
old_bailey.head()

 

 

 

year transcript tokens stemmed_tokens polarity total_words

trial_id

           
t18170115-1 1822 PETER JOHNSON was indicted for being at large,... PETER JOHNSON was indicted for being at large,...
[peter, johnson, was, indicted, for, being, at...
[peter, johnson, was, indict, for, be, at, lar... -0.128571 45
t18170115-2 1822 BENJAMIN HEARNE was indicted for burglariously... [benjamin, hearne, was, indicted, for, burglar... [benjamin, hearn, was, indict, for, burglari, ... 0.075000 382
t18170115-3 1822 JOHN DAVIS and JAMES LEMON , were indicted for... [john, davis, and, james, lemon, were, indicte... [john, davi, and, jame, lemon, were, indict, f... -0.027721 981
t18170115-4 1822 RICHARD WILTSHIRE and SUSAN PAR-SONS , were in... [richard, wiltshire, and, susan, par, sons, we... [richard, wiltshir, and, susan, par, son, were... 0.074495 372
t18170115-5 1822 MARY JOHNSTON was indicted for burglariously b... [mary, johnston, was, indicted, for, burglario... [mari, johnston, was, indict, for, burglari, b... -0.033333 963

 

Next, we need to calculate the number of matches to entries in our dictionary for each foundation for each speech.

Run the next cell to add six new columns to old_bailey, one per foundation, that show the number of word matches. This cell will also likely take some time to run (no more than a minute). Note that by now, you have the skills to write all the code in the next cell—we're just giving it to you because it's long, fiddly and writing nested for-loops is not the focus of this lab. Make sure you know what it does before you move on, though.

# Will take a bit of time to run due to the large size.

# do the following code for each foundation
for foundation in mft_dict.keys():
    # create a new, empty column
    num_match_words = np.zeros(len(old_bailey))
    stems = mft_dict[foundation]
   
    # do the following code for each foundation word
    for stem in stems:
        # find related word matches
        wd_count = np.array([sum([wd == stem for wd in transcript])for transcript in old_bailey['stemmed_tokens']])
        # add the number of matches to the total
        num_match_words += wd_count
       
        # create a new column for each foundation with the number of related words per transcript
    old_bailey[foundation] = num_match_words

old_bailey.head()

 

  year transcript tokens stemmed_tokens polarity total_words

authority/

subversion

care/

harm

fairness/

cheating

loyalty/

betrayal

sanctity/

degradation

trial_id                      
t18170115-1 1822 PETER JOHNSON was indicted for being at large,... [peter, johnson, was, indicted, for, being, at... [peter, johnson, was, indict, for, be, at, lar... -0.128571 45 1.0 0.0 0.0 0.0 0.0
t18170115-2 1822 BENJAMIN HEARNE was indicted for burglariously... [benjamin, hearne, was, indicted, for, burglar... [benjamin, hearn, was, indict, for, burglari, ... 0.075000 382 0.0 1.0 0.0 0.0 0.0
t18170115-3 1822 JOHN DAVIS and JAMES LEMON , were indicted for... [john, davis, and, james, lemon, were, indicte... [john, davi, and, jame, lemon, were, indict, f... -0.027721 981 0.0 0.0 0.0 0.0 0.0
t18170115-4 1822 RICHARD WILTSHIRE and SUSAN PAR-SONS , were in... [richard, wiltshire, and, susan, par, sons, we... [richard, wiltshir, and, susan, par, son, were... 0.074495 372 0.0 0.0 0.0 0.0 0.0
t18170115-5 1822 MARY JOHNSTON was indicted for burglariously b... [mary, johnston, was, indicted, for, burglario... [mari, johnston, was, indict, for, burglari, b... -0.033333 963 1.0 0.0 0.0 0.0 0.0

 

EXERCISE: The columns for each foundation currently contain the number of words related to that foundation for each of the trials. Calculate the percentage of foundation words per trial by dividing the number of matched words by the number of total words and multiplying by 100.

# do this for each foundation column
for foundation in mft_dict.keys():
    old_bailey[foundation] = 100*(old_bailey[foundation]/old_bailey['total_words'])
    #this is it, right? just overwriting the numbers with a percentage?
old_bailey.head()

 

  year transcript tokens stemmed_tokens polarity total_words

authority/

subversion

care/

harm

fairness/

cheating

loyalty/

betrayal

sanctity/

degradation

trial_id                      
t18170115-1 1822 PETER JOHNSON was indicted for being at large,... [peter, johnson, was, indicted, for, being, at... [peter, johnson, was, indict, for, be, at, lar... -0.1285 45 2.222222 0.00000 0.0 0.0 0.0
t18170115-2 1822 BENJAMIN HEARNE was indicted for burglariously... [benjamin, hearne, was, indicted, for, burglar... [benjamin, hearn, was, indict, for, burglari, ... 0.0750 382 0.000000 0.26178 0.0 0.0 0.0
t18170115-3 1822 JOHN DAVIS and JAMES LEMON , were indicted for... [john, davis, and, james, lemon, were, indicte... [john, davi, and, jame, lemon, were, indict, f... -0.0277 981 0.000000 0.00000 0.0 0.0 0.0
t18170115-4 1822 RICHARD WILTSHIRE and SUSAN PAR-SONS , were in... [richard, wiltshire, and, susan, par, sons, we... [richard, wiltshir, and, susan, par, son, were... 0.074495 372 0.000000 0.00000 0.0 0.0 0.0
t18170115-5 1822 MARY JOHNSTON was indicted for burglariously b... [mary, johnston, was, indicted, for, burglario... [mari, johnston, was, indict, for, burglari, b... -0.033333 963 0.103842 0.00000 0.0 0.0 0.0

 

Let's compare the average percentage of foundation words per transcript for the two dates 1822 and 1832.

EXERCISE: Create a dataframe that only has columns for the five foundations plus the year. Then, use the pandas dataframe function groupby to group rows by the year, and call the mean function on the groupby output to get the averages for each foundation.

# the names of the columns we want to keep
mft_columns = ['authority/subversion', 'care/harm', 'fairness/cheating', 'loyalty/betrayal',
'sanctity/degradation', 'year']

# create a data frame with only the above columns included
mft_df = old_bailey.loc[:, mft_columns]
# groups the rows of mft_df by year, then take the mean
foundation_avgs = mft_df.groupby('year').mean()

foundation_avgs

  authority/subversion care/harm fairness/cheating loyalty/betrayal sanctity/degradation
year          
1822 0.078470 0.205310 0.014769 0.013650 0.027381
1832 0.146665 0.099239 0.013073 0.012185 0.042233

 

Next, create a bar graph. The simplest way is to call .plot.barh() on your dataframe of the averages.

Also try calling .transpose() on your averages dataframe, then making a bar graph of that. The transpose function flips the rows and columns and can make it easier to compare the percentages.

# create a bar graph
foundation_avgs.plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e236320>

Bar graph of dataframe on the averages

 

#transpose and then make another bar graph
foundation_avgs.transpose().plot.barh()
<matplotlib.axes._subplots.AxesSubplot at 0x1a1dff59b0>
 

Bar graph of dataframe on the averages

 

QUESTION: What do you see from the bar graphs you created?

Why would this be a good approach to answering the question on how talk about morality changed between these two periods? What are some limitations of this approach? (Hint: look at the values on the graphs you calculated. Remember: These are percentages, not proportions.)

It may not be the best way to talk about how shared norms of right and wrong changed, since the interests of the English legal system are an intervening variable, and as the prompt notes, the proportion of cases that have words in the MFT categories is vanishingly small (0.2%). What is interesting is the change over time in a couple of the categories—the care/harm category and the authority/subversion category. It looks like public order became much more of a focus over the ten years from 1822 to 1832.