Application of Secondary Data in Epidemiological Study, Design Protocol and S...
ICBAI Paper (1)
1. Application of Machine Learning to Predict Outcome of US Court of Appeals
Krishna Mohan
Thomson Reuters
krishna.mohan3@tr.com
Nitin Hosurkar
Arrow Electronics
nitin.hosurkar@gmail.com
Pradeepta Mishra
Ma Foi
Pradeepta.mishra1@gmail.com
Abstract:
In 2004, Theodore Ruger et al (Theodore W. Ruger, 2004), made a bold claim that data analytics models
can predict the outcome of US Supreme Court better than experts in legal domain. Historical data was
used to develop decision trees that predicted whether the US Supreme Court would confirm or reverse the
lower court ruling – a binary outcome. This got many in the legal community excited, while many others
received it with cautious skepticism.
In this project, we aim to take prediction of court rulings further and look at the next lower level in US
Court system hierarchy – namely, the US Court of Appeals. Unlike the Supreme Court study which has a
binomial outcome, the US Court of Appeals has 12 possible outcomes. Therefore, the methods,
techniques and interpretation required to develop a predictive model are very different and challenging.
Data obtained over a 7 year period sourced from public domain were cleansed and dimensions reduced
using Chi-Square analysis and Boruta package from R. Classification techniques used include Random
Forest, Neural Network, XG Boost and Ensemble. Prediction accuracy of the models range from 36% to
98%, requiring identification of parameters that ensure robustness of the models. Although there are no
benchmarks available in legal domain to compare our accuracy levels, the results are highly encouraging.
By applying the models on similar data collected from other courts and over longer durations, there is an
opportunity to make them more robust and reliable. With rapid digitization, we see opportunities to apply
similar techniques in India in the near future.
Keywords:
Legal predictive analytics, multinomial classification, judicial analytics, random forest, neural network
Introduction (Section 1):
Every year tens of thousands of cases work their way through the US Judicial system. Very often parties
involved look for higher courts to get ruling in their favor based on expert advice received from lawyers
provided using their prior experience and intuition. The clients’ tangible and intangible stakes also clouds
their decision to pursue the case further. Only later do they realize that an out-of-court settlement would
have resulted in best outcome for all parties involved including the courts.
From the lawyers’ standpoint, it is not just enough to research on previous rulings in strategic
and tactical preparation of their cases. Rather, it would be important to understand the factors or
variables that courts rely on to arrive at their decisions.
2. Courts tend to document several parameters related to their functioning such as parties involved,
hearing dates, nature of the case, rulings from earlier courts, laws applied, etc. This data can be
leveraged to predict the outcome of future cases based on these parameters by applying data
analytics techniques. This data based approach is far more objective as compared to intuitive
and experience based speculation that has been the norm. Both clients and lawyers can make
decisions with lot more confidence. With reduction in frivolous and outlier cases, Courts would
be able to save their precious resources which can be repurposed for gaining efficiencies within
the system.
In this study, we have attempted to predict the outcome of US Court of Appeals. The outcome
can assume 12 possible ruling values. Therefore, the multi-variate output challenged us to go
well beyond Logistic Regression to techniques such as Random Forest, Neural Network, XG
Boost and Ensemble.
This paper defines the problem statement in Section 2. Related literature review and previous
work done in this area are discussed in Section 3. Data Sources are identified in Section 4. Next
we look into the nature of data and its engineering in Sections 5 and 6 respectively. Now that the
data is ready, we focus on selection criteria for model building techniques in Section 7. The
results are discussed in Section 8 and the overall conclusions drawn in Section 9.
Problem Statement (Section 2):
Develop models that can predict the outcome or treatment of a case by the US Court of Appeals
based on historical data using - basic case characteristics, participants, nature of the case, judges
and votes. Today, experience and intuition are used to make such predictions. This project will
involve data exploration, data engineering and building appropriate predictive models using
various techniques such as Random Forest, Neural Networks, XG Boost and Ensemble.
Literature Review (Section 3):
Prof. Frank B Cross of University of Texas, Austin studied the decision making process in the US Court
of Appeals. (Cross, 2003) He explains that there are four primary theoretical models that determine the
outcome of cases that the court handles. The first is the Legal Model, wherein the decision is made
strictly in accordance to the law. The second theoretical model is the Political Model in which ideology
of judges may be a factor. Third is the Strategic Model of adapting decisions to the preference of the US
Supreme Court. The fourth and last model is the Litigant-driven model in which the strategic decisions of
the parties involved can drive the outcome of a case.
Prof. Cross concludes that legal and political factors are statistically significant determinants of decisions,
while Strategic and Litigant-driver factors have no significance. This leaves a litigant with little
ammunition or tools to influence the outcome in his/her favor. It is possible that the litigants did not quite
have the tools to formulate their strategy to a point wherein the litigant driven factors also become
significant. This may primarily be due to over resilience on a lawyer’s intuition, experience and
expertise. A more objective approach for a litigant would be to use data as an instrument for strategizing.
In this paper, we focus on building one such strategy tool. Being able to predict the outcome of a case in
the US Court of Appeals, becomes an important input for a litigant to better determine his/her options or
strategy. While similar work has been done in the past with regard to predicting whether the US Supreme
Court would confirm or overturn the ruling of a lower court – a binary outcome, in this project we try to
predict multiple outcomes in the US Court of Appeals.
3. Data Sources (Section 4):
The Judicial Research Initiative (JuRI) at the University of South Carolina, Columbia took up the Appeals
Court Database Project to create an extensive dataset that would facilitate empirical analysis of the
judges’ votes and overall ruling of the Appealate Court. Data on a broad range of variables of theoretical
significance to public law scholars were coded and published. The 1997-2002 database (JuRI_data,
2003) along with codebook (JuRI_Codebook, 2003) effort was lead by Dr. Ashlyn K Kuersten of Western
Michigan University and Susan B. Haire of the University of Georgia.
Data source links relevant to this project are provided below:
Website: http://artsandsciences.sc.edu/poli/juri/appct.htm
Codebook: http://artsandsciences.sc.edu/poli/juri/KH_update_codebook.pdf
Data (stata format): http://www.cas.sc.edu/poli/juri/KH_update_stata.zip
Data and Variables (Section 5):
The raw data file in csv format consists of 2,160 rows and 244 columns. Almost all the variables were
categorical. Variables with more than 15% missing values were removed – all other variables that were
retained had less than 5% missing values. The data also consisted of 5-digit nominal values – each digit
represented a categorical value with as many as 12 sub-category levels. Composite data was decomposed
in separate fields and renamed for better understanding. Many of the categorical variables required
dummy coding, thus vastly enlarging size of our dataset.
The dependent or outcome variable for our study is ‘Treatment’. According to the Codebook, Treatment
can assume one of the 12 possible values that are coded as follows: 0= stay petition or motion granted,
1=affirmed, 2=reversed, 3=reversed and remanded, 4=vacated and remanded, 5=affirmed in part and
reversed in part, 6=affirmed in part, reversed in part and remanded, 7=vacated, 8=petition denied or
appeal dismissed, 9=certification to another court, 10=not ascertained, 11=affirmed, vacated and
remanded.
As illustrated in Fig. 1, 5 of the Treatment values constitute nearly 90% of the outcomes. After careful
study and consideration the commonalities and distinct features, the number of Treatment outcomes were
consolidated to 7 as shown in Fig. 1a. For easier understanding, the nominal values were replaced with
outcome description.
Fig 1: Distribution of Treatment outcome BEFORE consolidation
4. Fig 1a: Distribution of Treatment outcome AFTER consolidation
Data Engineering (Section 6):
Given that the original dataset had 244 columns, which was vastly expanded after decomposing
composite data and converting categorical variables to dummy variables, it was necessary to organize (see
Fig. 2) them in a manner that was easier to comprehend and perform further analysis such as
dimensionality reduction.
Fig 2: Data organization BEFORE Dimension Reduction
Chi-Square Analysis was performed on the categorical variables to identify predictor variables that
significantly affected the case outcome. The Chi-squared results were additionally corroborated by
performing feature selection using the “Boruta” package in R.
5. The plot from Boruta package in Fig. 3 shows the variables plotted (on x-axis) against the “Importance”
(on y-axis). The variables marked in GREEN are the most important features selected by the package.
Although Chi-Square analysis and Boruta have helped us arrive at the most significant predictor variables
to be considered for model building, the list is not final yet!! Based on domain knowledge, we decided to
make the following changes:
The field PRIOR_COURT is nothing but a description of ORIGIN_NUMBER. Therefore, we
will retain PRIOR_COURT and drop ORIGIN_NUMBER.
Once we know the CIRCUIT_COURT, it is not necessary to use the States under its jurisdiction.
Therefore, we will retain CIRCUIT_COURT and drop CIRCUIT_STATES.
Replace STATE_VAL with STATE so that we know which State is being referred to. Similarly,
we replaced DISTRICT_VAL with DISTRICT.
Both Chi-Square and Boruta did not select Judges as a significant predictor variable. However, we do
believe NUM_JUDGES should be included in the model.
After completing Feature Engineering steps described above, the significant variables identified were
organized as shown in Fig. 4.
Fig. 3: Boruta Package output graph – Most significant variables
6. Fig. 4: Data organization AFTER Dimension Reduction
Model Selection (Section 7):
Being a multi-variate classification problem, intuitively there was an inclination to use Multinomial
Logistic Regression. Taking a closer look, it was noticed that under the surface Multinomial Logistic
Regression still functions as a binomial model. Outcome is re-categorized as A versus B, C, D and so on
for each possible outcome. This results in an inevitable loss of information and result in misleading
conclusions. Therefore, it was decided to park Multinomial Logistic Research while more suitable
predictive models were explored using the following selection parameters:
Size of data – is it large enough to adequately train the model?
Dimensionality – with 264 columns, do we keep all of them or only the significant variables?
Would these algorithms be able to effectively handle independent categorical variables?
What precautions need to be taken to avoid over-fitting?
Do we have enough machine power namely, speed/performance/memory to run these
complicated algorithms?
Eventually, the classification models considered are: Random Forest, Neural Network and XGBoost.
Upon developing Random Forest models, we found that there is a tendency towards over-fitting with
accuracy rates as high as 99%. However, with randomized selection of rows and columns Random Forest
is expected to be immune to over-fitting. It is possible that our relatively small dataset size of 2160 rows
was a significant contributor to such an outcome. It was decided that further exploration including use of
larger dataset is required before publishing our conclusion on performance of Random Forest model.
In this paper, we will be focusing primarily on Neural Network and XGBoost.
Neural Network: Using Caret package in R, the data was split into Training and Test datasets. For
building the Neural Network model, we used ‘nnet’ package in R. Since our data predominantly
consisted of categorical variables, softmax was set to TRUE. The softmax function is a gradient-log
normalizer of the categorical probability distribution which is used in various probabilistic multiclass
classification methods (Softmax function, 2016). Similarly, the entropy was also set to TRUE.
Starting with the full Training dataset, as outlined in Fig. 5 steps were taken to progressively improve the
model accuracy by:
7. Trimming the data to select only the most significant variables
Balancing the trimmed Training data to have adequate representation of variables
Oversampling the outcome Treat variable to ensure the nnet Neural Network model has enough
of an opportunity to learn the characteristics of each outcome. This learning is important for the
model to correctly classify outcome of such cases.
Finally, the model developed using Training dataset was applied on Test dataset. Results
obtained were used for further analysis.
Fig 6: Neural Network Model Tuning
XGBoost: Similar to steps in developing Neural Network model, Training and Test datasets were created
using Caret package in R for XGBoost also. In addition, xgboost package in R was used. The model
tuning approach used in Neural Network was also used for XGBoost which has been depicted in Fig 7.
The Objective was specified as “multi:softprob” for Multiclass Classification within Parameters used to
develop XGBoost model.
8. Fig 7: XGBoost Model Tuning
Results and Discussion (Section 8):
The Confusion Matrix for the two Multi-nomial Classification models given in Fig. 8 and Fig. 9 helps us
make the following observations:
The prominent observation looking both the models is the impact that Oversampling had on their
performance. This helped the models to be prepared when encountered with all types of possible
outcomes.
Both the Machine Learning techniques have performed very well when the outcome is
‘Affirmed’. This is mainly because the high proportion of this outcome gave enough of an
opportunity for the models to learn over several iterations.
At the same time, outcomes such as ‘Reversed’ and ‘Vacated’ were predicted as ‘Affirmed’ – a
nearly diametrically opposite classification.
This behavior lends us to think that there is a fine line between a case being classified as either
Affirmed versus Reversed or Vacated. It could be influenced by one or two critical variables
which if identified could significantly simplify the models. We would like to pursue this in our
future efforts.
Studying the most significant variables indicated by both Neural Network and XG Boost, we
were able to make the following observations:
o The Appeals court (there are 13 Appeals Courts in USA) that is currently hearing the case
significantly affects the outcome. Understanding which Appeals Court is more likely to
rule in an Appellant’s favor would be valuable in working out the strategy for the
Appellant.
o If the previous court was unable to decide on the case and the outcome was ‘Not
Ascertained’, the Appeals Court is likely to give a more decisive ruling.
o Nature of the Applicant also plays a significant role in outcome of the Appeals court. If
the Appellant happens to be a ‘Natural Citizen’, this happens to have a greater
significance on the Appeals court outcome.
9. o In a panel of Judges, the Directionality of the 3rd
Judge has a significant affect on the
overall outcome of the case.
o Amongst these, the Judge’s assertion on broadest interpretation of First Amendment
protection including Freedom of Speech, Religion and Right to Protest Peacefully are
highly significant.
Fig 8: Neural Network - Confusion Matrix
Fig 9: XG Boost – Confusion Matrix
Conclusion (Section 9):
We have made an effort to develop multi-nomial predictive models that can predict the outcome of cases
handled by US Court of Appeals. These models would enable litigants and lawyers take decisions more
objectively using historical data rather than their experience and intuition. After extensive cleansing of
data and organizing them for better understanding, classification techniques such as Neural Network and
XG Boost were used.
The biggest limitation was the size of data available – total of 2160 rows. Machine learning techniques
we applied such as Neural Network and XG Boost had restricted opportunities to refine their weightages
for all possible outcomes. To address this shortcoming, over-sampled data was used that significantly
improved the model performance.
Overall, the results obtained from these models were very encouraging. The models’ level of Accuracy,
resource usage and consistency validated our intention to demonstrate use of analytics in legal domain.
As part of future studies, we plan to determine the characteristics of each outcome using Decision Trees
and also simplify the models to use fewer variables.
10. In the light of initiatives such as Digital India, we expect large amounts of legal data to become available
in the coming years. Analytics can enable bring in efficiencies in the Indian legal system to reduce the 3
crore plus cases pending court decisions.
References (Section 10):
Cross, F. B. (2003, December). Decision Making in the US Courts of Appeals. Retrieved from California
Law Review:
http://scholarship.law.berkeley.edu/cgi/viewcontent.cgi?article=1351&context=californialawreview
JuRI. (n.d.). Retrieved from http://artsandsciences.sc.edu/poli/juri/appct.htm
JuRI_Codebook. (2003). KH_codebook. Retrieved from Arts and Sciences, SC:
http://artsandsciences.sc.edu/poli/juri/KH_update_codebook.pdf
JuRI_data. (2003). www.cas.sc.edu. Retrieved from KH_update:
http://www.cas.sc.edu/poli/juri/KH_update_stata.zip
Softmax function. (2016, October 9). Retrieved from Wikipedia:
https://en.wikipedia.org/wiki/Softmax_function
Theodore W. Ruger, P. T. (2004, 01 01). The Supreme Court Forecasting Project: Legal and Political.
Retrieved from Berkeley Law:
http://scholarship.law.berkeley.edu/cgi/viewcontent.cgi?article=1018&context=facpubs