A clustering-based approach to detect probable outcomes of lawsuits

A clustering-based approach to
detect probable outcomes of lawsuits
Undergraduate thesis/final project
Escola de Informática Aplicada - UNIRIO
Author: Daniel Lemes Gribel <daniel.gribel@uniriotec.br>
Comission:
Leonardo G. Azevedo 1,2
(supervisor)
Maíra A. C. Gatti 2
(supervisor)
Adriana C. de F. Alvim 1
Sean W. M. Siqueira 1
1
UNIRIO, 2
IBM Research
December 19, 2014 1

The project idea
IBM Research, 2013: inspired from a Social Media Simulator
(SMSim project) developed to predict Twitter users behavior.
First idea: to model judges behavior and then predict lawsuits
outcomes through multi-agent simulation, as SMSim.
New proposal: develop an approach to suggest possible
outcomes for a given lawsuit based on modelling, similarity
detection and clustering.
2

Project contributions
Results shown that, by analysing past data, was possible to
verify the most likely outcome and to detect its uncertainty
degree.
3

Problem statement
Large amount of unstructured data coming from the numerous
lawsuits ⇒ Large number of hidden or unknown information
★ How do we know which similar lawsuits can be a reference
to a new lawsuit?
★ How do we estimate the time for taking the decisions?
★ How do we estimate a likelihood for the possible emergent
results?
4

The STF and its responsibilities
The Brazilian Supreme Court (STF) is an organism part of the
Brazilian Judiciary System, responsible for the safeguarding
and interpreting of the Constitution. STF decides matters
related to the Constitution or when there is doubt or controversy
regarding legal actions ².
² STF. Institucional. 2011. Available from internet: http://www.stf.jus.br/portal/cms/verTexto.asp?
servico=sobreStfConhecaStfInstitucional
5

STF judgement configuration
Nowadays, STF is constituted by 11 judges, who act in its Panels as
well as in its Plenary.
1. Monocratic: decision taken by a single judge.
2. Collegial: there is a rapporteur (one of them), and each judge
votes individually, prevailing the majority decision.
a. First Panel (Primeira Turma): 5 judges.
b. Second Panel (Segunda Turma): 5 judges.
c. Plenary: 11 judges – currently, there is an open position.
6

Law classes
There are several lawsuit classes in the Brazilian judicial system:
Habeas Corpus, Interlocutory Appeal, Extraordinary Appeal, etc.
In this work, only lawsuits belonging to the Appeal class are
considered *.
* The choice of Appeal class was supported by some conversation with a professor and a student of
Law School in Fundação Getúlio Vargas (FGV).
7

Law classes
Appeal: “the instrument to cause a review of a decision by the
same judicial authority, or other hierarchically higher, in order to
obtain their reform or modification” ³
● +50% of ~1.5M lawsuits judged by STF - which is
important in terms of the heterogeneity of the data.
● Have similar dynamics in their life cycles - which is
important in terms of pattern detection.
³ Moacyr Amaral Santos, professor, lawyer and minister of the Supreme Court.
8

Mental modelling
1. Look for an appeal lawsuit page in the STF website and
identify its meta-data: lawsuit id, period (start and end date),
state of origin, rapporteur, author, defendant, type (area of
Law) and subjects associated to the lawsuit.
2. Identify the summary and the claim of the lawsuit, found in
a document called “Acórdão”.
3. Extract decisions and votes from “Acórdão”.
9

Classification and clustering
Clustering goals 4
:
1. Development of a typology or classification.
2. Investigation of conceptual schemes for grouping entities.
3. Hypothesis generation through data exploration.
4. Hypothesis testing, or the attempt to determine if types
defined through other procedures are in fact present in a
dataset.
4
ALDENDERFER, M. S.; BLASHFIELD, R. K. Cluster Analysis. Beverly Hills: Sage, 1984.
11

Classification and clustering
12
Adapted from WOOYOUNG, K. Parallel Clustering Algorithms: Survey. Available from internet:
http://www.solver.com/hierarchical-clustering-intro

Hierarchical clustering
13
A B C D E
A,B D,E
C,D,E
A,B,C,
D,E
Agglomerative Divisive
tree cut
tree cut
Adapted from Frontline Solvers. Cluster Analysis. Available from internet: http://www.solver.
com/hierarchical-clustering-intro

Hierarchical clustering
+ Advantages:
● Does not require pre-defined
number of clusters.
● Accepts any valid measure of
distance.
● Less influenced by cluster
shapes and less sensitive to
handle clusters with different
densities.
14
- Disadvantages:
● Complexity, which in general
is ≥ O(n²), which makes them
too slow for large datasets.

Ward’s algorithm
Ward’s minimum variance criterion, a particularization of the
Ward general method, the objective function is to minimize the
total within-cluster variance.
As a general result, Ward’s minimum variance method leads to
compact and spherical clusters.
15

Single-linkage algorithm
In Single-linkage clustering, the
objective function is defined by those
two elements (one in each cluster) that
are closest to each other.
16
The shortest of these links causes the fusion of the two
clusters whose elements are involved.

Complete-linkage algorithm
In Complete-linkage clustering, the
objective function is defined by those
two elements (one in each cluster) that
are farthest away from each other.
17
The shortest of these links causes the fusion of the two
clusters whose elements are involved.

Similarity calculation
From the modelled dataset, calculate the similarities between
lawsuits:
1. Each pair of lawsuit receives a similarity coefficient regarding
to a property.
2. Then, a mean (resultant) matrix is obtained from each
property matrix.
Output: Similarity matrix
19

Similarity calculation
Similarity metric - Jaccard index:
20
Mean similarity:

Lawsuits clustering
From the similarities observed, run the hierarchical clustering
algorithm.
Output: lawsuits classified into clusters.
21

Lawsuit instance assigning
From the detected clusters, calculate the similarities between
the new lawsuit instance and the other lawsuits already
classified.
Output: new instance assigned to the most similar cluster.
22

Decisions compilation
Considering a list of judges that will decide the lawsuit:
1. Collect their past votes observed in the cluster.
2. Compute the degree of agreement between them.
For each judge jx
, compare his/her decisions with each decision taken by
another judge composing input, lawsuit by lawsuit.
Ratio no
of commum votes/no
of commum decisions determines the
degree of agreement for each judge.
Output: the likely outcome – a number between 0 and 1,
indicating the probable decision.
23

Datasets
lawsuit_16.csv: 16 lawsuits
decision_16.csv: 24 decisions
Lawsuits: lawsuit id, start/end date of lawsuit, state of origin,
rapporteur, defendant, author, type, subjects, summary and
claim.
Decisions: associated lawsuit id, decision id, type of decision,
date, votes tuple <judge name, vote> and resultant decision.
24

Similarity analysis
25Rapporteur Summary
completely similar
completely different

Similarity analysis
26Mean similarity Mean similarity (Pearson correlation)
completely similar

Clustering analysis
27
completely similar

Agglomerative algorithms performances
28

Prediction results
30
reveals an…
Optimization
problem!
● The correct choice of the number k of clusters is not trivial, depending on the distribution of
points in a dataset and on the desired clustering resolution.
● Possible approach: define a search space, overvalue a k, and then develop optimization
heuristics to determine a new stopping point (k2
) when the algorithm finds a good solution.
● A stopping point, in this case, could be when the algorithm finds a cluster that is similar
enough to the instance been tested and has difficulties to improve this best rate found.

Main contributions
● By analysing past data, it is possible that other similar cases
were already judged.
● Results shown that was possible to verify the most likely
outcome and to detect the degree of uncertainty of the
outcome.
● Prediction results were satisfied: lawsuit instances were
correctly assigned to clusters and similarity comparison
revealed a good coefficient between lawsuits.
31

Future work
● Use more sophisticated machine learning techniques.
● Investigate a more efficient clustering method than the
hierarchical clustering - consider optimization issues.
● Discriminate decisions by type.
● Develop a better mechanism to find lawsuits properties
weights.
● Have a training and a testing dataset. Then, use evaluation
metrics to check if predictions match real outcomes.
● Investigate stochastic simulation approaches.
32

Code and datasets at bitbucket.org Git repository.
Contact daniel.gribel@uniriotec.br to have access!
Thank you! Questions?
33

A clustering-based approach to detect probable outcomes of lawsuits

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to A clustering-based approach to detect probable outcomes of lawsuits

Similar to A clustering-based approach to detect probable outcomes of lawsuits (20)

Recently uploaded

Recently uploaded (20)

A clustering-based approach to detect probable outcomes of lawsuits