SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez nos Conditions d’utilisation et notre Politique de confidentialité.
SlideShare utilise les cookies pour améliorer les fonctionnalités et les performances, et également pour vous montrer des publicités pertinentes. Si vous continuez à naviguer sur ce site, vous acceptez l’utilisation de cookies. Consultez notre Politique de confidentialité et nos Conditions d’utilisation pour en savoir plus.
NLP applied to French legal
Demonstration of a significant bias of some French court
of appeal judges in decisions about the rights of asylum.
Feb 17th, 2016
● Anthony Sypniewski @ Google (NYC)
○ Software Engineer, Data Scientist
○ Supra Legem: Dev the website (back end, front end)
● Michaël Benesty @ Deloitte (Paris) [this is me]
○ Tax law associate, former CPA/Financial auditor
○ XGBoost + FeatureHashing R packages co-author, DMLC member
○ Supra Legem: Dev ETL & machine learning
None of the team member employers is linked in any way to this personal project.
Opinions expressed here are team’s only.
1. Brief overview of the French legal system
2. The intuition behind word2vec
3. How we designed our fancy neural network
4. Presentation of the dataset and our bias analysis result
French legal system… in 1 schema
ORDINARY COURTS ADMINISTRATIVE
COURTSCIVIL LAW CRIME LAW
Cour de cassation : chambers
Labour Commercial 3 Civil chambers Criminal
Cour d’appel : chambers
Labour Commercial Civil Criminal
Tribunal de Grande
Conseil de Prud’hommes Tribunal d’Instance
Juge de proximite
Cour administrative d’appel
A simple asylum
When an asylum seeker (or any
other undocumented people)
receives a deportation order
(dark red boxes), one of the
options is to ask an
administrative court judge to
cancel it.Those judge decisions
are the one we will analyze in the
For the most intrepid, a readable version of this
schema is available on the Senate website
Basic intuition behind word2vec : feed forward (1/3)
Famous algorithm which assigns similar vectors to similar words from a corpus (what means similar?). Below is a
theoric simplified version of word2vec which corresponds to 1 context word with Continuous Bag of Words (C-BOW).
Task is to predict a word from its context.
Ex.: The court of appeal judge is Mr. Toto.
Based on the distributional hypothesis (Harris, 1954): you shall know a word by the company it keeps (Firth, 1957)
● Input layer: 1 hot encoded context word (indice in dictionary)
● W: context dense word matrice (whole dic)
● Hidden layer: context word dense vector
● W’: output dense word matrice (whole dic)
● Output layer: P(output | context word)
Basic intuition behind word2vec : feed forward (2/3)
Loss : minimize E with
Back propagation :
can be interpreted as the prediction error
Continue back propagation
Output vector update :
Continue back propagation
Basic intuition behind word2vec : feed forward (3/3)
Possible intuitions* from previous slide (during back propagation):
● e is the error rate;
● Output Word vector (w) tends to look like its context vector (w’);
● Context word will be adjusted by the sum of the output vectors (w’) weighted by the prediction error
(e). Therefore, if a low probability is given to the word to predict, the adjustment will make context
vector looks like more that output vector. In some way the context vector will be closer to it (cosine
distance). Vice versa is true for high prediction for the wrong word.
● Combined, words sharing same distribution of contexts finish with similar vectors
Some interesting readings about Word2Vec may include:
● word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method (2014,
Yoav Goldberg, Omer Levy)
● Neural word embedding as implicit matrix factorization (2014, Yoav Goldberg, Omer Levy)
*: this is a bad and partial summary from word2vec Parameter Learning Explained (2016, Xin Rong)
Recurrent Neural Network (RNN) structure is not that far
from a shallow neural networks
● U matrix is initialized with vectors learned by word2vec
● x is just a 1 hot encoded vector of the word indice in the whole corpus dictionary
● The model is basically stored in W
● W is shared among steps -> RNN is like deep learning with shared weights.
● Because of noise on long sequences, vanishing / exploding gradient issue is exacerbated
Gated Recurrent Unit (GRU) improves long dependencies
GRU has 2 gates:
● Reset gate r: determines how to combine the new input with the
● Update gate z: how much of the previous memory to keep around
Unlike RNN, GRU can choose what to learn from the present and what to
throw away from the past.
● Important features are protected against being overwritten
● During back propagation, there are shortcut paths that bypass
multiple temporal steps -> avoid vanishing gradients as a result of
passing through multiple bounded nonlinearities
Generalizing the model by multi-input and multi task learning
Why multi input?
● Splitting the text in different parts for long
documents (in a way which makes sense)
○ Possible to have parameters specific to
each part (if parts are very different)
○ Split text to avoid long distance
● Provide non time-series data to help
understanding text document
○ Categorical data related to document
● Better model generalization as each task
has its own bias
○ Kind of ensemble approach, but during the
● One gradient descent process per task: it
is like* having a bigger dataset
*: Ok this is not the same thing but still, multitasking may
provide better improvement than just increasing epoch
What the final model looks like
Softmax Task 1
Softmax Task 2
Softmax Task 3
● Task 1: learn category of the applicant
● Task 2: learn category of the defendant
● Task 3: learn category of the decision
● Text 1: claims from the applicant
● Text 2: solution of the decision
● Context : categorical information about the
decision (court, judge name, thema, …)
Text 1 and Text 2 are extracted from full decision by another learning.
The dataset is probably quite easy to
get for a model: the vocabulary is
stable, there is no irony or double
negations and text formula and
structure are reused among decisions.
Fancy model helps too. Simple GRU on
task 2 only with default value (from
Keras) and no use of word2vec gave
slightly more than 80% accuracy.
Task 3 is the easiest task as the
structure of Text 2 is very stable. On
the other side there are 11 categories
which is higher than Task 1 & 2.
Learning on: training: 70%, validation: 10%, test: 20%
(multiclass 6 cat)
(multiclass 6 cat)
(multiclass 11 cat)
0.945 Not performed
Binarization means that the task 1 and 2 have been recasted to a
binary classification: the applicant (or defendant) may be a private
person or the administration. Basically there were 5 categories of
administration, and they have been merged to one.
Same model is used in both cases. Therefore all classifications
errors between the 5 administration categories disappear
mechanically. It explains entirely the accuracy improvement.
Legal decisions from administrative courts of appeal are *partially* available in open
● Open data decisions are provided by @Etalab there
● Supralegem.fr website covers [2000-2015], it represents 250K decisions
● On 2012-2015
○ ⅓ to ⅔ of the decisions issued by each appeal court per year is available
○ Court of appeal judge names are provided* for > 98% of the decisions (before 2012 < 95%)
● Analysis periods:
○ per judge: [2012-2015] because before too many judge names are missing
○ per court: [2009-2015] because before data are missing for some courts
● Important questions about open data decisions:
○ How are they selected?
○ Who selects them?
○ Why a part of the decision are not distributed?
*: judge names are included in the open data distributed and is not learned.
rejection rate per
Decision selection criteria:
● from a court of appeal
● marked as asylum category
● contain “quitter le territoire” &
“étranger” & “asile”
● applicant: natural person,
4 courts have a rejection rate increasing, 3 are
stable, 1 is decreasing. Seems to match the
Is there a bias with
● Rejection rates of deportation
order per judge/year
● Selected 3 highest & lowest rates
from top 20% in quantity of
Case documents are never public but would be
required to have a deep analysis of this apparent
Not shown here: judges from the same court may
have very different reject rates in asylum. However
there may be some (unknown) good reasons to
explain these gaps.
On the right, tweets from a judge assistant about
the practice of some judges of systematically
pushing to refuse canceling deportation orders
(before hearing the case). “OQTF” means Ordre de
Quitter le Territoire Français. Storify link
*: from administrative courts of appeal
Guerrive Marseille 78 41 43 60 453
Cherrier Marseille NA NA 67 63 233
Krulic Paris NA NA 60 73 198
Paris 90 97 98 100* 227
Pellissier Nancy NA 93 92 96 304
Helmholtz Versailles NA 93 92 91 201
*: Tandonnet Turot 2015 rate corresponds to few decisions and is not
The (im)possible legal consequences of a bias
● Article 6.1 of the European Convention on Human Rights (ECHR)
In the determination of his civil rights and obligations or of any criminal charge against him,
everyone is entitled to a fair and public hearing within a reasonable time by an independent and
impartial tribunal established by law…
● ECHR, Remli Vs France, 23/04/96: France is convicted for subjective partiality (racism of a juror)
● Article L721-1 of administrative justice code: if one has doubts about impartiality of a judge, one
can ask for its recusal
● Interesting report from French private law Supreme court about judge impartiality
Truth is that there is (very) little chance that a French judge recognizes a bias from statistics related to
The result of this work is provided to
everyone on a dedicated website.
Tags learned are provided as filters.
A viz of the results is generated on
each search to show patterns, if any.
As of Feb 17th, 2016, 2015 is not
complete, but this should be fixed in