NLP applied to French legal decisions

6 210 vues

Publié le

Demonstration of a significant bias of some French court of appeal judges in decisions about the rights of asylum.

www.supralegem.fr

Presentation meetup ML Paris Feb 17th, 2016

Publié dans : Données & analyses
0 commentaire
5 j’aime
Statistiques
Remarques
  • Soyez le premier à commenter

Aucun téléchargement
Vues
Nombre de vues
6 210
Sur SlideShare
0
Issues des intégrations
0
Intégrations
3 470
Actions
Partages
0
Téléchargements
25
Commentaires
0
J’aime
5
Intégrations 0
Aucune incorporation

Aucune remarque pour cette diapositive

NLP applied to French legal decisions

  1. 1. NLP applied to French legal decisions Demonstration of a significant bias of some French court of appeal judges in decisions about the rights of asylum. Feb 17th, 2016 www.supralegem.fr
  2. 2. Team ● Anthony Sypniewski @ Google (NYC) ○ Software Engineer, Data Scientist ○ Supra Legem: Dev the website (back end, front end) ● Michaël Benesty @ Deloitte (Paris) [this is me] ○ Tax law associate, former CPA/Financial auditor ○ XGBoost + FeatureHashing R packages co-author, DMLC member ○ Supra Legem: Dev ETL & machine learning None of the team member employers is linked in any way to this personal project. Opinions expressed here are team’s only.
  3. 3. Plan 1. Brief overview of the French legal system 2. The intuition behind word2vec 3. How we designed our fancy neural network 4. Presentation of the dataset and our bias analysis result
  4. 4. French legal system… in 1 schema ORDINARY COURTS ADMINISTRATIVE COURTSCIVIL LAW CRIME LAW Cour de cassation : chambers SupremeCourts1stDEGREE Labour Commercial 3 Civil chambers Criminal 2ndDEGREE Cour d’appel : chambers Labour Commercial Civil Criminal Cour d’ assises Tribunal de Commerce Tribunal de Grande Instance Tribunal Correctionnel Cour d’ assises Conseil de Prud’hommes Tribunal d’Instance Tribunal de Police Juge de proximite Conseil d’Etat Litigation division Cour administrative d’appel Tribunal administratif
  5. 5. A simple asylum seeker journey (or not) When an asylum seeker (or any other undocumented people) receives a deportation order (dark red boxes), one of the options is to ask an administrative court judge to cancel it.Those judge decisions are the one we will analyze in the next slides. For the most intrepid, a readable version of this schema is available on the Senate website
  6. 6. Basic intuition behind word2vec : feed forward (1/3) Famous algorithm which assigns similar vectors to similar words from a corpus (what means similar?). Below is a theoric simplified version of word2vec which corresponds to 1 context word with Continuous Bag of Words (C-BOW). Task is to predict a word from its context. Ex.: The court of appeal judge is Mr. Toto. Based on the distributional hypothesis (Harris, 1954): you shall know a word by the company it keeps (Firth, 1957) ● Input layer: 1 hot encoded context word (indice in dictionary) ● W: context dense word matrice (whole dic) ● Hidden layer: context word dense vector ● W’: output dense word matrice (whole dic) ● Output layer: P(output | context word)
  7. 7. Basic intuition behind word2vec : feed forward (2/3) Objective : Loss : minimize E with Back propagation : With : can be interpreted as the prediction error Continue back propagation Output vector update : Continue back propagation Parameter update
  8. 8. Basic intuition behind word2vec : feed forward (3/3) Possible intuitions* from previous slide (during back propagation): ● e is the error rate; ● Output Word vector (w) tends to look like its context vector (w’); ● Context word will be adjusted by the sum of the output vectors (w’) weighted by the prediction error (e). Therefore, if a low probability is given to the word to predict, the adjustment will make context vector looks like more that output vector. In some way the context vector will be closer to it (cosine distance). Vice versa is true for high prediction for the wrong word. ● Combined, words sharing same distribution of contexts finish with similar vectors Some interesting readings about Word2Vec may include: ● word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method (2014, Yoav Goldberg, Omer Levy) ● Neural word embedding as implicit matrix factorization (2014, Yoav Goldberg, Omer Levy) *: this is a bad and partial summary from word2vec Parameter Learning Explained (2016, Xin Rong)
  9. 9. Recurrent Neural Network (RNN) structure is not that far from a shallow neural networks ● U matrix is initialized with vectors learned by word2vec ● x is just a 1 hot encoded vector of the word indice in the whole corpus dictionary ● The model is basically stored in W ● W is shared among steps -> RNN is like deep learning with shared weights. ● Because of noise on long sequences, vanishing / exploding gradient issue is exacerbated
  10. 10. Gated Recurrent Unit (GRU) improves long dependencies GRU has 2 gates: ● Reset gate r: determines how to combine the new input with the previous memory ● Update gate z: how much of the previous memory to keep around Unlike RNN, GRU can choose what to learn from the present and what to throw away from the past. Advantages: ● Important features are protected against being overwritten ● During back propagation, there are shortcut paths that bypass multiple temporal steps -> avoid vanishing gradients as a result of passing through multiple bounded nonlinearities
  11. 11. Generalizing the model by multi-input and multi task learning Why multi input? ● Splitting the text in different parts for long documents (in a way which makes sense) ○ Possible to have parameters specific to each part (if parts are very different) ○ Split text to avoid long distance dependency issue ● Provide non time-series data to help understanding text document ○ Categorical data related to document context Why multitask? ● Better model generalization as each task has its own bias ○ Kind of ensemble approach, but during the learning process ● One gradient descent process per task: it is like* having a bigger dataset *: Ok this is not the same thing but still, multitasking may provide better improvement than just increasing epoch numbers.
  12. 12. What the final model looks like Text 1 Text 2 Merge + Dropout Merge + Dropout Softmax Task 1 Softmax Task 2 Softmax Task 3 Context Dense + PReLU + Dropout * 3 ● Task 1: learn category of the applicant ● Task 2: learn category of the defendant ● Task 3: learn category of the decision solution ● Text 1: claims from the applicant ● Text 2: solution of the decision ● Context : categorical information about the decision (court, judge name, thema, …) Text 1 and Text 2 are extracted from full decision by another learning. Bidirectional GRU Bidirectional GRU
  13. 13. Results The dataset is probably quite easy to get for a model: the vocabulary is stable, there is no irony or double negations and text formula and structure are reused among decisions. Fancy model helps too. Simple GRU on task 2 only with default value (from Keras) and no use of word2vec gave slightly more than 80% accuracy. Task 3 is the easiest task as the structure of Text 2 is very stable. On the other side there are 11 categories which is higher than Task 1 & 2. Learning on: training: 70%, validation: 10%, test: 20% Task Accuracy Accuracy after binarization Task 1 (multiclass 6 cat) 0.971 0.978 Task 2 (multiclass 6 cat) 0.918 0.946 Task 3 (multiclass 11 cat) 0.945 Not performed Binarization means that the task 1 and 2 have been recasted to a binary classification: the applicant (or defendant) may be a private person or the administration. Basically there were 5 categories of administration, and they have been merged to one. Same model is used in both cases. Therefore all classifications errors between the 5 administration categories disappear mechanically. It explains entirely the accuracy improvement.
  14. 14. Legal decisions from administrative courts of appeal are *partially* available in open data ● Open data decisions are provided by @Etalab there ● Supralegem.fr website covers [2000-2015], it represents 250K decisions ● On 2012-2015 ○ ⅓ to ⅔ of the decisions issued by each appeal court per year is available ○ Court of appeal judge names are provided* for > 98% of the decisions (before 2012 < 95%) ● Analysis periods: ○ per judge: [2012-2015] because before too many judge names are missing ○ per court: [2009-2015] because before data are missing for some courts ● Important questions about open data decisions: ○ How are they selected? ○ Who selects them? ○ Why a part of the decision are not distributed? *: judge names are included in the open data distributed and is not learned.
  15. 15. Basic statistics about asylum rejection rate per court Decision selection criteria: ● from a court of appeal ● marked as asylum category ● contain “quitter le territoire” & “étranger” & “asile” ● applicant: natural person, defendant: administration 4 courts have a rejection rate increasing, 3 are stable, 1 is decreasing. Seems to match the political context.
  16. 16. Is there a bias with some judges* regarding asylum? ● Rejection rates of deportation order per judge/year ● Selected 3 highest & lowest rates from top 20% in quantity of decisions [2012-2015] Case documents are never public but would be required to have a deep analysis of this apparent bias. Not shown here: judges from the same court may have very different reject rates in asylum. However there may be some (unknown) good reasons to explain these gaps. On the right, tweets from a judge assistant about the practice of some judges of systematically pushing to refuse canceling deportation orders (before hearing the case). “OQTF” means Ordre de Quitter le Territoire Français. Storify link *: from administrative courts of appeal Judge Adm. court of appeal % 2012 % 2013 % 2014 % 2015 Decision quantity [12-15] Guerrive Marseille 78 41 43 60 453 Cherrier Marseille NA NA 67 63 233 Krulic Paris NA NA 60 73 198 Tandonnet Turot Paris 90 97 98 100* 227 Pellissier Nancy NA 93 92 96 304 Helmholtz Versailles NA 93 92 91 201 *: Tandonnet Turot 2015 rate corresponds to few decisions and is not significant.
  17. 17. The (im)possible legal consequences of a bias ● Article 6.1 of the European Convention on Human Rights (ECHR) In the determination of his civil rights and obligations or of any criminal charge against him, everyone is entitled to a fair and public hearing within a reasonable time by an independent and impartial tribunal established by law… ● ECHR, Remli Vs France, 23/04/96: France is convicted for subjective partiality (racism of a juror) ● Article L721-1 of administrative justice code: if one has doubts about impartiality of a judge, one can ask for its recusal ● Interesting report from French private law Supreme court about judge impartiality Truth is that there is (very) little chance that a French judge recognizes a bias from statistics related to a colleague
  18. 18. SupraLegem.fr The result of this work is provided to everyone on a dedicated website. Tags learned are provided as filters. A viz of the results is generated on each search to show patterns, if any. As of Feb 17th, 2016, 2015 is not complete, but this should be fixed in coming days.
  19. 19. Thanks! Contact us contact@supralegem.fr www.supralegem.fr

×