I, Robot, Esquire: Information Extraction and Summarization in Legal Documents
Pundits constantly predict the demise of many types of knowledge workers at the hands of intelligent machines, and few professionals perform more textual document review than lawyers. In this session, I’ll share work that eBrevia has been doing to apply research from the fields of ML and NLP to summarize and extract information from legal contracts to help accelerate corporate mergers and acquisitions. I will look at the unique characteristics of the legal industry, examine some supervised and semi-supervised training strategies and classification models, and discuss the limitations of these techniques and the essential role lawyers will continue to play.
Jacob Mundt – Chief Technology Officer, eBrevia at MLconf ATL
1. I, Robot, Esquire
Information Extraction and Summarization in Legal Documents
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
Jacob Mundt – MLConf ATL
2. Who we are
Commercializing machine learning technology developed at
Columbia University to make legal document review more
efficient, accurate and cost effective.
One of four
national winners in
Startup America
DEMO Competition
One of CIO.com’s
top ten enterprise
products at DEMO
Fall 2012
Most Promising
Software Product
of the Year award
from Connecticut
Technology
Council
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
Completed
Connecticut
Innovations’
TechStart Fund
Program
2
3. Management Team
Large law firm experience;
tech startup experience;
sales & business
development experience
Harvard Law
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
Led R&D team at tech
company extracting data
in medical industry
Columbia Masters; NLP
researcher
Founder of Ivy Link (20+
staff); Chief of Staff of
350-person real estate
private equity firm
Harvard Law; law firm &
in-house experience
Ned Gannon
CEO
Adam Nguyen
COO
Jake Mundt
CTO
3
4. The Future of Law
“In contrast, in looking 25 years ahead from
now, I argue that it would be absurd to expect
lawyers and courts to carry on operating as
they do now.”
—Richard Susskind, Tomorrow’s lawyers
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
4
“Well, if droids could think, there'd
be none of us here, would there?”
— Obi-Wan Kenobi
5. I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
5
6. Corporate Mergers and Due Diligence
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
6
Business due
diligence
Legal due
diligence
Closing
7. Corporate mergers and due diligence
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
7
8. Legal Due Diligence Process
Extract Summarize Analyze Advise
Teams of junior
attorneys billed out at
$300-$500/hour poring
over hundreds of
contracts in virtual data
rooms to summarize
their content and
identify red flags.
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
8
9. Legal Due Diligence Summary
Here come the
spreadsheets –
summarize ALL the
contracts:
– leases
– executive
employment
agreements
– supplier agreements
– Loan/credit
agreements
Extract key data
points
Also extract any
clauses that discuss
particular provisions
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
9
10. The Stone Age
On site data
room with reams
of documents,
organized by
seller
Buyer’s agents
travel to evaluate
the target, under
constant
supervision
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
10
11. State of the Art – Virtual Data Rooms
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
11
Digitized, but not
machine readable
Some simple OCR and
searching capability
Commercial systems
like IntraLinks have
advanced capabilities,
but mostly focused on
security and
auditability.
12. The Future is Here
Misses stems, synonyms, plural forms
False positives—some common words also have
special meanings in context.
Impossible to find dates, parties, dollar amounts,
or any other generic quantities
We can do better
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
12
13. I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
13
14. Can we use ML and NLP?
Actually many sub-problems:
Classify entire document type—
discover contracts amongst
heterogeneous corpus
Duplicate detection
Group documents that were based
on a common form agreement
Automatically flagging
questionable docs for further
review
Automatic provision extraction
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
14
15. Why this is Easy
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
15
Precise, formal
writing
Extremely
structured
Lots of clause
reuse
16. Why this is Hard
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
16
Precise, formal
writing
Extremely
structured
Lots of clause
reuse
Obfuscation
High demands
on recall
Deep chains of
defined term
references
17. Detecting “Evil” Clauses?
Lawyers actually prefer
to make the calls on
exactly what to
include, and how to
advise the client
Just find the source
material, and let the
lawyer decide.
Determine relevance,
don’t make value
judgments
“Learning to detect
spyware using end user
license agreements”,
Lavesson, et al. (2009)
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
17
Illustration of Saint Wolfgang and the Devil with the Devil's
Contract, by Michael Pacher.
18. I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
18
19. eBrevia’s Approach
Not all provisions are the same!
Topic modeling
Information Extraction (IE)
Rule based approach
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
19
• Find sentences discussing “change of control”
• Find restrictions concerning confidential information
• The contract runs from TIMEX to TIMEX.
• The monthly rent will start at $X, and increase by no more than
Y% annually.
• Find every borrower’s FICO score
20. Text analysis pipeline
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
20
OCR
Sentence
Segmentation
NLP Processing
(POS, NER,
Parsing)
Document
Structure tagging
General
Candidate
detection
Rule Based
detection
Topic classifier
Candidate
detection for IE
Information
Extraction and
slot filling
21. Classifier Features
Basic textual analysis feature
– words
– n-grams
– positional and morphological
features.
– Named entities
Syntactic features
– Parts of speech
– Parse tree and heads
Structural features
– First level classifier pass for
determining document structure
– Especially important on scanned
documents where these features
aren’t readily available
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
21
The/O buyer/O
Acme/ORG Inc./ORG
indemnify
indemnify
Client shall indemnify
N V V
Section III: Miscellaneous
1. Lorem ipsum dolor
a. sit amet,
consectetur
22. Hunting for Training Data
All your customer’s data
is confidential
– Redacted contracts
– Mine the SEC
Expense of lawyer-labeled
training data
– Bootstrapping
– Co-training with different
feature sets
– Active learning
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
22
23. Hacks and Special Cases
Very useful, but boring
Formatting fixes specific to legal documents
– ALL CAPS
– Handling of amendments
– Handwritten signature blocks
Hand crafted rules very good for high-precision
heuristics—customers expect the
software not to miss “easy” provisions.
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
23
24. I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
24
25. The Audacity of Keywords
Seemingly-reliable keywords, aren’t
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
25
Phrase Likelihood that
candidate phrase is
relevant
Likelihood candidate
phrase is irrelevant
“Change [of|in] Control” 48.4% 51.6%
“13(d) and 14(d)” 98.7% 1.3%
A simple keyword based search with an obvious
keyword wouldn’t even get us to 50% precision!
Conversely, a human would have never
discovered this reliable trigram heuristic.
26. The Tyranny of Paper
Lawyers still have a lot
of paper – over 50% of
the documents uploaded
to our system are scans.
OCR on poor quality
scans works poorly for
keyword searching but
decently with ML, with
properly constructed
features.
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
26
27. Welcoming our Robot Lawyer Overlords
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
27
“[eBrevia’s software] cuts down significantly on time by
performing 50-60% of the work up front and then you
work from there.”
– NY law firm partner
“Your product is a great fit for our firm’s approach to
practicing law.”
– Partner, national law firm
29. User Interface Notes
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
29
Highlight in original, formatted document
Cross-referencing, editing,
and corrections
30. User Interface Notes
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
30
Additional critical features
Quick Correction
Level of confidence indications (similar to
Google voice transcription)
Good generic text search features to make
human review easy
31. I, Robot, Esquire - Overview
Motivation
Can we use ML and NLP?
eBrevia Solution – Deep Dive
Challenges and Lessons Learned
Future directions
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
31
32. Current Research and Future Directions
Coreference resolution: intra- and inter
document. Useful for doc references, and
entity references.
Machine learning for document cross-referencing
and definition resolution.
Automatic summarization of longer
provisions to provide quick overviews.
Understanding the lineage of a document –
where its various pieces came from, and
how they were changed.
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
32
33. Feedback Learning from Lawyers
Some lawyers
are just bad
Noise is NOT
random
– They fall for
the same
“trap”
– They’re often
bad in the
same way
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
33
So can’t use noise-tolerant learning algorithms to
deal with this.
Consensus models, model user reputation/ability
34. Current Research and Future Directions
Other upcoming applications for eBrevia’s
technology:
Contract management
Document drafting
Lease abstraction
Financial/Compliance
Consumer applications
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
34
35. Thank You – Contact Info
jmundt@ebrevia.com | (203) 870-3000 Proprietary & Confidential
35