Describes the Kairntech approach to real-world NLP/AI requirements, putting an emphasis on the quick and efficient creation and curation of training data sets.
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Stefan Geissler kairntech - SDC Nice Apr 2019
1. IC-SDV 2019
April 9-10, 2019
Nice, France
Addressing requirements for
real-world deployments of ML & NLP
Stefan Geißler, Kairntech
2. Agenda
Looking back: the NLP landscape has changed
dramatically
Algorithms Data!
Support dataset creation: The Kairntech Sherpa
Kairntech? Who are we
Conclusion
3. Looking back : NLP landscape has changed
2000:
Very few open source components
Lexicons, Taggers, Morphology,
Parsers mostly proprietory, complex to
install and maintain, limited coverage
« Make or Buy »
High level of manual efforts in
creating and maintaining lexical
knowledge bases, rule systems
5. 2019: A tipping point in ML & NLP?
« 2018 was the ‘image net’ moment for deep learning in NLP’ (S.
Ruder)
In Image Processing in 2012 a Deep Learning network won a
public contest by a large margin. Now in 2018 we saw exciting NLP
models implementing transfer learning: ELMo, UMLfit, BERT
« ML Engineering in NLP will truly blossom in 2019 » (E. Ameisen)
Focus on Tools beyond model building! Link NLP/AI to production
use! What does it mean to build data-driven products and
services?
« Enough papers: Let’s build AI now! » (A. Ng, 2017)
« AI is the new electricity! »
6. Example: Named Entity Recognition
Cf.
https://www.researchgate.net/publication/329933780_A_Survey_on_Deep_Learning_for_Named_Entity_Recognition/download
Many / most of
these approaches
available with
code
7. NLP: A commodity?
Named entity recognition in four steps:
$ pip install spacy
$ python –m spacy download en
$ cat > testspacy.py
import spacy
nlp = spacy.load(‘en’)
doc = nlp(“Angela Merkel will meet Emmanuel Macron at the summit in Amsterdam”)
for entity in doc.ents:
print(entity.text)
CRTL-D
$ python testspacy.py
Angela Merkel
Emmanuel Macron
Amsterdam
8. Algorithms are commodity
Even the top scoring system from the list earlier is available on github:
https://github.com/zalandoresearch/flair
For the protocol:
The survey does not list Delft (
https://github.com/kermitt2/delft),
implemented by the Kairntech chief
ML expert and which
•Scores exactly at 93,09% on
Conll2003, too
•Creates models that are very
compact (~5MB vs. >150MB)
•Loads model in ~2sec at initialization
10. Pain points
Off-the-shelf NLP models often don’t work for
specific needs
Implementation is slowed down by the need of
building specific training dataset
AI/NLP services are often require integration of
business glossaries & knowledge graph
Absence of maintenance leads to quality deviations
11. Frequent requirements in real-world projects
In many commercial scenarios around entity extraction, an entity not only has to be
recognized but also typed
A DATE in a contract may be the date when the contract becomes effective,
when it was signed, when it will be terminated
A PERSON in a legal opinion may be the defendant, the lawyer, the judge, the
witness …
A DISEASE in clinical study may be the core therapeutic area or a peripheral
occasional adverse event
This is beyond the public named entity recognition modules
Typically, for these decisions no training corpora exist. They must be established
within a project.
12. You don’t have to take my word on that.
Let’s listen to what the experts say:
Algorithms are commodity, data is gold
Peter Norvig:
“We [at Google] don't have
better algorithms than anyone
else; we just have more data!”
“More data beats clever
algorithms.”
Angela Merkel:
“Data is the new oil of the 21st
century!“
13. So: We need data, not only algorithms
Charts copied from https://hackernoon.com/%EF%B8%8F-big-challenge-in-deep-learning-training-data-31a88b97b282
14. Requirements
What will be more important for
the success of your project?
Driving the training accuracy from, say,
92,4 to 93,6% on a pre-defined data set?
or
ML components that allow high quality with
small training sets and moderate annotation and
training time?
15. Example
The Conll2003 data set used in many academic NER
experiments contains >100000 entities
Assume 30sec per entity 100 person days pure annotation
time! (With one single annotator)
Unrealistic in most commercial project settings.
Commercial projects have requirements that are different
from academic research!
16. On dataset preparation: Requirements
Web-based (no install), intuitive GUI, usable by domain experts
Limit manual annotation efforts: Active Learning
Collaboration (work in teams, measure inter-annotator agreement)
Not just NER annotation: Entity typing, document categorization, …
Must facilitate deployment-to-production
17. Why another tool?
WebAnno:
Scientific focus: « Annotate corpora to allow the study of
linguistic phenomena »
Sentence-based, Loosing all layout information
Spacy/Prodi.gy:
Focus on local/lexical named entity recognition. Underlying
model by default considering a narrow window of n (n=4) words
left and right.
Brat:
Interface-only. Integration with model building, semi-automatic
suggestions, deployment?
18. Kairntech Sherpa
Annotation
environment
Raw or preannotated
Corpora:
Text, Audio, …
ML model
Curated AnnotationsAutomatic Annotation
Suggestions
User
Datasets and
ML models
Search, Collaboration, Manual &
assisted annotation, Quality
metrics, Synchronisation into ML
model
19. Active Learning?
Reduce effort in manual annotation of data by presenting the user with data in
some informed order:
Ask the user for feedback on the samples that promises the highest benefit:
Samples that are least certain*
(*) Diagrams used from datacamp.com
Active Learning applied on NLP tasks has been shown to reduce the amount of
required training data dramatically
7% of the sample under AL regime yield the same quality as naive selection
(cf. Laws 2012: https://d-nb.info/1030521204/34)
In a project that would mean 1 day annotation instead of 14 days
20. Benefits of AL?
Growing accuracy on a
(simple) ML task as number
of samples grows
Naive selection (« Random »,
orange line) growing slowly
Informed selection (« QBC,
« query by committee », red
line) grows much faster
AL promises to reduce effort
required for manual
annotation
21. A non-expert workflow for dataset creation
Ask the
application for
suggestions
(De-) validate
and retrain
Once satisfied,
export/deploy
22. About Kairntech
Kairntech: The company
Created in dec 2018, 10 partners
France (Paris & Grenoble/Meylan), Germany
(Heidelberg)
Kairntech: The team
Background in Software engineering, Machine
Learning, Sales, Management
+15 years of experence in NLP development and
deployment from Xerox, IBM, TEMIS. Development of
components currently in production at CERN, NASA,
EPO…)
23. Kairntech: Our profile
Industrialize the creation of document sets (training
corpora) by offering an environment for the data
preparation by domain experts, easy and efficient to use
The transformation of data sets in document analysis
services, adding value to enterprise knowledge
repositories (e.g. knowledge graphs)
Industrial deployment of maintenance of these services.
25. Conclusions
So much data!
But very little of it labelled and useful for superised learning
So many pretrained models!
But most of the time they do not quite do what you need in
your project
So many algorithms!
But a library alone will not allow you to implement the solution
you need
Kairntech is there to support you!
26. Thank you for your attention !
Stefan.Geissler@kairntech.com
Editor's Notes
Attention: Numbers are not always comparable!
Are the models trained with or without the validation set?
Are the numbers the best of a set of n experiments? Or the average of n experiments?
We have spent some effort in redoing the experiments reported in the literature and there are often slight variations. This does not mean that there is dishonesty involved!! But it means that when results are within a few tenth of a percent, the question “which approach is best” becomes blurry.
https://blog.floydhub.com/ten-trends-in-deep-learning-nlp/
What does it mean for me?
Can this research be applied to everyday applications? Or is the underlying technology still evolving so rapidly that it is not worth investing time developing an approach which may be considered obsolete with the next research paper?