Aspects of NLP Practice

Practical Aspects
of NLP Work

Vsevolod Dyomkin
Grammarly

TAAC'2012, Kyiv, Ukraine

Topics
* Practical vs Theoretical
NLP work
* Working with Data for NLP
* NLP Tools

A bit about Grammarly

(c) xkcd

An example of what
we deal with

Research vs Development
“Trick for productionizing research:
read current 3-5 pubs and note the
stupid simple thing they all claim to
beat, implement that.

--Jay Kreps
https://twitter.com/jaykreps/
status/219977241839411200

NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy

D - development work:
implement the algorithm as an API with
sufficient performance and scaling characteristics

Research
1. Set a goal
Business goal:

* Develop best/good enough/better than
Word/etc spellchecker

* Develop a set of grammar rules, that will
catch errors according to MLA Style

* Develop a thesaurus, that will produce
synonyms relevant to context

Translate it to measurable goal
* On a test corpus of 10000 sentences with
common errors achieve smaller number of FNs
(and FPs), that other spellcheckers/Word
spellchecker/etc

* On a corpus of examples of sentences with
each kind of error (and similar sentences
without this kind of error) find all
sentences with errors and do not find
errors in correct sentences

* On a test corpus of 1000 sentences
suggest synonyms for all meaningful words
that will be considered relevant by human
linguists in 90% of the cases

Research
1. Set a goal
2. Devise an algorithm
3. Train & improve the
algorithm

Research
1. Set a goal
2. Devise an algorithm
3. Train & improve the
algorithm

http://nlp-class.org

4. Test its performance
ML: one corpus, divided into
training,development,test

4. Test its performance
ML: one corpus, divided into
training,development,test

Often — different corpora:
* for training some part
of the algorithm
* for testing the whole
system

Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

Theoretical maxima
achievable. Why?

* because you need their data

Theoretical maxima
achievable. Why?

* because you need their data

* domains might differ

Pre/post-processing
What ultimately matters is
not crude performance, but...

Pre/post-processing

Acceptance to users (much
harder to measure & depends
on domain).

Pre/post-processing

Acceptance to users (much
harder to measure & depends
on domain).

Real-world is messier, than
any lab set-up.

Examples of
pre-processing
For spellcheck:

* some people tend to use
words, separated by
slashes, like:
spell/grammar check

* handling of abbreviations

Data
“Data is the next Intel Inside.
--Tim O'Reilly, What is Web2.0
http://oreilly.com/web2/archive/what-is-web-
20.html?page=3

Categorization of Data

* Structured — small
* Semi-structured — medium
* Unstructured — big

Where to get data?
Well-known sources:
* Penn Tree Bank
* Wordnet
* BNC
* Web1T Google N-gram Corpus
* Linguistic Data Consortium
(http://www.ldc.upenn.edu/)

More data
Also well-known sources, but
with a twist:

* Wikipedia & Wiktionary,
DBPedia
* OpenWeb Common Crawl
* Public APIs of some
services: Twitter, Wordnik

Academic resources
* Stanford
* CoNLL
* Oxford (http://www.ota.ox.ac.uk/)
* CMU, MIT,...
* LingPipe, OpenNLP, NLTK,...

Crowd-sourced data
Jonathan Zittrain,
The Future of the Internet
http://goo.gl/hs4qB

And remember...
“Data is ten times more
powerful than algorithms.
--Peter Norvig
The Unreasonable
Effectiveness of Data
http://youtu.be/yvDCzhbjYWs

Levels of NLP tools
High-level: user services

Middle-level: NLP algorithms

Low-level: data-crunching

Choosing a language
Requirement types:
* Research
* NLP-specific
* Production

Research
requirements
* Interactivity
* Mathematical basis
* Expressiveness
* Agility Malleability
* Advanced tools

Specific NLP
requirements
* Good support for statistics
& number-crunching
– Statistical AI

* Good support for working
with trees & symbols
– Symbolic AI

Production
requirements
* Scalability
* Maintainability
* Integrability
* ...

Lisp FTW
* Truly interactive
environment
* Very flexible => DSLs
* Native tree support
* Fast and solid

- No OpenNLP/NLTK

Heterogeneous systems
“Java way” vs. “Unix way”

Create language-agnostic
systems, that can easily
communicate!

Take-aways
* As they say, in theory research and
practice are the same, but in
practice...

* Data is key. There are 3 types of it.
Collect it, build tools to work
with it easily and efficiently

* Choose a good language for R&D:
interactive & malleable, with as few
barriers as possible

Thanks!

Vsevolod Dyomkin
@vseloved

Aspects of NLP Practice

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Aspects of NLP Practice

Similaire à Aspects of NLP Practice (20)

Dernier

Dernier (20)

Aspects of NLP Practice