This document discusses practical aspects of natural language processing (NLP) work. It contrasts research work, which involves setting goals, devising algorithms, training models, and testing accuracy, with development work, which focuses on implementing algorithms as scalable APIs. The document emphasizes that obtaining data is crucial for NLP and describes sources for structured, semi-structured, and unstructured data. It recommends Lisp as a language that supports the interactivity, flexibility, and tree processing needed for NLP research and development work.
5. Research vs Development
“Trick for productionizing research:
read current 3-5 pubs and note the
stupid simple thing they all claim to
beat, implement that.
--Jay Kreps
https://twitter.com/jaykreps/
status/219977241839411200
6. NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy
D - development work:
implement the algorithm as an API with
sufficient performance and scaling characteristics
7. Research
1. Set a goal
Business goal:
* Develop best/good enough/better than
Word/etc spellchecker
* Develop a set of grammar rules, that will
catch errors according to MLA Style
* Develop a thesaurus, that will produce
synonyms relevant to context
8. Translate it to measurable goal
* On a test corpus of 10000 sentences with
common errors achieve smaller number of FNs
(and FPs), that other spellcheckers/Word
spellchecker/etc
* On a corpus of examples of sentences with
each kind of error (and similar sentences
without this kind of error) find all
sentences with errors and do not find
errors in correct sentences
* On a test corpus of 1000 sentences
suggest synonyms for all meaningful words
that will be considered relevant by human
linguists in 90% of the cases
9. Research
1. Set a goal
2. Devise an algorithm
3. Train & improve the
algorithm
10. Research
1. Set a goal
2. Devise an algorithm
3. Train & improve the
algorithm
http://nlp-class.org
11. 4. Test its performance
ML: one corpus, divided into
training,development,test
12. 4. Test its performance
ML: one corpus, divided into
training,development,test
Often — different corpora:
* for training some part
of the algorithm
* for testing the whole
system
18. Pre/post-processing
What ultimately matters is
not crude performance, but...
Acceptance to users (much
harder to measure & depends
on domain).
Real-world is messier, than
any lab set-up.
19. Examples of
pre-processing
For spellcheck:
* some people tend to use
words, separated by
slashes, like:
spell/grammar check
* handling of abbreviations
20. Data
“Data is the next Intel Inside.
--Tim O'Reilly, What is Web2.0
http://oreilly.com/web2/archive/what-is-web-
20.html?page=3
22. Where to get data?
Well-known sources:
* Penn Tree Bank
* Wordnet
* BNC
* Web1T Google N-gram Corpus
* Linguistic Data Consortium
(http://www.ldc.upenn.edu/)
23. More data
Also well-known sources, but
with a twist:
* Wikipedia & Wiktionary,
DBPedia
* OpenWeb Common Crawl
* Public APIs of some
services: Twitter, Wordnik
25. Crowd-sourced data
Jonathan Zittrain,
The Future of the Internet
http://goo.gl/hs4qB
26. And remember...
“Data is ten times more
powerful than algorithms.
--Peter Norvig
The Unreasonable
Effectiveness of Data
http://youtu.be/yvDCzhbjYWs
31. Specific NLP
requirements
* Good support for statistics
& number-crunching
– Statistical AI
* Good support for working
with trees & symbols
– Symbolic AI
36. Take-aways
* As they say, in theory research and
practice are the same, but in
practice...
* Data is key. There are 3 types of it.
Collect it, build tools to work
with it easily and efficiently
* Choose a good language for R&D:
interactive & malleable, with as few
barriers as possible