SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Practical Aspects
   of NLP Work

    Vsevolod Dyomkin
        Grammarly

TAAC'2012, Kyiv, Ukraine
Topics
* Practical vs Theoretical
  NLP work
* Working with Data for NLP
* NLP Tools
A bit about Grammarly




        (c) xkcd
An example of what
   we deal with
Research vs Development
“Trick for productionizing research:
read current 3-5 pubs and note the
stupid simple thing they all claim to
beat, implement that.


           --Jay Kreps
           https://twitter.com/jaykreps/
           status/219977241839411200
NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy

D - development work:
implement the algorithm as an API with
sufficient performance and scaling characteristics
Research
1. Set a goal
Business goal:

* Develop best/good enough/better than
Word/etc spellchecker

* Develop a set of grammar rules, that will
catch errors according to MLA Style

* Develop a thesaurus, that will produce
synonyms relevant to context
Translate it to measurable goal
* On a test corpus of 10000 sentences with
common errors achieve smaller number of FNs
(and FPs), that other spellcheckers/Word
spellchecker/etc

* On a corpus of examples of sentences with
each kind of error (and similar sentences
without this kind of error) find all
sentences with errors and do not find
errors in correct sentences

* On a test corpus of 1000 sentences
suggest synonyms for all meaningful words
that will be considered relevant by human
linguists in 90% of the cases
Research
1. Set a goal
2. Devise an algorithm
3. Train & improve the
   algorithm
Research
1. Set a goal
2. Devise an algorithm
3. Train & improve the
   algorithm

http://nlp-class.org
4. Test its performance
ML: one corpus, divided into
training,development,test
4. Test its performance
ML: one corpus, divided into
training,development,test

Often — different corpora:
* for training some part
  of the algorithm
* for testing the whole
  system
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

* because you need their data
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

* because you need their data

* domains might differ
Pre/post-processing
What ultimately matters is
not crude performance, but...
Pre/post-processing
What ultimately matters is
not crude performance, but...

Acceptance to users (much
harder to measure & depends
on domain).
Pre/post-processing
What ultimately matters is
not crude performance, but...

Acceptance to users (much
harder to measure & depends
on domain).

Real-world is messier, than
any lab set-up.
Examples of
    pre-processing
For spellcheck:

* some people tend to use
  words, separated by
  slashes, like:
    spell/grammar check

* handling of abbreviations
Data
“Data is the next Intel Inside.
        --Tim O'Reilly, What is Web2.0
        http://oreilly.com/web2/archive/what-is-web-
                20.html?page=3
Categorization of Data

* Structured — small
* Semi-structured — medium
* Unstructured — big
Where to get data?
Well-known sources:
* Penn Tree Bank
* Wordnet
* BNC
* Web1T Google N-gram Corpus
* Linguistic Data Consortium
  (http://www.ldc.upenn.edu/)
More data
Also well-known sources, but
with a twist:

* Wikipedia & Wiktionary,
  DBPedia
* OpenWeb Common Crawl
* Public APIs of some
  services: Twitter, Wordnik
Academic resources
*   Stanford
*   CoNLL
*   Oxford (http://www.ota.ox.ac.uk/)
*   CMU, MIT,...
*   LingPipe, OpenNLP, NLTK,...
Crowd-sourced data
     Jonathan Zittrain,
     The Future of the Internet
     http://goo.gl/hs4qB
And remember...
“Data is ten times more
powerful than algorithms.
       --Peter Norvig
       The Unreasonable
       Effectiveness of Data
       http://youtu.be/yvDCzhbjYWs
Tools
Levels of NLP tools
High-level: user services

Middle-level: NLP algorithms

Low-level: data-crunching
Choosing a language
Requirement types:
* Research
* NLP-specific
* Production
Research
       requirements
*   Interactivity
*   Mathematical basis
*   Expressiveness
*   Agility Malleability
*   Advanced tools
Specific NLP
     requirements
* Good support for statistics
  & number-crunching
  – Statistical AI

* Good support for working
  with trees & symbols
  – Symbolic AI
Production
       requirements
*   Scalability
*   Maintainability
*   Integrability
*   ...
Choose Lisp




    (c) xkcd
Lisp FTW
* Truly interactive
  environment
* Very flexible => DSLs
* Native tree support
* Fast and solid

- No OpenNLP/NLTK
Heterogeneous systems
“Java way” vs. “Unix way”

Create language-agnostic
systems, that can easily
communicate!
Take-aways
* As they say, in theory research and
  practice are the same, but in
  practice...

* Data is key. There are 3 types of it.
  Collect it, build tools to work
  with it easily and efficiently

* Choose a good language for R&D:
  interactive & malleable, with as few
  barriers as possible
Thanks!

Vsevolod Dyomkin
    @vseloved

Contenu connexe

Tendances

Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
butest
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
Benjamin Habegger
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 

Tendances (20)

Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Language models
Language modelsLanguage models
Language models
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Frontiers of Natural Language Processing
Frontiers of Natural Language ProcessingFrontiers of Natural Language Processing
Frontiers of Natural Language Processing
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 

En vedette (9)

NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python Programmers
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelines
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
 

Similaire à Aspects of NLP Practice

Similaire à Aspects of NLP Practice (20)

Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Session 2.1 ontological representation of the telecom domain for advanced a...
Session 2.1   ontological representation of the telecom domain for advanced a...Session 2.1   ontological representation of the telecom domain for advanced a...
Session 2.1 ontological representation of the telecom domain for advanced a...
 
Pattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to DatabasePattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to Database
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Dmdh winter 2015 session #1
Dmdh winter 2015 session #1Dmdh winter 2015 session #1
Dmdh winter 2015 session #1
 
Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2Feb.2016 Demystifying Digital Humanities - Workshop 2
Feb.2016 Demystifying Digital Humanities - Workshop 2
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 

Aspects of NLP Practice

  • 1. Practical Aspects of NLP Work Vsevolod Dyomkin Grammarly TAAC'2012, Kyiv, Ukraine
  • 2. Topics * Practical vs Theoretical NLP work * Working with Data for NLP * NLP Tools
  • 3. A bit about Grammarly (c) xkcd
  • 4. An example of what we deal with
  • 5. Research vs Development “Trick for productionizing research: read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that. --Jay Kreps https://twitter.com/jaykreps/ status/219977241839411200
  • 6. NLP practice R - research work: set a goal → devise an algorithm → train the algorithm → test its accuracy D - development work: implement the algorithm as an API with sufficient performance and scaling characteristics
  • 7. Research 1. Set a goal Business goal: * Develop best/good enough/better than Word/etc spellchecker * Develop a set of grammar rules, that will catch errors according to MLA Style * Develop a thesaurus, that will produce synonyms relevant to context
  • 8. Translate it to measurable goal * On a test corpus of 10000 sentences with common errors achieve smaller number of FNs (and FPs), that other spellcheckers/Word spellchecker/etc * On a corpus of examples of sentences with each kind of error (and similar sentences without this kind of error) find all sentences with errors and do not find errors in correct sentences * On a test corpus of 1000 sentences suggest synonyms for all meaningful words that will be considered relevant by human linguists in 90% of the cases
  • 9. Research 1. Set a goal 2. Devise an algorithm 3. Train & improve the algorithm
  • 10. Research 1. Set a goal 2. Devise an algorithm 3. Train & improve the algorithm http://nlp-class.org
  • 11. 4. Test its performance ML: one corpus, divided into training,development,test
  • 12. 4. Test its performance ML: one corpus, divided into training,development,test Often — different corpora: * for training some part of the algorithm * for testing the whole system
  • 13. Theoretical maxima Theoretical maxima are rarely achievable. Why?
  • 14. Theoretical maxima Theoretical maxima are rarely achievable. Why? * because you need their data
  • 15. Theoretical maxima Theoretical maxima are rarely achievable. Why? * because you need their data * domains might differ
  • 16. Pre/post-processing What ultimately matters is not crude performance, but...
  • 17. Pre/post-processing What ultimately matters is not crude performance, but... Acceptance to users (much harder to measure & depends on domain).
  • 18. Pre/post-processing What ultimately matters is not crude performance, but... Acceptance to users (much harder to measure & depends on domain). Real-world is messier, than any lab set-up.
  • 19. Examples of pre-processing For spellcheck: * some people tend to use words, separated by slashes, like: spell/grammar check * handling of abbreviations
  • 20. Data “Data is the next Intel Inside. --Tim O'Reilly, What is Web2.0 http://oreilly.com/web2/archive/what-is-web- 20.html?page=3
  • 21. Categorization of Data * Structured — small * Semi-structured — medium * Unstructured — big
  • 22. Where to get data? Well-known sources: * Penn Tree Bank * Wordnet * BNC * Web1T Google N-gram Corpus * Linguistic Data Consortium (http://www.ldc.upenn.edu/)
  • 23. More data Also well-known sources, but with a twist: * Wikipedia & Wiktionary, DBPedia * OpenWeb Common Crawl * Public APIs of some services: Twitter, Wordnik
  • 24. Academic resources * Stanford * CoNLL * Oxford (http://www.ota.ox.ac.uk/) * CMU, MIT,... * LingPipe, OpenNLP, NLTK,...
  • 25. Crowd-sourced data Jonathan Zittrain, The Future of the Internet http://goo.gl/hs4qB
  • 26. And remember... “Data is ten times more powerful than algorithms. --Peter Norvig The Unreasonable Effectiveness of Data http://youtu.be/yvDCzhbjYWs
  • 27. Tools
  • 28. Levels of NLP tools High-level: user services Middle-level: NLP algorithms Low-level: data-crunching
  • 29. Choosing a language Requirement types: * Research * NLP-specific * Production
  • 30. Research requirements * Interactivity * Mathematical basis * Expressiveness * Agility Malleability * Advanced tools
  • 31. Specific NLP requirements * Good support for statistics & number-crunching – Statistical AI * Good support for working with trees & symbols – Symbolic AI
  • 32. Production requirements * Scalability * Maintainability * Integrability * ...
  • 33. Choose Lisp (c) xkcd
  • 34. Lisp FTW * Truly interactive environment * Very flexible => DSLs * Native tree support * Fast and solid - No OpenNLP/NLTK
  • 35. Heterogeneous systems “Java way” vs. “Unix way” Create language-agnostic systems, that can easily communicate!
  • 36. Take-aways * As they say, in theory research and practice are the same, but in practice... * Data is key. There are 3 types of it. Collect it, build tools to work with it easily and efficiently * Choose a good language for R&D: interactive & malleable, with as few barriers as possible