SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Practical NLP with Lisp

   Vsevolod Dyomkin
       Grammarly
Topics

*   Overview of NLP practice
*   Getting Data
*   Using Lisp: pros & cons
*   A couple of examples
A bit about Grammarly




        (c) xkcd
An example of what
   we deal with
NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy
NLP practice
R - research work:
set a goal →
devise an algorithm →
train the algorithm →
test its accuracy

D - development work:
implement the algorithm as an API with
sufficient performance and scaling
characteristics
Research
1. Set a goal
Business goal:

* Develop best/good enough/better than
Word/etc spellchecker

* Develop a set of grammar rules, that will
catch errors according to MLA Style

* Develop a thesaurus, that will produce
synonyms relevant to context
Translate it to measurable goal
* On a test corpus of 10000 sentences with
common errors achieve smaller number of FNs
(and FPs), that other spellcheckers/Word
spellchecker/etc

* On a corpus of examples of sentences with
each kind of error (and similar sentences
without this kind of error) find all
sentences with errors and do not find
errors in correct sentences

* On a test corpus of 1000 sentences
suggest synonyms for all meaningful words
that will be considered relevant by human
linguists in 90% of the cases
A Note on
       Terminology
FN and FP instead of
precision (P), recall (R)

FN = 1-R
FP = 1-P or ???
f1 = P*R/(P+R) =
(1-FN-FP+FN*FP)/(2-(FN+FP))
Research contd.
2. Devise an algorithm
3. Train & improve the
algorithm
Research contd.
2. Devise an algorithm
3. Train & improve the
algorithm

http://nlp-class.org
4. Test its performance
ML: one corpus, divided into
training,development,test
4. Test its performance
ML: one corpus, divided into
training,development,test

Often — different corpora:
* for training some part (not
whole) of the algorithm
* for testing the whole
system
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

* Because you need their
data. (And data is key)
Theoretical maxima
Theoretical maxima are rarely
achievable. Why?

* Because you need their
data. (And data is key)

* Domains might differ
Pre/post-processing
What ultimately matters is
not crude performance, but...
Pre/post-processing
What ultimately matters is
not crude performance, but...

Acceptance to users (much
harder to measure & depends
on domain).
Pre/post-processing
What ultimately matters is
not crude performance, but...

Acceptance to users (much
harder to measure & depends
on domain).

Real-world is messier, than
any lab set-up.
Examples of
    pre-processing
For spellcheck:

* some people tend to use
words, separated by slashes,
like: spell/grammar check

* handling of abbreviations
Where to get data?
Well-known sources:
* Penn Tree Bank
* Wordnet
* Web1T Google N-gram Corpus
* Linguistic Data Consortium
  (http://www.ldc.upenn.edu/)
More data
Also well-known sources, but
with a twist:
* Wikipedia & Wiktionary,
DBPedia
* OpenWeb Common Crawl
(updated: 2010)
* Public APIs of some
services: Twitter, Wordnik
Obscure corpora
Academic resources:
* Stanford
* CoNLL
* Oxford (http://www.ota.ox.ac.uk/)
* CMU, MIT,...
* LingPipe, OpenNLP, NLTK,...
Human-powered?


http://goo.gl/hs4qB
Beyond corpora?

* Bootstrapping
* Seeding
And remember...
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig, “The Unreasonable
Effectiveness of Data.”
http://youtu.be/yvDCzhbjYWs
Using Lisp for NLP




      (c) xkcd
Why Lisp?
Lisp is a carefully crafted
tool for:

*   Engineers
*   Practical researchers
*   Artists
*   Entrepreneurs
Some examples
*   Piano.aero
*   ITA Software
*   Secure Outcomes
*   Impromptu

* Land of Lisp
http://youtu.be/HM1Zb3xmvMc
Research
       requirements
*   Interactivity
*   Mathematical basis
*   Expressiveness
*   Agility Malleability
*   Advanced tools
Specific NLP
     requirements
* Good support for statistics
& number-crunching (matrices)
– Statistical AI

* Good support for working
with trees & symbols
– Symbolic AI
Production
       requirements
*   Scalability
*   Maintainability
*   Integrability
*   ...
...eventually

* Speed
...eventually

* Speed
* Speed
...eventually

* Speed
* Speed
* Speed
Heterogeneous
        systems
You have to split the system
and communicate:

“Java” way vs. “Unix” way

* Sockets, Redis, ZeroMQ, etc
for communication
* JSON, SEXPs, etc for data
Lisp drawbacks
There's no OpenNLP or SciPy &
generally there's fewer
libraries.
Lisp drawbacks
There's no OpenNLP or SciPy &
generally there's fewer
libraries.

But...
*   github: eslick/cl-langutils
*   github: mathematical-systems/clml
*   github: tpapp/lla
*   github: blindglobe/common-lisp-stat
*   … and http://quicklisp.org
But #2
Porter stemmer:
http://tartarus.org/~martin/PorterStemmer
& http://www.cliki.net/PorterStemmer

or Soundex:
http://www.cs.cmu.edu/afs/cs/project/ai-
repository/ai/lang/lisp/code/0.html

are irrelevant with good data
More drawbacks

Lisp is a fringe language

   Not special language
  (like R, J or Octave)
Example #1


API interaction
Example #2
Lisp FTW
* truly interactive
environment
* very flexible => DSLs
* native tree support
* fast and solid
Take-aways
* Take nlp-class

* Data is key, collect it, build tools
to work with it easily and efficiently

* A good language for R&D should be
first of all interactive & malleable,
with as few barriers as possible

* ... it also helps if you don't need to
port your code for production

* Lisp is one of the good examples
Thanks!

Vsevolod Dyomkin
    @vseloved

Contenu connexe

Tendances

[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...
[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...
[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...Jihoo Kim
 
스타트업처럼 토이프로젝트하기
스타트업처럼 토이프로젝트하기스타트업처럼 토이프로젝트하기
스타트업처럼 토이프로젝트하기Sunyoung Shin
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural networkFerdous ahmed
 
Deep Learning for Recommender Systems - Budapest RecSys Meetup
Deep Learning for Recommender Systems  - Budapest RecSys MeetupDeep Learning for Recommender Systems  - Budapest RecSys Meetup
Deep Learning for Recommender Systems - Budapest RecSys MeetupAlexandros Karatzoglou
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief NetworksHasan H Topcu
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
REST - Representational state transfer
REST - Representational state transferREST - Representational state transfer
REST - Representational state transferTricode (part of Dept)
 
LLM 모델 기반 서비스 실전 가이드
LLM 모델 기반 서비스 실전 가이드LLM 모델 기반 서비스 실전 가이드
LLM 모델 기반 서비스 실전 가이드Tae Young Lee
 
공공데이터 활용사례
공공데이터 활용사례공공데이터 활용사례
공공데이터 활용사례Kyunghoon Kim
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기Brian Hong
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Webhooks in FME Server, Cityworks & GIS Applications
Webhooks in FME Server, Cityworks & GIS ApplicationsWebhooks in FME Server, Cityworks & GIS Applications
Webhooks in FME Server, Cityworks & GIS ApplicationsSafe Software
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Vishal Mishra
 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorRoelof Pieters
 
Korean manual for nodexl fb, flickr, twitter, youtube, wiki
Korean manual for nodexl fb, flickr, twitter, youtube, wikiKorean manual for nodexl fb, flickr, twitter, youtube, wiki
Korean manual for nodexl fb, flickr, twitter, youtube, wikiHan Woo PARK
 
공간정보, 디지털 트윈, 그리고 스마트 시티
공간정보, 디지털 트윈, 그리고 스마트 시티공간정보, 디지털 트윈, 그리고 스마트 시티
공간정보, 디지털 트윈, 그리고 스마트 시티SANGHEE SHIN
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsErik Bernhardsson
 

Tendances (20)

[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...
[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...
[Paper Review] Personalized Top-N Sequential Recommendation via Convolutional...
 
스타트업처럼 토이프로젝트하기
스타트업처럼 토이프로젝트하기스타트업처럼 토이프로젝트하기
스타트업처럼 토이프로젝트하기
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Deep Learning for Recommender Systems - Budapest RecSys Meetup
Deep Learning for Recommender Systems  - Budapest RecSys MeetupDeep Learning for Recommender Systems  - Budapest RecSys Meetup
Deep Learning for Recommender Systems - Budapest RecSys Meetup
 
Deep Belief Networks
Deep Belief NetworksDeep Belief Networks
Deep Belief Networks
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
REST - Representational state transfer
REST - Representational state transferREST - Representational state transfer
REST - Representational state transfer
 
Linux aio
Linux aioLinux aio
Linux aio
 
LLM 모델 기반 서비스 실전 가이드
LLM 모델 기반 서비스 실전 가이드LLM 모델 기반 서비스 실전 가이드
LLM 모델 기반 서비스 실전 가이드
 
공공데이터 활용사례
공공데이터 활용사례공공데이터 활용사례
공공데이터 활용사례
 
쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기쿠키런 1년, 서버개발 분투기
쿠키런 1년, 서버개발 분투기
 
PostgreSQL - Case Study
PostgreSQL - Case StudyPostgreSQL - Case Study
PostgreSQL - Case Study
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Webhooks in FME Server, Cityworks & GIS Applications
Webhooks in FME Server, Cityworks & GIS ApplicationsWebhooks in FME Server, Cityworks & GIS Applications
Webhooks in FME Server, Cityworks & GIS Applications
 
Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.Convolutional Neural Network and RNN for OCR problem.
Convolutional Neural Network and RNN for OCR problem.
 
Deep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog DetectorDeep Learning as a Cat/Dog Detector
Deep Learning as a Cat/Dog Detector
 
Korean manual for nodexl fb, flickr, twitter, youtube, wiki
Korean manual for nodexl fb, flickr, twitter, youtube, wikiKorean manual for nodexl fb, flickr, twitter, youtube, wiki
Korean manual for nodexl fb, flickr, twitter, youtube, wiki
 
공간정보, 디지털 트윈, 그리고 스마트 시티
공간정보, 디지털 트윈, 그리고 스마트 시티공간정보, 디지털 트윈, 그리고 스마트 시티
공간정보, 디지털 트윈, 그리고 스마트 시티
 
ML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive AnalyticsML+Hadoop at NYC Predictive Analytics
ML+Hadoop at NYC Predictive Analytics
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 

En vedette

NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationVsevolod Dyomkin
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesDominic Graefen
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python ProgrammersVsevolod Dyomkin
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturyVsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispDamien Cassou
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelinesVsevolod Dyomkin
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 

En vedette (19)

lisp (vs ruby) metaprogramming
lisp (vs ruby) metaprogramminglisp (vs ruby) metaprogramming
lisp (vs ruby) metaprogramming
 
NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
 
LISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love ParanthesesLISP: How I Learned To Stop Worrying And Love Parantheses
LISP: How I Learned To Stop Worrying And Love Parantheses
 
Aspects of NLP Practice
Aspects of NLP PracticeAspects of NLP Practice
Aspects of NLP Practice
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python Programmers
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Metaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common LispMetaprogramming and Reflection in Common Lisp
Metaprogramming and Reflection in Common Lisp
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelines
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 

Similaire à Practical NLP with Lisp

Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"GeeksLab Odessa
 
Web data from R
Web data from RWeb data from R
Web data from Rschamber
 
Survey of Program Transformation Technologies
Survey of Program Transformation TechnologiesSurvey of Program Transformation Technologies
Survey of Program Transformation TechnologiesChunhua Liao
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Perl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code LinterPerl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code Lintermoznion
 
Perl Myths 200909
Perl Myths 200909Perl Myths 200909
Perl Myths 200909Tim Bunce
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Python: The Programmer's Lingua Franca
Python: The Programmer's Lingua FrancaPython: The Programmer's Lingua Franca
Python: The Programmer's Lingua FrancaActiveState
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4LKoji Sekiguchi
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Machine Learning Prague
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014Edwin de Jonge
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansDevfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansTomoki Aburatani
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Mastering Python lesson3b_for_loops
Mastering Python lesson3b_for_loopsMastering Python lesson3b_for_loops
Mastering Python lesson3b_for_loopsRuth Marvin
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Sparklis exploration et interrogation de points d'accès sparql par interactio...
Sparklis exploration et interrogation de points d'accès sparql par interactio...Sparklis exploration et interrogation de points d'accès sparql par interactio...
Sparklis exploration et interrogation de points d'accès sparql par interactio...SemWebPro
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 

Similaire à Practical NLP with Lisp (20)

Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Web data from R
Web data from RWeb data from R
Web data from R
 
Survey of Program Transformation Technologies
Survey of Program Transformation TechnologiesSurvey of Program Transformation Technologies
Survey of Program Transformation Technologies
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Perl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code LinterPerl::Lint - Yet Another Perl Source Code Linter
Perl::Lint - Yet Another Perl Source Code Linter
 
Perl Myths 200909
Perl Myths 200909Perl Myths 200909
Perl Myths 200909
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Python: The Programmer's Lingua Franca
Python: The Programmer's Lingua FrancaPython: The Programmer's Lingua Franca
Python: The Programmer's Lingua Franca
 
Php extensions
Php extensionsPhp extensions
Php extensions
 
An Introduction to NLP4L
An Introduction to NLP4LAn Introduction to NLP4L
An Introduction to NLP4L
 
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
Chris Brew - TR Discover: A Natural Language Interface for Exploring Linked D...
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
Devfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-KoansDevfest kyoto2018 Lisp-Koans
Devfest kyoto2018 Lisp-Koans
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Mastering Python lesson3b_for_loops
Mastering Python lesson3b_for_loopsMastering Python lesson3b_for_loops
Mastering Python lesson3b_for_loops
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Sparklis exploration et interrogation de points d'accès sparql par interactio...
Sparklis exploration et interrogation de points d'accès sparql par interactio...Sparklis exploration et interrogation de points d'accès sparql par interactio...
Sparklis exploration et interrogation de points d'accès sparql par interactio...
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 

Dernier

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Dernier (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Practical NLP with Lisp

  • 1. Practical NLP with Lisp Vsevolod Dyomkin Grammarly
  • 2. Topics * Overview of NLP practice * Getting Data * Using Lisp: pros & cons * A couple of examples
  • 3. A bit about Grammarly (c) xkcd
  • 4. An example of what we deal with
  • 5. NLP practice R - research work: set a goal → devise an algorithm → train the algorithm → test its accuracy
  • 6. NLP practice R - research work: set a goal → devise an algorithm → train the algorithm → test its accuracy D - development work: implement the algorithm as an API with sufficient performance and scaling characteristics
  • 7. Research 1. Set a goal Business goal: * Develop best/good enough/better than Word/etc spellchecker * Develop a set of grammar rules, that will catch errors according to MLA Style * Develop a thesaurus, that will produce synonyms relevant to context
  • 8. Translate it to measurable goal * On a test corpus of 10000 sentences with common errors achieve smaller number of FNs (and FPs), that other spellcheckers/Word spellchecker/etc * On a corpus of examples of sentences with each kind of error (and similar sentences without this kind of error) find all sentences with errors and do not find errors in correct sentences * On a test corpus of 1000 sentences suggest synonyms for all meaningful words that will be considered relevant by human linguists in 90% of the cases
  • 9. A Note on Terminology FN and FP instead of precision (P), recall (R) FN = 1-R FP = 1-P or ??? f1 = P*R/(P+R) = (1-FN-FP+FN*FP)/(2-(FN+FP))
  • 10. Research contd. 2. Devise an algorithm 3. Train & improve the algorithm
  • 11. Research contd. 2. Devise an algorithm 3. Train & improve the algorithm http://nlp-class.org
  • 12. 4. Test its performance ML: one corpus, divided into training,development,test
  • 13. 4. Test its performance ML: one corpus, divided into training,development,test Often — different corpora: * for training some part (not whole) of the algorithm * for testing the whole system
  • 14. Theoretical maxima Theoretical maxima are rarely achievable. Why?
  • 15. Theoretical maxima Theoretical maxima are rarely achievable. Why? * Because you need their data. (And data is key)
  • 16. Theoretical maxima Theoretical maxima are rarely achievable. Why? * Because you need their data. (And data is key) * Domains might differ
  • 17. Pre/post-processing What ultimately matters is not crude performance, but...
  • 18. Pre/post-processing What ultimately matters is not crude performance, but... Acceptance to users (much harder to measure & depends on domain).
  • 19. Pre/post-processing What ultimately matters is not crude performance, but... Acceptance to users (much harder to measure & depends on domain). Real-world is messier, than any lab set-up.
  • 20. Examples of pre-processing For spellcheck: * some people tend to use words, separated by slashes, like: spell/grammar check * handling of abbreviations
  • 21. Where to get data? Well-known sources: * Penn Tree Bank * Wordnet * Web1T Google N-gram Corpus * Linguistic Data Consortium (http://www.ldc.upenn.edu/)
  • 22. More data Also well-known sources, but with a twist: * Wikipedia & Wiktionary, DBPedia * OpenWeb Common Crawl (updated: 2010) * Public APIs of some services: Twitter, Wordnik
  • 23. Obscure corpora Academic resources: * Stanford * CoNLL * Oxford (http://www.ota.ox.ac.uk/) * CMU, MIT,... * LingPipe, OpenNLP, NLTK,...
  • 26. And remember... “Data is ten times more powerful than algorithms.” -- Peter Norvig, “The Unreasonable Effectiveness of Data.” http://youtu.be/yvDCzhbjYWs
  • 27. Using Lisp for NLP (c) xkcd
  • 28. Why Lisp? Lisp is a carefully crafted tool for: * Engineers * Practical researchers * Artists * Entrepreneurs
  • 29. Some examples * Piano.aero * ITA Software * Secure Outcomes * Impromptu * Land of Lisp http://youtu.be/HM1Zb3xmvMc
  • 30. Research requirements * Interactivity * Mathematical basis * Expressiveness * Agility Malleability * Advanced tools
  • 31. Specific NLP requirements * Good support for statistics & number-crunching (matrices) – Statistical AI * Good support for working with trees & symbols – Symbolic AI
  • 32. Production requirements * Scalability * Maintainability * Integrability * ...
  • 36. Heterogeneous systems You have to split the system and communicate: “Java” way vs. “Unix” way * Sockets, Redis, ZeroMQ, etc for communication * JSON, SEXPs, etc for data
  • 37. Lisp drawbacks There's no OpenNLP or SciPy & generally there's fewer libraries.
  • 38. Lisp drawbacks There's no OpenNLP or SciPy & generally there's fewer libraries. But... * github: eslick/cl-langutils * github: mathematical-systems/clml * github: tpapp/lla * github: blindglobe/common-lisp-stat * … and http://quicklisp.org
  • 39. But #2 Porter stemmer: http://tartarus.org/~martin/PorterStemmer & http://www.cliki.net/PorterStemmer or Soundex: http://www.cs.cmu.edu/afs/cs/project/ai- repository/ai/lang/lisp/code/0.html are irrelevant with good data
  • 40. More drawbacks Lisp is a fringe language Not special language (like R, J or Octave)
  • 43. Lisp FTW * truly interactive environment * very flexible => DSLs * native tree support * fast and solid
  • 44. Take-aways * Take nlp-class * Data is key, collect it, build tools to work with it easily and efficiently * A good language for R&D should be first of all interactive & malleable, with as few barriers as possible * ... it also helps if you don't need to port your code for production * Lisp is one of the good examples