SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
The pope is catholic.
language as an interfacelanguage as data
introduction
(@spencermountain)
introduction
problem
problem
problem
london in the rain london in the
in the rain
london in
in the
the rain
london
in
the
rain
4-gram: 3-gram: 2-gram: 1-gram:
10 requests per keystroke
4-words -
5-words : 15 6-words : 21 7-words : 28 8-words : 36
problem
london in the rain
london in the
in the rain
london in
in the
the rain
London
in
the
rain
in
the
london in the
in the rain
london in
the rain
Stopwords
blacklist:
Edge gram
filter:
Redundancy
check:
#1
#2
#3
in the
“rain”
“london”
“london in the rain”
problem
● NLTK - excellent, huge, python
● Stanford parser - excellent, huge, java
Alchemy,
TextRazor,
OpenCalais,
Embedly,
Zemanta
Or an offsite API?● Freeling - excellent, huge, C++
● Illinois tagger - excellent, huge, java
When all you’ve got is a jackhammer..
niche
Can it be hacked?
tldr: yes.
niche
How big is a language?
Shakespeare - 35,000
niche
Zipfs law
The top 10 words account for 25% of language.
The top 100 words account for 50% of language.
The top 50,000 words account for 95% of language.
niche
602 kb
uncompressed
50,000
different words
An average person will ever hear
~4 lookups in binary search
niche
first, let’s kill the
nouns 70%
process
180 kb
uncompressed
Noun Verb Adjective Adverb
Tomato
Tomatoes
Toronto
Torontonian
Speak
Spoke
Speaking
will speak
have spoken
had spoken
...
nice
nicer
nicest
quickly
quicklier
quickliest
“awesome”“awesomeify”
improveify your vocabularies
“quickly”“quick”
n/2.3
Each word
*not handsome *not truly*not is*not economics
“tomatoey”“tomato”
“speak”“speaker”
“aggressive”“agressiveness”
“civil”“civilize”
niche
then, let’s conjugate
our verbs
process
110 kb
uncompressed
process
jQuery
256kb
d3js
330kb
react
653kb
110 kb
uncompressed
lodash
503kb
the whole
english
language
110kb
Ok, let’s roll our own POS tagger..
(what could go rong?)
process
1) ☑ Lexicon
2) ◻ Suffix regexes
3) ◻ Sentence
Suffix rules
process
1) ☑ Lexicon
2) ☑ Suffix regexes
3) ◻ Sentence
Grammar rules - markov
She could walk the walk .
before: Verb - Det - Verb
after: Verb - Det - Noun
process
1) ☑ Lexicon
2) ☑ Suffix regexes
3) ☑ Sentence
“Unreasonable effectiveness” of rule-based taggers-
● a 1,000 word lexicon - 45% precision
● fallback to [Noun] - 70% precision
● a little regex - 74% precision
● a little grammar in it - 81% precision
process
t.text(“keep on rocking in the free world”)
t.negate()
//“don’t keep on rocking in the free world.”
outcome
t.text(“it is a cool library”)
t.toValleyGirl()
//“so, it is like, a cool library.”
outcome
We gave the monkeys the
bananas,
..because they were ripe. ..because they were hungry.
outcome
We gave the monkeys the bananas
[Pr] [Verb] [Dt] [Noun] [Dt] [Noun]
list of letters
POS-tagging
Dependency parser We give [Noun] [Noun]
[act / transfer / voluntary] [genus / monkey]
Knowledge engine
[plant / banana]
outcome
#TODOFML
● Mutable/Immutable API
● Speed, performance testing
● Romantic-language verb conjugations
● ‘bl.ocks.org’ of demos and docs
outcome
npm install --wooyeah
@spencermountain
Slack group, mailing list, github, Toronto/coffee

Contenu connexe

Similaire à nlp_compromise

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
tswr
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 

Similaire à nlp_compromise (20)

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Collatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin textsCollatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin texts
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
An overview of JavaScript and Node.js
An overview of JavaScript and Node.jsAn overview of JavaScript and Node.js
An overview of JavaScript and Node.js
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyond
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Here be dragons
Here be dragonsHere be dragons
Here be dragons
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)
 
Learning for sequences - Adam Mathias
Learning for sequences  - Adam MathiasLearning for sequences  - Adam Mathias
Learning for sequences - Adam Mathias
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 

Plus de TorontoNodeJS (7)

Node.js API pitfalls
Node.js API pitfallsNode.js API pitfalls
Node.js API pitfalls
 
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages TodaySafely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
 
KoNote
KoNoteKoNote
KoNote
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event Loop
 
Avoiding callback hell with promises
Avoiding callback hell with promisesAvoiding callback hell with promises
Avoiding callback hell with promises
 
Node as an API shim
Node as an API shimNode as an API shim
Node as an API shim
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
 

Dernier

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

nlp_compromise

  • 1. The pope is catholic. language as an interfacelanguage as data introduction
  • 6. london in the rain london in the in the rain london in in the the rain london in the rain 4-gram: 3-gram: 2-gram: 1-gram: 10 requests per keystroke 4-words - 5-words : 15 6-words : 21 7-words : 28 8-words : 36 problem
  • 7. london in the rain london in the in the rain london in in the the rain London in the rain in the london in the in the rain london in the rain Stopwords blacklist: Edge gram filter: Redundancy check: #1 #2 #3 in the “rain” “london” “london in the rain” problem
  • 8. ● NLTK - excellent, huge, python ● Stanford parser - excellent, huge, java Alchemy, TextRazor, OpenCalais, Embedly, Zemanta Or an offsite API?● Freeling - excellent, huge, C++ ● Illinois tagger - excellent, huge, java When all you’ve got is a jackhammer.. niche
  • 9. Can it be hacked? tldr: yes. niche
  • 10. How big is a language? Shakespeare - 35,000 niche
  • 11. Zipfs law The top 10 words account for 25% of language. The top 100 words account for 50% of language. The top 50,000 words account for 95% of language. niche
  • 12. 602 kb uncompressed 50,000 different words An average person will ever hear ~4 lookups in binary search niche
  • 13. first, let’s kill the nouns 70% process 180 kb uncompressed
  • 14. Noun Verb Adjective Adverb Tomato Tomatoes Toronto Torontonian Speak Spoke Speaking will speak have spoken had spoken ... nice nicer nicest quickly quicklier quickliest “awesome”“awesomeify” improveify your vocabularies “quickly”“quick” n/2.3 Each word *not handsome *not truly*not is*not economics “tomatoey”“tomato” “speak”“speaker” “aggressive”“agressiveness” “civil”“civilize” niche
  • 15. then, let’s conjugate our verbs process 110 kb uncompressed
  • 17. Ok, let’s roll our own POS tagger.. (what could go rong?) process 1) ☑ Lexicon 2) ◻ Suffix regexes 3) ◻ Sentence
  • 18. Suffix rules process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ◻ Sentence
  • 19. Grammar rules - markov She could walk the walk . before: Verb - Det - Verb after: Verb - Det - Noun process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ☑ Sentence
  • 20. “Unreasonable effectiveness” of rule-based taggers- ● a 1,000 word lexicon - 45% precision ● fallback to [Noun] - 70% precision ● a little regex - 74% precision ● a little grammar in it - 81% precision process
  • 21. t.text(“keep on rocking in the free world”) t.negate() //“don’t keep on rocking in the free world.” outcome
  • 22. t.text(“it is a cool library”) t.toValleyGirl() //“so, it is like, a cool library.” outcome
  • 23. We gave the monkeys the bananas, ..because they were ripe. ..because they were hungry. outcome
  • 24. We gave the monkeys the bananas [Pr] [Verb] [Dt] [Noun] [Dt] [Noun] list of letters POS-tagging Dependency parser We give [Noun] [Noun] [act / transfer / voluntary] [genus / monkey] Knowledge engine [plant / banana] outcome
  • 25. #TODOFML ● Mutable/Immutable API ● Speed, performance testing ● Romantic-language verb conjugations ● ‘bl.ocks.org’ of demos and docs outcome
  • 26. npm install --wooyeah @spencermountain Slack group, mailing list, github, Toronto/coffee