SlideShare une entreprise Scribd logo
1  sur  48
Microsoft New England Research and Development Center, June 22, 2011 Natural Language Processing and Machine Learning Using Python Shankar Ambady|
Example Files Hosted on Github https://github.com/shanbady/NLTK-Boston-Python-Meetup
What is “Natural Language Processing”? Where is this stuff used? The Machine learning paradox A look at a few key terms Quick start – creating NLP apps in Python
What is Natural Language Processing?	 ,[object Object]
 The goal is to enable machines to understand human language and extract meaning from text.
 It is a field of study which falls under the category of machine learning and more specifically computational linguistics.
The “Natural Language Toolkit” is a python module that provides a variety of functionality that will aide us in processing text.,[object Object]
Paradoxes in Machine Learning
Context Little sister: What’s your name? Me: Uhh….Shankar..? Sister: Can you spell it? Me: yes. S-H-A-N-K-A…..
Sister: WRONG! It’s spelled “I-T”
Language translation is a complicated matter! Go to:  http://babel.mrfeinberg.com/ The problem with communication is the illusion that it has occurred
The problem with communication is the illusion that it has occurred Das Problem mit Kommunikation ist die Illusion, dass es aufgetreten ist The problem with communication is the illusion that it arose Das Problem mit Kommunikation ist die Illusion, dass es entstand The problem with communication is the illusion that it developed Das Problem mit Kommunikation ist die Illusion, die sie entwickelte The problem with communication is the illusion, which developed it
The problem with communication is the illusion that it has occurred The problem with communication is the illusion, which developed it EPIC FAIL
Police policepolice. The above statement is a complete sentence states that:   “police officers police other police officers”. She yelled “police police police” “someone called for police”.
Key Terms
The NLP Pipeline
Setting up NLTK Source downloads available for mac and linux as well as installable packages for windows. Currently only available for Python 2.5 – 2.6 http://www.nltk.org/download `easy_install nltk` Prerequisites NumPy SciPy
First steps NLTK comes with packages of corpora that are required for many modules.  Open a python interpreter: importnltk nltk.download()  If you do not want to use the downloader with a gui (requires TKInter module) Run: python -m nltk.downloader <name of package or “all”>
You may individually select packages or download them in bulk.
Let’s dive into some code!
Part of Speech Tagging fromnltkimportpos_tag,word_tokenize sentence1='this is a demo that will show you how to detects parts of speech with little effort using NLTK!' tokenized_sent=word_tokenize(sentence1) printpos_tag(tokenized_sent) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]
Penn Bank Part-of-Speech Tags Source: http://www.ai.mit.edu/courses/6.863/tagdef.html
NLTK Text nltk.clean_html(rawhtml) from nltk.corpus import brown from nltk import Text brown_words = brown.words(categories='humor') brownText = Text(brown_words) brownText.collocations() brownText.count("car") brownText.concordance("oil") brownText.dispersion_plot(['car', 'document', 'funny', 'oil']) brownText.similar('humor')
Find similar terms (word definitions) using Wordnet importnltk fromnltk.corpusimportwordnetaswn synsets=wn.synsets('phone') print[str(syns.definition)forsynsinsynsets]  'electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds‘ '(phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language‘ 'electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear‘ 'get or try to get into communication (with someone) by telephone'
from nltk.corpus import wordnet as wn synsets = wn.synsets('phone') print [str(  syns.definition ) for syns in synsets] “syns.definition” can be modified to output hypernyms , meronyms, holonyms etc:
Fun things to Try
Feeling lonely? Eliza is there to talk to you all day! What human could ever do that for you?? fromnltk.chatimporteliza eliza.eliza_chat() ……starts the chatbot Therapist --------- Talk to the program by typing in plain English, using normal upper- and lower-case letters and punctuation.  Enter "quit" when done. ======================================================================== Hello.  How are you feeling today?
Englisch to German to Englisch to German…… fromnltk.bookimport* babelize_shell() Babel> the internet is a series of tubes Babel> german Babel> run 0> the internet is a series of tubes 1> das Internet ist eine Reihe SchlSuche 2> the Internet is a number of hoses 3> das Internet ist einige SchlSuche 4> the Internet is some hoses Babel>
Let’s build something even cooler
Lets write a spam filter! 	A program that analyzes legitimate emails “Ham” as well as “spam” and learns the features that are associated with each. 	Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.
What you will need NLTK (of course) as well as the “stopwords” corpus A good dataset of emails; Both spam and ham Patience and a cup of coffee  	(these programs tend to take a while to complete)
Finding Great Data: The Enron Emails A dataset of 200,000+ emails made publicly available in 2003 after the Enron scandal. Contains both spam and actual corporate ham mail. For this reason it is one of the most popular datasets used for testing and developing anti-spam software. The dataset we will use is located at the following url: http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/ It contains a list of archived files that contain plaintext emails in two folders , Spam and Ham.
Extract one of the archives from the site into your working directory.  Create a python script, lets call it “spambot.py”. Your working directory should contain the “spambot” script and the folders “spam” and “ham”. “Spambot.py” fromnltkimportword_tokenize,WordNetLemmatizer,NaiveBayesClassifierbr />,classify,MaxentClassifier fromnltk.corpusimportstopwords importrandom importos,glob,re
“Spambot.py” (continued) wordlemmatizer = WordNetLemmatizer() commonwords = stopwords.words('english')  hamtexts=[] spamtexts=[] forinfileinglob.glob(os.path.join('ham/','*.txt')): text_file=open(infile,"r") hamtexts.extend(text_file.read()) text_file.close() forinfileinglob.glob(os.path.join('spam/','*.txt')): text_file=open(infile,"r") spamtexts.extend(text_file.read()) text_file.close() load common English words into list start globbing the files into the appropriate lists
“Spambot.py” (continued) mixedemails=([(email,'spam')foremailinspamtexts] mixedemails+=   [(email,'ham')foremailinhamtexts]) random.shuffle(mixedemails) label each item with the appropriate label and store them as a list of tuples From this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham.  lets give them a nice shuffle
“Spambot.py” (continued) defemail_features(sent): features={} wordtokens=[wordlemmatizer.lemmatize(word.lower())forwordinword_tokenize(sent)] forwordinwordtokens: ifwordnotincommonwords: features[word]=True returnfeatures featuresets=[(email_features(n),g)for(n,g)inmixedemails] Normalize words If the word is not a stop-word then lets consider it a “feature” Let’s run each email through the feature extractor  and collect it in a “featureset” list
[object Object]
To use features that are non-binary such as number values, you must convert it to a binary feature. This process is called “binning”.
If the feature is the number 12 the feature is: (“11<x<13”, True),[object Object]
“Spambot.py” (continued) print classifier.labels() This will output the labels that our classifier will use to tag new data ['ham', 'spam'] The purpose of create a “training set” and a “test set” is to test the accuracy of our classifier on a separate sample from the same data source. printclassify.accuracy(classifier,test_set) 0.75623
classifier.show_most_informative_features(20) Spam Ham
“Spambot.py” (continued) WhileTrue: featset=email_features(raw_input("Enter text to classify: ")) printclassifier.classify(featset) We can now directly input new email and have it classified as either Spam or Ham
A few notes: ,[object Object]
 The threshold value that determines the sample size of the feature set will need to be refined until it reaches its maximum accuracy. This will need to be adjusted if training data is added, changed or removed.,[object Object]
Try classifying your own emails using this trained classifier and you will notice a sharp decline in accuracy.,[object Object]
Two Labels Fresh rotten Flixster, Rotten Tomatoes, the Certified Fresh Logo are trademarks or registered trademarks of Flixster, Inc. in the United States and other countries
easy_installtweetstream importtweetstream words=["green lantern"] withtweetstream.TrackStream(“yourusername",“yourpassword",words)asstream: fortweetinstream: tweettext=tweet.get('text','') tweetuser=tweet['user']['screen_name'].encode('utf-8') featset=review_features(tweettext) printtweetuser,': ',tweettext printtweetuser,“ thinks Green Lantern is ",classifier.classify(featset),br />"----------------------------"

Contenu connexe

Tendances

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easyGopi Krishnan Nambiar
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview QuestionsShubham Shrimant
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitBen Healey
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialAlyona Medelyan
 
Python interview questions
Python interview questionsPython interview questions
Python interview questionsPragati Singh
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsPhoenix
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & styleKevlin Henney
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to PythonNowell Strite
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to pythonMohamed Hegazy
 
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAMaulik Borsaniya
 

Tendances (20)

Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Natural Language Processing made easy
Natural Language Processing made easyNatural Language Processing made easy
Natural Language Processing made easy
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
Most Asked Python Interview Questions
Most Asked Python Interview QuestionsMost Asked Python Interview Questions
Most Asked Python Interview Questions
 
NLTK
NLTKNLTK
NLTK
 
Document Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language ToolkitDocument Classification using the Python Natural Language Toolkit
Document Classification using the Python Natural Language Toolkit
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Python interview questions
Python interview questionsPython interview questions
Python interview questions
 
Introduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data AnalyticsIntroduction to Python Pandas for Data Analytics
Introduction to Python Pandas for Data Analytics
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
Python Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & stylePython Foundation – A programmer's introduction to Python concepts & style
Python Foundation – A programmer's introduction to Python concepts & style
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to python
 
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYAChapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
Chapter 1 - INTRODUCTION TO PYTHON -MAULIK BORSANIYA
 
Python - Lesson 1
Python - Lesson 1Python - Lesson 1
Python - Lesson 1
 
Python Presentation
Python PresentationPython Presentation
Python Presentation
 

En vedette

TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyMorgan Elizabeth Romano
 
Python an-intro - odp
Python an-intro - odpPython an-intro - odp
Python an-intro - odpArulalan T
 
Manual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrososManual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrososJorge Andres Godoy Marin
 
Diplomatic list. - Free Online Library
Diplomatic list. - Free Online LibraryDiplomatic list. - Free Online Library
Diplomatic list. - Free Online Libraryaboundingconcei84
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Prakash Pimpale
 
Centros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoCentros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoMundo Contact
 
Intro to Python Programming Language
Intro to Python Programming LanguageIntro to Python Programming Language
Intro to Python Programming LanguageDipankar Achinta
 
Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)Olgaç Demirkol
 

En vedette (8)

TheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkeyTheAssociationOfAtheismOnLegalPersonalityInTurkey
TheAssociationOfAtheismOnLegalPersonalityInTurkey
 
Python an-intro - odp
Python an-intro - odpPython an-intro - odp
Python an-intro - odp
 
Manual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrososManual de trabajos con quimicos peligrosos
Manual de trabajos con quimicos peligrosos
 
Diplomatic list. - Free Online Library
Diplomatic list. - Free Online LibraryDiplomatic list. - Free Online Library
Diplomatic list. - Free Online Library
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
Centros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercadoCentros de contacto: las demandas y requerimientos del mercado
Centros de contacto: las demandas y requerimientos del mercado
 
Intro to Python Programming Language
Intro to Python Programming LanguageIntro to Python Programming Language
Intro to Python Programming Language
 
Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)Unutturulmak istenen Devrimci Atatürk! (2)
Unutturulmak istenen Devrimci Atatürk! (2)
 

Similaire à Nltk - Boston Text Analytics

summer training report on python
summer training report on pythonsummer training report on python
summer training report on pythonShubham Yadav
 
python presentation
python presentationpython presentation
python presentationVaibhavMawal
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxMrHackerxD
 
Hasktut
HasktutHasktut
Hasktutkv33
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to pythonRanjith kumar
 
Python_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxPython_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxlemonchoos
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answersRojaPriya
 
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTHWEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTHBhavsingh Maloth
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In PythonMarwan Osman
 
Mastering python lesson1
Mastering python lesson1Mastering python lesson1
Mastering python lesson1Ruth Marvin
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answerskavinilavuG
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptpavankalyanadroittec
 
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdfREPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdfSana Khan
 

Similaire à Nltk - Boston Text Analytics (20)

How To Tame Python
How To Tame PythonHow To Tame Python
How To Tame Python
 
PYTHON PPT.pptx
PYTHON PPT.pptxPYTHON PPT.pptx
PYTHON PPT.pptx
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on python
 
python presentation
python presentationpython presentation
python presentation
 
Pythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptxPythonlearn-01-Intro.pptx
Pythonlearn-01-Intro.pptx
 
Hasktut
HasktutHasktut
Hasktut
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Python_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptxPython_Introduction_Good_PPT.pptx
Python_Introduction_Good_PPT.pptx
 
MODULE 1.pptx
MODULE 1.pptxMODULE 1.pptx
MODULE 1.pptx
 
Python programming
Python programmingPython programming
Python programming
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Python for dummies
Python for dummiesPython for dummies
Python for dummies
 
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTHWEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
WEB PROGRAMMING UNIT VIII BY BHAVSINGH MALOTH
 
Programming Under Linux In Python
Programming Under Linux In PythonProgramming Under Linux In Python
Programming Under Linux In Python
 
Mastering python lesson1
Mastering python lesson1Mastering python lesson1
Mastering python lesson1
 
Python interview questions and answers
Python interview questions and answersPython interview questions and answers
Python interview questions and answers
 
Python Programming.pptx
Python Programming.pptxPython Programming.pptx
Python Programming.pptx
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
 
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdfREPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
REPORT ON AUDIT COURSE PYTHON BY SANA 2.pdf
 

Dernier

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Nltk - Boston Text Analytics

  • 1. Microsoft New England Research and Development Center, June 22, 2011 Natural Language Processing and Machine Learning Using Python Shankar Ambady|
  • 2. Example Files Hosted on Github https://github.com/shanbady/NLTK-Boston-Python-Meetup
  • 3. What is “Natural Language Processing”? Where is this stuff used? The Machine learning paradox A look at a few key terms Quick start – creating NLP apps in Python
  • 4.
  • 5. The goal is to enable machines to understand human language and extract meaning from text.
  • 6. It is a field of study which falls under the category of machine learning and more specifically computational linguistics.
  • 7.
  • 9. Context Little sister: What’s your name? Me: Uhh….Shankar..? Sister: Can you spell it? Me: yes. S-H-A-N-K-A…..
  • 10. Sister: WRONG! It’s spelled “I-T”
  • 11. Language translation is a complicated matter! Go to: http://babel.mrfeinberg.com/ The problem with communication is the illusion that it has occurred
  • 12. The problem with communication is the illusion that it has occurred Das Problem mit Kommunikation ist die Illusion, dass es aufgetreten ist The problem with communication is the illusion that it arose Das Problem mit Kommunikation ist die Illusion, dass es entstand The problem with communication is the illusion that it developed Das Problem mit Kommunikation ist die Illusion, die sie entwickelte The problem with communication is the illusion, which developed it
  • 13. The problem with communication is the illusion that it has occurred The problem with communication is the illusion, which developed it EPIC FAIL
  • 14. Police policepolice. The above statement is a complete sentence states that: “police officers police other police officers”. She yelled “police police police” “someone called for police”.
  • 17.
  • 18. Setting up NLTK Source downloads available for mac and linux as well as installable packages for windows. Currently only available for Python 2.5 – 2.6 http://www.nltk.org/download `easy_install nltk` Prerequisites NumPy SciPy
  • 19. First steps NLTK comes with packages of corpora that are required for many modules. Open a python interpreter: importnltk nltk.download() If you do not want to use the downloader with a gui (requires TKInter module) Run: python -m nltk.downloader <name of package or “all”>
  • 20. You may individually select packages or download them in bulk.
  • 21. Let’s dive into some code!
  • 22. Part of Speech Tagging fromnltkimportpos_tag,word_tokenize sentence1='this is a demo that will show you how to detects parts of speech with little effort using NLTK!' tokenized_sent=word_tokenize(sentence1) printpos_tag(tokenized_sent) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('demo', 'NN'), ('that', 'WDT'), ('will', 'MD'), ('show', 'VB'), ('you', 'PRP'), ('how', 'WRB'), ('to', 'TO'), ('detects', 'NNS'), ('parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('with', 'IN'), ('little', 'JJ'), ('effort', 'NN'), ('using', 'VBG'), ('NLTK', 'NNP'),('!', '.')]
  • 23. Penn Bank Part-of-Speech Tags Source: http://www.ai.mit.edu/courses/6.863/tagdef.html
  • 24. NLTK Text nltk.clean_html(rawhtml) from nltk.corpus import brown from nltk import Text brown_words = brown.words(categories='humor') brownText = Text(brown_words) brownText.collocations() brownText.count("car") brownText.concordance("oil") brownText.dispersion_plot(['car', 'document', 'funny', 'oil']) brownText.similar('humor')
  • 25. Find similar terms (word definitions) using Wordnet importnltk fromnltk.corpusimportwordnetaswn synsets=wn.synsets('phone') print[str(syns.definition)forsynsinsynsets] 'electronic equipment that converts sound into electrical signals that can be transmitted over distances and then converts received signals back into sounds‘ '(phonetics) an individual sound unit of speech without concern as to whether or not it is a phoneme of some language‘ 'electro-acoustic transducer for converting electric signals into sounds; it is held over or inserted into the ear‘ 'get or try to get into communication (with someone) by telephone'
  • 26. from nltk.corpus import wordnet as wn synsets = wn.synsets('phone') print [str( syns.definition ) for syns in synsets] “syns.definition” can be modified to output hypernyms , meronyms, holonyms etc:
  • 28. Feeling lonely? Eliza is there to talk to you all day! What human could ever do that for you?? fromnltk.chatimporteliza eliza.eliza_chat() ……starts the chatbot Therapist --------- Talk to the program by typing in plain English, using normal upper- and lower-case letters and punctuation. Enter "quit" when done. ======================================================================== Hello. How are you feeling today?
  • 29. Englisch to German to Englisch to German…… fromnltk.bookimport* babelize_shell() Babel> the internet is a series of tubes Babel> german Babel> run 0> the internet is a series of tubes 1> das Internet ist eine Reihe SchlSuche 2> the Internet is a number of hoses 3> das Internet ist einige SchlSuche 4> the Internet is some hoses Babel>
  • 30. Let’s build something even cooler
  • 31. Lets write a spam filter! A program that analyzes legitimate emails “Ham” as well as “spam” and learns the features that are associated with each. Once trained, we should be able to run this program on incoming mail and have it reliably label each one with the appropriate category.
  • 32. What you will need NLTK (of course) as well as the “stopwords” corpus A good dataset of emails; Both spam and ham Patience and a cup of coffee (these programs tend to take a while to complete)
  • 33. Finding Great Data: The Enron Emails A dataset of 200,000+ emails made publicly available in 2003 after the Enron scandal. Contains both spam and actual corporate ham mail. For this reason it is one of the most popular datasets used for testing and developing anti-spam software. The dataset we will use is located at the following url: http://labs-repos.iit.demokritos.gr/skel/i-config/downloads/enron-spam/preprocessed/ It contains a list of archived files that contain plaintext emails in two folders , Spam and Ham.
  • 34. Extract one of the archives from the site into your working directory. Create a python script, lets call it “spambot.py”. Your working directory should contain the “spambot” script and the folders “spam” and “ham”. “Spambot.py” fromnltkimportword_tokenize,WordNetLemmatizer,NaiveBayesClassifierbr />,classify,MaxentClassifier fromnltk.corpusimportstopwords importrandom importos,glob,re
  • 35. “Spambot.py” (continued) wordlemmatizer = WordNetLemmatizer() commonwords = stopwords.words('english') hamtexts=[] spamtexts=[] forinfileinglob.glob(os.path.join('ham/','*.txt')): text_file=open(infile,"r") hamtexts.extend(text_file.read()) text_file.close() forinfileinglob.glob(os.path.join('spam/','*.txt')): text_file=open(infile,"r") spamtexts.extend(text_file.read()) text_file.close() load common English words into list start globbing the files into the appropriate lists
  • 36. “Spambot.py” (continued) mixedemails=([(email,'spam')foremailinspamtexts] mixedemails+= [(email,'ham')foremailinhamtexts]) random.shuffle(mixedemails) label each item with the appropriate label and store them as a list of tuples From this list of random but labeled emails, we will defined a “feature extractor” which outputs a feature set that our program can use to statistically compare spam and ham. lets give them a nice shuffle
  • 37. “Spambot.py” (continued) defemail_features(sent): features={} wordtokens=[wordlemmatizer.lemmatize(word.lower())forwordinword_tokenize(sent)] forwordinwordtokens: ifwordnotincommonwords: features[word]=True returnfeatures featuresets=[(email_features(n),g)for(n,g)inmixedemails] Normalize words If the word is not a stop-word then lets consider it a “feature” Let’s run each email through the feature extractor and collect it in a “featureset” list
  • 38.
  • 39. To use features that are non-binary such as number values, you must convert it to a binary feature. This process is called “binning”.
  • 40.
  • 41. “Spambot.py” (continued) print classifier.labels() This will output the labels that our classifier will use to tag new data ['ham', 'spam'] The purpose of create a “training set” and a “test set” is to test the accuracy of our classifier on a separate sample from the same data source. printclassify.accuracy(classifier,test_set) 0.75623
  • 43. “Spambot.py” (continued) WhileTrue: featset=email_features(raw_input("Enter text to classify: ")) printclassifier.classify(featset) We can now directly input new email and have it classified as either Spam or Ham
  • 44.
  • 45.
  • 46.
  • 47. Two Labels Fresh rotten Flixster, Rotten Tomatoes, the Certified Fresh Logo are trademarks or registered trademarks of Flixster, Inc. in the United States and other countries
  • 48. easy_installtweetstream importtweetstream words=["green lantern"] withtweetstream.TrackStream(“yourusername",“yourpassword",words)asstream: fortweetinstream: tweettext=tweet.get('text','') tweetuser=tweet['user']['screen_name'].encode('utf-8') featset=review_features(tweettext) printtweetuser,': ',tweettext printtweetuser,“ thinks Green Lantern is ",classifier.classify(featset),br />"----------------------------"
  • 49.
  • 50. Further Resources: I will be uploading this presentation to my site: http://www.shankarambady.com “Natural Language Processing with Python” by Steven Bird, Ewan Klein, and Edward Loper http://www.nltk.org/book API reference : http://nltk.googlecode.com/svn/trunk/doc/api/index.html Great NLTK blog: http://streamhacker.com/
  • 51. Thank you for watching! Special thanks to: John Verostek and Microsoft!

Notes de l'éditeur

  1. You may also do #print [str(syns.examples) for syns in synsets] for usage examples of each definition