SlideShare une entreprise Scribd logo
1  sur  33
Handling Text Data
INAFU6513 Lecture 7b
Lab 7: your 5-7 things
Get familiar with text processing
Get familiar with text data
Read text data
Classify text data
Analyse text data
Text processing
● Information retrieval
○ Search
○ Named entity recognition
● Learning
○ Classification
○ Clustering
○ Topic identification/ topic following
○ Sentiment analysis
○ Network analysis (words, people etc)
Reading Text Data
Text Data Sources
● Messages (tweets, emails, sms messages...)
● Document text (reports, blogposts, website text…)
● Audio (via speech-to-text processing)
● Images (via OCR)
Get your raw text data
fsipa = open('sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
print(sipatext)
Counting: Bags of Words
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform([sipatext])
print('{}'.format(word_counts))
print('{}'.format(count_vect.vocabulary_))
Counting sets of words: N-Grams
● Pairs (or triples, 4s etc) of words
● Also: pairs etc of characters, e.g. [‘mor’, ‘ore’, ‘re ‘,
‘e t’, ‘ th’, ‘tha’, ‘han’]
● Know your Ns:
○ ‘Unigram’ == 1-gram
○ ‘Bigram’ == 2-gram
○ ‘Trigram’ == 3-gram
count_vectn = CountVectorizer(ngram_range =(2, 2))
Stopwords
count_vect2 =
CountVectorizer(stop_words='english')
word_counts2 =
count_vect2.fit_transform([sipatext])
Term Frequencies
● TF: Term Frequency:
○ word count / (number of words in this document)
○ “How important (0 to 1) is this word to this document”?
● IDF: Inverse Document Frequency
○ 1 / (number of documents this word appears in)
○ “How common is this word in this corpus”?
● TFIDF:
○ TF * IDF
Machine Learning with Text Data
Classifying Text
Words are a valid input to machine learning algorithms
In this example, we’re using:
● Newsgroup emails as samples (‘rows’ in our input)
● Words in each email as features (‘columns’)
● Newsgroup ids as targets
The 20newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
cats = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups( subset='train', categories=cats)
twenty_test = fetch_20newsgroups(subset='test', categories=cats)
Example email
Convert words to TFIDF scores
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
Fit your model to the data
from sklearn.naive_bayes import MultinomialNB
nb_classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
Test your model
docs_test = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_test)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = nb_classifier.predict(X_new_tfidf)
for doc, category in zip(docs_test, predicted):
print('{} => {}'.format(doc, twenty_train.target_names[category]))
Text Clustering
We can also ‘cluster’ documents
● The ‘distance’ function is based on the words they have in common
Common machine learning algorithms for text clustering include:
● Latent Semantic Analysis
● Latent Dirichlet Allocation
Text Analysis
Word colocation
● Create a graph (network visualisation) of words that appear together in
documents
● Use network analysis (later session) to show which pairs of words are
important in your documents
Sentiment analysis
● Mark documents (e.g. tweets) as having positive or negative sentiment
● Using machine learning
○ Training set: sentences, with ‘positive’/’negative’ for each sentence
● Using a sentiment dictionary
○ Positive or negative ‘score’ for each emotive word
○ Sentiment dictionaries can be used as machine learning algorithms
‘seeds’
Named Entity Recognition
● Find the names of people, organisations, locations etc in text
● Can use these to create social graphs (networks showing how people etc
connect to each other) and find ‘hubs’, ‘connectors’ etc
Natural Language Processing
Natural Language Processing
● Understanding the grammar and meaning of text
● Useful for, e.g. translation between languages
● Python library: NLTK
Getting started with NLTK
import nltk
nltk.download()
Get text ready for NLTK processing
from nltk import word_tokenize
from nltk.text import Text
fsipa = open('example_data/sipatext.txt', 'r')
sipatext = fsipa.read()
fsipa.close()
sipawords = word_tokenize(sipatext)
textlist = Text(sipawords)
NLTK: concordance
textlist.concordance(‘school’)
textlist.similar('school')
textlist.common_contexts(['school', 'university'])
NLTK: word dispersion plots
from nltk.book import *
text2.dispersion_plot(['Elinor', 'Willoughby', 'Sophia'])
NLTK: Word Meanings
from nltk.corpus import wordnet as wn
word = 'class'
synset = wn.synsets(word)
print('Synset: {}n'.format(synset))
for i in range(len(synset)):
print('Meaning {}: {} {}'.format(i, synset[i].lemma_names(), synset[i].definition()))
NLTK: Synsets
NLTK: converting words into logic
from nltk import load_parser
parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0)
sentence = 'Angus gives a bone to every dog'
tokens = sentence.split()
for tree in parser.parse(tokens):
print(tree.label()['SEM'])
Exercises
Exercises
Try the code in the 7.x series notebooks

Contenu connexe

Tendances

Presentation on array
Presentation on array Presentation on array
Presentation on array
topu93
 
Data Structure Midterm Lesson Arrays
Data Structure Midterm Lesson ArraysData Structure Midterm Lesson Arrays
Data Structure Midterm Lesson Arrays
Maulen Bale
 

Tendances (18)

Array and Collections in c#
Array and Collections in c#Array and Collections in c#
Array and Collections in c#
 
intorduction to Arrays in java
intorduction to Arrays in javaintorduction to Arrays in java
intorduction to Arrays in java
 
Array
ArrayArray
Array
 
Array in c#
Array in c#Array in c#
Array in c#
 
Arrays in Java
Arrays in JavaArrays in Java
Arrays in Java
 
Java Arrays
Java ArraysJava Arrays
Java Arrays
 
9 python data structure-2
9 python data structure-29 python data structure-2
9 python data structure-2
 
Array in C# 3.5
Array in C# 3.5Array in C# 3.5
Array in C# 3.5
 
Whiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in PythonWhiteboarding Coding Challenges in Python
Whiteboarding Coding Challenges in Python
 
Array in c++
Array in c++Array in c++
Array in c++
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Machine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable ConversionMachine Learning - Dummy Variable Conversion
Machine Learning - Dummy Variable Conversion
 
Presentation on array
Presentation on array Presentation on array
Presentation on array
 
Arrays
ArraysArrays
Arrays
 
Data Structure Midterm Lesson Arrays
Data Structure Midterm Lesson ArraysData Structure Midterm Lesson Arrays
Data Structure Midterm Lesson Arrays
 
Chap09
Chap09Chap09
Chap09
 
Arrays In C++
Arrays In C++Arrays In C++
Arrays In C++
 
Arrays in java
Arrays in javaArrays in java
Arrays in java
 

En vedette

En vedette (10)

NLTK introduction
NLTK introductionNLTK introduction
NLTK introduction
 
Large scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azureLarge scale nlp using python's nltk on azure
Large scale nlp using python's nltk on azure
 
NLTK
NLTKNLTK
NLTK
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
Introduction to NLTK
Introduction to NLTKIntroduction to NLTK
Introduction to NLTK
 
Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics Natural Language Toolkit (NLTK), Basics
Natural Language Toolkit (NLTK), Basics
 
NLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easyNLTK: Natural Language Processing made easy
NLTK: Natural Language Processing made easy
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 

Similaire à Session 07 text data.pptx

Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
kperi
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
AbdurRazzaqe1
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
Gabriel Moreira
 

Similaire à Session 07 text data.pptx (20)

Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
Authorship Attribution and Forensic Linguistics with Python/Scikit-Learn/Pand...
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Python basic
Python basicPython basic
Python basic
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
Python for Dummies
Python for DummiesPython for Dummies
Python for Dummies
 
Python dictionaries
Python dictionariesPython dictionaries
Python dictionaries
 
Python-Cheat-Sheet.pdf
Python-Cheat-Sheet.pdfPython-Cheat-Sheet.pdf
Python-Cheat-Sheet.pdf
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
프알못의 Keras 사용기
프알못의 Keras 사용기프알못의 Keras 사용기
프알못의 Keras 사용기
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Typescript - why it's awesome
Typescript - why it's awesomeTypescript - why it's awesome
Typescript - why it's awesome
 
Language translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlowLanguage translation with Deep Learning (RNN) with TensorFlow
Language translation with Deep Learning (RNN) with TensorFlow
 
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
IRJET- Unabridged Review of Supervised Machine Learning Regression and Classi...
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
These questions will be a bit advanced level 2
These questions will be a bit advanced level 2These questions will be a bit advanced level 2
These questions will be a bit advanced level 2
 
Introduction to Machine Learning by MARK
Introduction to Machine Learning by MARKIntroduction to Machine Learning by MARK
Introduction to Machine Learning by MARK
 
2. Python Cheat Sheet.pdf
2. Python Cheat Sheet.pdf2. Python Cheat Sheet.pdf
2. Python Cheat Sheet.pdf
 

Plus de bodaceacat

Ardrone represent
Ardrone representArdrone represent
Ardrone represent
bodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
bodaceacat
 

Plus de bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 
Rhok and opendata hackathon intro
Rhok and opendata hackathon introRhok and opendata hackathon intro
Rhok and opendata hackathon intro
 

Dernier

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Dernier (20)

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 

Session 07 text data.pptx

  • 2. Lab 7: your 5-7 things Get familiar with text processing Get familiar with text data Read text data Classify text data Analyse text data
  • 3. Text processing ● Information retrieval ○ Search ○ Named entity recognition ● Learning ○ Classification ○ Clustering ○ Topic identification/ topic following ○ Sentiment analysis ○ Network analysis (words, people etc)
  • 5. Text Data Sources ● Messages (tweets, emails, sms messages...) ● Document text (reports, blogposts, website text…) ● Audio (via speech-to-text processing) ● Images (via OCR)
  • 6. Get your raw text data fsipa = open('sipatext.txt', 'r') sipatext = fsipa.read() fsipa.close() print(sipatext)
  • 7. Counting: Bags of Words from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() word_counts = count_vect.fit_transform([sipatext]) print('{}'.format(word_counts)) print('{}'.format(count_vect.vocabulary_))
  • 8. Counting sets of words: N-Grams ● Pairs (or triples, 4s etc) of words ● Also: pairs etc of characters, e.g. [‘mor’, ‘ore’, ‘re ‘, ‘e t’, ‘ th’, ‘tha’, ‘han’] ● Know your Ns: ○ ‘Unigram’ == 1-gram ○ ‘Bigram’ == 2-gram ○ ‘Trigram’ == 3-gram count_vectn = CountVectorizer(ngram_range =(2, 2))
  • 10. Term Frequencies ● TF: Term Frequency: ○ word count / (number of words in this document) ○ “How important (0 to 1) is this word to this document”? ● IDF: Inverse Document Frequency ○ 1 / (number of documents this word appears in) ○ “How common is this word in this corpus”? ● TFIDF: ○ TF * IDF
  • 12. Classifying Text Words are a valid input to machine learning algorithms In this example, we’re using: ● Newsgroup emails as samples (‘rows’ in our input) ● Words in each email as features (‘columns’) ● Newsgroup ids as targets
  • 13. The 20newsgroups dataset from sklearn.datasets import fetch_20newsgroups cats = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] twenty_train = fetch_20newsgroups( subset='train', categories=cats) twenty_test = fetch_20newsgroups(subset='test', categories=cats)
  • 15. Convert words to TFIDF scores from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(twenty_train.data) tfidf_transformer = TfidfTransformer(use_idf=True) X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
  • 16. Fit your model to the data from sklearn.naive_bayes import MultinomialNB nb_classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
  • 17. Test your model docs_test = ['God is love', 'OpenGL on the GPU is fast'] X_new_counts = count_vect.transform(docs_test) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = nb_classifier.predict(X_new_tfidf) for doc, category in zip(docs_test, predicted): print('{} => {}'.format(doc, twenty_train.target_names[category]))
  • 18. Text Clustering We can also ‘cluster’ documents ● The ‘distance’ function is based on the words they have in common Common machine learning algorithms for text clustering include: ● Latent Semantic Analysis ● Latent Dirichlet Allocation
  • 20. Word colocation ● Create a graph (network visualisation) of words that appear together in documents ● Use network analysis (later session) to show which pairs of words are important in your documents
  • 21. Sentiment analysis ● Mark documents (e.g. tweets) as having positive or negative sentiment ● Using machine learning ○ Training set: sentences, with ‘positive’/’negative’ for each sentence ● Using a sentiment dictionary ○ Positive or negative ‘score’ for each emotive word ○ Sentiment dictionaries can be used as machine learning algorithms ‘seeds’
  • 22. Named Entity Recognition ● Find the names of people, organisations, locations etc in text ● Can use these to create social graphs (networks showing how people etc connect to each other) and find ‘hubs’, ‘connectors’ etc
  • 24. Natural Language Processing ● Understanding the grammar and meaning of text ● Useful for, e.g. translation between languages ● Python library: NLTK
  • 25. Getting started with NLTK import nltk nltk.download()
  • 26. Get text ready for NLTK processing from nltk import word_tokenize from nltk.text import Text fsipa = open('example_data/sipatext.txt', 'r') sipatext = fsipa.read() fsipa.close() sipawords = word_tokenize(sipatext) textlist = Text(sipawords)
  • 28. NLTK: word dispersion plots from nltk.book import * text2.dispersion_plot(['Elinor', 'Willoughby', 'Sophia'])
  • 29. NLTK: Word Meanings from nltk.corpus import wordnet as wn word = 'class' synset = wn.synsets(word) print('Synset: {}n'.format(synset)) for i in range(len(synset)): print('Meaning {}: {} {}'.format(i, synset[i].lemma_names(), synset[i].definition()))
  • 31. NLTK: converting words into logic from nltk import load_parser parser = load_parser('grammars/book_grammars/simple-sem.fcfg', trace=0) sentence = 'Angus gives a bone to every dog' tokens = sentence.split() for tree in parser.parse(tokens): print(tree.label()['SEM'])
  • 33. Exercises Try the code in the 7.x series notebooks

Notes de l'éditeur

  1. Topic following: includes tracking things like hate speech (iHub Nairobi has done a lot of work on this topic) Verification: the Pheme project (http://www.pheme.eu/) is working on automatically tracking the veracity of stories.
  2. For speech recognition in python, try https://pypi.python.org/pypi/SpeechRecognition/ or speech http://code.activestate.com/recipes/579115-recognizing-speech-speech-to-text-with-the-python-/ We’re looking at two pieces of data today: the Wikipedia entry for SIPA, and a set of tweets about the #migrantcrisis, grabbed from the Twitter API by using notebook 3.1.
  3. Scikit-learn has some powerful text processing functions, including this one to separate text into words
  4. word n-grams; character n-grams
  5. Stopwords are common words (“the”, “a”, “and”) that don’t add to meaning, and might confuse outputs
  6. From http://stevenloria.com/finding-important-words-in-a-document-using-tf-idf/: If a word appears frequently in a document, it's important. Give the word a high score. But if a word appears in many documents, it's not a unique identifier. Give the word a low score.
  7. Aka computational linguistics
  8. More than you ever wanted to know about parsing sentences: http://www.nltk.org/howto/featgram.html Simple_sem is a simple grammar, just for teaching: its whole specification is at https://github.com/nltk/nltk_teach/blob/master/examples/grammars/book_grammars/simple-sem.fcfg