SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Deep Learning
and Text Mining
Will Stanton
Ski Hackathon Kickoff Ceremony, Feb 28, 2015
We have a problem
● At Return Path, we process billions of emails
a year, from tons of senders
● We want to tag and cluster senders
○ Industry verticals (e-commerce, apparel, travel, etc.)
○ Type of customers they sell to (luxury, soccer moms,
etc.)
○ Business model (daily deals, flash sales, etc.)
● It’s too much to do by hand!
What to do?
● Standard approaches aren’t great
○ Bag of words classification model (document-term matrix, LSA, LDA)
■ Have to manually label lots of cases first
■ Difficult with lots of data (especially LDA)
○ Bag of words clustering
■ Can’t easily put one company into multiple categories (ie. more
general tagging)
■ Needs lots of tuning
● How about deep learning neural networks?
○ Very trendy. Let’s try it!
Neural Networks
Input
Layer First
Hidden
Layer
Second
Hidden
Layer
Output
Layer
Inputs
x = (x1, x2, x3)
Output
y
● Machine learning algorithms
modeled after the way the human
brain works
● Learn patterns and structure by
passing training data through
“neurons”
● Useful for classification,
regression, feature extraction, etc.
Deep Learning
● Neural networks with lots of hidden layers
(hundreds)
● State of the art for machine translation, facial
recognition, text classification, speech
recognition
○ Tasks with real deep structure, that humans do
automatically but computers struggle with
○ Should be good for company tagging!
Distributed Representations
Pixels
EdgesCollection of faces
Shapes
Typical facial types
(features)
● Human brain uses distributed representations
● We can use deep learning to do the same thing with
words (letters -> words -> phrases -> sentences -> …)
Deep Learning Challenges
● Computationally difficult to train (ie. slow)
○ Each hidden layer means more parameters
○ Each feature means more parameters
● Real human-generated text has a near-
infinite number of features and data
○ ie. slow would be a problem
● Solution: use word2vec
word2vec
● Published by scientists at Google in 2013
● Python implementation in 2014
○ gensim library
● Learns distributed vector representations
of words (“word to vec”) using a neural net
○ NOTE for hardcore experts: word2vec does not strictly or necessarily train a deep neural
net, but it uses deep learning technology (distributed representations, backpropagation,
stochastic gradient descent, etc.) and is based on a series of deep learning papers
What is the output?
● Distributed vector representations of words
○ each word is encoded as a vector of floats
○ vecqueen
= (0.2, -0.3, .7, 0, … , .3)
○ vecwoman
= (0.1, -0.2, .6, 0.1, … , .2)
○ length of the vectors = dimension of the word
representation
○ key concept of word2vec: words with similar
vectors have a similar meaning (context)
word2vec Features
● Very fast and scalable
○ Google trained it on 100’s of billions of words
● Uncovers deep latent structure of word
relationships
○ Can solve analogies like King::Man as Queen::? or
Paris::France as Berlin::?
○ Can solve “one of these things is not like another”
○ Can be used for machine translation or automated
sentence completion
How does it work?
● Feed the algorithm (lots of) sentences
○ totally unsupervised learning
● word2vec trains a neural net that encodes
the context of words within sentences
○ “Skip-grams”: what is the probability that the word
“queen” appears 1 word after “woman”, 2 words
after, etc.
word2vec at Return Path
● At Return Path, we implemented word2vec
on data from our Consumer Data Stream
○ billions of email subject lines from millions of users
○ fed 30 million unique subject lines (300m words) and
sending domains into word2vec (using Python)
Lots of
subject
lines
word2vec word vectors insights
Grouping companies with word2vec
● Find daily deals sites like Groupon
[word for (word, score) in model.most_similar('groupon.com', topn = 100) if
'.com' in word]
['grouponmail.com.au', 'specialicious.com', 'livingsocial.com', 'deem.com',
'hitthedeals.com', 'grabone-mail-ie.com', 'grabone-mail.com', 'kobonaty.com',
'deals.com.au', 'coupflip.com', 'ouffer.com', 'wagjag.com']
● Find apparel sites like Gap
[word for (word, score) in model.most_similar('gap.com', topn = 100) if '.com'
in word]
['modcloth.com', 'bananarepublic.com', 'shopjustice.com', 'thelimited.com',
'jcrew.com', 'gymboree.com', 'abercrombie-email.com', 'express.com',
'hollister-email.com', 'abercrombiekids-email.com', 'thredup.com',
'neimanmarcusemail.com']
More word2vec applications
● Find relationships between products
● model.most_similar(positive=['iphone', 'galaxy'], negative=['apple']) =
‘samsung’
● ie. iphone::apple as galaxy::? samsung!
● Distinguish different companies
● model.doesnt_match(['sheraton','westin','aloft','walmart']) = ‘walmart’
● ie. Wal Mart does not match Sheraton, Westin, and Aloft hotels
● Other possibilities
○ Find different companies with similar marketing copy
○ Automatically construct high-performing subject lines
○ Many more...
Try it yourself
● C implementation exists, but I recommend
Python
○ gensim library: https://radimrehurek.com/gensim/
○ tutorial:http://radimrehurek.
com/gensim/models/word2vec.html
○ webapp to try it out as part of tutorial
○ Pretrained Google News and Freebase models:
https://code.google.com/p/word2vec/
○ Only takes 10 lines of code to get started!
Thanks for listening!
● Many thanks to:
○ Data Science Association and Level 3
○ Michael Walker for organizing
● Slides posted on http://will-stanton.com/
● Email me at will@will-stanton.com
● Return Path is hiring! Voted #2 best
midsized company to work for in the country
http://careers.returnpath.com/

Contenu connexe

Tendances

Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovBhaskar Mitra
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryRoelof Pieters
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer ConnectAnuj Gupta
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Anoop Deoras
 
Practical Deep Learning for NLP
Practical Deep Learning for NLP Practical Deep Learning for NLP
Practical Deep Learning for NLP Textkernel
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMDivya Gera
 
Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Divya Gera
 

Tendances (19)

Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Recurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas MikolovRecurrent networks and beyond by Tomas Mikolov
Recurrent networks and beyond by Tomas Mikolov
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
Learning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionaryLearning to understand phrases by embedding the dictionary
Learning to understand phrases by embedding the dictionary
 
Talk from NVidia Developer Connect
Talk from NVidia Developer ConnectTalk from NVidia Developer Connect
Talk from NVidia Developer Connect
 
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019Tutorial on Deep Learning in Recommender System, Lars summer school 2019
Tutorial on Deep Learning in Recommender System, Lars summer school 2019
 
Practical Deep Learning for NLP
Practical Deep Learning for NLP Practical Deep Learning for NLP
Practical Deep Learning for NLP
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...Natural language processing techniques transition from machine learning to de...
Natural language processing techniques transition from machine learning to de...
 

En vedette

Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analyticsErik Tromp
 
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.ioDataconomy Media
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDatamining Tools
 
Thinking about nlp
Thinking about nlpThinking about nlp
Thinking about nlpPan Xiaotong
 
NLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionNLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionevolutionpd
 
Functional Reactive Programming in Clojurescript
Functional Reactive Programming in ClojurescriptFunctional Reactive Programming in Clojurescript
Functional Reactive Programming in ClojurescriptLeonardo Borges
 
AI Reality: Where are we now? Data for Good? - Bill Boorman
AI Reality: Where are we now? Data for Good? - Bill  BoormanAI Reality: Where are we now? Data for Good? - Bill  Boorman
AI Reality: Where are we now? Data for Good? - Bill BoormanTextkernel
 
Using Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesUsing Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesBenjamin Taylor
 
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~antibayesian 俺がS式だ
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
RでTwitterテキストマイニング
RでTwitterテキストマイニングRでTwitterテキストマイニング
RでTwitterテキストマイニングYudai Shinbo
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析Takeshi Arabiki
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine LearningAmrinder Arora
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining William Simms
 
SVM&R with Yaruo!!
SVM&R with Yaruo!!SVM&R with Yaruo!!
SVM&R with Yaruo!!guest8ee130
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 

En vedette (20)

Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
"TextMining with ElasticSearch", Saskia Vola, CEO at textminers.io
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Thinking about nlp
Thinking about nlpThinking about nlp
Thinking about nlp
 
NLP
NLPNLP
NLP
 
NLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionNLP@Work Conference: email persuasion
NLP@Work Conference: email persuasion
 
Functional Reactive Programming in Clojurescript
Functional Reactive Programming in ClojurescriptFunctional Reactive Programming in Clojurescript
Functional Reactive Programming in Clojurescript
 
AI Reality: Where are we now? Data for Good? - Bill Boorman
AI Reality: Where are we now? Data for Good? - Bill  BoormanAI Reality: Where are we now? Data for Good? - Bill  Boorman
AI Reality: Where are we now? Data for Good? - Bill Boorman
 
Using Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesUsing Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From Resumes
 
Monads in Clojure
Monads in ClojureMonads in Clojure
Monads in Clojure
 
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
あんちべのすべらない話~俺のツイートがこんなにウケないはずがない~
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
RでTwitterテキストマイニング
RでTwitterテキストマイニングRでTwitterテキストマイニング
RでTwitterテキストマイニング
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine Learning
 
ElasticSearch for data mining
ElasticSearch for data mining ElasticSearch for data mining
ElasticSearch for data mining
 
SVM&R with Yaruo!!
SVM&R with Yaruo!!SVM&R with Yaruo!!
SVM&R with Yaruo!!
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 

Similaire à Deep Learning and Text Mining

Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Chris Fregly
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxSahithiGurlinka
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdfvkharish18
 
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptxOWAISSALAUDDINKHAN
 
Cloud Study Jam[1st OCT] gdscgtbit.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptxCloud Study Jam[1st OCT] gdscgtbit.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptxGDSCGTBIT
 
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...MongoDB
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)Amazon Web Services
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzurePlain Concepts
 
National Wildlife Federation- OMS- Dreamcore 2011
National Wildlife Federation- OMS- Dreamcore 2011National Wildlife Federation- OMS- Dreamcore 2011
National Wildlife Federation- OMS- Dreamcore 2011nonlinear creations
 
MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀GDSCNiT
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven DesignNicolò Pignatelli
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Turi, Inc.
 
Knowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4jKnowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4jChristophe Willemsen
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015lbishal
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Kwame Porter Robinson
 
Building an An AI Startup with MongoDB at x.ai
Building an An AI Startup with MongoDB at x.aiBuilding an An AI Startup with MongoDB at x.ai
Building an An AI Startup with MongoDB at x.aiMongoDB
 

Similaire à Deep Learning and Text Mining (20)

Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptx
 
data-science-pdf-16588.pdf
data-science-pdf-16588.pdfdata-science-pdf-16588.pdf
data-science-pdf-16588.pdf
 
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
 
Cloud Study Jam[1st OCT] gdscgtbit.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptxCloud Study Jam[1st OCT] gdscgtbit.pptx
Cloud Study Jam[1st OCT] gdscgtbit.pptx
 
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
MongoDB Days Silicon Valley: Building an Artificial Intelligence Startup with...
 
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
AWS re:Invent 2016: Deep Learning in Alexa (MAC202)
 
Introducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en AzureIntroducción a NLP (Natural Language Processing) en Azure
Introducción a NLP (Natural Language Processing) en Azure
 
National Wildlife Federation- OMS- Dreamcore 2011
National Wildlife Federation- OMS- Dreamcore 2011National Wildlife Federation- OMS- Dreamcore 2011
National Wildlife Federation- OMS- Dreamcore 2011
 
MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven Design
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
 
Knowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4jKnowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4j
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler Labeling all the Things with the WDI Skill Labeler
Labeling all the Things with the WDI Skill Labeler
 
Building an An AI Startup with MongoDB at x.ai
Building an An AI Startup with MongoDB at x.aiBuilding an An AI Startup with MongoDB at x.ai
Building an An AI Startup with MongoDB at x.ai
 

Dernier

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Dernier (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Deep Learning and Text Mining

  • 1. Deep Learning and Text Mining Will Stanton Ski Hackathon Kickoff Ceremony, Feb 28, 2015
  • 2. We have a problem ● At Return Path, we process billions of emails a year, from tons of senders ● We want to tag and cluster senders ○ Industry verticals (e-commerce, apparel, travel, etc.) ○ Type of customers they sell to (luxury, soccer moms, etc.) ○ Business model (daily deals, flash sales, etc.) ● It’s too much to do by hand!
  • 3. What to do? ● Standard approaches aren’t great ○ Bag of words classification model (document-term matrix, LSA, LDA) ■ Have to manually label lots of cases first ■ Difficult with lots of data (especially LDA) ○ Bag of words clustering ■ Can’t easily put one company into multiple categories (ie. more general tagging) ■ Needs lots of tuning ● How about deep learning neural networks? ○ Very trendy. Let’s try it!
  • 4. Neural Networks Input Layer First Hidden Layer Second Hidden Layer Output Layer Inputs x = (x1, x2, x3) Output y ● Machine learning algorithms modeled after the way the human brain works ● Learn patterns and structure by passing training data through “neurons” ● Useful for classification, regression, feature extraction, etc.
  • 5. Deep Learning ● Neural networks with lots of hidden layers (hundreds) ● State of the art for machine translation, facial recognition, text classification, speech recognition ○ Tasks with real deep structure, that humans do automatically but computers struggle with ○ Should be good for company tagging!
  • 6. Distributed Representations Pixels EdgesCollection of faces Shapes Typical facial types (features) ● Human brain uses distributed representations ● We can use deep learning to do the same thing with words (letters -> words -> phrases -> sentences -> …)
  • 7. Deep Learning Challenges ● Computationally difficult to train (ie. slow) ○ Each hidden layer means more parameters ○ Each feature means more parameters ● Real human-generated text has a near- infinite number of features and data ○ ie. slow would be a problem ● Solution: use word2vec
  • 8. word2vec ● Published by scientists at Google in 2013 ● Python implementation in 2014 ○ gensim library ● Learns distributed vector representations of words (“word to vec”) using a neural net ○ NOTE for hardcore experts: word2vec does not strictly or necessarily train a deep neural net, but it uses deep learning technology (distributed representations, backpropagation, stochastic gradient descent, etc.) and is based on a series of deep learning papers
  • 9. What is the output? ● Distributed vector representations of words ○ each word is encoded as a vector of floats ○ vecqueen = (0.2, -0.3, .7, 0, … , .3) ○ vecwoman = (0.1, -0.2, .6, 0.1, … , .2) ○ length of the vectors = dimension of the word representation ○ key concept of word2vec: words with similar vectors have a similar meaning (context)
  • 10. word2vec Features ● Very fast and scalable ○ Google trained it on 100’s of billions of words ● Uncovers deep latent structure of word relationships ○ Can solve analogies like King::Man as Queen::? or Paris::France as Berlin::? ○ Can solve “one of these things is not like another” ○ Can be used for machine translation or automated sentence completion
  • 11. How does it work? ● Feed the algorithm (lots of) sentences ○ totally unsupervised learning ● word2vec trains a neural net that encodes the context of words within sentences ○ “Skip-grams”: what is the probability that the word “queen” appears 1 word after “woman”, 2 words after, etc.
  • 12. word2vec at Return Path ● At Return Path, we implemented word2vec on data from our Consumer Data Stream ○ billions of email subject lines from millions of users ○ fed 30 million unique subject lines (300m words) and sending domains into word2vec (using Python) Lots of subject lines word2vec word vectors insights
  • 13. Grouping companies with word2vec ● Find daily deals sites like Groupon [word for (word, score) in model.most_similar('groupon.com', topn = 100) if '.com' in word] ['grouponmail.com.au', 'specialicious.com', 'livingsocial.com', 'deem.com', 'hitthedeals.com', 'grabone-mail-ie.com', 'grabone-mail.com', 'kobonaty.com', 'deals.com.au', 'coupflip.com', 'ouffer.com', 'wagjag.com'] ● Find apparel sites like Gap [word for (word, score) in model.most_similar('gap.com', topn = 100) if '.com' in word] ['modcloth.com', 'bananarepublic.com', 'shopjustice.com', 'thelimited.com', 'jcrew.com', 'gymboree.com', 'abercrombie-email.com', 'express.com', 'hollister-email.com', 'abercrombiekids-email.com', 'thredup.com', 'neimanmarcusemail.com']
  • 14. More word2vec applications ● Find relationships between products ● model.most_similar(positive=['iphone', 'galaxy'], negative=['apple']) = ‘samsung’ ● ie. iphone::apple as galaxy::? samsung! ● Distinguish different companies ● model.doesnt_match(['sheraton','westin','aloft','walmart']) = ‘walmart’ ● ie. Wal Mart does not match Sheraton, Westin, and Aloft hotels ● Other possibilities ○ Find different companies with similar marketing copy ○ Automatically construct high-performing subject lines ○ Many more...
  • 15. Try it yourself ● C implementation exists, but I recommend Python ○ gensim library: https://radimrehurek.com/gensim/ ○ tutorial:http://radimrehurek. com/gensim/models/word2vec.html ○ webapp to try it out as part of tutorial ○ Pretrained Google News and Freebase models: https://code.google.com/p/word2vec/ ○ Only takes 10 lines of code to get started!
  • 16. Thanks for listening! ● Many thanks to: ○ Data Science Association and Level 3 ○ Michael Walker for organizing ● Slides posted on http://will-stanton.com/ ● Email me at will@will-stanton.com ● Return Path is hiring! Voted #2 best midsized company to work for in the country http://careers.returnpath.com/