SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Natural Language Processing with Mahout
Casey Stella
June 26, 2013
1 of 19
Table of Contents
Preliminaries
Mahout → An Overview
NLP with Mahout → An Example
Ingesting Data
Topic Models
Questions & Bibliography
2 of 19
Introduction
• I’m a Systems Architect at Hortonworks
• Prior to this, I’ve spent my time and had a lot of fun
◦ Doing data mining on medical data at Explorys using the Hadoop
ecosystem
◦ Doing signal processing on seismic data at Ion Geophysical using
MapReduce
◦ Being a graduate student in the Math department at Texas A&M in
algorithmic complexity theory
• I’m going to talk about Natural Language Processing in the
Hadoop ecosystem.
• I’m going to go over Apache Mahout in general and then focus on
Topic Models.
3 of 19
Natural Language Processing
Peter Norvig, the Director of Research at Google, said in the Amazon
book review for the book “Statistical Natural Language Processing”1
by Manning and Schuetze
If someone told me I had to make a million bucks in one year,
and I could only refer to one book to do it, I’d grab a copy of
this book and start a web text-processing company.
1
http://www.amazon.com/review/R3GSYXSKRU8V174 of 19
Apache Mahout
• Apache Mahout is a
◦ Library of stand-alone scalable and distributed machine learning
algorithms
◦ Library of high performance math and primitive collections useful in
machine learning
◦ Library of primitive distributed statistical and linear algebraic
operations useful in machine learning
• The distributed algorithms are able to be run on Hadoop via a set
of stand-alone helper utilities as well as providing an API.
5 of 19
Selection of Available Algorithms
Type Algorithm
Linear Algebra Stochastic Gradient Descent
Linear Algebra Stochastic Singular Value Decomposition
Classification Random Forests
Classification Naïve Bayesian
Classification Hidden Markov Models
Clustering Normal and Fuzzy K-Means
Clustering Expectation Maximization
Clustering Dirichlet Process Clustering
Clustering Latent Dirichlet Allocation
Clustering Spectral Clustering
Clustering MinHash Clustering
Pattern Mining Parallel FP Growth
6 of 19
Ingesting a Corpus of Documents
• Mahout provides a number of utilities to allow one to ingest data
into Hadoop in the format expected by the ML algorithms
• The basic pattern is
◦ Convert the documents to SequenceFiles via the seqdirectory
command
◦ Convert those sequence files to a set of sparse vectors using
seq2sparse by computing a dictionary, assigning integers for words,
computing feature weights and creating a vector for each document
using the word-integer mapping and feature-weight.
◦ Convert the keys associated with the sparse vectors to incrementing
integers using the rowid command
7 of 19
Converting a Sequence File to a set of Vectors
• Create a sparse set of vectors using the mahout utility seq2sparse.
• The seq2sparse command allows you to specify:
-wt The weighting method used: tf or tfidf
--minSupport The minimum number of times a term has
to occur to exist in the document
--norm An integer k > 0 indicating the Lk metric to
be used to normalize the vectors.
-a The analyzer to use.
8 of 19
Topic Models
• Topic modeling is intended to find a set of broad themes or “topics”
from a corpus of documents.
• Documents contain multiple topics and, indeed, can be considered
a “mixture” of topics.
• Probabalistic topic modeling algorithms attempt to determine the
set of topics and mixture of topics per-document in an
unsupervised way.
• Consider a collection of newspaper articles, topics may be “sports”,
“politics”, etc.
9 of 19
High Level: Latent Dirichlet Allocation
• Topics are determined by looking at how often words appear
together in the same document
• Each document is associated a probability distribution over the set
of topics
• Latent Dirichlet Allocation (LDA) is a statistical topic model which
learns
◦ what the topics are
◦ which documents employ said topics and at what distribution
10 of 19
Latent Dirichlet Allocation: A Parable
Tim is the owner of an independent record shop and is interested in
finding the natural genres of music by considering natural groupings
based on what people buy. So, Tim logs the records people buy and
who buys them. Tim does not know the genres and he doesn’t know
the different genres each customer likes.
Tim chooses that he wants to learn K genres and let a set of records
define a given genre. He can then assign a label to the genre by
eyeballing the records in the genres.
• Words correspond to records
• Documents correspond to people
• Topics correspond to genres and are represented by {records}.
11 of 19
Latent Dirichlet Allocation: A Parable
Tim starts by making a guess as to why records are bought by certain
people. For example, he assumes that customers who buy record A
have interest in the same genre and therefore record A must be a
representative of that genre. Of course, this assumption is very likely
to be incorrect, so he needs to improve in the face of better data.
12 of 19
Latent Dirichlet Allocation: A Parable
He comes up with the following scheme:
• Pick a record and a customer who bought that record.
• Guess why the record was bought by the customer.
• Other records that the customer bought are likely of the same
genre. In other words, the more records that are bought by the
same customer, the more likely that those records are part of the
same genre.
• Make a new guess as to why the customer bought that record,
choosing a genre with some probability according to how likely Tim
thinks it is.
13 of 19
Latent Dirichlet Allocation: A Parable
Tim goes through each customer purchase over and over again. His
guesses keep getting better because he starts to notice patterns (i.e.
people who buy the same records are likely interested in the same
genres). Eventually he feels like he’s refined his model enough and is
ready to draw conclusions:
• For each genre, you can count the records assigned to that genre to
figure out what records are associated with the genre.
• By looking at the records in the genre, you can give the genre a
label.
• For each customer D and genre T, you can compute the
proportions of records who were bought by D because they liked
genre T. These give you a representation of customer D. For
example, you might learn that records bought by Jim consist of
10% “Easy Listening”, “20% Rap”, and 70% “Country & Western”.
14 of 19
LDA in Mahout
• Original implementation followed the original implementation
proposed by Blei et al. [2003].
◦ The problem, in part, the amount of information sent out of the
mappers scaled with the product of the number of terms in the
vocabulary and number of topics.
◦ On a 1 billion non-zero entry corpus, for 200 topics, original
implementation sent 2.5 TB of data from the mappers per iteration.
◦ Recently (as of 0.6 [MAHOUT-897]) moved to Collapsed Variational
Bayes by Asuncion et al. [2009].
◦ About 15x faster than original implementation
15 of 19
LDA in Mahout
• Input data is expected to be a sparse vector
• Ingestion Pipeline from a document directory
◦ seqdirectory → Transforms docs to a sequence file containing a
document per entry
◦ seq2sparse → Transforms to a sequence file of sparse vectors as well
as a dictionary of terms
◦ rowid → Transforms the keys of the sparse vectors into integers
• LDA expects the matrix as input as well as the dictionary
16 of 19
LDA in Mahout
• The cvb tool will run the LDA algorithm
• Input: sequence file of SparseVectors of word counts weighted by
term frequency
• Output: Topic model
• Parameters:
-k The number of topics
-nt The number of unique features defined by the input document
vectors
-maxIter The maximum number of iterations.
-mipd The maximum number of iterations per document
-a Smoothing for the document topic distribution; should be about
50
k , with k being the number of topics.
-e Smoothing for the term topic distribution
17 of 19
LDA in Mahout → Topics
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
bush sharon sharon sharon sharon
election administration disengagement bush disengagement
palestinians bush administration administration us
hand us election settlement bush
year endorse state plan west
fence leader bush palestinians election
roadmap iraq hand disengagement solution
arafat year conflict solution hand
have election year election day
conflict transfer us fence year
18 of 19
Questions & Bibliography
Thanks for your attention! Questions?
• Code and scripts for this talk at
http://github.com/cestella/NLPWithMahout
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
BIBLIOGRAPHY
A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing
and inference for topic models. In Proceedings of the Twenty-Fifth
Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages
27–34, Arlington, Virginia, United States, 2009. AUAI Press. ISBN
978-0-9749039-5-8.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J.
Mach. Learn. Res., 3:993–1022, Mar. 2003. ISSN 1532-4435.19 of 19

Contenu connexe

En vedette

評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例JubatusOfficial
 
Jubatusをベースにしたオーディエンスの分析エンジンの紹介
Jubatusをベースにしたオーディエンスの分析エンジンの紹介Jubatusをベースにしたオーディエンスの分析エンジンの紹介
Jubatusをベースにしたオーディエンスの分析エンジンの紹介JubatusOfficial
 
標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例JubatusOfficial
 
Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介JubatusOfficial
 
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Hirotaka Ogawa
 
Jubatusで始める機械学習
Jubatusで始める機械学習Jubatusで始める機械学習
Jubatusで始める機械学習JubatusOfficial
 
Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門Shuzo Kashihara
 
世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイント世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイントNTT DATA OSS Professional Services
 
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介瑛 村下
 
センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例JubatusOfficial
 
Jubatus分類器の活用テクニック
Jubatus分類器の活用テクニックJubatus分類器の活用テクニック
Jubatus分類器の活用テクニックJubatusOfficial
 
Jubatus使ってみた 作ってみたJubatus
Jubatus使ってみた 作ってみたJubatusJubatus使ってみた 作ってみたJubatus
Jubatus使ってみた 作ってみたJubatusJubatusOfficial
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Shohei Hido
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modelingSameer Wadkar
 
Natural Language Processing (NLP), Search and Wearable Technology
Natural Language Processing (NLP), Search and Wearable TechnologyNatural Language Processing (NLP), Search and Wearable Technology
Natural Language Processing (NLP), Search and Wearable Technologypixelbuilders
 
jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証odasatoshi
 

En vedette (20)

評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例評BanにおけるJubatus活用事例
評BanにおけるJubatus活用事例
 
Jubatus on Mavericks
Jubatus on MavericksJubatus on Mavericks
Jubatus on Mavericks
 
Jubatusをベースにしたオーディエンスの分析エンジンの紹介
Jubatusをベースにしたオーディエンスの分析エンジンの紹介Jubatusをベースにしたオーディエンスの分析エンジンの紹介
Jubatusをベースにしたオーディエンスの分析エンジンの紹介
 
標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例標的型メール対策製品でのJubatus活用事例
標的型メール対策製品でのJubatus活用事例
 
Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介Jubatus 0.6.0 新機能紹介
Jubatus 0.6.0 新機能紹介
 
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
Jubatus Casual Talks #2: 大量映像・画像のための異常値検知とクラス分類
 
Jubatusで始める機械学習
Jubatusで始める機械学習Jubatusで始める機械学習
Jubatusで始める機械学習
 
Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門Jubatus Casual Talks #2 Jubatus開発者入門
Jubatus Casual Talks #2 Jubatus開発者入門
 
世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイント世界征服を目指すJubatusだからこそ期待する5つのポイント
世界征服を目指すJubatusだからこそ期待する5つのポイント
 
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
Jubatus Casual Talks #2 : 0.5.0の新機能(クラスタリング)の紹介
 
センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例センサデータ解析におけるJubatus活用事例
センサデータ解析におけるJubatus活用事例
 
Jubatus分類器の活用テクニック
Jubatus分類器の活用テクニックJubatus分類器の活用テクニック
Jubatus分類器の活用テクニック
 
Jubatus casulatalks2
Jubatus casulatalks2Jubatus casulatalks2
Jubatus casulatalks2
 
A use case of online machine learning using Jubatus
A use case of online machine learning using JubatusA use case of online machine learning using Jubatus
A use case of online machine learning using Jubatus
 
Jubatus使ってみた 作ってみたJubatus
Jubatus使ってみた 作ってみたJubatusJubatus使ってみた 作ってみたJubatus
Jubatus使ってみた 作ってみたJubatus
 
Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門Jubatus Casual Talks #2 異常検知入門
Jubatus Casual Talks #2 異常検知入門
 
Large scale topic modeling
Large scale topic modelingLarge scale topic modeling
Large scale topic modeling
 
Natural Language Processing (NLP), Search and Wearable Technology
Natural Language Processing (NLP), Search and Wearable TechnologyNatural Language Processing (NLP), Search and Wearable Technology
Natural Language Processing (NLP), Search and Wearable Technology
 
Big data concept
Big data conceptBig data concept
Big data concept
 
jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証jubaanomalyでキーストローク認証
jubaanomalyでキーストローク認証
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 

Dernier (20)

201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 

Mahout and Scalable Natural Language Processing

  • 1. Natural Language Processing with Mahout Casey Stella June 26, 2013 1 of 19
  • 2. Table of Contents Preliminaries Mahout → An Overview NLP with Mahout → An Example Ingesting Data Topic Models Questions & Bibliography 2 of 19
  • 3. Introduction • I’m a Systems Architect at Hortonworks • Prior to this, I’ve spent my time and had a lot of fun ◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem ◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce ◦ Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory • I’m going to talk about Natural Language Processing in the Hadoop ecosystem. • I’m going to go over Apache Mahout in general and then focus on Topic Models. 3 of 19
  • 4. Natural Language Processing Peter Norvig, the Director of Research at Google, said in the Amazon book review for the book “Statistical Natural Language Processing”1 by Manning and Schuetze If someone told me I had to make a million bucks in one year, and I could only refer to one book to do it, I’d grab a copy of this book and start a web text-processing company. 1 http://www.amazon.com/review/R3GSYXSKRU8V174 of 19
  • 5. Apache Mahout • Apache Mahout is a ◦ Library of stand-alone scalable and distributed machine learning algorithms ◦ Library of high performance math and primitive collections useful in machine learning ◦ Library of primitive distributed statistical and linear algebraic operations useful in machine learning • The distributed algorithms are able to be run on Hadoop via a set of stand-alone helper utilities as well as providing an API. 5 of 19
  • 6. Selection of Available Algorithms Type Algorithm Linear Algebra Stochastic Gradient Descent Linear Algebra Stochastic Singular Value Decomposition Classification Random Forests Classification Naïve Bayesian Classification Hidden Markov Models Clustering Normal and Fuzzy K-Means Clustering Expectation Maximization Clustering Dirichlet Process Clustering Clustering Latent Dirichlet Allocation Clustering Spectral Clustering Clustering MinHash Clustering Pattern Mining Parallel FP Growth 6 of 19
  • 7. Ingesting a Corpus of Documents • Mahout provides a number of utilities to allow one to ingest data into Hadoop in the format expected by the ML algorithms • The basic pattern is ◦ Convert the documents to SequenceFiles via the seqdirectory command ◦ Convert those sequence files to a set of sparse vectors using seq2sparse by computing a dictionary, assigning integers for words, computing feature weights and creating a vector for each document using the word-integer mapping and feature-weight. ◦ Convert the keys associated with the sparse vectors to incrementing integers using the rowid command 7 of 19
  • 8. Converting a Sequence File to a set of Vectors • Create a sparse set of vectors using the mahout utility seq2sparse. • The seq2sparse command allows you to specify: -wt The weighting method used: tf or tfidf --minSupport The minimum number of times a term has to occur to exist in the document --norm An integer k > 0 indicating the Lk metric to be used to normalize the vectors. -a The analyzer to use. 8 of 19
  • 9. Topic Models • Topic modeling is intended to find a set of broad themes or “topics” from a corpus of documents. • Documents contain multiple topics and, indeed, can be considered a “mixture” of topics. • Probabalistic topic modeling algorithms attempt to determine the set of topics and mixture of topics per-document in an unsupervised way. • Consider a collection of newspaper articles, topics may be “sports”, “politics”, etc. 9 of 19
  • 10. High Level: Latent Dirichlet Allocation • Topics are determined by looking at how often words appear together in the same document • Each document is associated a probability distribution over the set of topics • Latent Dirichlet Allocation (LDA) is a statistical topic model which learns ◦ what the topics are ◦ which documents employ said topics and at what distribution 10 of 19
  • 11. Latent Dirichlet Allocation: A Parable Tim is the owner of an independent record shop and is interested in finding the natural genres of music by considering natural groupings based on what people buy. So, Tim logs the records people buy and who buys them. Tim does not know the genres and he doesn’t know the different genres each customer likes. Tim chooses that he wants to learn K genres and let a set of records define a given genre. He can then assign a label to the genre by eyeballing the records in the genres. • Words correspond to records • Documents correspond to people • Topics correspond to genres and are represented by {records}. 11 of 19
  • 12. Latent Dirichlet Allocation: A Parable Tim starts by making a guess as to why records are bought by certain people. For example, he assumes that customers who buy record A have interest in the same genre and therefore record A must be a representative of that genre. Of course, this assumption is very likely to be incorrect, so he needs to improve in the face of better data. 12 of 19
  • 13. Latent Dirichlet Allocation: A Parable He comes up with the following scheme: • Pick a record and a customer who bought that record. • Guess why the record was bought by the customer. • Other records that the customer bought are likely of the same genre. In other words, the more records that are bought by the same customer, the more likely that those records are part of the same genre. • Make a new guess as to why the customer bought that record, choosing a genre with some probability according to how likely Tim thinks it is. 13 of 19
  • 14. Latent Dirichlet Allocation: A Parable Tim goes through each customer purchase over and over again. His guesses keep getting better because he starts to notice patterns (i.e. people who buy the same records are likely interested in the same genres). Eventually he feels like he’s refined his model enough and is ready to draw conclusions: • For each genre, you can count the records assigned to that genre to figure out what records are associated with the genre. • By looking at the records in the genre, you can give the genre a label. • For each customer D and genre T, you can compute the proportions of records who were bought by D because they liked genre T. These give you a representation of customer D. For example, you might learn that records bought by Jim consist of 10% “Easy Listening”, “20% Rap”, and 70% “Country & Western”. 14 of 19
  • 15. LDA in Mahout • Original implementation followed the original implementation proposed by Blei et al. [2003]. ◦ The problem, in part, the amount of information sent out of the mappers scaled with the product of the number of terms in the vocabulary and number of topics. ◦ On a 1 billion non-zero entry corpus, for 200 topics, original implementation sent 2.5 TB of data from the mappers per iteration. ◦ Recently (as of 0.6 [MAHOUT-897]) moved to Collapsed Variational Bayes by Asuncion et al. [2009]. ◦ About 15x faster than original implementation 15 of 19
  • 16. LDA in Mahout • Input data is expected to be a sparse vector • Ingestion Pipeline from a document directory ◦ seqdirectory → Transforms docs to a sequence file containing a document per entry ◦ seq2sparse → Transforms to a sequence file of sparse vectors as well as a dictionary of terms ◦ rowid → Transforms the keys of the sparse vectors into integers • LDA expects the matrix as input as well as the dictionary 16 of 19
  • 17. LDA in Mahout • The cvb tool will run the LDA algorithm • Input: sequence file of SparseVectors of word counts weighted by term frequency • Output: Topic model • Parameters: -k The number of topics -nt The number of unique features defined by the input document vectors -maxIter The maximum number of iterations. -mipd The maximum number of iterations per document -a Smoothing for the document topic distribution; should be about 50 k , with k being the number of topics. -e Smoothing for the term topic distribution 17 of 19
  • 18. LDA in Mahout → Topics Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 bush sharon sharon sharon sharon election administration disengagement bush disengagement palestinians bush administration administration us hand us election settlement bush year endorse state plan west fence leader bush palestinians election roadmap iraq hand disengagement solution arafat year conflict solution hand have election year election day conflict transfer us fence year 18 of 19
  • 19. Questions & Bibliography Thanks for your attention! Questions? • Code and scripts for this talk at http://github.com/cestella/NLPWithMahout • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com BIBLIOGRAPHY A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 27–34, Arlington, Virginia, United States, 2009. AUAI Press. ISBN 978-0-9749039-5-8. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003. ISSN 1532-4435.19 of 19