SlideShare une entreprise Scribd logo
1  sur  27
Numerical Recipes in  Hadoop Jake Mannix linkedin/in/jakemannix twitter/pbrane jake.mannix@gmail.com jmannix@apache.org Principal SDE, LinkedIn Committer, Apache Mahout, Zoie,  Bobo-Browse, Decomposer Author, Lucene in Depth (Manning MM/DD/2010)
A Mathematician’s Apology What mathematical structure describes all of these? Full-text search: Score documents matching “query string” Collaborative filtering recommendation: Users who liked {those} also liked {these} (Social/web)-graph proximity: People/pages “close” to {this} are {these}
Matrix Multiplication!
Full-text Search Vector Space Model of IR Corpus as term-document matrix Query as bag-of-words vector Full-text search is just:
Collaborative Filtering User preference matrix  (and item-item similarity matrix                 ) Input user as vector of preferences  (simple) Item-based CF recommendations are: T
Graph Proximity Adjacency matrix: 2nd degree adjacency matrix:   Input all of a user’s “friends” or page links: (weighted) distance measure of 1st – 3rd degree connections is then:
Dictionary Applications                  Linear Algebra
How does this help? In Search: Latent Semantic Indexing (LSI) probabalistic LSI Latent Dirichlet Allocation In Recommenders: Singular Value Decomposition Layered Restricted Boltzmann Machines  (Deep Belief Networks) In Graphs: PageRank Spectral Decomposition / Spectral Clustering
Often use “Dimensional Reduction” To alleviate the sparse Big Data problem of “the curse of dimensionality” Used to improve recall and relevance  in general: smooth the metric on your data set
New applications with Matrices If Search is finding doc-vector by:  and users query with data represented: Q =  Giving implicit feedback based on click-through per session: C =
… continued Then               has the form (docs-by-terms) for search! Approach has been used by Ted Dunning at Veoh (and probably others)
Linear Algebra performance tricks Naïve item-based recommendations: Calculate item similarity matrix: Calculate item recs: Express in one step: In matrix notation: Re-writing as:      is the vector of preferences for user “v”,       is the vector of preferences of item “i” The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.
Item Recommender via Hadoop
Apache Mahout Apache Mahout currently on release 0.3 http://lucene.apache.org/mahout Will be a “Top Level Project” soon (before 0.4) ( http://mahout.apache.org ) “Scalable Machine Learning with commercially friendly licensing”
Mahout Features  Recommenders  absorbed the Taste project Classification (Naïve Bayes, C-Bayes, more) Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…) Fast non-distributed linear mathematics  absorbed the classic CERN Colt project Distributed Matrices and decomposition absorbed the Decomposer project mahout shell-script analogous to $HADOOP_HOME/bin/hadoop $MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100 $MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300 etc… Taste web-app for real-time recommendations
DistributedRowMatrix Wrapper around a SequenceFile<IntWritable,VectorWritable> Distributed methods like: Matrix transpose(); Matrix times(Matrix other); Vector times(Vectorv); Vector timesSquared(Vectorv); To get SVD: pass into DistributedLanczosSolver: LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);
Questions? Contact:  jake.mannix@gmail.com jmannix@apache.org http://twitter.com/pbrane http://www.decomposer.org/blog http://www.linkedin.com/in/jakemannix
Appendix There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data… That having been said, there are some general techniques
Dealing with Curse of Dimensionality Sparseness means fast, but overlap is too small Can we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem? If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…
Solution A: Matrix decomposition Singular Value Decomposition (truncated) “best” approximation to your matrix Used in Latent Semantic Indexing (LSI) For graphs: spectral decomposition Collaborative filtering (Netflix leaderboard) Issues: very computation intensive  no parallelized open-source packages see Apache Mahout Makes things too dense
SVD: continued Hadoopimpl. in Mahout (Lanczos) O(N*d*k) for rank-k SVD on N docs, delt’s each  Density can be dealt with by doing Canopy Clustering offline But only extracting linear feature mixes Also, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?
Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD
Co-ocurrence-based kernel Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best) “Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”} {item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…} Dim(features) goes from 105to 108+(yikes!)
Online Random Projection Randomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix  Or project eachnGram down to an random (but sparse) 103-dim vector: V= {123876244 =>1.3}    (tf-IDF of “disney”) V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1}     (c= 1.3 / sqrt(3))
Outer-product and Sum Take the 103-dim projected vectors and outer-product with themselves, result is 103x103-dim matrix ,[object Object],All results go to single Reducer, where you compute…
SVD  SVD-them quickly (they fit in memory)  Over and over again (as new data comes in) Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity). SVD-projected vectors can be assigned immediately to nearest clusters if desired
References Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061 Sparse hashing/projection: John Langford et al. “VowpalWabbit” http://hunch.net/~vw/

Contenu connexe

Tendances

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsDatabricks
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Lucidworks
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonMark Conway
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation frameworkJoseph Adler
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphLucidworks
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseMax Neunhöffer
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesTodd McGrath
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 

Tendances (20)

Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Overview of the Hive Stinger Initiative
Overview of the Hive Stinger InitiativeOverview of the Hive Stinger Initiative
Overview of the Hive Stinger Initiative
 
Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P.
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
AlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in PythonAlphaPy: A Data Science Pipeline in Python
AlphaPy: A Data Science Pipeline in Python
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation framework
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Webinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and GraphWebinar: Solr 6 Deep Dive - SQL and Graph
Webinar: Solr 6 Deep Dive - SQL and Graph
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Complex queries in a distributed multi-model database
Complex queries in a distributed multi-model databaseComplex queries in a distributed multi-model database
Complex queries in a distributed multi-model database
 
Spark SQL with Scala Code Examples
Spark SQL with Scala Code ExamplesSpark SQL with Scala Code Examples
Spark SQL with Scala Code Examples
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 

En vedette

The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...tulipbiru64
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1ATHMAN HAJ-HAMOU
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending Assem CHELLI
 
K Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language SoftwareK Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language SoftwareAbdallah Aziz
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleMongoDB
 
treaty of hudabiya
treaty of hudabiyatreaty of hudabiya
treaty of hudabiyaAsif Sheikh
 
Treaty of Al Hudaybiyah
Treaty of Al HudaybiyahTreaty of Al Hudaybiyah
Treaty of Al HudaybiyahFaryal2000
 

En vedette (14)

K Search
K SearchK Search
K Search
 
E lex presentation_03
E lex presentation_03E lex presentation_03
E lex presentation_03
 
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
The Effectiveness Of Searching Arabic Resources Through OPAC : A Case Study I...
 
Cebit2009new
Cebit2009newCebit2009new
Cebit2009new
 
Information retrieval based on word sens 1
Information retrieval based on word sens 1Information retrieval based on word sens 1
Information retrieval based on word sens 1
 
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending  Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
Proposal of an Advanced Retrieval System for NobleQur'an - Thesis defending
 
Chap10
Chap10Chap10
Chap10
 
K Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language SoftwareK Search Al Khawarizmy Language Software
K Search Al Khawarizmy Language Software
 
Statistika
StatistikaStatistika
Statistika
 
REA (Resources, Events, Agents)
REA (Resources, Events, Agents)REA (Resources, Events, Agents)
REA (Resources, Events, Agents)
 
Indexing Strategies to Help You Scale
Indexing Strategies to Help You ScaleIndexing Strategies to Help You Scale
Indexing Strategies to Help You Scale
 
treaty of hudabiya
treaty of hudabiyatreaty of hudabiya
treaty of hudabiya
 
Treaty of Al Hudaybiyah
Treaty of Al HudaybiyahTreaty of Al Hudaybiyah
Treaty of Al Hudaybiyah
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 

Similaire à Seattle Scalability Mahout

OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engineKeeyong Han
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul Divyanshu
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...Edward Blurock
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational DatabasesUdi Bauman
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYAAditya Srinivasan
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data FrameworkseXascale Infolab
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation EnginesAdam Rogers
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 

Similaire à Seattle Scalability Mahout (20)

OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Vipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentationVipul divyanshu mahout_documentation
Vipul divyanshu mahout_documentation
 
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
ChemConnect: Characterizing CombusAon KineAc Data with ontologies and meta-­‐...
 
Nonrelational Databases
Nonrelational DatabasesNonrelational Databases
Nonrelational Databases
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
TCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYATCS_DATA_ANALYSIS_REPORT_ADITYA
TCS_DATA_ANALYSIS_REPORT_ADITYA
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Jane Recommendation Engines
Jane Recommendation EnginesJane Recommendation Engines
Jane Recommendation Engines
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Seattle Scalability Mahout

  • 1. Numerical Recipes in Hadoop Jake Mannix linkedin/in/jakemannix twitter/pbrane jake.mannix@gmail.com jmannix@apache.org Principal SDE, LinkedIn Committer, Apache Mahout, Zoie, Bobo-Browse, Decomposer Author, Lucene in Depth (Manning MM/DD/2010)
  • 2. A Mathematician’s Apology What mathematical structure describes all of these? Full-text search: Score documents matching “query string” Collaborative filtering recommendation: Users who liked {those} also liked {these} (Social/web)-graph proximity: People/pages “close” to {this} are {these}
  • 4. Full-text Search Vector Space Model of IR Corpus as term-document matrix Query as bag-of-words vector Full-text search is just:
  • 5. Collaborative Filtering User preference matrix (and item-item similarity matrix ) Input user as vector of preferences (simple) Item-based CF recommendations are: T
  • 6. Graph Proximity Adjacency matrix: 2nd degree adjacency matrix: Input all of a user’s “friends” or page links: (weighted) distance measure of 1st – 3rd degree connections is then:
  • 7. Dictionary Applications Linear Algebra
  • 8. How does this help? In Search: Latent Semantic Indexing (LSI) probabalistic LSI Latent Dirichlet Allocation In Recommenders: Singular Value Decomposition Layered Restricted Boltzmann Machines (Deep Belief Networks) In Graphs: PageRank Spectral Decomposition / Spectral Clustering
  • 9. Often use “Dimensional Reduction” To alleviate the sparse Big Data problem of “the curse of dimensionality” Used to improve recall and relevance in general: smooth the metric on your data set
  • 10. New applications with Matrices If Search is finding doc-vector by: and users query with data represented: Q = Giving implicit feedback based on click-through per session: C =
  • 11. … continued Then has the form (docs-by-terms) for search! Approach has been used by Ted Dunning at Veoh (and probably others)
  • 12. Linear Algebra performance tricks Naïve item-based recommendations: Calculate item similarity matrix: Calculate item recs: Express in one step: In matrix notation: Re-writing as: is the vector of preferences for user “v”, is the vector of preferences of item “i” The result is the matrix sum of the outer (tensor) products of these vectors, scaled by the entry they intersect at.
  • 14. Apache Mahout Apache Mahout currently on release 0.3 http://lucene.apache.org/mahout Will be a “Top Level Project” soon (before 0.4) ( http://mahout.apache.org ) “Scalable Machine Learning with commercially friendly licensing”
  • 15. Mahout Features Recommenders absorbed the Taste project Classification (Naïve Bayes, C-Bayes, more) Clustering (Canopy, fuzzy-K-means, Dirichlet, etc…) Fast non-distributed linear mathematics absorbed the classic CERN Colt project Distributed Matrices and decomposition absorbed the Decomposer project mahout shell-script analogous to $HADOOP_HOME/bin/hadoop $MAHOUT_HOME/bin/mahout kmeans –i “in” –o “out” –k 100 $MAHOUT_HOME/bin/mahout svd –i “in” –o “out” –k 300 etc… Taste web-app for real-time recommendations
  • 16. DistributedRowMatrix Wrapper around a SequenceFile<IntWritable,VectorWritable> Distributed methods like: Matrix transpose(); Matrix times(Matrix other); Vector times(Vectorv); Vector timesSquared(Vectorv); To get SVD: pass into DistributedLanczosSolver: LanczosSolver.solve(Matrix input, Matrix eigenVectors, List<Double> eigenValues, int rank);
  • 17. Questions? Contact: jake.mannix@gmail.com jmannix@apache.org http://twitter.com/pbrane http://www.decomposer.org/blog http://www.linkedin.com/in/jakemannix
  • 18. Appendix There are lots of ways to deal with sparse Big Data, and many (not all) need to deal with the dimensionality of the feature-space growing beyond reasonable limits, and techniques to deal with this depend heavily on your data… That having been said, there are some general techniques
  • 19. Dealing with Curse of Dimensionality Sparseness means fast, but overlap is too small Can we reduce the dimensionality (from “all possible text tokens” or “all userIds”) while keeping the nice aspects of the search problem? If possible, collapse “similar” vectors (synonymous terms, userIds with high overlap, etc…) towards each other while keeping “dissimilar” vectors far apart…
  • 20. Solution A: Matrix decomposition Singular Value Decomposition (truncated) “best” approximation to your matrix Used in Latent Semantic Indexing (LSI) For graphs: spectral decomposition Collaborative filtering (Netflix leaderboard) Issues: very computation intensive no parallelized open-source packages see Apache Mahout Makes things too dense
  • 21. SVD: continued Hadoopimpl. in Mahout (Lanczos) O(N*d*k) for rank-k SVD on N docs, delt’s each Density can be dealt with by doing Canopy Clustering offline But only extracting linear feature mixes Also, still very computation intensive and I/O intensive (k-passes over data set), are there better dimensional reduction methods?
  • 22. Solution B: Stochastic Decomposition co-ocurrence-based kernel + online Random Projection + SVD
  • 23. Co-ocurrence-based kernel Extract bigram phrases / pairs of items rated by the same person (using Log-Likelihood Ratio test to pick the best) “Disney on Ice was Amazing!” -> {“disney”, “disney on ice”, “ice”, “was” “amazing”} {item1:4, item2:5, item5:3, item9:1} -> {item1:4, (items1+2):4.5, item2:5, item5:3,…} Dim(features) goes from 105to 108+(yikes!)
  • 24. Online Random Projection Randomly project kernelized text vectors down to “merely” 103dimensions with a Gaussian matrix Or project eachnGram down to an random (but sparse) 103-dim vector: V= {123876244 =>1.3} (tf-IDF of “disney”) V’= c*{h(i) => 1, h(h(i)) =>1, h(h(h(i))) =>1} (c= 1.3 / sqrt(3))
  • 25.
  • 26. SVD SVD-them quickly (they fit in memory) Over and over again (as new data comes in) Use the most recent SVD to project your (already randomly projected) text still further (now encoding “semantic” similarity). SVD-projected vectors can be assigned immediately to nearest clusters if desired
  • 27. References Randomized matrix decomposition review: http://arxiv.org/abs/0909.4061 Sparse hashing/projection: John Langford et al. “VowpalWabbit” http://hunch.net/~vw/

Notes de l'éditeur

  1. And the usual references for LSI and Spectral Decomposition