SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Machine Learning with Apache Mahout
Classification, Clustering and Recommendation




  3/3/2011                                     Michaël Figuière
Machine Learning
Machine Learning

                                           Machine Learning is a
                                            subset of Artificial
                                                   Intelligence
         Artificial Intelligence




                                   Machine Learning
NoSQL, Search and Machine Learning

                                NoSQL, Search and
                                  Machine Learning
                                  greatly complete
                   Machine
                   Learning           each other !




        NoSQL                   Search
Machine Learning algorithms

• Recommentations

         Advice user with recommended items



• Classification
         Automatically classify documents based on a given set of
         examples

• Clustering
         Automatically discover groups within a set of documents



• Patterns mining, evolutionary algorithms, ...
Recommendation - User based
                              Amazon suggests
                               articles bought
                                      by similar
                                    customers
Recommendation - Item based

                                 On the article page
                              Amazon leverages item
                              based recommendation
Similarities between users

                                       Here we observes
                                       that users 1 and 2
                                      have similar tastes
              1           2




         A    B       C   D   E   F




                  1
Recommendation use cases

• Advice user with items on e-commerce websites

         And increase revenue



• Advice user with feature he may be interested in on a Web application
         As most features are usually unknown



• Filter and adapt scoring of results of a search engine
         Based on similar users clicks, ...
Classification



                Mails classified as
                spams by GMail
Classification use cases

• Automatically attach tags to documents

        Based on existing manual tagging, wikipedia, ...



• Extract suspicious documents
        Spam, corrupted documents, ...
Clustering

 Trendy topics
 discovered by
 Google News
Clustering with K-Means



        A
                    B



                C




                                D
                        E
                            F
Clustering with K-Means


                                    Cluster centers
        A
                    B               with random
                                    initial position
                C




                                D
                        E
                            F
Clustering with K-Means

                                    Data are attached
                                       to the nearest
        A
                    B                 cluster center


                C




                                D
                        E
                            F
Clustering with K-Means

                                    Cluster centers are
                                      moved in order to
        A                             minimize the sum
                    B
                                           of distances

                C




                                D
                        E
                            F
Clustering with K-Means

                                       The data point C is
                                     then attached to the
        A                           first center as it has
                    B
                                       become the nearest

                C




                                D
                        E
                            F
Clustering use cases

• Finds key topics in a set of documents

         News feeds, business documents, ...



• Finds some typical behaviors within a set of users
         Visit frequency, buying habits, ...
Apache Mahout
In few words

• Implementation of machine learning algorithms in Java

         Continuously growing collection of algorithms



• Most of them come in a MapReduce implementation for Hadoop

         Scalable to huge datasets



• Still quite young but growing fast
         Started in early 2009



• Intended to be for Machine Learning what Lucene is for Information Retrieval
Documentation
Recommendation example

DataModel model = new FileDataModel(new File("data.csv"));


UserSimilarity simil =
   new PearsonCorrelationSimilarity(model);


UserNeighborhood neighborhood =
   new NearestNUserNeighborhood(2, similarity, model);


Recommender recommender =
   new GenericUserBasedRecommender(model, neighborhood, simil);


List<RecommendedItem> recommendations =
   recommender.recommend(1, 1);


The code for a basic recommendation is pretty straightforward !
Classification with Mahout



       Training       Training
      examples       algorithm           Model



                                 Copy



      New data        Model             Decision
Clustering with Mahout




                    Clustering    List of
     Documents
                    algorithm    clusters
Relevance evaluation


          Entire dataset




      Data used for
                           Data used to evaluate
         training
                           relevance of an algorithm
                           and its settings
A search engine use case
A Search Engine


                  Search
A Search Engine


            MyCustomer   Search
A Search Engine


                      MyCustomer                               Search




   Document     Non Disclosure Agreement                                          12 days ago
                   ... MyCustomer agrees not to disclose any part of ...



   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline



                    Tika


       PDF
                  Text
                            Analyzer
                Extractor
                                                Search
                                                 Index
                            Analyzer
      Phone
       Call



                                       Lucene
A more complex Search Engine


                      MyCustomer                               Search

                    Sales                   Juridic                   Accounting




   Document     2010 Sales Report                                                 1 month ago
                ... MyCustomer: 12 M€ with 3 deals ...



                Phone Call                                                        2 days ago
   Phone Call   Customer: MyCustomer           Time: 9:55am     Duration: 13min
                Description: Invoice not received for order #2354E
Indexing Pipeline with Mahout



            Tika       Mahout


 PDF
             Text
                       Classifier   Analyzer
           Extractor
                                                       Search
                                                        Index
                       Classifier   Analyzer
 Phone
  Call



                                              Lucene
Query pipeline


                     Lucene

            Query

                     Analyzer
                                Search
                                 Index


           Results
Query pipeline with Mahout


                        Lucene

           Query

                         Analyzer
                                       Search
                                        Index
                         Custom
                         Analyzer
                         Scoring
           Results


                       Using Mahout
                     recommendations
Conclusion

• Machine learning brings a lot of valuable features for enterprises

         Revenue increasing, better productivity, user adoption, ...



• Mahout is growing fast and is becoming a great choice for Java apps

        With easy integration to business applications



• Business people are not used to that kind of use cases
         Collaboration with technical folks is mandatory
Questions / Answers




                       ?
                      blog.xebia.fr
                      @mfiguiere

Contenu connexe

Similaire à Machine Learning Classification, Clustering and Recommendation with Apache Mahout

LAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkLAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkHendrik Drachsler
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured DataDataWorks Summit
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE
 
Keynote: Harnessing the power of Elasticsearch for simplified search
Keynote: Harnessing the power of Elasticsearch for simplified searchKeynote: Harnessing the power of Elasticsearch for simplified search
Keynote: Harnessing the power of Elasticsearch for simplified searchElasticsearch
 
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsFSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsAmazon Web Services
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...Egyptian Engineers Association
 
التقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتالتقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتMohammed El Rafie Tarabay
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMarion DE SOUSA
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3Dave King
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NETDev Raj Gautam
 
Driving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine LearningDriving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine LearningCCG
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Futurefeiwin
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best OfferMichel Bruley
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...Dr. Haxel Consult
 
Building Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startBuilding Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startMaxim Salnikov
 
A wrapper for QuantLib and reference data
A wrapper for QuantLib and reference dataA wrapper for QuantLib and reference data
A wrapper for QuantLib and reference dataJun Hong
 

Similaire à Machine Learning Classification, Clustering and Recommendation with Apache Mahout (20)

LAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_frameworkLAK13 linkedup tutorial_evaluation_framework
LAK13 linkedup tutorial_evaluation_framework
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
Analyzing Multi-Structured Data
Analyzing Multi-Structured DataAnalyzing Multi-Structured Data
Analyzing Multi-Structured Data
 
My Resume
My ResumeMy Resume
My Resume
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
 
Keynote: Harnessing the power of Elasticsearch for simplified search
Keynote: Harnessing the power of Elasticsearch for simplified searchKeynote: Harnessing the power of Elasticsearch for simplified search
Keynote: Harnessing the power of Elasticsearch for simplified search
 
FSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital MarketsFSI202 Machine Learning in Capital Markets
FSI202 Machine Learning in Capital Markets
 
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
لموعد الإثنين 03 يناير 2022 143 مبادرة #تواصل_تطوير المحاضرة ال 143 من المباد...
 
التقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتباتالتقنيات المستخدمة لتطوير المكتبات
التقنيات المستخدمة لتطوير المكتبات
 
Machine Learning in e commerce - Reboot
Machine Learning in e commerce - RebootMachine Learning in e commerce - Reboot
Machine Learning in e commerce - Reboot
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Digital Trails Dave King 1 5 10 Part 2 D3
Digital Trails   Dave King   1 5 10   Part 2   D3Digital Trails   Dave King   1 5 10   Part 2   D3
Digital Trails Dave King 1 5 10 Part 2 D3
 
Machine Learning With ML.NET
Machine Learning With ML.NETMachine Learning With ML.NET
Machine Learning With ML.NET
 
Driving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine LearningDriving Customer Loyalty with Azure Machine Learning
Driving Customer Loyalty with Azure Machine Learning
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best Offer
 
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
AI-SDV 2020: Bringing AI to SME projects: Addressing customer needs with a fl...
 
Building Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to startBuilding Generative AI-infused apps: what's possible and how to start
Building Generative AI-infused apps: what's possible and how to start
 
A wrapper for QuantLib and reference data
A wrapper for QuantLib and reference dataA wrapper for QuantLib and reference data
A wrapper for QuantLib and reference data
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 

Plus de Michaël Figuière

EclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraEclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraMichaël Figuière
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersMichaël Figuière
 
YaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersYaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersMichaël Figuière
 
Geneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersGeneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersMichaël Figuière
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Michaël Figuière
 
NYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthNYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthMichaël Figuière
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversMichaël Figuière
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
GTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLGTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLMichaël Figuière
 
Duchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutDuchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutMichaël Figuière
 
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...Michaël Figuière
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraMichaël Figuière
 
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesBreizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesMichaël Figuière
 
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Michaël Figuière
 
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesLorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesMichaël Figuière
 
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesTours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesMichaël Figuière
 
Paris JUG (sept 2010) - NoSQL : Des concepts à la réalité
Paris JUG (sept 2010) - NoSQL : Des concepts à la réalitéParis JUG (sept 2010) - NoSQL : Des concepts à la réalité
Paris JUG (sept 2010) - NoSQL : Des concepts à la réalitéMichaël Figuière
 
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real worldXebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real worldMichaël Figuière
 

Plus de Michaël Figuière (20)

EclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache CassandraEclipseCon - Building an IDE for Apache Cassandra
EclipseCon - Building an IDE for Apache Cassandra
 
Paris Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for DevelopersParis Cassandra Meetup - Cassandra for Developers
Paris Cassandra Meetup - Cassandra for Developers
 
YaJug - Cassandra for Java Developers
YaJug - Cassandra for Java DevelopersYaJug - Cassandra for Java Developers
YaJug - Cassandra for Java Developers
 
Geneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java DevelopersGeneva JUG - Cassandra for Java Developers
Geneva JUG - Cassandra for Java Developers
 
ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0ChtiJUG - Cassandra 2.0
ChtiJUG - Cassandra 2.0
 
Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!Cassandra summit 2013 - DataStax Java Driver Unleashed!
Cassandra summit 2013 - DataStax Java Driver Unleashed!
 
NYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in DepthNYC* Tech Day - New Cassandra Drivers in Depth
NYC* Tech Day - New Cassandra Drivers in Depth
 
Paris Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra DriversParis Cassandra Meetup - Overview of New Cassandra Drivers
Paris Cassandra Meetup - Overview of New Cassandra Drivers
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
 
GTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQLGTUG Nantes (Dec 2011) - BigTable et NoSQL
GTUG Nantes (Dec 2011) - BigTable et NoSQL
 
Duchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache MahoutDuchess France (Nov 2011) - Atelier Apache Mahout
Duchess France (Nov 2011) - Atelier Apache Mahout
 
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
JUG Summer Camp (Sep 2011) - Les applications et architectures d’entreprise d...
 
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec CassandraBreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
BreizhCamp (Jun 2011) - Haute disponibilité et élasticité avec Cassandra
 
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux EntreprisesBreizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
Breizh JUG (mar 2011) - NoSQL : Des Grands du Web aux Entreprises
 
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
 
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprisesLorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
Lorraine JUG (dec 2010) - NoSQL, des grands du Web aux entreprises
 
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprisesTours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
Tours JUG (oct 2010) - NoSQL, des grands du Web aux entreprises
 
Paris JUG (sept 2010) - NoSQL : Des concepts à la réalité
Paris JUG (sept 2010) - NoSQL : Des concepts à la réalitéParis JUG (sept 2010) - NoSQL : Des concepts à la réalité
Paris JUG (sept 2010) - NoSQL : Des concepts à la réalité
 
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real worldXebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
Xebia Knowledge Exchange (mars 2010) - Lucene : From theory to real world
 

Dernier

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 

Dernier (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 

Machine Learning Classification, Clustering and Recommendation with Apache Mahout

  • 1. Machine Learning with Apache Mahout Classification, Clustering and Recommendation 3/3/2011 Michaël Figuière
  • 3. Machine Learning Machine Learning is a subset of Artificial Intelligence Artificial Intelligence Machine Learning
  • 4. NoSQL, Search and Machine Learning NoSQL, Search and Machine Learning greatly complete Machine Learning each other ! NoSQL Search
  • 5. Machine Learning algorithms • Recommentations Advice user with recommended items • Classification Automatically classify documents based on a given set of examples • Clustering Automatically discover groups within a set of documents • Patterns mining, evolutionary algorithms, ...
  • 6. Recommendation - User based Amazon suggests articles bought by similar customers
  • 7. Recommendation - Item based On the article page Amazon leverages item based recommendation
  • 8. Similarities between users Here we observes that users 1 and 2 have similar tastes 1 2 A B C D E F 1
  • 9. Recommendation use cases • Advice user with items on e-commerce websites And increase revenue • Advice user with feature he may be interested in on a Web application As most features are usually unknown • Filter and adapt scoring of results of a search engine Based on similar users clicks, ...
  • 10. Classification Mails classified as spams by GMail
  • 11. Classification use cases • Automatically attach tags to documents Based on existing manual tagging, wikipedia, ... • Extract suspicious documents Spam, corrupted documents, ...
  • 12. Clustering Trendy topics discovered by Google News
  • 13. Clustering with K-Means A B C D E F
  • 14. Clustering with K-Means Cluster centers A B with random initial position C D E F
  • 15. Clustering with K-Means Data are attached to the nearest A B cluster center C D E F
  • 16. Clustering with K-Means Cluster centers are moved in order to A minimize the sum B of distances C D E F
  • 17. Clustering with K-Means The data point C is then attached to the A first center as it has B become the nearest C D E F
  • 18. Clustering use cases • Finds key topics in a set of documents News feeds, business documents, ... • Finds some typical behaviors within a set of users Visit frequency, buying habits, ...
  • 20. In few words • Implementation of machine learning algorithms in Java Continuously growing collection of algorithms • Most of them come in a MapReduce implementation for Hadoop Scalable to huge datasets • Still quite young but growing fast Started in early 2009 • Intended to be for Machine Learning what Lucene is for Information Retrieval
  • 22. Recommendation example DataModel model = new FileDataModel(new File("data.csv")); UserSimilarity simil = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, simil); List<RecommendedItem> recommendations = recommender.recommend(1, 1); The code for a basic recommendation is pretty straightforward !
  • 23. Classification with Mahout Training Training examples algorithm Model Copy New data Model Decision
  • 24. Clustering with Mahout Clustering List of Documents algorithm clusters
  • 25. Relevance evaluation Entire dataset Data used for Data used to evaluate training relevance of an algorithm and its settings
  • 26. A search engine use case
  • 27. A Search Engine Search
  • 28. A Search Engine MyCustomer Search
  • 29. A Search Engine MyCustomer Search Document Non Disclosure Agreement 12 days ago ... MyCustomer agrees not to disclose any part of ... Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 30. Indexing Pipeline Tika PDF Text Analyzer Extractor Search Index Analyzer Phone Call Lucene
  • 31. A more complex Search Engine MyCustomer Search Sales Juridic Accounting Document 2010 Sales Report 1 month ago ... MyCustomer: 12 M€ with 3 deals ... Phone Call 2 days ago Phone Call Customer: MyCustomer Time: 9:55am Duration: 13min Description: Invoice not received for order #2354E
  • 32. Indexing Pipeline with Mahout Tika Mahout PDF Text Classifier Analyzer Extractor Search Index Classifier Analyzer Phone Call Lucene
  • 33. Query pipeline Lucene Query Analyzer Search Index Results
  • 34. Query pipeline with Mahout Lucene Query Analyzer Search Index Custom Analyzer Scoring Results Using Mahout recommendations
  • 35. Conclusion • Machine learning brings a lot of valuable features for enterprises Revenue increasing, better productivity, user adoption, ... • Mahout is growing fast and is becoming a great choice for Java apps With easy integration to business applications • Business people are not used to that kind of use cases Collaboration with technical folks is mandatory
  • 36. Questions / Answers ? blog.xebia.fr @mfiguiere