SlideShare une entreprise Scribd logo
1  sur  32
Big Data Warehousing Meetup

Today’s Topic: Building a Relevance
Engine using Hadoop, Mahout & Pig




                                      Sponsored By:
WELCOME!
  Joe Caserta
  Founder & President, Caserta Concepts
Agenda
7:00     Networking
         Grab a slice of pizza and a drink...



7:15     Joe Caserta                              Welcome
         President, Caserta Concepts              About the Meetup and about Caserta Concepts
         Author, Data Warehouse ETL Toolkit


7:30     Erik Laurence                            Big Data Facts and Figures
         VP Marketing, Caserta Concepts           Interesting observations from the world of Big Data



7:45     Elliott Cordo                            Relevance
         Principal Consultant, Caserta Concepts   Building a Big Data recommendation engine with Mahout



8:15     Grant Ingersoll                          Machine Learning
         Chief Scientist, Lucidworks              Powering large scale data driven real time apps with
         Mahout co-founder                        Apache Solr and Mahout
         Lucene/Solr committer

8:45 -   More Networking
9:00     Tell us what you’re up to…
About BDW Meetup
• Big Data is a complex, rapidly
 changing landscape

• We want to share our stories and
 hear about yours

• Great networking opportunity for
 like minded data nerds

• Opportunities to collaborate on
 exciting projects
About Caserta Concepts
 Focused                             Industries Served
 Expertise
                                    •   Financial Services
 •   Big Data Analytics             •   Healthcare / Insurance
 •   Data Warehousing               •   Retail / eCommerce
 •   Business Intelligence          •   Digital Media / Marketing
 •   Strategic Data                 •   K-12 / Higher Education
     Ecosystems

     Founded in 2001

     • President: Joe Caserta, industry thought leader,
       consultant, educator and co-author, The Data
       Warehouse ETL Toolkit (Wiley, 2004)
Client Portfolio
Finance
& Insurance




Retail/eCommerce
& Manufacturing




Education
& Services
Expertise & Offerings
 Strategic Roadmap/
 Assessment/Consulting


 Big Data
 Analytics




 Data Warehousing/
 ETL/Data Integration


 BI/Visualization/
 Analytics



 Master Data Management
Big Data at Caserta Concepts
Caserta Concepts is a blend of the best designers in traditional
enterprise data with the best new designers in Big Data.

            Traditional Data                   Big Data
          • Tools                        • Tools
                • RDBMS                        • Hadoop
                • DQ                           • Mahout
                • MDM                          • Relevance Engine
                • BI                           • Analytics
                • ETL                    • New Data
                • Analytics                    • Social
          • Traditional Data                   • Machine
                • Transactions                 • Deep History
                                               • Unstructured



                      Immutable Data Concepts
              • Transformation   • Profiling
              • Conforming       • Processing Efficiency/Speed


                                                                    8
Contacts

     Joe Caserta
     President & Founder, Caserta Concepts
     P: (855) 755-2246 x227
     E: joe@casertaconcepts.com


     Erik Laurence
     VP Marketing, Caserta Concepts
     P: (855) 755-2246 x528                   info@casertaconcepts.com
     E: erik@casertaconcepts.com              1(855) 755-2246
                                              www.casertaconcepts.com
     Elliott Cordo
     Principal Consultant, Caserta Concepts
     P: (855) 755-2246 x267
     E: elliott@casertaconcepts.com
BIG DATA FACTS AND FIGURES
   Erik Laurence
   VP Marketing, Caserta Concepts
What is Really Meant by Big Data?
• The 4 Vs of Big Data
                                                        10%
  • Volume
                                                                   Structured
    • More data than ever before
    • Most of world’s data is unstructured,       90%              Un/Semi/Multi-
                                                                   Structured
      semi-structured or multi-structured
  • Variety
    • More sources than ever before
    • Social, web logs, machine logs, documents, geotags, video, …
  • Velocity
    • Some data only has value for a short period of time
    • Relevance engines, financial fraud sensors, early warning sensors, etc.
  • Vitality
    • Agility is required in analytics
    • Adapt quickly to changing business needs
Enterprise Involvement with Big Data
                         6%

                              18%
                                         Beyond Pilot Stage
                                         Engaged in Pilot
                 76%
                                         Not Yet Involved




• Awareness of Big Data high among enterprises, but three-quarters still
  wondering, ―What is this all about?‖
• Answer across all businesses, ―We don't know what the business case
  is.‖



                                                            Source: WSJ November 29, 2012
Business Cases Have Been Identified
―The use of data and analytics …is going to be a basis of competition
going forward for individual firms, for sectors and even for countries.
Those companies that are able to use data effectively are more likely to
win in the marketplace.‖
         - Michael Chui, McKinsey Global Institute

In just one field—personal location data—$100 billion of value can be
created globally for service providers through use of data.

Benefits for consumers could be six times that.




    Source: (WSJ 11/29/12)
Big Data Played A Role in the Election
―This was the first presidential
election campaign where all of the
data that was coming into the
campaign was successfully
collected and centralized.

―The Obama campaign did a
successful job with that; the                     Obama campaign hired an analytics department five
                                                  times as large as that of the 2008 operation.
Romney campaign did not.‖

  - John Aristotle Phillips, Chief Executive of
  Aristotle International (WSJ 11/29/12)
Big Data Example in Obama Campaign
• $40k-a-head dinner in June at Sarah Jessica
  Parker’s home in NYC
• 7 different versions of the email solicitation for the
  event
  • Some mentioned a 2nd fundraiser that night, a Mariah
    Carey concert
  • Some said Ms. Parker is a mother
  • Some said Vogue editor Anna Wintour would be at the
    dinner
• Who got which email depended on big data
  • Profile info about each prospect
  • How they react to different messages
• Campaign created a single massive system to join
 info from Democratic voter files to
  • pollsters, fundraisers, field workers and consumer
    databases, social-media, and mobile contacts

  Sources: WSJ, Time Magazine
Hadoop Market: Growing & Evolving
• Big data outranks virtualization as
 #1 trend driving spending initiatives
  • Barclays CIO Survey, April 2012


• Overall market at $100B
  • Hadoop 2nd only to RDBMS in
    potential


• Estimates put market growth at >
 40% CAGR
  • IDC expects Big Data tech and
    services market to grow to $16.9B in
    2015
  • According to JPMC 50% of Big Data
    market will be influenced by Hadoop
Hadoop Cost Effective for Archiving
• Hadoop is orders of magnitude cheaper than traditional
 archival methods

• Annual cost of 1 TB of archival storage for a credit card
 company




        Tape                SAN                     Hadoop
       $30,000             $3,000                    $300
Hadoop is Fast
• Sears' process to analyze loyalty club
 marketing campaigns took six weeks on
 mainframe, Teradata, and SAS servers
  • In retail, that’s half the season!


• New process on Hadoop is done weekly
  • For online and mobile, daily analysis is done


• What’s more, old models used 10% of data, new models use all
 the data



• Source: Information Week (October 31, 2012)
BUILDING A RECOMMENDATION ENGINE
   Elliott Cordo
   Principal Consultant, Caserta Concepts
Recommendations
• Your customers expect them
   • Good recommendations make life easier
   • Help them find information, products, and services they might not
     have thought of


• What makes a good recommendation?
  • Relevant but not obvious
  • Sense of ―surprise‖
Where can recommendations
engines be found?
• Applications can be found in a wide variety of industries
 and applications:
  • Travel
  • Service Industry
  • Music/Online radio
  • TV and Video
  • Online Publications
  • Retail
   ..and countless others
Our Use Case: Online Magazine
Goals:
• Serve customers recommendations based on what their
  peers are reading.
• Recommendation must have context to the article they
  are currently viewing.
Technical Details
Core Platform:
• Cloudera Hadoop Cluster
• Mahout Machine Learning Library
• Apache Pig


Additional Technology:
• Talend Big Data Edition (ETL to/from relational)
• Datameer (Analysis and Visualization)
How we did it
Solution leverages three main algorithms:
• Mahout K-Means – identifying groups of similar articles
• Mahout Item-Based Recommender - recommendations
  based on peer behavior
• Raw Popularity – custom Pig script ―people who read this
  article also read..‖
K-Means
• Treats items as coordinates
• Places a number of random
  ―centroids‖ and assigns the
  nearest items
• Moves the centroids around
  based on average location
• Process repeats until the
  assignments stop changing

We used the major attributes of the articles to create
coordinate points:
Author, Topic, Section, Region, Media, etc.

                                *Diagram from Collective Intelligence by Toby Segaran
Item-Based Recommender
• Build an item-item matrix determining relationships
  between pairs of items (usage)
• Using the matrix, and the data on the current user, infer
  his taste


• We use a dataset containing Customer, Article and
  Rating
   • Since no rating was available we used a 1 to 5
      scale based on age (a ramped 6 month decay)
• In the output a 0 to 5 scale is calculated, 5 being the
  most highly recommended for this customer
Popularity
• Self join usage dataset based on Article
  Also_Read_Data= join Readers1 by
  Customer_ID, Readers2 by Customer_ID using 'merge'
• Group article based on Article, ―Also Read Article‖
• Sort descending based on the number of distinct peer
  customers
• Limit 25 (most popular ―Also Read Article‖)
• In the output a 0 to 5 scale is calculated, 5 being the most
  popular for a given article
Delivering Recommendations
Customer views an article online and we are passed their
Customer ID and the Article they are viewing

We then do the following:
1. K-Means – get all items in the same cluster and calculate
                                 Item-Based:           K-Means:
   Euclidean Distance. Reverse and scale 0-5.
                               Peers are reading        Similar

2. Item-Based - get all peer recommendations for this customer
3. Popularity – get all popular recommendations for this article
4. Join the three data sets together, add the final rankings and
   bring back the most highly rated articles.
                                          Popularity:
                                         Most popular
Items recommended by more than 1
algorithm are the most highly rated


          Item-Based:                      K-Means:
        Peers are reading                   Similar




                             Popularity:
                            Most popular
                                                     Best
                                                Recommendations
Improvements/Ideas
• Conditionally swap algorithms: Peer recommendations
  can be unwieldy for new users
• Allow users to rate how relevant this recommendation is -
  > retrain the model
• Play with the weighting of current algorithms, evaluate
  others
• Hybrid search platform: Replace or supplement K-Means
  with Search platform
MACHINE LEARNING
   Grant Ingersoll
   President, Lucidworks
   Mahout co-founder
   Lucene/Solr committer
NETWORKING

Contenu connexe

Plus de Caserta

Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 

Plus de Caserta (20)

Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 

Big Data Warehousing: Building a Relevance Engine using Hadoop, Mahout, and Pig

  • 1. Big Data Warehousing Meetup Today’s Topic: Building a Relevance Engine using Hadoop, Mahout & Pig Sponsored By:
  • 2. WELCOME! Joe Caserta Founder & President, Caserta Concepts
  • 3. Agenda 7:00 Networking Grab a slice of pizza and a drink... 7:15 Joe Caserta Welcome President, Caserta Concepts About the Meetup and about Caserta Concepts Author, Data Warehouse ETL Toolkit 7:30 Erik Laurence Big Data Facts and Figures VP Marketing, Caserta Concepts Interesting observations from the world of Big Data 7:45 Elliott Cordo Relevance Principal Consultant, Caserta Concepts Building a Big Data recommendation engine with Mahout 8:15 Grant Ingersoll Machine Learning Chief Scientist, Lucidworks Powering large scale data driven real time apps with Mahout co-founder Apache Solr and Mahout Lucene/Solr committer 8:45 - More Networking 9:00 Tell us what you’re up to…
  • 4. About BDW Meetup • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Opportunities to collaborate on exciting projects
  • 5. About Caserta Concepts Focused Industries Served Expertise • Financial Services • Big Data Analytics • Healthcare / Insurance • Data Warehousing • Retail / eCommerce • Business Intelligence • Digital Media / Marketing • Strategic Data • K-12 / Higher Education Ecosystems Founded in 2001 • President: Joe Caserta, industry thought leader, consultant, educator and co-author, The Data Warehouse ETL Toolkit (Wiley, 2004)
  • 6. Client Portfolio Finance & Insurance Retail/eCommerce & Manufacturing Education & Services
  • 7. Expertise & Offerings Strategic Roadmap/ Assessment/Consulting Big Data Analytics Data Warehousing/ ETL/Data Integration BI/Visualization/ Analytics Master Data Management
  • 8. Big Data at Caserta Concepts Caserta Concepts is a blend of the best designers in traditional enterprise data with the best new designers in Big Data. Traditional Data Big Data • Tools • Tools • RDBMS • Hadoop • DQ • Mahout • MDM • Relevance Engine • BI • Analytics • ETL • New Data • Analytics • Social • Traditional Data • Machine • Transactions • Deep History • Unstructured Immutable Data Concepts • Transformation • Profiling • Conforming • Processing Efficiency/Speed 8
  • 9. Contacts Joe Caserta President & Founder, Caserta Concepts P: (855) 755-2246 x227 E: joe@casertaconcepts.com Erik Laurence VP Marketing, Caserta Concepts P: (855) 755-2246 x528 info@casertaconcepts.com E: erik@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com
  • 10. BIG DATA FACTS AND FIGURES Erik Laurence VP Marketing, Caserta Concepts
  • 11. What is Really Meant by Big Data? • The 4 Vs of Big Data 10% • Volume Structured • More data than ever before • Most of world’s data is unstructured, 90% Un/Semi/Multi- Structured semi-structured or multi-structured • Variety • More sources than ever before • Social, web logs, machine logs, documents, geotags, video, … • Velocity • Some data only has value for a short period of time • Relevance engines, financial fraud sensors, early warning sensors, etc. • Vitality • Agility is required in analytics • Adapt quickly to changing business needs
  • 12. Enterprise Involvement with Big Data 6% 18% Beyond Pilot Stage Engaged in Pilot 76% Not Yet Involved • Awareness of Big Data high among enterprises, but three-quarters still wondering, ―What is this all about?‖ • Answer across all businesses, ―We don't know what the business case is.‖ Source: WSJ November 29, 2012
  • 13. Business Cases Have Been Identified ―The use of data and analytics …is going to be a basis of competition going forward for individual firms, for sectors and even for countries. Those companies that are able to use data effectively are more likely to win in the marketplace.‖ - Michael Chui, McKinsey Global Institute In just one field—personal location data—$100 billion of value can be created globally for service providers through use of data. Benefits for consumers could be six times that. Source: (WSJ 11/29/12)
  • 14. Big Data Played A Role in the Election ―This was the first presidential election campaign where all of the data that was coming into the campaign was successfully collected and centralized. ―The Obama campaign did a successful job with that; the Obama campaign hired an analytics department five times as large as that of the 2008 operation. Romney campaign did not.‖ - John Aristotle Phillips, Chief Executive of Aristotle International (WSJ 11/29/12)
  • 15. Big Data Example in Obama Campaign • $40k-a-head dinner in June at Sarah Jessica Parker’s home in NYC • 7 different versions of the email solicitation for the event • Some mentioned a 2nd fundraiser that night, a Mariah Carey concert • Some said Ms. Parker is a mother • Some said Vogue editor Anna Wintour would be at the dinner • Who got which email depended on big data • Profile info about each prospect • How they react to different messages • Campaign created a single massive system to join info from Democratic voter files to • pollsters, fundraisers, field workers and consumer databases, social-media, and mobile contacts Sources: WSJ, Time Magazine
  • 16. Hadoop Market: Growing & Evolving • Big data outranks virtualization as #1 trend driving spending initiatives • Barclays CIO Survey, April 2012 • Overall market at $100B • Hadoop 2nd only to RDBMS in potential • Estimates put market growth at > 40% CAGR • IDC expects Big Data tech and services market to grow to $16.9B in 2015 • According to JPMC 50% of Big Data market will be influenced by Hadoop
  • 17. Hadoop Cost Effective for Archiving • Hadoop is orders of magnitude cheaper than traditional archival methods • Annual cost of 1 TB of archival storage for a credit card company Tape SAN Hadoop $30,000 $3,000 $300
  • 18. Hadoop is Fast • Sears' process to analyze loyalty club marketing campaigns took six weeks on mainframe, Teradata, and SAS servers • In retail, that’s half the season! • New process on Hadoop is done weekly • For online and mobile, daily analysis is done • What’s more, old models used 10% of data, new models use all the data • Source: Information Week (October 31, 2012)
  • 19. BUILDING A RECOMMENDATION ENGINE Elliott Cordo Principal Consultant, Caserta Concepts
  • 20. Recommendations • Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of • What makes a good recommendation? • Relevant but not obvious • Sense of ―surprise‖
  • 21. Where can recommendations engines be found? • Applications can be found in a wide variety of industries and applications: • Travel • Service Industry • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others
  • 22. Our Use Case: Online Magazine Goals: • Serve customers recommendations based on what their peers are reading. • Recommendation must have context to the article they are currently viewing.
  • 23. Technical Details Core Platform: • Cloudera Hadoop Cluster • Mahout Machine Learning Library • Apache Pig Additional Technology: • Talend Big Data Edition (ETL to/from relational) • Datameer (Analysis and Visualization)
  • 24. How we did it Solution leverages three main algorithms: • Mahout K-Means – identifying groups of similar articles • Mahout Item-Based Recommender - recommendations based on peer behavior • Raw Popularity – custom Pig script ―people who read this article also read..‖
  • 25. K-Means • Treats items as coordinates • Places a number of random ―centroids‖ and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing We used the major attributes of the articles to create coordinate points: Author, Topic, Section, Region, Media, etc. *Diagram from Collective Intelligence by Toby Segaran
  • 26. Item-Based Recommender • Build an item-item matrix determining relationships between pairs of items (usage) • Using the matrix, and the data on the current user, infer his taste • We use a dataset containing Customer, Article and Rating • Since no rating was available we used a 1 to 5 scale based on age (a ramped 6 month decay) • In the output a 0 to 5 scale is calculated, 5 being the most highly recommended for this customer
  • 27. Popularity • Self join usage dataset based on Article Also_Read_Data= join Readers1 by Customer_ID, Readers2 by Customer_ID using 'merge' • Group article based on Article, ―Also Read Article‖ • Sort descending based on the number of distinct peer customers • Limit 25 (most popular ―Also Read Article‖) • In the output a 0 to 5 scale is calculated, 5 being the most popular for a given article
  • 28. Delivering Recommendations Customer views an article online and we are passed their Customer ID and the Article they are viewing We then do the following: 1. K-Means – get all items in the same cluster and calculate Item-Based: K-Means: Euclidean Distance. Reverse and scale 0-5. Peers are reading Similar 2. Item-Based - get all peer recommendations for this customer 3. Popularity – get all popular recommendations for this article 4. Join the three data sets together, add the final rankings and bring back the most highly rated articles. Popularity: Most popular
  • 29. Items recommended by more than 1 algorithm are the most highly rated Item-Based: K-Means: Peers are reading Similar Popularity: Most popular Best Recommendations
  • 30. Improvements/Ideas • Conditionally swap algorithms: Peer recommendations can be unwieldy for new users • Allow users to rate how relevant this recommendation is - > retrain the model • Play with the weighting of current algorithms, evaluate others • Hybrid search platform: Replace or supplement K-Means with Search platform
  • 31. MACHINE LEARNING Grant Ingersoll President, Lucidworks Mahout co-founder Lucene/Solr committer