SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
Spotting Hadoop in the wild
                         Practical use cases from Last.fm and Massive Media


                                             @klbostee



Thursday 12 January 12
• “Data scientist is a job title for an
                         employee who analyses data, particularly
                         large amounts of it, to help a business gain a
                         competitive edge” —WhatIs.com
                    • “Someone who can obtain, scrub, explore,
                         model and interpret data, blending
                         hacking, statistics and machine
                         learning” —Hilary Mason, bit.ly


Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
                    • 2009: Data & Scalability Engineer at Last.fm
                    • 2011: Data Scientist at Massive Media




Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
                    • 2009: Data & Scalability Engineer at Last.fm
                    • 2011: Data Scientist at Massive Media
                    • Created Dumbo, a Python API for Hadoop
                    • Contributed some code to Hadoop itself
                    • Organized several HUGUK meetups
Thursday 12 January 12
What are those yellow things?




Thursday 12 January 12
Core principles


                    • Distributed
                    • Fault tolerant
                    • Sequential reads and writes
                    • Data locality

Thursday 12 January 12
Pars pro toto

                                                 Pig     Hive

                                         HBase
                             ZooKeeper

                                                  MapReduce

                                                  HDFS

                         Hadoop itself is basically the kernel that
                         provides a file system and task scheduler


Thursday 12 January 12
Hadoop file system




                         DataNode   DataNode   DataNode




Thursday 12 January 12
Hadoop file system

                         File A =




                          DataNode       DataNode   DataNode




Thursday 12 January 12
Hadoop file system

                         File A =

                         File B =




                          DataNode       DataNode   DataNode




Thursday 12 January 12
Hadoop file system
                                           Linux
                         File A =
                                           block
                         File B =
                                             Hadoop
                                               block


                          DataNode       DataNode      DataNode




Thursday 12 January 12
Hadoop file system
                                             Linux
                         File A =
                                             block
                         File B =
                                              Hadoop
                                                block
                         No random writes!

                          DataNode       DataNode       DataNode




Thursday 12 January 12
Hadoop task scheduler


                         TaskTracker   TaskTracker   TaskTracker


                         DataNode      DataNode      DataNode




Thursday 12 January 12
Hadoop task scheduler
                         Job A =


                         TaskTracker   TaskTracker   TaskTracker


                          DataNode     DataNode      DataNode




Thursday 12 January 12
Hadoop task scheduler
                         Job A =              Job B =


                         TaskTracker   TaskTracker      TaskTracker


                          DataNode     DataNode         DataNode




Thursday 12 January 12
Some practical tips


                    • Install a distribution
                    • Use compression
                    • Consider increasing your block size
                    • Watch out for small files

Thursday 12 January 12
HBase

                                                  Pig     Hive

                                         HBase
                             ZooKeeper

                                                   MapReduce

                                                   HDFS

                         HBase is a database on top of HDFS that
                         can easily be accessed from MapReduce


Thursday 12 January 12
Data model
                                 Column family A       Column family B

                    Row keys   Column X   Column Y   Column U   Column V


                         ...      ...        ...        ...        ...




Thursday 12 January 12
Data model
                                 Column family A       Column family B

                    Row keys   Column X   Column Y   Column U   Column V
         sorted




                         ...      ...        ...        ...        ...




Thursday 12 January 12
Data model
                                   Column family A       Column family B

                    Row keys    Column X    Column Y   Column U    Column V
         sorted




                         ...        ...        ...        ...         ...



                    •    Configurable number of versions per cell
                    •    Each cell version has a timestamp
                    •    TTL can be specified per column family


Thursday 12 January 12
Random becomes sequential


                            ...       KeyValue

                                      KeyValue
                         KeyValue




                                                 sorted
                                                          HDFS
                         KeyValue
                                        ...
                                      KeyValue

                         Commit log   Memstore



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...              KeyValue

                                             KeyValue
                         KeyValue




                                                        sorted
                                                                 HDFS
                         KeyValue
                                               ...
                                             KeyValue

                         Commit log          Memstore



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                                       HDFS
                         KeyValue
                                                     ...
                         KeyValue                  KeyValue

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                   KeyValue
                                                                       HDFS
                         KeyValue
                                                      ...
                         KeyValue                  KeyValue

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue



                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                              sorted
                                                   KeyValue
                                                                            HDFS
                         KeyValue
                                                      ...              sequential
                         KeyValue                  KeyValue              write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue                  High write throughput!


                            ...                    KeyValue

                                                   KeyValue
                         KeyValue




                                                                 sorted
                                                   KeyValue
                                                                               HDFS
                         KeyValue
                                                      ...                 sequential
                         KeyValue                  KeyValue                 write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Random becomes sequential
                                  KeyValue                  High write throughput!
                                                                           + efficient scans
                                                                           + free empty cells
                                                                           + no fragmentation
                            ...                    KeyValue                + ...

                                                   KeyValue
                         KeyValue




                                                                 sorted
                                                   KeyValue
                                                                               HDFS
                         KeyValue
                                                      ...                 sequential
                         KeyValue                  KeyValue                 write

                                      sequential
                         Commit log                Memstore
                                        write



Thursday 12 January 12
Horizontal scaling
                Row keys                        sorted




Thursday 12 January 12
Horizontal scaling
                Row keys                        sorted




Thursday 12 January 12
Horizontal scaling
                Row keys                               sorted




                         Region




                   RegionServer




Thursday 12 January 12
Horizontal scaling
                Row keys                                    sorted




                         Region           Region         Region

                         Region
                           ...              ...             ...
                   RegionServer         RegionServer   RegionServer




Thursday 12 January 12
Horizontal scaling
                Row keys                                                sorted




                         Region                Region                Region

                         Region
                           ...                   ...                    ...
                   RegionServer              RegionServer          RegionServer

                •        Each region has its own commit log and memstores
                •        Moving regions is easy since the data is all in HDFS
                •        Strong consistency as each region is served only once

Thursday 12 January 12
Some practical tips

                    • Restrict the number of regions per server
                    • Restrict the number column families
                    • Use compression
                    • Increase file descriptor limits on nodes
                    • Use a large enough buffer when scanning

Thursday 12 January 12
Look, a herd of Hadoops!




Thursday 12 January 12
• “Last.fm lets you effortlessly keep a record
                         of what you listen to from any player. Based
                         on your taste, Last.fm recommends you
                         more music and concerts” —Last.fm
                    • Over 60 billion tracks scrobbled since 2003
                    • Started using Hadoop in 2006, before Yahoo

Thursday 12 January 12
• “Massive Media is the social media
                         company behind the successful digital
                         brands Netlog.com and Twoo.com.
                         We enable members to meet nearby
                         people instantly” —MassiveMedia.eu
                    • Over 80 million users on web and mobile
                    • Using Hadoop for about a year now
Thursday 12 January 12
Hadoop adoption

                    1. Business intelligence
                    2. Testing and experimentation
                    3. Fraud and abuse detection
                    4. Product features
                    5. PR and marketing



Thursday 12 January 12
Hadoop adoption




                                                         m f
                                                       st.
                                                     La
                    1. Business intelligence         √
                    2. Testing and experimentation   √
                    3. Fraud and abuse detection     √
                    4. Product features              √
                    5. PR and marketing              √



Thursday 12 January 12
Hadoop adoption




                                                                       ia
                                                                     ed
                                                                    Me
                                                         m


                                                                  siv
                                                           f
                                                       st.


                                                                as
                                                     La


                                                               M
                    1. Business intelligence         √ √
                    2. Testing and experimentation   √ √
                    3. Fraud and abuse detection     √ √
                    4. Product features              √ √
                    5. PR and marketing              √



Thursday 12 January 12
Business intelligence




Thursday 12 January 12
Testing and experimentation




Thursday 12 January 12
Fraud and abuse detection




Thursday 12 January 12
Fraud and abuse detection




Thursday 12 January 12
Product features




Thursday 12 January 12
PR and marketing




Thursday 12 January 12
Let’s dive into the first use case!




Thursday 12 January 12
Goals and requirements

                    • Timeseries graphs of 1000 or so metrics
                    • Segmented over about 10 dimensions




Thursday 12 January 12
Goals and requirements

                    • Timeseries graphs of 1000 or so metrics
                    • Segmented over about 10 dimensions
                    1. Scale with very large number of events
                    2. History for graphs must be long enough
                    3. Accessing the graphs must be instantaneous
                    4. Possibility to analyse in detail when needed


Thursday 12 January 12
Attempt #1

                    • Log table in MySQL
                    • Generate graphs from this table on-the-fly




Thursday 12 January 12
Attempt #1

                    • Log table in MySQL
                    • Generate graphs from this table on-the-fly
                    1. Large number of events      √
                    2. Long enough history          ⁄
                    3. Instantaneous access         ⁄
                    4. Analyse in detail           √

Thursday 12 January 12
Attempt #2

                    • Counters in MySQL table
                    • Update counters on every event




Thursday 12 January 12
Attempt #2

                    • Counters in MySQL table
                    • Update counters on every event
                    1. Large number of events      ⁄
                    2. Long enough history        √
                    3. Instantaneous access       √
                    4. Analyse in detail           ⁄

Thursday 12 January 12
Attempt #3

                    • Put log files in HDFS through syslog-ng
                    • MapReduce on logs and write to HBase




Thursday 12 January 12
Attempt #3

                    • Put log files in HDFS through syslog-ng
                    • MapReduce on logs and write to HBase
                    1. Large number of events      √
                    2. Long enough history         √
                    3. Instantaneous access        √
                    4. Analyse in detail           √

Thursday 12 January 12
Architecture

                          Syslog-ng

                           HDFS

                         MapReduce

                           HBase


Thursday 12 January 12
Architecture

                          Syslog-ng

                           HDFS
                                         Realtime
                                        processing
                         MapReduce

                           HBase


Thursday 12 January 12
Architecture

                              Syslog-ng

                               HDFS
                                             Realtime
                   Ad-hoc                   processing
                             MapReduce
                   results

                               HBase


Thursday 12 January 12
HBase schema

                    • Separate table for each time granularity
                    • Global segmentations in row keys
                         •   <language>||<country>||...|||<timestamp>
                         •   * for “not specified”
                         •   trailing *s are omitted
                    • Further segmentations in column keys
                     • e.g. payments_via_paypal, payments_via_sms
                    • Related metrics in same column family
Thursday 12 January 12
Questions?



Thursday 12 January 12

Contenu connexe

Tendances

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionXplenty
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Adam Kawa
 
Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentationjexp
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneEnkitec
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 

Tendances (20)

Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
HDFS
HDFSHDFS
HDFS
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Hive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly CompetitionHive vs Hbase, a Friendly Competition
Hive vs Hbase, a Friendly Competition
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Intro to Neo4j presentation
Intro to Neo4j presentationIntro to Neo4j presentation
Intro to Neo4j presentation
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Hadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry OsborneHadoop Meets Exadata- Kerry Osborne
Hadoop Meets Exadata- Kerry Osborne
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 

Similaire à Spotting Hadoop in the wild

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
DataLogix Hadoop Solution
DataLogix Hadoop SolutionDataLogix Hadoop Solution
DataLogix Hadoop SolutionDataLogix B.V.
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionEdureka!
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of HadoopNam Nham
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabsWhizlabs
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-servicesSreenu Musham
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander K
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandasPurna Chander
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 

Similaire à Spotting Hadoop in the wild (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop technology doc
Hadoop technology docHadoop technology doc
Hadoop technology doc
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
DataLogix Hadoop Solution
DataLogix Hadoop SolutionDataLogix Hadoop Solution
DataLogix Hadoop Solution
 
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solutionHadoop a Highly Available and Secure Enterprise Data Warehousing solution
Hadoop a Highly Available and Secure Enterprise Data Warehousing solution
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
Hadoop
HadoopHadoop
Hadoop
 
The Family of Hadoop
The Family of HadoopThe Family of Hadoop
The Family of Hadoop
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs50 must read hadoop interview questions & answers - whizlabs
50 must read hadoop interview questions & answers - whizlabs
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Not Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache HadoopNot Just Another Overview of Apache Hadoop
Not Just Another Overview of Apache Hadoop
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hive sq lfor-hadoop
Hive sq lfor-hadoopHive sq lfor-hadoop
Hive sq lfor-hadoop
 
Hive and data analysis using pandas
 Hive  and  data analysis  using pandas Hive  and  data analysis  using pandas
Hive and data analysis using pandas
 
Hive and data analysis using pandas
Hive  and  data analysis  using pandasHive  and  data analysis  using pandas
Hive and data analysis using pandas
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 

Dernier

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 

Dernier (20)

UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 

Spotting Hadoop in the wild

  • 1. Spotting Hadoop in the wild Practical use cases from Last.fm and Massive Media @klbostee Thursday 12 January 12
  • 2. • “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com • “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.ly Thursday 12 January 12
  • 3. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media Thursday 12 January 12
  • 4. • 2007: Started using Hadoop as PhD student • 2009: Data & Scalability Engineer at Last.fm • 2011: Data Scientist at Massive Media • Created Dumbo, a Python API for Hadoop • Contributed some code to Hadoop itself • Organized several HUGUK meetups Thursday 12 January 12
  • 5. What are those yellow things? Thursday 12 January 12
  • 6. Core principles • Distributed • Fault tolerant • Sequential reads and writes • Data locality Thursday 12 January 12
  • 7. Pars pro toto Pig Hive HBase ZooKeeper MapReduce HDFS Hadoop itself is basically the kernel that provides a file system and task scheduler Thursday 12 January 12
  • 8. Hadoop file system DataNode DataNode DataNode Thursday 12 January 12
  • 9. Hadoop file system File A = DataNode DataNode DataNode Thursday 12 January 12
  • 10. Hadoop file system File A = File B = DataNode DataNode DataNode Thursday 12 January 12
  • 11. Hadoop file system Linux File A = block File B = Hadoop block DataNode DataNode DataNode Thursday 12 January 12
  • 12. Hadoop file system Linux File A = block File B = Hadoop block No random writes! DataNode DataNode DataNode Thursday 12 January 12
  • 13. Hadoop task scheduler TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 14. Hadoop task scheduler Job A = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 15. Hadoop task scheduler Job A = Job B = TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode Thursday 12 January 12
  • 16. Some practical tips • Install a distribution • Use compression • Consider increasing your block size • Watch out for small files Thursday 12 January 12
  • 17. HBase Pig Hive HBase ZooKeeper MapReduce HDFS HBase is a database on top of HDFS that can easily be accessed from MapReduce Thursday 12 January 12
  • 18. Data model Column family A Column family B Row keys Column X Column Y Column U Column V ... ... ... ... ... Thursday 12 January 12
  • 19. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... Thursday 12 January 12
  • 20. Data model Column family A Column family B Row keys Column X Column Y Column U Column V sorted ... ... ... ... ... • Configurable number of versions per cell • Each cell version has a timestamp • TTL can be specified per column family Thursday 12 January 12
  • 21. Random becomes sequential ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log Memstore Thursday 12 January 12
  • 22. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue Commit log Memstore Thursday 12 January 12
  • 23. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore write Thursday 12 January 12
  • 24. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... KeyValue KeyValue sequential Commit log Memstore write Thursday 12 January 12
  • 25. Random becomes sequential KeyValue ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 26. Random becomes sequential KeyValue High write throughput! ... KeyValue KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 27. Random becomes sequential KeyValue High write throughput! + efficient scans + free empty cells + no fragmentation ... KeyValue + ... KeyValue KeyValue sorted KeyValue HDFS KeyValue ... sequential KeyValue KeyValue write sequential Commit log Memstore write Thursday 12 January 12
  • 28. Horizontal scaling Row keys sorted Thursday 12 January 12
  • 29. Horizontal scaling Row keys sorted Thursday 12 January 12
  • 30. Horizontal scaling Row keys sorted Region RegionServer Thursday 12 January 12
  • 31. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer Thursday 12 January 12
  • 32. Horizontal scaling Row keys sorted Region Region Region Region ... ... ... RegionServer RegionServer RegionServer • Each region has its own commit log and memstores • Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only once Thursday 12 January 12
  • 33. Some practical tips • Restrict the number of regions per server • Restrict the number column families • Use compression • Increase file descriptor limits on nodes • Use a large enough buffer when scanning Thursday 12 January 12
  • 34. Look, a herd of Hadoops! Thursday 12 January 12
  • 35. • “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm • Over 60 billion tracks scrobbled since 2003 • Started using Hadoop in 2006, before Yahoo Thursday 12 January 12
  • 36. • “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu • Over 80 million users on web and mobile • Using Hadoop for about a year now Thursday 12 January 12
  • 37. Hadoop adoption 1. Business intelligence 2. Testing and experimentation 3. Fraud and abuse detection 4. Product features 5. PR and marketing Thursday 12 January 12
  • 38. Hadoop adoption m f st. La 1. Business intelligence √ 2. Testing and experimentation √ 3. Fraud and abuse detection √ 4. Product features √ 5. PR and marketing √ Thursday 12 January 12
  • 39. Hadoop adoption ia ed Me m siv f st. as La M 1. Business intelligence √ √ 2. Testing and experimentation √ √ 3. Fraud and abuse detection √ √ 4. Product features √ √ 5. PR and marketing √ Thursday 12 January 12
  • 42. Fraud and abuse detection Thursday 12 January 12
  • 43. Fraud and abuse detection Thursday 12 January 12
  • 45. PR and marketing Thursday 12 January 12
  • 46. Let’s dive into the first use case! Thursday 12 January 12
  • 47. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions Thursday 12 January 12
  • 48. Goals and requirements • Timeseries graphs of 1000 or so metrics • Segmented over about 10 dimensions 1. Scale with very large number of events 2. History for graphs must be long enough 3. Accessing the graphs must be instantaneous 4. Possibility to analyse in detail when needed Thursday 12 January 12
  • 49. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly Thursday 12 January 12
  • 50. Attempt #1 • Log table in MySQL • Generate graphs from this table on-the-fly 1. Large number of events √ 2. Long enough history ⁄ 3. Instantaneous access ⁄ 4. Analyse in detail √ Thursday 12 January 12
  • 51. Attempt #2 • Counters in MySQL table • Update counters on every event Thursday 12 January 12
  • 52. Attempt #2 • Counters in MySQL table • Update counters on every event 1. Large number of events ⁄ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail ⁄ Thursday 12 January 12
  • 53. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase Thursday 12 January 12
  • 54. Attempt #3 • Put log files in HDFS through syslog-ng • MapReduce on logs and write to HBase 1. Large number of events √ 2. Long enough history √ 3. Instantaneous access √ 4. Analyse in detail √ Thursday 12 January 12
  • 55. Architecture Syslog-ng HDFS MapReduce HBase Thursday 12 January 12
  • 56. Architecture Syslog-ng HDFS Realtime processing MapReduce HBase Thursday 12 January 12
  • 57. Architecture Syslog-ng HDFS Realtime Ad-hoc processing MapReduce results HBase Thursday 12 January 12
  • 58. HBase schema • Separate table for each time granularity • Global segmentations in row keys • <language>||<country>||...|||<timestamp> • * for “not specified” • trailing *s are omitted • Further segmentations in column keys • e.g. payments_via_paypal, payments_via_sms • Related metrics in same column family Thursday 12 January 12