SlideShare une entreprise Scribd logo
1  sur  49
Introduction to Analytics
and Big Data - Hadoop

The University of British Columbia
Computer Science Alumni/Industry Lecture Series
Geoff Fawkes
November, 2013

© 2013 Geoff Fawkes. All Rights Reserved.

1 / 450
Who am I?
 Director Engineering, Teradata
 HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various

engineering roles
 Technology executive, mentor, software engineer


B.Sc. Comp Sci (UBC), MBA Executive (SFU)

 Interruptive (disruptive?) personality



Please ask questions to me / each other as we go along
I don’t have all the answers – you do!

 Credits: Rob Pegler, SNIA Education


Storage Networking Industry Association, 2012

 Who’s paying attention - 450 slides page count?


Not that “big” - - about 50
© 2013 Geoff Fawkes. All Rights Reserved.
2
Big Data and Hadoop
 History
 Data Challenges
 Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
3
Customer Challenges: The Data
Deluge

© 2013 Geoff Fawkes. All Rights Reserved.
4
Big Data is Different than Business
Intelligence

© 2013 Geoff Fawkes. All Rights Reserved.
5
Questions From Business Will Vary

© 2013 Geoff Fawkes. All Rights Reserved.
6
Web 2.0 is “Data Driven”

© 2013 Geoff Fawkes. All Rights Reserved.
7
The World of Data-Driven
Applications

© 2013 Geoff Fawkes. All Rights Reserved.
8
Attributes of Big Data

© 2013 Geoff Fawkes. All Rights Reserved.
9
Top Ten Common Big Data Problems

© 2013 Geoff Fawkes. All Rights Reserved.
10
Industries Are Embracing Big Data

© 2013 Geoff Fawkes. All Rights Reserved.
11
Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
12
Why Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
13
Storage and Memory B/W Lagging
CPU

© 2013 Geoff Fawkes. All Rights Reserved.
14
Commodity Hardware Economics

© 2013 Geoff Fawkes. All Rights Reserved.
15
What is Hadoop?
 Hadoop Adoption
 HDFS
 MapReduce
 Examples
 Ecosystem Projects

© 2013 Geoff Fawkes. All Rights Reserved.
17
Hadoop Adoption in the Industry

© 2013 Geoff Fawkes. All Rights Reserved.
18
What is Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
19
What is Hadoop?

© 2013 Geoff Fawkes. All Rights Reserved.
20
HDFS 101 – The Data Set System

© 2013 Geoff Fawkes. All Rights Reserved.
21
HDFS Organization and Replication

© 2013 Geoff Fawkes. All Rights Reserved.
22
Hadoop Server Roles - Multiple

© 2013 Geoff Fawkes. All Rights Reserved.
23
Hadoop Cluster

© 2013 Geoff Fawkes. All Rights Reserved.
24
HDFS File Write Operation - Instance

© 2013 Geoff Fawkes. All Rights Reserved.
25
HDFS File Read Operation - Instance

© 2013 Geoff Fawkes. All Rights Reserved.
26
HDFS File Operation R/W Replication

© 2013 Geoff Fawkes. All Rights Reserved.
27
MapReduce 101 – Functional
Programming Meets Distributed Processing

© 2013 Geoff Fawkes. All Rights Reserved.
28
What is MapReduce?

© 2013 Geoff Fawkes. All Rights Reserved.
29
Key MapReduce Terminology

© 2013 Geoff Fawkes. All Rights Reserved.
30
MapReduce Basic Concepts

© 2013 Geoff Fawkes. All Rights Reserved.
31
Example 1: MapReduce Operation

© 2013 Geoff Fawkes. All Rights Reserved.
32
Example 2: Sample Dataset

© 2013 Geoff Fawkes. All Rights Reserved.
33
MapReduce Paradigm – UNIX Cmd

© 2013 Geoff Fawkes. All Rights Reserved.
34
Example 3: Count Words

© 2013 Geoff Fawkes. All Rights Reserved.
35
Ex. 3: Lifecycle of a MapReduce Job
Map function

Reduce function

Run this program as a
MapReduce job

© 2013 Geoff Fawkes. All Rights Reserved.
36
Ex. 3: Lifecycle of a MapReduce Job
Map function

Reduce function

Run this program as a
MapReduce job

© 2013 Geoff Fawkes. All Rights Reserved.
37
Ex. 3: Lifecycle of a MapReduce Job
Time

Input
Splits

Map
Wave 1

Map
Wave 2

Reduce
Wave 1

Reduce
Wave 2

How are the number of splits, number of map and reduce
tasks, memory allocation to tasks, etc., determined?
© 2013 Geoff Fawkes. All Rights Reserved.
38
MapReduce Job Configuration Parms
 190+ parameters in

Hadoop
 Set manually or
defaults are used

© 2013 Geoff Fawkes. All Rights Reserved.
39
Putting it all Together: MapReduce +
HDFS

© 2013 Geoff Fawkes. All Rights Reserved.
40
Hadoop Ecosystem Projects

- Interactive SQL Query & Modeling
- Data flow for tedious MapReduce Jobs
- Columnar NoSQL Store

© 2013 Geoff Fawkes. All Rights Reserved.
41
Compare: Hadoop, SQL, Massively
Parallel Processing (MPP)

© 2013 Geoff Fawkes. All Rights Reserved.
42
Compare: RDBMS and MapReduce

© 2013 Geoff Fawkes. All Rights Reserved.
43
Hadoop Use Cases
 Set Top Cable TV Boxes
 Pay Per View Advertising
 Bank Risk Modelling
 Product Sentiment Analysis

© 2013 Geoff Fawkes. All Rights Reserved.
44
Example 1: Set Top Cable TV Boxes

© 2013 Geoff Fawkes. All Rights Reserved.
45
Example 2: Pay Per View Advertising

© 2013 Geoff Fawkes. All Rights Reserved.
46
Example 3: Bank Risk Modelling

© 2013 Geoff Fawkes. All Rights Reserved.
47
Example 4: Product Sentiment Analysis

© 2013 Geoff Fawkes. All Rights Reserved.
48
More Reading?
 World Economic Forum: “Personal Data: The Emergence of a New Asset
Class” 2011
 McKinsey Global Institute: Big Data: The next frontier for innovation,
competition, and productivity
 Big Data: Harnessing a game-changing asset
 IDC: 2011 Digital Universe Study: Extracting Value from Chaos
 The Economist: Data, Data Everywhere
 Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New
Field
 O’Reilly – What is Data Science?
 O’Reilly – Building Data Science Teams?
 O’Reilly – Data for the public good
 Obama Administration “Big Data Research and Development Initiative.”

© 2013 Geoff Fawkes. All Rights Reserved.
49
Introduction to Analytics
and Big Data – Hadoop
Q&A
Geoff Fawkes
http://www.linkedin.com/pub/geoff-fawkes/1/269/202
@gfawkes
November, 2013

© 2013 Geoff Fawkes. All Rights Reserved.

50

Contenu connexe

Tendances

Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleSpringPeople
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...DataWorks Summit
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaEdureka!
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystemnallagangus
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Abdul Nasir
 
Pivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-workingPivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-workingtts2086
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...DataWorks Summit/Hadoop Summit
 
new_Rajesh_Hadoop Developer_2016
new_Rajesh_Hadoop Developer_2016new_Rajesh_Hadoop Developer_2016
new_Rajesh_Hadoop Developer_2016Rajesh Kumar
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
 

Tendances (20)

Introduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeopleIntroduction To Hadoop Administration - SpringPeople
Introduction To Hadoop Administration - SpringPeople
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 
Introducing Big Data
Introducing Big DataIntroducing Big Data
Introducing Big Data
 
Introducing Data Lakes
Introducing Data LakesIntroducing Data Lakes
Introducing Data Lakes
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Concepts on Hadoop
Concepts on HadoopConcepts on Hadoop
Concepts on Hadoop
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
Resume
ResumeResume
Resume
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
Hadoop Distriubted File System (HDFS) presentation 27- 5-2015
 
Pivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-workingPivotal-HadoopOverview2016-working
Pivotal-HadoopOverview2016-working
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
Worldpay - Delivering Multi-Tenancy Applications in A Secure Operational Plat...
 
new_Rajesh_Hadoop Developer_2016
new_Rajesh_Hadoop Developer_2016new_Rajesh_Hadoop Developer_2016
new_Rajesh_Hadoop Developer_2016
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 

En vedette

An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache CassandraMike Frampton
 
Pbe 3.0 final presentation 2011
Pbe 3.0 final presentation 2011Pbe 3.0 final presentation 2011
Pbe 3.0 final presentation 2011kylekeller
 
Mumias the kiangoi report
Mumias the kiangoi reportMumias the kiangoi report
Mumias the kiangoi reportChweya Kiangoi
 
Linked data our experience
Linked data our experienceLinked data our experience
Linked data our experienceTalis Consulting
 
Personal hygiene istirahat & tidur
Personal hygiene istirahat & tidurPersonal hygiene istirahat & tidur
Personal hygiene istirahat & tidurIdha Chan
 
SolTec Presentation
SolTec PresentationSolTec Presentation
SolTec Presentationtcmg
 
The use of biological materials in the production of rice
The use of biological materials in the production of riceThe use of biological materials in the production of rice
The use of biological materials in the production of riceworachak11
 
R.O.G.E.R Games for health 2011
R.O.G.E.R Games for health 2011R.O.G.E.R Games for health 2011
R.O.G.E.R Games for health 2011Laurent Grumiaux
 
Cats And Dogs Living Together: Langsec Is Also About Usability
Cats And Dogs Living Together: Langsec Is Also About UsabilityCats And Dogs Living Together: Langsec Is Also About Usability
Cats And Dogs Living Together: Langsec Is Also About UsabilityMeredith Patterson
 
Sociala medier och effektivitet enabling final ver01
Sociala medier och effektivitet   enabling final ver01Sociala medier och effektivitet   enabling final ver01
Sociala medier och effektivitet enabling final ver01Charles Limerius
 
Population project keynote
Population project keynotePopulation project keynote
Population project keynotekylekeller
 
Powerful presentation
Powerful presentationPowerful presentation
Powerful presentationkimkuboom
 

En vedette (20)

An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
Youth council
Youth councilYouth council
Youth council
 
Pbe 3.0 final presentation 2011
Pbe 3.0 final presentation 2011Pbe 3.0 final presentation 2011
Pbe 3.0 final presentation 2011
 
Mumias the kiangoi report
Mumias the kiangoi reportMumias the kiangoi report
Mumias the kiangoi report
 
Linked data our experience
Linked data our experienceLinked data our experience
Linked data our experience
 
Linked data your journey
Linked data your journeyLinked data your journey
Linked data your journey
 
Personal hygiene istirahat & tidur
Personal hygiene istirahat & tidurPersonal hygiene istirahat & tidur
Personal hygiene istirahat & tidur
 
Black Belt In Retail
Black Belt In Retail Black Belt In Retail
Black Belt In Retail
 
SolTec Presentation
SolTec PresentationSolTec Presentation
SolTec Presentation
 
The use of biological materials in the production of rice
The use of biological materials in the production of riceThe use of biological materials in the production of rice
The use of biological materials in the production of rice
 
Black Belt In Retail
Black Belt In Retail Black Belt In Retail
Black Belt In Retail
 
R.O.G.E.R Games for health 2011
R.O.G.E.R Games for health 2011R.O.G.E.R Games for health 2011
R.O.G.E.R Games for health 2011
 
Shopss
ShopssShopss
Shopss
 
Cats And Dogs Living Together: Langsec Is Also About Usability
Cats And Dogs Living Together: Langsec Is Also About UsabilityCats And Dogs Living Together: Langsec Is Also About Usability
Cats And Dogs Living Together: Langsec Is Also About Usability
 
Sociala medier och effektivitet enabling final ver01
Sociala medier och effektivitet   enabling final ver01Sociala medier och effektivitet   enabling final ver01
Sociala medier och effektivitet enabling final ver01
 
Linked Data in Action
Linked Data in ActionLinked Data in Action
Linked Data in Action
 
Population project keynote
Population project keynotePopulation project keynote
Population project keynote
 
Powerful presentation
Powerful presentationPowerful presentation
Powerful presentation
 
Black Belt In Retail
Black Belt In Retail Black Belt In Retail
Black Belt In Retail
 
Black Belt In Retail
Black Belt In Retail Black Belt In Retail
Black Belt In Retail
 

Similaire à Intro to big data and hadoop ubc cs lecture series - g fawkes

How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...Big Data Spain
 
Hadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationHadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationVskills
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureKovid Academy
 
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...BICC Thomas More
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data InfrastructureTrivadis
 
HDFS & MapReduce
HDFS & MapReduceHDFS & MapReduce
HDFS & MapReduceSkillspeed
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Hadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseHadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseCloudera, Inc.
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationInside Analysis
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramSkillspeed
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Top 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarTop 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarSkillspeed
 
Delivering on the Hadoop/HBase Integrated Architecture
Delivering on the Hadoop/HBase Integrated ArchitectureDelivering on the Hadoop/HBase Integrated Architecture
Delivering on the Hadoop/HBase Integrated ArchitectureDataWorks Summit
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn HadoopSilicon Halton
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...DataWorks Summit
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleHarald Erb
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placementsofia taylor
 

Similaire à Intro to big data and hadoop ubc cs lecture series - g fawkes (20)

How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 How to use Hadoop for operational and transactional purposes by RODRIGO MERI... How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
How to use Hadoop for operational and transactional purposes by RODRIGO MERI...
 
Hadoop and Mapreduce Certification
Hadoop and Mapreduce CertificationHadoop and Mapreduce Certification
Hadoop and Mapreduce Certification
 
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...BI congres 2016-2: Diving into weblog data with SAS on Hadoop -  Lisa Truyers...
BI congres 2016-2: Diving into weblog data with SAS on Hadoop - Lisa Truyers...
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
HDFS & MapReduce
HDFS & MapReduceHDFS & MapReduce
HDFS & MapReduce
 
Future of-hadoop-analytics
Future of-hadoop-analyticsFuture of-hadoop-analytics
Future of-hadoop-analytics
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Hadoop: Extending your Data Warehouse
Hadoop: Extending your Data WarehouseHadoop: Extending your Data Warehouse
Hadoop: Extending your Data Warehouse
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
Run Your First Hadoop 2.x Program
Run Your First Hadoop 2.x ProgramRun Your First Hadoop 2.x Program
Run Your First Hadoop 2.x Program
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Top 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer WebinarTop 5 Tasks Of A Hadoop Developer Webinar
Top 5 Tasks Of A Hadoop Developer Webinar
 
Delivering on the Hadoop/HBase Integrated Architecture
Delivering on the Hadoop/HBase Integrated ArchitectureDelivering on the Hadoop/HBase Integrated Architecture
Delivering on the Hadoop/HBase Integrated Architecture
 
Learning How to Learn Hadoop
Learning How to Learn HadoopLearning How to Learn Hadoop
Learning How to Learn Hadoop
 
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
Big Data Management System: Smart SQL Processing Across Hadoop and your Data ...
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placement
 

Dernier

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Intro to big data and hadoop ubc cs lecture series - g fawkes

  • 1. Introduction to Analytics and Big Data - Hadoop The University of British Columbia Computer Science Alumni/Industry Lecture Series Geoff Fawkes November, 2013 © 2013 Geoff Fawkes. All Rights Reserved. 1 / 450
  • 2. Who am I?  Director Engineering, Teradata  HSBC, Pivotal/Aptean, Newbridge/Alcatel, etc. various engineering roles  Technology executive, mentor, software engineer  B.Sc. Comp Sci (UBC), MBA Executive (SFU)  Interruptive (disruptive?) personality   Please ask questions to me / each other as we go along I don’t have all the answers – you do!  Credits: Rob Pegler, SNIA Education  Storage Networking Industry Association, 2012  Who’s paying attention - 450 slides page count?  Not that “big” - - about 50 © 2013 Geoff Fawkes. All Rights Reserved. 2
  • 3. Big Data and Hadoop  History  Data Challenges  Why Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 3
  • 4. Customer Challenges: The Data Deluge © 2013 Geoff Fawkes. All Rights Reserved. 4
  • 5. Big Data is Different than Business Intelligence © 2013 Geoff Fawkes. All Rights Reserved. 5
  • 6. Questions From Business Will Vary © 2013 Geoff Fawkes. All Rights Reserved. 6
  • 7. Web 2.0 is “Data Driven” © 2013 Geoff Fawkes. All Rights Reserved. 7
  • 8. The World of Data-Driven Applications © 2013 Geoff Fawkes. All Rights Reserved. 8
  • 9. Attributes of Big Data © 2013 Geoff Fawkes. All Rights Reserved. 9
  • 10. Top Ten Common Big Data Problems © 2013 Geoff Fawkes. All Rights Reserved. 10
  • 11. Industries Are Embracing Big Data © 2013 Geoff Fawkes. All Rights Reserved. 11
  • 12. Why Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 12
  • 13. Why Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 13
  • 14. Storage and Memory B/W Lagging CPU © 2013 Geoff Fawkes. All Rights Reserved. 14
  • 15. Commodity Hardware Economics © 2013 Geoff Fawkes. All Rights Reserved. 15
  • 16. What is Hadoop?  Hadoop Adoption  HDFS  MapReduce  Examples  Ecosystem Projects © 2013 Geoff Fawkes. All Rights Reserved. 17
  • 17. Hadoop Adoption in the Industry © 2013 Geoff Fawkes. All Rights Reserved. 18
  • 18. What is Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 19
  • 19. What is Hadoop? © 2013 Geoff Fawkes. All Rights Reserved. 20
  • 20. HDFS 101 – The Data Set System © 2013 Geoff Fawkes. All Rights Reserved. 21
  • 21. HDFS Organization and Replication © 2013 Geoff Fawkes. All Rights Reserved. 22
  • 22. Hadoop Server Roles - Multiple © 2013 Geoff Fawkes. All Rights Reserved. 23
  • 23. Hadoop Cluster © 2013 Geoff Fawkes. All Rights Reserved. 24
  • 24. HDFS File Write Operation - Instance © 2013 Geoff Fawkes. All Rights Reserved. 25
  • 25. HDFS File Read Operation - Instance © 2013 Geoff Fawkes. All Rights Reserved. 26
  • 26. HDFS File Operation R/W Replication © 2013 Geoff Fawkes. All Rights Reserved. 27
  • 27. MapReduce 101 – Functional Programming Meets Distributed Processing © 2013 Geoff Fawkes. All Rights Reserved. 28
  • 28. What is MapReduce? © 2013 Geoff Fawkes. All Rights Reserved. 29
  • 29. Key MapReduce Terminology © 2013 Geoff Fawkes. All Rights Reserved. 30
  • 30. MapReduce Basic Concepts © 2013 Geoff Fawkes. All Rights Reserved. 31
  • 31. Example 1: MapReduce Operation © 2013 Geoff Fawkes. All Rights Reserved. 32
  • 32. Example 2: Sample Dataset © 2013 Geoff Fawkes. All Rights Reserved. 33
  • 33. MapReduce Paradigm – UNIX Cmd © 2013 Geoff Fawkes. All Rights Reserved. 34
  • 34. Example 3: Count Words © 2013 Geoff Fawkes. All Rights Reserved. 35
  • 35. Ex. 3: Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job © 2013 Geoff Fawkes. All Rights Reserved. 36
  • 36. Ex. 3: Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job © 2013 Geoff Fawkes. All Rights Reserved. 37
  • 37. Ex. 3: Lifecycle of a MapReduce Job Time Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined? © 2013 Geoff Fawkes. All Rights Reserved. 38
  • 38. MapReduce Job Configuration Parms  190+ parameters in Hadoop  Set manually or defaults are used © 2013 Geoff Fawkes. All Rights Reserved. 39
  • 39. Putting it all Together: MapReduce + HDFS © 2013 Geoff Fawkes. All Rights Reserved. 40
  • 40. Hadoop Ecosystem Projects - Interactive SQL Query & Modeling - Data flow for tedious MapReduce Jobs - Columnar NoSQL Store © 2013 Geoff Fawkes. All Rights Reserved. 41
  • 41. Compare: Hadoop, SQL, Massively Parallel Processing (MPP) © 2013 Geoff Fawkes. All Rights Reserved. 42
  • 42. Compare: RDBMS and MapReduce © 2013 Geoff Fawkes. All Rights Reserved. 43
  • 43. Hadoop Use Cases  Set Top Cable TV Boxes  Pay Per View Advertising  Bank Risk Modelling  Product Sentiment Analysis © 2013 Geoff Fawkes. All Rights Reserved. 44
  • 44. Example 1: Set Top Cable TV Boxes © 2013 Geoff Fawkes. All Rights Reserved. 45
  • 45. Example 2: Pay Per View Advertising © 2013 Geoff Fawkes. All Rights Reserved. 46
  • 46. Example 3: Bank Risk Modelling © 2013 Geoff Fawkes. All Rights Reserved. 47
  • 47. Example 4: Product Sentiment Analysis © 2013 Geoff Fawkes. All Rights Reserved. 48
  • 48. More Reading?  World Economic Forum: “Personal Data: The Emergence of a New Asset Class” 2011  McKinsey Global Institute: Big Data: The next frontier for innovation, competition, and productivity  Big Data: Harnessing a game-changing asset  IDC: 2011 Digital Universe Study: Extracting Value from Chaos  The Economist: Data, Data Everywhere  Data Science Revealed: A Data-Driven Glimpse into the Burgeoning New Field  O’Reilly – What is Data Science?  O’Reilly – Building Data Science Teams?  O’Reilly – Data for the public good  Obama Administration “Big Data Research and Development Initiative.” © 2013 Geoff Fawkes. All Rights Reserved. 49
  • 49. Introduction to Analytics and Big Data – Hadoop Q&A Geoff Fawkes http://www.linkedin.com/pub/geoff-fawkes/1/269/202 @gfawkes November, 2013 © 2013 Geoff Fawkes. All Rights Reserved. 50

Notes de l'éditeur

  1. Housekeeping: Keep your mobile devices on, turn up the ringer volume really loud, tweet, checkin on foursquare, update your facebook as I speak – we now live in a multi-tasking world so I’m ok with interruptions. Ask questions. If I don’t have the answer, someone else may, and you can drop me an email after. How many pages?!
  2. Introductory presentation for new hires at Teradata. Mixture of business and engineering concepts. Scratch the surface – references at the end of presentation.
  3. Zettabyte = 10 to the power of 21
  4. Teradata used Tableau
  5. Baidu is chinese language version of Google. William Gibson, author, poet quote. Coined the term “cyberspace” in his 1982 book Neuromancer. Predicted the rise and popularity of reality TV.
  6. Structured Data – defined format, such as XML document or database tables Semi Structured Data – May be a schema but often ignored eg. spreadsheet, in which cells/fields can store any type of data Unstructured Data – no particular internal structure eg. plain text, image tile, twitter feed. 80% of Big Data is unstructured.
  7. If Gartner says so, it must be right ;>) Motivations for Hadoop: Huge dependency on network and huge bandwidth demands Scaling up and down is not a smooth process Partial failures are difficult to handle A lot of processing power is spent on transporting data Data synchronization is required during exchange As a developer you should not be worrying about these issues being handled by your application - - these are the problems that Hadoop solves, leaving you to focus on business logic.
  8. Basic I/O problem – while storage capacity of hard drives has increased, access speed (rate at which data can be read), has not. Eg. 1 TB drives are normal, but at 100 mega/bits transfer would take 2.5 hours to read all the data on the drive.
  9. The world continues to move towards commodity hardware.
  10. Commercial companies focused on developing and supporting Hadoop: Hortonworks, Cloudera, Amazon Web Services (AWS)
  11. In more simplistic terms, Hadoop is a framework that facilitates functioning of several machines together to achieve the goal of analyzing large sets of data. Hadoop framework supports reliability and data motion. MapReduce divides an application’s retrieval of data into many small fragments of work, each executed or re-executed on a node in the cluster. Data is stored on many compute nodes, providing very high aggregate bandwidth across the cluster for HDFS. Node failures are automatically handled by the framework, through parallelism, heartbeat, checksum and replication.
  12. The Hadoop platform consists of: Hadoop kernel (implemented in Java), MapReduce (any programming language used) and HDFS (Hadoop Distributed File System). HDFS can be accessed natively through a Java API for applications to use (a C language wrapper is also available) Ext3 – Third extended file system commonly used by Linux kernel is supported Xfs – Journaling file system supporting 64-bit and parallel I/O
  13. Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). An HDFS Block is greater than a Disk Block to minimize cost of seeks to disk. HDFS files are write-once. Once written are closed and cannot be changed. A typical single file in HDFS is Gigabytes-to-Terabytes in size.
  14. Terminology. A set of machines is a Hadoop cluster, using Master-Slave architecture. Each node in a Hadoop Instance, has a single NameNode and a cluster of DataNodes. A NameNode is the software to maintain file system structure and metadata for the Datanodes. A Datanode is the software to store and retrieve blocks of data. Can be up to 4,000 slave DataNodes per NameNode. NameNode Job Tracker takes care of MapReduce task execution tracking. DataNode Task Tracker takes care of MapReduce processing for write/read requests. NameNode does not require a lot of disk space, but requires a lot of RAM (the brains of the Instance). DataNode does not require a lot of RAM, but requires a lot of disk space. Failover – the transition from active NameNode to secondary/standby NameNode by a failover controller such as Zookeeper.
  15. HDFS is designed to run on commodity hardware. Low cost servers running Linux/Apache. Philosophy of the cluster design is to bring computing as close as possble to the data. All HDFS communication protocols are layered on top of the TCP/IP protocol. NameNode and Datanodes can be located anywhere.
  16. A single instance is a single HDFS cluster.
  17. A single instance is a single HDFS cluster.
  18. Blocks – a disk block is 512 bytes, a file block is 3 kb, and an HDFS block is 64MB default (up to 128MB). Hardware and data corruption is the norm, rather than the exception. An HDFS instance may consist of hundreds or 1000s of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some components of HDFS are always non-functional (dead). By default, each block is replicated 3 times (can be changed by application in configuration). Replica placement is heavily studied for optimization - HDFS’s policy is to put one replica on one node in the local rack and distribute other replicas to other nodes and other racks, with the goal to reduce seek times, and encourage cluster rebalancing. Separate from file operations, the NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. If NameNode itself crashes, backup will have to be restored from disk. The Zookeeper tool provides NameNode failover coordination, through high availability of active/passive NameNodes.
  19. Analogy to UNIX is a large distributed pipeline
  20. Map Server/Function 1, Map Server/Function 2, Map Server/Function 3: each process in parallel In MapReduce, every input is viewed as a Key-Value pair. Eg. Key=Sentence 1, Value=“John has a red car, which has no radio”. Step 1 – Each sentence is given to a Map, and each word is counted in a wave. In this example, there are 3 Map jobs. Step 2 – Shuffle and sort simply moves the words to server locations, where all the unique keys are brought together. Step 3 – The words on each server are aggregated, and reduced. In this example Reduce is performed across two waves. Final output on lower right.
  21. As a developer you have to start thinking about your data storage problem in a distributed way, instead of in a monolithic way.
  22. Step 1 – data is broken into file splits of 64 MB (or 128 MB) and the blocks are moved to different NameNodes Step 2 – Once all the blocks are moved, the Hadoop framework passes on your program to each NameNode Step 3 – Job Tracker then starts scheduling the programs on individual Datanodes Step 4 – Once all the Datanodes are done, the output (yellow) is written back
  23. Also built on top of Hadoop, are the helper applications: Hive – interactive SQL query and modeling using datawarehouse view of HDFS. Projects a table structure on the dataset and then manipulates it with HiveQL. Pig – Data flow for tedious MapReduce jobs. A language for expressing data analysis and infrastructure processes. HBase – Columnar NoSQL store for billions of rows HCatalog – Table and schema management Zookeeper – NameNode to backup failover coordination Ambari – management tool
  24. Download commercial implementations: Hortonworks (Sandbox is a single node download), Cloudera, Amazon services
  25. Question is not “Why should I care about Big Data”, but rather, how can I get closer to Big Data and start taking advantage of it. Thanks to Peter Smith and Michel Ng to organizing. If you have a topic you would like to present on, see Peter – contribute your expertise to the tech ecosystem in Vancouver Send me questions via LinkedIn and copy will be posted to my profile Hootsuite, Quickmobile, a few others in Vancouver looking for analytics developers – have a look