Integrating Hadoop in your existing data warehouse and business intelligence environment. Speakers Jeff Hammerbacher, Cloudera and Anil Madan, eBay.
Recording of webinar on https://www1.gotomeeting.com/register/515000760
4. Presentation Outline
! 1. The standard model
! 2. The 3 stages of Hadoop adoption
! 3. Cloudera partnerships
! 4. Analytics at eBay
! Questions and Discussion
Wednesday, November 17, 2010
5. 1. The Standard Model
Data Warehousing and Business Intelligence
Wednesday, November 17, 2010
13. Stage 1
Copy or Archive
Application
Database
Application
Requests
Data
Warehouse
ETL
Business
Intelligence
Analytics
Hadoop
Wednesday, November 17, 2010
14. Stage 1
Add Unstructured Data
Application
Database
Application
Requests
Data
Warehouse
ETL
Business
Intelligence
Analytics
Hadoop
Wednesday, November 17, 2010
15. Stage 1
Consolidate Multiple Data Warehouses
Application
Database
Data
Warehouse
ETL
Hadoop
Application
Database
Data
Warehouse
ETL
Wednesday, November 17, 2010
16. Stage 2
On the Critical Path
Wednesday, November 17, 2010
17. Stage 2
Structure and Store
Application
Database
Application
Requests
Data
Warehouse
Business
Intelligence
Analytics
Hadoop
Wednesday, November 17, 2010
18. Stage 3
Ad Hoc Query Support
Wednesday, November 17, 2010
22. Cloudera Partnerships
Cloud, Hardware, and OS
! Processor
! AMD, Intel
! Server
! Acer, HP, Supermicro
! OS
! Canonical
! Cloud
! VMware vCloud
! CDH runs on AWS and Rackspace Cloud as well
Wednesday, November 17, 2010
27. 1
eBay’s Data Scale
• eBay manages …
• Over 90 million active users worldwide
• Over 220 million items for sale
• Over 10 billion URL requests per day
• • … in a dynamic environment
• Tens of new features each week
• Roughly 10% of items are listed or ended every day
• Collect Everything
• eBay processes 40TB of new, incremental data per day
• eBay analyzes 40PB of data per day
• Store every historical item and purchase
eBay has one of the largest EDW system and is building one of the world’s
largest Hadoop clusters
30. Data Sourcing Patterns
4
Source Preparation Format Pattern / Learning
Click Stream
Session
Event
Session
Container
Session/Event Streamed as Gzip/
Binary. Prepared as LZO/Text.
Session/Event Data
Build an index and use LzoTextInputFormat
for splits
Session Container - a join of
Session and corresponding Event
data.
Prepared as Sequence Files.
Session Container - Secondary sort with
reduce side join
EDW
Item
Transaction
User
Feedback
Bids
Incremental feed streamed and
maintained as GZIP/Text.
Smaller data set , keep it in the original
format.
Prepare a snapshot as
SequenceFile.
Rebuild daily snapshot with previous
snapshot and incremental day’s data.
Build a Hive table on snapshot data Create external Hive table which points to
SequenceFile
HBase
a) Leverage TotalOrderPartitoner
with RandomSamplers to identify
partition ranges for reducers.
b) Create HBaseregions using Hfile
c) Update RegionServers using
ruby script loadtable.rb
Learning
a) Incremental data not temporal/sparse,
hence not suitable as versions in a column
oriented DB.
b) HBase insert vs. append performance,
120K vs. 12K rows per sec
c) Hfile flush durability issues HBASE-1923
31. Hadoop Ecosystem
5
5
Hadoop Core
(HDFS,Common)
MapReduce
(Java, Streaming, Pipes,Scala)
Data Access
(Hbase, Pig, Hive)
Tools & Libraries
(HUE,UC4,Oozie.Mobius,Mahout)
Monitoring & Alerting
(Ganglia, Nagios)
• MapReduce
Sourcing data primarily Java
Applications using Perl, Scala, Python…
• Data Access Frameworks
Pig – data piplelines
Hive – Adhoc queries
MQL – Mobius Query Language
• Monitoring & Alerting
Ganglia, Nagios, Cloudera Enterprise
• Tools & Libraries
HUE/Mobius – lifecycle of user jobs
UC4 ‐ scheduling
Oozie – user workflow and data pipelines
Mahout – data mining