Visual Mapping of Clickstream Data

Visual Mapping of Clickstream Data:
Introduction and Demonstration
Cedric Carbone, Ciaran Dynes
Talend

2
© Talend 2014
Visual mapping of
Clickstream data: introduction
and demonstration
Ciaran Dynes VP Products
Cedric Carbone CTO

3
© Talend 2014
Agenda
• Clickstream live demo
• Moving from hand-code to code generation
• Performance benchmark
• Optimization of code generation

4
© Talend 2014
Hortonworks Clickstream demo
http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/

5
© Talend 2014
Trying to get from this…

6
© Talend 2014
Big Data – “pure Hadoop”
Visual design in Map Reduce and optimize before
deploying on Hadoop
to this…

7
© Talend 2014
Demo overview
• Demo flow overview :-
1. Load raw Omniture web log files to HDFS
• Can discuss the ‘schema on read’ principle, how it allows any data type to be
easily loaded to a ‘data lake’ and is then available for analytical processing
• http://ibmdatamag.com/2013/05/why-is-schema-on-read-so-useful/
2. Define a Map/Reduce process to transform the data
• Identical skills to any graphical ETL tool
• Lookup customer and product data to enrich the results
• Results written back to HDFS
3. Federate the results to a visualisation tool of your choice
• Excel
• Analytics tool such Tableau, Qlikview, etc.
• Google Charts

8
© Talend 2014
Big Data Clickstream Analysis
Clickstream Dashboard
TALEND
Load to HDFS
TALEND
BIG DATA
(Integration)
TALEND
Federate to
analytics
HADOOP
HDFS Map/Reduce
Web logs
Hive

9
© Talend 2014
Native Map/Reduce Jobs
• Create classic ETL patterns using native Map/Reduce
- Only data management solution on the market to generate native
Map/Reduce code
• No need for expensive
big data coding skills
• Zero pre-installation on
the Hadoop cluster
• Hadoop is the “engine”
for data processing
#dataos

11
© Talend 2014
PERFORMANCE OF CODE
GENERATION

12
© Talend 2014
MapReduce 2.0, YARN, Storm, Spark
• Yarn: Ensures predictable performance & QoS for all apps
• Enables apps to run “IN” Hadoop rather than “ON”
• In Labs: Streaming with Apache Storm
• In Labs: mini-Batch and In-Memory with Apache Spark
Applications Run Natively IN Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Spark)
GRAPH
(Giraph)
NoSQL
(MongoDB)
EVENTS
(Falcon)
ONLINE
(HBase)
OTHER
(Search)
Source: Hortonworks

13
© Talend 2014
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, Spark)
GRAPH
(Giraph)
NoSQL
(MongoDB)
Events
(Falcon)
ONLINE
(HBase)
OTHER
(Search)
Talend: Tap – Transform – Deliver
TRANSFORM (Data Refinement)
PROFILE PARSEMAP CDCCLEANSE
STANDARD-
IZE
MACHINE
LEARNING
MATCH
TAP
(Ingestion)
SQOOP
FLUME
HDFS API
HBase API
HIVE
800+
DELIVER
(as an API)
ActiveMQKaraf
CamelCXF
KafkaStorm
MetaSecurity
MDMiPaaS
GovernHA

14
© Talend 2014
© Talend 2013
• Context : 9 Nodes cluster, Replication: 3
- DELL R210-II, 1 Xeon® E3 1230 v2, 4 Cores, 16 Go RAM
- Map Slots : 2 Slots / Node
- Reduce Slots : 2 Slots / Node
• Total Processing Capabilities :
- 9*2 Maps Slots : 18 Maps
- 9*2 Reduce Slots : 18 Reduces
• Data Volume : 1,10,100GB
Talend Labs Benchmark Environment

15
© Talend 2014
© Talend 2013
• PIG and Hive Apache communities are usingTPCH
benchmarks
- https://issues.apache.org/jira/browse/PIG-2397
- https://issues.apache.org/jira/browse/HIVE-600
• We are currently running the same tests in our labs
- Pig Hand Coded script vs. Talend Pig generated code
- Pig Hand Coded script vs. Talend Map/Reduce generated code
- Hive QL produced by community vs. Hive ELT capabilities
• Partial results already available for Pig
- Very good results
TPCH Benchmark

16
© Talend 2014
Optimizing Job configuration ?
• By default, Talend follows Hadoop recommendations
regarding the number of reducers usable for the job
execution.
• The rule is that 99% of the total reducers available can be
used
- http://wiki.apache.org/hadoop/HowManyMapsAndReduces
- For Talend benchmark, default max reducers is :
• 3 nodes : 5 (3*2 = 6 * 99% = 5)
• 6 nodes : 11 (6*2 = 12 * 99% = 11)
• 9 nodes : 17 (9*2 = 18 * 99% = 17)
- Another customer benchmark, default max reducer :
• 700 * 99% = 693 nodes (assumption with half Dell and half HP servers)
© Talend 2013

17
© Talend 2014
TPCH Results : Pig Hand Coded vs Pig generated
© Talend 2013
• 19 tests with results similar or better to Pig Hand Coded scripts

18
© Talend 2014
© Talend 2013
• Code is already optimized and automatically applied
Talend
code
is faster

19
© Talend 2014
PERFORMANCE IMPROVEMENTS

20
© Talend 2014
© Talend 2013
• 3 tests will benefit from a new COGROUP feature
Requires
CoGroup
1

21
© Talend 2014
Example: How Sort works for Hadoop
Talend has implemented the TeraSort Algorithm
for Hadoop
1. 1st Map/Reduce Job is generated to analyze the data ranges
- Each Mapper reads its data and analyze its bucket critical values
- The reduce will produce Quartile files for all the data to sort
2. 2nd Map/Reduce job is started
- Each Map does simply send the key to sort to the reducer
- A custom partitioner is created to send the data to the best bucket
depending on the quartile file previously created
- Each reducer will output the data sorted by buckets
• Research: tSort : GraySort, MinuteSort
© Talend 2013
2

22
© Talend 2014
How-to-Get Sandbox!
• Videos on the Jumpstart
- How to Launch http://youtu.be/J3Ppr9Cs9wA
- Clickstream video http://youtu.be/OBYYFLmdCXg
• To get the Sandbox
- http://www.talend.com/contact

23
© Talend 2014
Step-by-Step Directions
• Completely Self-contained Demo VM Sandbox
• Key Scenarios like Clickstream Analysis

24
© Talend 2014
Come try the Sandbox
Hortonworks Dev Café & Talend
2

25
© Talend 2014
RUNTIME PLATFORM (JAVA, Hadoop, SQL, etc.)
Talend Platform for Big Data v5.4
Talend Platform for Big Data
TALEND UNIFIED PLATFORM
Studio Repository Deployment Execution Monitoring
DATA INTEGRATION
Data
Access ETL / ELT Version
Control
Business
Rules
Change
Data Capture Scheduler Parallel
Processing
High
Availability
Big DATA QUALITY
Hive Data
Profiling
Drill-down
to Values
DQ Portal,
Monitoring
Data
Stewardship
Report
Design
Address
Validation
Custom
Analysis
M/R Parsing,
Matching
BIG DATA
Hadoop 2.0
MapReduce
ETL/ELT
Hcatalog/
meta-data
Pig, Sqoop,
Hive
Hadoop Job
Scheduler
Google Big
Query
NoSQL
SupportHDFS

NonStop HBase – Making HBase
Continuously Available for Enterprise
Deployment
Dr. Konstantin Boudnik
WANdisco

Non-Stop HBase
Making HBase Continuously Available for
Enterprise Deployment
Konstantin Boudnik – Director, Advanced Technologies, WANdisco
Brett Rudenstein – Senior Product Manager, WANdisco

WANdisco: continuous availability company
 WANdisco := Wide Area Network Distributed Computing
 We solve availability problems for enterprises.. If you can’t afford 99.999% - we’ll help
 Publicly trading at London Stock Exchange since mid-2012 (LSE:WAND)
 Apache Software Foundation sponsor; actively contributing to Hadoop, SVN, and others
 US patented active-active replication technology
 Located on three continents
 Enterprise ready, high availability software solutions that enable globally distributed
organizations to meet today’s data challenges of secure storage, scalability and
availability
 Subversion, Git, Hadoop HDFS, HBase at 200+ customer sites

Traditionally everybody relies on backups

HA is (mostly) a glorified backup
 Redundancy of critical elements
- Standby servers
- Backup network links
- Off-site copies of critical data
- RAID mirroring
 Baseline:
- Create and synchronize replicas
- Clients switching in case of failure
- Extra hardware allaying idly spinning “just in case”

A Typical Architecture (HDFS HA)

WANdisco Active-Active Architecture
/ page 35
 100% Uptime with WANdisco’s patented replication technology
- Zero downtime / zero data loss
- Enables maintenance without downtime
 Automatic recovery of failed servers; Automatic rebalancing as workload increases
HDFS Data

Multi-threaded Server Software:
Multiple threads processing client requests in a loop
Server
Process
make change to state (db)
get client request e.g.
hbase put
send return value to
client
OP OP OP OP
OP
OP
OP OPOP OP
OP
OP
thread
1
thread
3
thread
2
thread
1
thread
2
thread
3
acquire
lock
release
lock

Ways to achieve single server redundancy

Using a TCP Connection to send data to three
replicated servers (Load Balancer)
serve
r3
Server
Process
OP OP
serve
r2
Server
Process
OP OP OP OP
serve
r1
Server
Process
OP OP OP OP
Client
OP OP OP OP
Load
Balancer
Load
Balancer

HBase WAL replication
 State Machine (HRegion contents, HMaster metadata, etc.) is modified first
 Modification Log (HBase WAL) is sent to a Highly Available shared storage
 Standby Server(s) read edits log and serve as warm standby servers, ready to take
over should the active server fail

HBase WAL replication
serve
r1
Server
Process
OP OP OP OP
server
2
Server
ProcessShared
Storage
Standby
Server
WAL Entries
Single Active
Server

HBase WAL tailing, WAL Snapshots etc.
 Only one active region server is possible
 Failover takes time
 Failover is error prone
 RegionServer failover isn’t seamless for clients

Implementing multiple active masters
with Paxos coordination
(not about leader election)

Three replicated servers
serve
r3
Server
Process
OP OP OP OP
Distributed
Coordination
Engine
serve
r2
Server
Process
Distributed
Coordination
Engine
OP OP OP OP
serve
r1
Server
Process
OP OP OP OP
Distributed
Coordination
Engine Paxos
DConE
Clie
nt
Clie
nt
Clie
nt
Clie
nt
Clie
nt
Paxos
DConE
OP
OPOP
OP

HBase Continuous Availability
(multiple active masters)

HBase Single Points of Failure
 Single HBase Master
- Service interruption after Master failure
 Hbase client
- Client session doesn’t failover after a RegionServer failure
 HBase Region Server: downtime
- 30 secs ≥ MMTR ≤ 200 secs
 Region major compaction (not a failure, but…)
- (un)-scheduled downtime of a region for compaction

HBase Region Server
& Master Replication

NonStopRegionServer:
Client
Service
e.g. multi
Client
Service
DConE
HRegionServer
NonStopRegionServer
1
Client
Service
e.g. multi
Client
Service
DConE
HRegionServer
NonStopRegionServer
2
Hbase
Client
1. Client calls
HRegionServer multi
2. NonStopRegionServer
intercepts
3. NonStopRegionServer makes
paxos
proposal using DConE
library4. Proposal comes back as
agreement
on all
NonStopRegionServers
5. NonStopRegionServer calls
super.multi
on all nodes. State changes
are recorded
6. NonStopRegionServer 1
alone sends
response back to client
HMaster is similar

HBase RegionServer replication using
WANdisco DConE
 Shared nothing architecture
 HFiles, WALs etc. are not shared
 Replica count is tuned
 Snapshots of HFiles do not need to be created
 Messy details of WAL tailing are not necessary:
- WAL might not be needed at all (!)
 Not an eventual consistency model
 Does not serve up stale data

Thank you
Konstantin Boudnik
cos@wandisco.com
@c0sin

Visual Mapping of Clickstream Data

Visual Mapping of Clickstream Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Visual Mapping of Clickstream Data

Similaire à Visual Mapping of Clickstream Data (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Visual Mapping of Clickstream Data