Big data – can it deliver speed and accuracy v1

BIG DATA ANALYTICS
Can it deliver speed and accuracy?
Risk & Compliance Engineering, Paypal
Gurinder S. Grewal
This deck contains generic architecture information, and does not
reflect the exact details of current or planned systems.
June 2013

ABOUT PAYPAL
• 123MM active users
• 190 markets, 25 currencies
• $300,000 payments processed/minute
• 2B+ events/day
• 12 TB new data added per day
• 500K+ real time queries per second
• < 100ms average response time
we are talking, a lot of data …big data!

WHAT IS BIG DATA?
transactions
interactions
observations
petabytes of data
diverse analytics
variety of data structures
hadoop
large number of characteristics
large map/reduce cluster
terradata

GROWING COMPLEXITY AND EXPECTATIONS
Emerging technologies in the modern world are opening up possibilities
for sophisticated analytics.
Data infrastructure is growing, so are the expectations - make decisions
fast and with higher accuracy!
FraudSophistication
DataComplexity
time
Simple rules, black/white lists
Linear Modes, aggregated variables
Location, Time Analysis
lowhigh
lowhigh
Inline Histories Analysis
Consistency
Networks..

Time taken to make a decision
DECISIONS MUST BE QUICK
• A gang of cyber-criminals stole $45 million in a matter of hours
• More than 36,000 transactions were made worldwide and about $40
million was stolen in 6 hours
Source: http://www.huffingtonpost.com/2013/05/09/atm-fraud_n_3248331.html
BusinessValue
80
low
high
Prevention
Fast
Detection
High Fraud
Loss
Fraudloss
low
high

DECISIONS MUST BE ACCURATE
11:01AM
11:05AM
11:06AM
• Credit card used from three distance locations in short time
Result based on realtime analysis: Block the card, not decided?
• According to past purchasing behavior
• Card holder lives in US - wife paid bill online from home PC
• Card holder’s kid studies in Europe - used card to purchase books
• Card holder travels to Japan - paid for lunch
Result based on historical analysis: It’s a legit usage

DO WE HAVE CONFLICTING REQUIREMENTS?
speed
• analyze data incoming at high velocity in split second
• consume data in timely manner to make decisions
accuracy
• utilize powerful analytics techniques (text mining, predictive analysis)
• processing large variety and volume of data (details)
cost
• can’t spend a dollar to save a penny – pick a right tool for right job

TIERED BIG DATA STRATEGY
real time
e.g. filters
near real time
e.g. correlations
offline
e.g. behavioral analysis
cost, speed
data volume, accuracy
effective decision = fn(accuracy, speed, cost)
data age
secondshoursyears
Data in-motion
Data in-use

BIG DATA - COMPUTATION STRATEGY
Offline
(map-reduce, batch)
Offline variablesOnline variables
Near Real-time
(complex event
processing)
Realtime
(in-flow processing)
• fast, very stringent availability and performance SLA’s
• computations are simple and eventually accurate
• computations are transient, short lived (user sessions)
• event-driven, incremental processing
• high efficiency and scalability
• data for short time windows (hours)
• optimized for throughput
• computations are slow and accurate
• data captured as events for historical analysis

Hadoop Technology Stack
BIG DATA IN USE - OFFLINE ECOSYSTEM
HDFS HBase
Map Reduce Framework
Data Storage
Data Processing Data Integration
ETL
Flume, Sqoop
Programming Languages
Pig Hive QL
Scheduling, Coordination
Zookeeper
Oozie
UI Framework/SDK
Hue Hue SDK
Structured Data
Unstructured Data
MPP DW RDBMS

BIG DATA IN MOTION – ONLINE ECOSYSTEM
Complex Event Processing
correlations
filtering
aggregations
pattern matching In-memory data store
Message Bus Offline
Decision Service
Events stream
CEP enables continuous analytics on data in motion
• Solution for velocity of big data
• Well suited for detection, decisioning, alerting and taking actions
• Relies on in-memory data grid for ability to provide low latency
Monitoring

BIG DATA MOVEMENT
Offline
Data movement between offline and online is the key and biggest challenge
• ETL jobs require custom coding, biggest bottleneck
• Data transfer very expensive, slow across networks, multiple data centers
• Online data stores are not optimized for parallel or bulk loads
Slows down data store during ETL operation
Negatively impacts online applications availability
Data
Cloud Offline

BIG DATA MOVEMENT EVOLUTION
Offline
In-memory data store
Offline
NoSQL
(persistent backing store)
In-memory data store
Two-tier architecture
Data Cloud
Data Cloud
Initial state
• 500GB in 16 hours
Optimization – Phase 1
• 2 TB in 16 hours
• Split data files prepared offline
• Maximize data load parallelism
• Maximum data compression
• Optimize data format
• Validation before data movement
Scale – Phase 2
• 10 TB in 6 hours
• Add persistent NoSQL behind in-memory store
• Blast bulk load into NoSQL store
• Batch process will warm the cache
• Lazy warm-up as needed, while serving r/w
• Refresh cache contents via time based evictions
Batch
Multi-tier architecture

Confidential and Proprietary14
USE CASE: GRAPH BASED DECISIONING
Map/Reduce Graph
builder
In-memory graph store
Online Graph Server
Daily
incremental
updates
Continuous
graph
updates and
rollup
• Generate graph and associated complex variables on Hadoop on daily basis
• Move the incremental changes to online in-memory graph store
• Based on event stream, keep graph, offline variables up-to-date
• In-memory store provides fast read only access to Decision services
Decision Service
Avg. read time: 2ms
95th percentile: 6ms
Events stream
offline online

Confidential and Proprietary15
• Hadoop is best for offline processing of variety and volume data – not for real time
• CEP is a solution for online, big data in motion (velocity), complements Hadoop
• Harness true power of big data by combining offline and online data
• Data integration is a key – careful planning and optimization is needed
• Online data stores are not optimized for highly parallel writes, bulk loads
• Big data can solve complex problems while delivering speed and accuracy
CONCLUSION

Big data – can it deliver speed and accuracy v1

Big data – can it deliver speed and accuracy v1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à Big data – can it deliver speed and accuracy v1

Similaire à Big data – can it deliver speed and accuracy v1 (20)

Dernier

Dernier (20)

Big data – can it deliver speed and accuracy v1

Notes de l'éditeur