In this Meetup Yaar Reuveni – Team Leader & Nir Hedvat – Software Engineer from Liveperson Data Platform R&D team, will talk about the journey we made from early days of the data platform in production with high friction and low awareness to issues into a mature, measurable data platform that is visible and trustworthy.
2. Yaar Reuveni & Nir Hedvat
Becoming a Proactive Data
Platform
3. Yaar Reuveni
• 6 Years at Liveperson
• 1 Reporting & BI
• 3 Data Platform
• 2 Data Platform team lead
• I love to travel
• And
4. Nir Hedvat
• Software Engineer B.Sc
• 3 years as a C++ Developer
at IBM Rational Rhapsody™
• 1.5 years at LivePerson
• Cloud and Parallel Computing
Enthusiast
• Love Math and Powerlifting
5. Agenda
• Our Scale & Operation
• Evolution in becoming proactive
i. Hope & Low awareness
ii. Storming & Troubleshooting
iii. Fortifying
iv. Internalization & Comprehension
v. Being Proactive
• Showcases
• Implementation
6. Our Scale
• 2 M Daily chats
• 100 M Daily monitored visitor sessions
• 20 B Events per day
• 2 TB Raw data per day
• 2 PB Total in Hadoop clusters
• Hundreds producers * event types * consumers
8. Stage 1: Hope & Low awareness
We built it and it’s awesome
Online
producer
Offline
producer
local
files
DSPT
Jobs
Raw
Data
* DSPT - Data single point of truth
9. Stage 1: Hope & Low awareness
We’ve got customers
Dashboards
Data Science
Apps
Reporting
Data ScienceData Access
Ad-Hoc
Queries
10. Stage 2: Storming & Troubleshooting
You’ve got NOC & SCS on speed dial
Issues arise:
• Data loss
• Data delays
• Partial data out of frame
• Missing/faulty calculations for consumers
• One producer does not send for over a week
11. Stage 2: Storming & Troubleshooting
You’ve got NOC & SCS on speed dial
Common issues types and generators:
• Hadoop ops
• Production ops
• Events schema
• New data producers
• High new features rate (LE2.0)
• Data stuck in pipeline
• Bugs
14. Stage 3: Fortifying
Every interruption derives a new protection
• Monitors on jobs, failures, success rate
• Monitors on service status
• Simple data freshness checks e.g. measure the
newest event
• Measure latency of specific parts of the pipeline
15. Stage 4: Internalization & Comprehension
Auditing requirements
• Measure principles:
– Loss
• How much?
• Which customer?
• What Type?
• Where in the pipeline?
– Freshness
• Percentiles
• Trends
– Statistics
• Event type count
• Event per LP customer
• Trends
17. Stage 4: Internalization & Comprehension
Mechanism
Data
Common Header
Audit Header
1. Enrich events with
audit metadata
Control Event -
Audit aggregation
Common Header
Audit Header
2. Send control
events per x minutes
18. Stage 4: Internalization & Comprehension
Mechanism
Data
Common Header
Data
Common Header
Data
Common Header
Data
Common Header
Data
Common Header
Data
Common Header
Audit Header
Control Event - Audit
aggregation
Common Header
Audit Header
Control Event - Audit
aggregation
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Audit Header
Data
Common Header
Old Data Flow
Audited Data Flow
19. Stage 4: Internalization & Comprehension
How to measure loss?
• Tag all events going through our API with an
auditing header:
<host_name>:<bulk_id>:<sequence_id>
When:
• host_name - the logical identification of the producer server
• bulk_id - an arbitrary unique number that should identify a bulk (changes every X
minutes)
• sequence_id - auto incremented persistent number used to identify missing bulks
• Every X minutes send an audit control event:
{
eventType: AuditControlEvent,
Bulks: [{bulk_id:“srv-xyz:111:97”, data_tier:”shark producer”, total_count:785},
{bulk_id:“srv-xyz:112:98”, data_tier:”shark producer”, total_count:1715}]
}
20. Stage 4: Internalization & Comprehension
What’s next?
• Immediate gain: enables research loss straight
on the raw data
Next:
• Count events per auditing bulk
• Load into some DB for dashboarding:
In this example, assuming you look at the table after 11:34, and we refer to more than 3 hours as loss, we can see that from server
srv-xyz at bulk_id 1a2b3c we can see 750 events were created and only 405+250 = 655 events arrived within 3 hours this means
we can detect a loss of 95 events from this server.
Audit metadata Data Tier Insertion time Events count
srv-xyz:1a2b3c:25 Producer 08:34 750
srv-xyz:1a2b3c:25 HDFS 09:05 405
srv-xyz:1a2b3c:25 HDFS 10:13 250
21. Stage 4: Internalization & Comprehension
How to measure freshness?
• Run incremental on the raw data
• Group events by
– Total
– Event type
– LP customer
• Per event calculate
Insertion time - creation time
• Per group:
– Total count
– Min, max & average
– Count into time buckets (0-30; 30-60; 60-120; 120-∞)
29. Showcase II
Deployment issue
• Constant loss
• Only in one farm
• Depends on traffic
• Only a specific producer type
• From all of its nodes
30. Showcase III
Consumer jobs issues
• Our auditing detected a loss in Alpha
• Data stuck in a job failure dir
• Functional monitoring missed it
• We streamed the stuck data
36. • Load data from HDFS
• Aggregate events according to audit metadata
• Save aggregated audit data to MySql
• Spark implementation
Implementation
Audit Aggregator
38. • Our jobs work incrementally or manually
• Offset management by ZooKeeper
• Failing during saving stage leads to lost offset
• Saving data and offset on same stream
Audit Aggregator job
Overcoming Pitfalls
46. Hadoop Platform
Overcoming Pitfalls
• Our data model is built over Avro
• Avro comes with schema evolution
• Avro data is stored along with its schema
• High model-modification rate
• LOBs schema changes are synchronized
Producer → Consumer
47. Hadoop Platform
Overcoming Pitfalls
• MR/Spark job is revision-compiled when using
SpecificRecord
• Using GenericRecord removes the burden of
recompiling each time schema changes