In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
2. Data should be accessible, easy to discover, and
easy to process for everyone.
Motivation
3. Big Data Users at Netflix
Analysts Engineers
Desires
Self Service
Easy
Rich Toolset Rich APIs
A Single Platform / Data Architecture that Serves Both Groups
4. Netflix Data Warehouse - Storage
S3 is the source of truth
Decouples storage from
processing.
Persistent data; multiple/
transient Hadoop clusters
Data sources
Event data from cloud
services via Ursula/Honu
Dimension data from
Cassandra via Aegisthus
~100 billion events processed
/ day
Petabytes of data persisted
and available to queries on
S3.
5. Netflix Data Platform - Processing
Long running clusters
sla and ad-hoc
Supplemental nightly
bonus clusters
For high priority ETL jobs
2,000+ instances in
aggregate across the
clusters
7. Netflix Data Platform – Primitive
Service Layer
Primitive, decoupled services
Building blocks for more
complicated
tools/services/apps
Serves 1000s of MapReduce
Jobs / day
100+ jobs concurrently
8. Netflix Data Platform – Tools
Sting
(Adhoc
Visualization)
Looper
(Backloading)
Forklift
(Data Movement)
Ignite
(A/B Test Analytics)
Lipstick
(Workflow
Visualization)
Spock
(Data Auditing)
Heavily utilize services in the
primitive layer.
Follow the same design
philosophy as primitive apps:
RESTful API
Decoupled javascript interfaces
9. Pig and Hive at Netflix
• Hive
– AdHoc queries
– Lightweight aggregation
• Pig
– Complex Dataflows / ETL
– Data movement “glue” between complex
operations
10. What is Pig?
• A data flow language
• Simple to learn
– Very few reserved words
– Comparable to a SQL logical query plan
• Easy to extend and optimize
• Extendable via UDFs written in multiple
languages
– Java, Python, Ruby, Groovy, Javascript
11. Sample Pig Script* (Word Count)
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS
word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
* http://en.wikipedia.org/wiki/Pig_(programming_tool)#Example
13. Pig…
• Data flows are easy & flexible to express in text
– Facilitates code reuse via UDFs and macros
– Allows logical grouping of operations vs grouping by order
of execution.
– But errors are easy to make and overlook.
• Scripts can quickly get complicated
• Visualization quickly draws attention to:
– Common errors
– Execution order / logical flow
– Optimization opportunities
20. Lipstick for Fast Development
• During development:
– Keep track of data flow
– Spot common errors
• Omitted (hanging) operators
• Data type issues
– Easily estimate and optimize complexity
• Number of MR jobs generated
• Map only vs full Map/Reduce jobs
• Opportunities to rejigger logic to:
– Combine multiple jobs into a single job
– Manipulate execution order to achieve better parallelism (e.g.
less blocking)
21. Lipstick for Job Monitoring
• During execution:
– Graphically monitor execution status from a single
console
– Spot optimization opportunities
• Map vs reduce side joins
• Data skew
• Better parallelism settings
22. Lipstick for Support
• Empowers users to support themselves
– Better operational visibility
• What is my script currently doing?
• Why is my script slow?
– Examine intermediate output of jobs
– All execution information in one place
• Facilitates communication between
infrastructure / support teams and end users
– Lipstick link contains all information needed to
provide support.
24. Lipstick Architecture - Console
• Implements PigProgressNotificationListener interface
• Listens for:
1. New statements to be registered (unoptimized plan)
2. Script launched event (optimized, physical, M/R plan)
3. MR Job completion/failure event
4. Heartbeat progress (during execution)
• Pig Plans and Progress Lipstick objects
• Communicates with Lipstick Server
25. Pig Compilation Plans
Optimized Logical Plan
Physical Plan
MapReduce Plan
(grouping of Physical Operators into
map or reduce jobs)
Pig Script
Unoptimized Logical Plan
(~1:1 logical operator / line of Pig)
Lipstick associates Logical Operators
with MapReduce jobs by inferring
relationships between Logical and
Physical Operations.
26. Lipstick Architecture - Server
• Simple REST interface
• It’s a Grails app!
• Pig client posts plans and puts progress
• Javascript client
• gets plans and progress
• Searches jobs by job name and user name
27. Lipstick Architecture – JS Client
• Displays and annotates graphs with status / progress
• Completely decoupled from Server
• Event based design
• Periodically polls Server for job progress
• Usability is a key focus
28. My Job has stalled.
Solving Problems with Lipstick -
Common Problem #1
36. Future of Lipstick
• Annotate common errors and inefficiencies on the graph
– Skew / map side join opportunities / scalar issues
– E.g. Warnings / error dashboard
• Provide better details of runtime performance
– Timings annotated on graph
– Min / median / max mapper and reducer times
– Map / reduce completion over time
• Search through execution history
– Examine trends in runtime and data volumes
– History of failure / success
• Search jobs for commonalities
– Common datasets loaded / saved
– Better grasp data lineage
– Common uses of UDFs and macros
39. Wrapping up
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• Check out the quickstart guide
– https://github.com/Netflix/Lipstick/wiki/Getting-
Started#1-quick-start
– Get started playing with Lipstick in under 5 minutes!
• We happily welcome your feedback and
contributions!