Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

•

9 j'aime•5,526 vues

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users. From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.

Technologie Business

Watching Pigs Fly with the
Netflix Hadoop Toolkit
Hadoop Summit 2013
San Jose, CA

Data should be accessible, easy to discover, and
easy to process for everyone.
Our Motivation

Hadoop Platform as a Service
Data Platform

Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)
Forklift
(Data Movement)
Looper
(Backloading)
Ignite
(A/B Test Analytics)
Spock
(Data Auditing)
Genie
(Hadoop PaaS)
Lipstick
(Pig Workflow
Visualization)
Event Service
(Orchestration)
Hadoop
S3
Other Processing

But, what makes good recommendations?
Similarity
Personalization

We’re Sorry
COLORS!
Box art is colorful…

Hadoop Platform as a Service
S3Cassandra TeradataRedshiftRDS

Data Platform as a Service
Franklin
(Metadata API)
S3Cassandra TeradataRedshiftRDS

Data Platform as a Service
Franklin
(Metadata API)

Whether your dataset is large or small, being
able to visualize it makes it easier to explain.

Data Platform as a Service
Franklin
(Metadata API)
Sting
(Adhoc Visualization)

Sting
• Allows users to cache the results of a genie job
in memory
• Sub second response to OLAP style operations
(slicing, dicing, aggregations).
• Adhoc / recurring schedule
• Easy to use!

Hemlock
Grove
House of
Cards
Arrested
Development

# of subscribers X # of titles
= ???,000,…,000 (big data)
Big Data

Lipstick
• Allows users to visualize their data flow
• Allows users to see common errors
• Allows users to easily monitor their jobs
• Empowers users to support themselves
• Facilitates communication between
infrastructure team and users

Logical Operator
(reduce side)
Logical Operator
(map side)
Map/Reduce Job
Intermediate Row Count
Records
Loaded

Unoptimized/Optimized
Logical Plan Toggle
Dangling
Operator

I didn’t get the data I was expecting
Common Problem #2

I don’t understand why my job failed.
Common Problem #3

Failed Job
(light red background)
Successful Job
(light blue background)

Wrapping up
• Demos at the Netflix booth in the exhibit hall
(see more Lipstick, Sting, and Genie).
• Lipstick is part of Netflix OSS.
• Clone it on github at
http://github.com/Netflix/Lipstick
• We welcome feedback and contributions!

 Charles Smith: charsmith@netflix.com
 Jeff Magnusson: jmagnusson@netflix.com
Thank you!
Jobs: http://jobs.netflix.com
Netflix OSS: http://netflix.github.io
Tech Blog: http://techblog.netflix.com/

Recommandé

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong

Putting Lipstick on Apache Pig at NetflixJeff Magnusson

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA

Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine

Rapid Data Analytics @ NetflixData Con LA

Data Warehousing Patterns for HadoopMichelle Ufford

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Spark at AirbnbHao Wang

Recommandé

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong

Putting Lipstick on Apache Pig at NetflixJeff Magnusson

Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA

Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine

Rapid Data Analytics @ NetflixData Con LA

Data Warehousing Patterns for HadoopMichelle Ufford

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Spark at AirbnbHao Wang

DATA @ NFLX (Tableau Conference 2014 Presentation)Blake Irvine

Big Data Meets Learning Science: Keynote by Al EssaSpark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Realtime streaming architecture in INFINARIOJozo Kovac

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

No sql and sql - open analytics summitOpen Analytics

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit

Data Warehousing with Spark Streaming at ZalandoDatabricks

Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit

Disrupting Big Data with Apache Spark in the CloudJen Aman

Shifting Data Science into High GearSpark Summit

Bridging the Gap Between Datasets and DataFramesDatabricks

How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit

A Production Quality Sketching Library for the Analysis of Big DataDatabricks

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks

Extracting Insights from Data at TwitterPrasad Wagle

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

Netflix: Wachstumsstrategie zeigt WirkungStefan Böhm

OSCON 2015Charles Smith

Contenu connexe

Tendances

DATA @ NFLX (Tableau Conference 2014 Presentation)Blake Irvine

Big Data Meets Learning Science: Keynote by Al EssaSpark Summit

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit

Realtime streaming architecture in INFINARIOJozo Kovac

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit

Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks

No sql and sql - open analytics summitOpen Analytics

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...Databricks

Using Hadoop to build a Data Quality Service for both real-time and batch dataDataWorks Summit/Hadoop Summit

Data Warehousing with Spark Streaming at ZalandoDatabricks

Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit

Disrupting Big Data with Apache Spark in the CloudJen Aman

Shifting Data Science into High GearSpark Summit

Bridging the Gap Between Datasets and DataFramesDatabricks

How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit

A Production Quality Sketching Library for the Analysis of Big DataDatabricks

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks

Extracting Insights from Data at TwitterPrasad Wagle

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...Spark Summit

Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarDatabricks

Tendances (20)

DATA @ NFLX (Tableau Conference 2014 Presentation)

Big Data Meets Learning Science: Keynote by Al Essa

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Realtime streaming architecture in INFINARIO

Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

No sql and sql - open analytics summit

Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...

Using Hadoop to build a Data Quality Service for both real-time and batch data

Data Warehousing with Spark Streaming at Zalando

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Disrupting Big Data with Apache Spark in the Cloud

Shifting Data Science into High Gear

Bridging the Gap Between Datasets and DataFrames

How Spark Enables the Internet of Things- Paula Ta-Shma

A Production Quality Sketching Library for the Analysis of Big Data

Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah

Extracting Insights from Data at Twitter

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

En vedette

Netflix: Wachstumsstrategie zeigt WirkungStefan Böhm

OSCON 2015Charles Smith

Orgenealegna301

Data Governance - Atlas 7.12.2015Hortonworks

(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit

En vedette (6)

Netflix: Wachstumsstrategie zeigt Wirkung

OSCON 2015

Orgene

Data Governance - Atlas 7.12.2015

(BDT303) Running Spark and Presto on the Netflix Big Data Platform

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...

Similaire à Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyRohit Kulkarni

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Ncku csie talk about SparkGiivee The

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku

How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee

Big Data - HDInsight and Power BIPrasad Prabhu (PP)

Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri

Data Infrastructure for a World of MusicLars Albertsson

Lipstick On Pig bigdatagurus_meetup

Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs

A general introduction to Spring Data / Neo4JFlorent Biville

Architecting the Future of Big Data and SearchHortonworks

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion

Hadoop basicsAntonio Silveira

Hadoop with PythonDonald Miner

PySaprkGiivee The

The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku

How Graph Databases used in Police Department?Samet KILICTAS

Similaire à Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013) (20)

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Ncku csie talk about Spark

State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

How Concur uses Big Data to get you to Tableau Conference On Time

Big Data - HDInsight and Power BI

Why apache Flink is the 4G of Big Data Analytics Frameworks

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Data Infrastructure for a World of Music

Lipstick On Pig

Netflix - Pig with Lipstick by Jeff Magnusson

A general introduction to Spring Data / Neo4J

Architecting the Future of Big Data and Search

Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production

Hadoop basics

Hadoop with Python

PySaprk

The Rise of the DataOps - Dataiku - J On the Beach 2016

How Graph Databases used in Police Department?

Dernier

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Slack Application Development 101 Slidespraypatel2

AI as an Interface for Commercial BuildingsMemoori

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Dernier (20)

Pigging Solutions Piggable Sweeping Elbows

How to Remove Document Management Hurdles with X-Docs?

Slack Application Development 101 Slides

AI as an Interface for Commercial Buildings

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Handwritten Text Recognition for manuscripts and early printed texts

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Azure Monitor & Application Insight to monitor Infrastructure & Application

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

Pigging Solutions in Pet Food Manufacturing

Maximizing Board Effectiveness 2024 Webinar.pptx

[2024]Digital Global Overview Report 2024 Meltwater.pdf

08448380779 Call Girls In Civil Lines Women Seeking Men

My Hashitalk Indonesia April 2024 Presentation

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

GenCyber Cyber Security Day Presentation

Scaling API-first – The story of a global engineering organization

Benefits Of Flutter Compared To Other Frameworks

A Domino Admins Adventures (Engage 2024)

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

1. Watching Pigs Fly with the Netflix Hadoop Toolkit Hadoop Summit 2013 San Jose, CA

2. Data should be accessible, easy to discover, and easy to process for everyone. Our Motivation

3. Our Users Analysts Engineers

4. Hadoop Platform as a Service

5. Hadoop Platform as a Service S3

6. Hadoop Platform as a Service Data Platform

7. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization) Forklift (Data Movement) Looper (Backloading) Ignite (A/B Test Analytics) Spock (Data Auditing) Genie (Hadoop PaaS) Lipstick (Pig Workflow Visualization) Event Service (Orchestration) Hadoop S3 Other Processing

8. Let’s solve a problem using the data!

9. Build a recommender.

10. But, what makes good recommendations? Similarity Personalization

11. COLORS!

12. COLORS! Box art is colorful…

13. We’re Sorry COLORS! Box art is colorful…

14. Where can I find the data?

15. Hadoop Platform as a Service S3

16. Hadoop Platform as a Service S3Cassandra TeradataRedshiftRDS

17. Data Platform as a Service Franklin (Metadata API) S3Cassandra TeradataRedshiftRDS

18. Data Platform as a Service Franklin (Metadata API)

19. Create a dataset for box art and color.

20. Whether your dataset is large or small, being able to visualize it makes it easier to explain.

21. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization)

22. Sting • Allows users to cache the results of a genie job in memory • Sub second response to OLAP style operations (slicing, dicing, aggregations). • Adhoc / recurring schedule • Easy to use!

23. Hive Query Schema

24. % Content Consumed / Hour

25. Hemlock Grove House of Cards Arrested Development

26. Similarity

27.

28.

29. House of Cards Macbeth

30.

31.

32. Toddlers & Tiaras Star Trek: Voyager

33. Personalization

34. # of subscribers X # of titles = ???,000,…,000 (big data) Big Data

35. Netflix Apache Pig

36.

37. Data Platform as a Service Franklin (Metadata API) Sting (Adhoc Visualization)

38. Lipstick • Allows users to visualize their data flow • Allows users to see common errors • Allows users to easily monitor their jobs • Empowers users to support themselves • Facilitates communication between infrastructure team and users

39. Lipstick

40. Overall Job Progress

41. Logical Plan Overall Job Progress

42. Logical Operator (reduce side) Logical Operator (map side) Map/Reduce Job Intermediate Row Count Records Loaded

43. Hadoop Counters

44. My Job has stalled. Common Problem #1

45.

46. Unoptimized/Optimized Logical Plan Toggle Dangling Operator

47. I didn’t get the data I was expecting Common Problem #2

48.

49.

50. I don’t understand why my job failed. Common Problem #3

51. Failed Job (light red background) Successful Job (light blue background)

52.

53. Wrapping up • Demos at the Netflix booth in the exhibit hall (see more Lipstick, Sting, and Genie). • Lipstick is part of Netflix OSS. • Clone it on github at http://github.com/Netflix/Lipstick • We welcome feedback and contributions!

54.  Charles Smith: charsmith@netflix.com  Jeff Magnusson: jmagnusson@netflix.com Thank you! Jobs: http://jobs.netflix.com Netflix OSS: http://netflix.github.io Tech Blog: http://techblog.netflix.com/

Notes de l'éditeur

E want to talk today about parts of our big data architecture. …………. We would like to talk about what we are doing to make the data more accessible to the users of the platform.
Like a lot of other companies we are experiencing an explosion of data. Which is good since we are a data-driven company, but if the volume of data makes it harder to find what is useful or makes it harder to process, the value of our data decreases. Alternatively if we decide to only consume data that was useful in the past we won’t continue to find new ways to provide value to our customers. Our goal as a team is to make data available so that anyone at Netflix can use it for interesting new work. We all know data is being created faster than ever before. For Netflix, besides the obvious things that grow over time, like what people are watching, what they are rating, and what they comment on, we have a whole range of additional data. Interaction with our websites, interactions with devices, and things social media, and we have done a lot of interesting work with that data. Even so, the fact of the matter is that we aren’t quite sure what data is going to be useful in the future. So since storage is cheap, we can err on the side of collecting more data than we may ever be able to utilize. And a lot of work has been done on processing that data, but these tools are all relatively new and often require a lot of engineering knowledge to realize the full value of the platform.So the problem is that we have a large volume of data and a large group of smart people that could use that data to help the company. But if they don’t know or can’t find the data that is available, or if it is hard to process the data then it will be a long time before we realize the value.----- Meeting Notes (6/12/13 18:11) -----But this isn't a problem that is specific to Pig. While we've spent a lot of time building systems that can process vast quantities of data, as with all new technologies they tend to only be initially accessible to a group of people in the know. Most likely the engineers that built the system. We don't want to be gatekeepers of the data. The way that we are going to get the most value out of our data, is to have a broader audience. We've found that it's ubiquitous across all facets of the Hadoop user experience. While Hadoop has made it possible to process enourmous quantities of data, tooling hasn't progressed to the point of making possible easy….
S3 is a big place
So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
Jeff and I fall solidly on the engineering side of the spectrum, and as such the technology that goes into our platform is always interesting. But at the end of the day our tools are only truly useful if they allow more effective use our data. So we thought that to talk about our architecture it makes a lot more sense if you approach the problem as a user that just wants to use the data.
Look, Netflix does a lot of things with our data to support the business. But at the end of the day we want to connect our customers with the movies and shows they love. So we thought, what better way to talk about Netflix’s data than to talk a little about building a recommendation system using pieces of our platform. So we are going to have something of a mini-Hack Day if you will.----- Meeting Notes (6/17/13 20:59) -----Connecting users with movies they love.
So very quickly let’s talk about how we will build the recommender. There are two types of recommendations that Netflix usually gives you. One is similarity. Similarity can be thought of as a measure of distance between two movies where the closer two movies are, the more similar they are. The other is personalization. Personalization takes a lot of different forms and is often very complicated, but one way to think of personalization is as a distance between a person and movies, where the close a movie is to a person, the more likely that he or she will like the movie. So what we want to do is come up with a vector space in which we can calculate distance between movies. And once we have done that we will try to project our customers into that space so we can measure distance between customers and movies.
S3 is a big place
Abstraction between name of data and location. Location of datasets can change over time…
Abstraction between name of data and location. Location of datasets can change over time…
It turns out that we didn’t yet have a dataset in Franklin with the box art, but we did have lists of titles that I could use to make sense of the box art images. So I needed to create one.So what I decided to do was convert that into a new dataset that I could use. To do that I downloaded box art for each title and converted it to websafe colors. I did this so that rather than having a hundred different pixels of slightly different colors of orange, I would have three. The 216 websafe colors is a much easier space to work in.
After I created the dataset what I really wanted to do was look at how different titles compare to each other. Now I can do this myself and create some sample graphs, what would be a lot more useful is if I could share the data with the other people working with me and they could easily explore it so they can have an idea of what I am doing.
We found that that it was a common need for our users to visualize our large datasets. So we created a lightweight visualization tool named Sting that makes it easy to explore and socialize the results of Hive queries around the organization.----- Meeting Notes (6/17/13 19:58) -----lightweight data viz framework
Insert more real screen shot here…
What we are looking at here is Sting filtered on three titles. Each bar is the stacked histogram of the title. So you can see that Hemlock grove is about 40% black and then it has mostly gray and some shades of red. House of cards is mostly black and gray with a some blues and reds, and Arrested Development looks mostly Orange. And after a bit of playing around and comparing colors, it seemed though not perfect, that I could do a straight distance calculation in this vector space and get decent results.
So let’s look at how it worked out.
Here you can see House of cards is a mix of blacks and greys, like I pointed out and there is some red in there (blood on the hands, although you probably can’t see it).
And it’s closest title is already a winner. Visually we can see similar colors. And for those of you with knowledge of both titles, you probably think this is so good that I am cheating.
But looking at the titles in Sting we can see visually that what our system is telling us looks right. We would expect these titles to be close.
One of the more polarizing Star Treks, so it has a bunch of purple and various reds and blues and black.
At Netflix, we make heavy use of both pig and hive. Hive is typically used for adhoc analysis, while Pig is used inscheduled workflows.
The scripts can be very complicated – compiling to many map/reduce steps and performing complex data transformations along the way.We’ve been happy with our choice of Pig in that it provides an abstraction to easily express complicated map/reduce logic along with some facilities for code reuse (udfs, macros). When workflows get sufficiently complicated however, Pig is almost so abstract that it becomes hard to follow the data flow logic and image how it will translate to map reduce.
So we built a tool called Lipstick that piggybacks on top of our Pig scripts, allowing users to get a graphical view of their data flows and monitor their Pig scripts as they run.
Some key features….