SlideShare a Scribd company logo
1 of 21
Using Scalding for Data-Driven
Product Development
Sasha Ovsankin
LinkedIn
http://linkedin.com/in/sashao
• Studied Mathematical Physics at
Moscow University
• Software Engineering background
• Work at LinkedIn on Email Experience
• Publish open source at
https://github.com/SashaOv
• Publish music at SoundCloud
/home
Scalding is a must-have tool in your arsenal of
Hadoop development.
– Hadoop ecosystem at LinkedIn
– Hadoop development tools
– Scalding: why and how
– What we do with Scalding, code examples.
/linkedin/hadoop/overview
Online Apps
Databases
NoSQL Data
Stores
Hadoop
HDFS
Hadoop Flows
Tracking/log
ging
Analytics
Data
Products
Messaging
Message
delivery
/linkedin/hadoop/practices
• All online data end up in HDFS
– Mostly encoded in Avro
• Production Process
– CI/Automatic Build
• More info forthcoming
– Production Review
– Operations and Monitoring
• More info at http://lnkd.in/gridops2013
• Result: Thousands of jobs running in production
• More info at http://lnkd.in/big-data-ecosystem
/linkedin/hadoop/dev-tools
• PIG
• Java MR
• Scalding
• +many others, will not talk about them today
/hadoop/dev-tools/PIG
• Relatively mature tool
– first official release 2008
• Easy to learn
• Availability of experienced people
• Extendable via UDF
/hadoop/dev-tools/Java
• Java MR
– Maximum flexibility with Hadoop API
– Verbose
• Cascading
– Retain (some) Java flexibility
– Less verbose
/hadoop/dev-tools/Scalding
http://github.com/twitter/scalding
• Scala-based DSL
• Built on Cascading, stable and mature framework
• Uses API similar to Scala collections:
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
}
• Succinct and powerful
• High level of abstraction
…/tools/comparison
PIG Java/Scala
Debugging: stack traces No* Yes
Code reuse Macros, jobs Classes, packages,
modules, frameworks…
Custom data
structures/algorithms
UDF Native
Packaging Fat jars Thin jars
Avro support Partial Native
Unit testing PigUnit (in Java) Standard unit testing
frameworks:
JUNIT/TestMg/MRUnit,
Scalding tests
PIG Java MR Scalding
LOC count Small* Large Small
…/tools/buyers-guide
If you need… Then use…
Quick-and-dirty simple scripts,
existing UDFs
PIG, Hive
Complex flows, full access to
Avro, debugging, unit testing,
productization
Scalding
Full flexibility of Hadoop API
but not too complex processing
Java MR
/linkedin/email-experience
• Goal
– Improve messaging users’ experience
• Plan
– Track
– Experiment
– Optimize
– Personalize
• Implementation
– Generate messages offline
– Apply sophisticated relevance algorithms
– Shorten the release cycle to facilitate fast iteration
/linkedin/email-experience/overview
Content sources
(PIG)
HDFS
Content sources
(Scalding)
Content sources
(Crunch)
Targeting,
Relevance
(Scalding, Java )
Email/Message
production
(Java MR)
Framework
(Java)
Online Delivery
System
…/email-experience/why-scalding
• Scala + Map Reduce = match made in heaven
scala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }
res20: Int = 333833500
• Stack traces (yeah!)
• Native Avro support
• Integrates well with CI/build system
…/email-experience/code
…/email-experience/code/2
/linkedin/…/scalding/status
• Started >1 year ago
• Thousands of production LOC written in Scalding by
our team
– Pretty happy with readability and maintainability
• ~10 flows are currently in production, and counting
• Currently ~12 people are coding in Scalding
• Created Scalding user group
• Growing interest
• Learning:
– Scala[Scalding] < Scala[ _ ]
/linkedin/…/scalding/users
• Data science
• Enterprise services
• Email experience
• Content
/linkedin/…/scalding/what-to-improve
• Better Scala language IDE tools
• One-click development (-> demo)
• Monitoring and troubleshooting
– Counters – implemented in 0.9
– Better troubleshooting of the
ser/de process
• Better tools for tuning of jobs
– setting #of mappers and reducers
• Best practices
/home
Scalding is a must-have tool in your arsenal of
Hadoop development.
– Hadoop ecosystem at LinkedIn
– Hadoop development tools
– Scalding: why and how
– What we do with Scalding, code examples.
/linkedin/join-us
• Work on unique and interesting problems
• Be part of great engineering community
• Use latest tools and technologies
• Help connect the world’s professionals to help
them become more productive and successful
• We are looking for amazing people interested in
Data Science and Software Engineering
Questions?

More Related Content

What's hot

Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
Adam Kawa
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
DataWorks Summit
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 

What's hot (19)

Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Linked in stream experimentation framework
Linked in stream experimentation frameworkLinked in stream experimentation framework
Linked in stream experimentation framework
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Machine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlibMachine Learning using Apache Spark MLlib
Machine Learning using Apache Spark MLlib
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Building Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling WaterBuilding Machine Learning Applications with Sparkling Water
Building Machine Learning Applications with Sparkling Water
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 

Viewers also liked

Viewers also liked (6)

Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014Spark at Twitter - Seattle Spark Meetup, April 2014
Spark at Twitter - Seattle Spark Meetup, April 2014
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Scalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of codeScalding - Hadoop Word Count in LESS than 70 lines of code
Scalding - Hadoop Word Count in LESS than 70 lines of code
 
Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009Scala Talk at FOSDEM 2009
Scala Talk at FOSDEM 2009
 

Similar to How LinkedIn Uses Scalding for Data Driven Product Development

Deepankar Sehdev- Resume2015
Deepankar Sehdev- Resume2015Deepankar Sehdev- Resume2015
Deepankar Sehdev- Resume2015
Deepankar Sehdev
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
Rajeev Kumar
 

Similar to How LinkedIn Uses Scalding for Data Driven Product Development (20)

Using Scalding for Data Driven Product Development at LinkedIn
Using Scalding for Data Driven Product Development at LinkedInUsing Scalding for Data Driven Product Development at LinkedIn
Using Scalding for Data Driven Product Development at LinkedIn
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
H2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SFH2O Rains with Databricks Cloud - Parisoma SF
H2O Rains with Databricks Cloud - Parisoma SF
 
Resume
ResumeResume
Resume
 
H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16H2O Rains with Databricks Cloud - NY 02.16.16
H2O Rains with Databricks Cloud - NY 02.16.16
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Deepankar Sehdev- Resume2015
Deepankar Sehdev- Resume2015Deepankar Sehdev- Resume2015
Deepankar Sehdev- Resume2015
 
DeepeshRehi
DeepeshRehiDeepeshRehi
DeepeshRehi
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

How LinkedIn Uses Scalding for Data Driven Product Development