SlideShare une entreprise Scribd logo
1  sur  25
How AdMobius uses Cascading in
AdTech Stack
Jyotirmoy Sundi
Sr Data Engineer in Lotame
(Acquired by LOTAME on March, 2014)
What does AdMobius do

AdMobius is a Mobile Audience Management
Platform (MAMP). It helps advertiser identify
mobile audiences by demographics and interest
through standard, custom, private segments
and reach them at scale.
Target effectively across all platforms in multiple devices
Laptop
Mobile
Ipod
Ipad
Wearables
Topics

Device graph building and scoring device links

Cascading Taps for Hive, MySQL, HBase

Modularized Testing

Optimal Config Setups

Running in YARN

Conclusion
AdMobius Stack
Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph
Hadoop | (Experimental Spark)Hadoop | (Experimental Spark)
RackspaceRackspace
YARN | MR1YARN | MR1
Custom WorkflowsCustom Workflows

Why Cascading
− Easy custom aggregators.
• In the existing MR framework it was very difficult
to write a series of complex aggregated logic and
run them in scale before making sure of its
correctness. You can do that in hive by UDFs or
UDAFs but we found it much easier in Cascading.
− Easy for Java Developers to understand
• visualize and write complicated workflows though
the concept of pipes, taps, tuples.
Workflow for audience profile scoring
Driven
https://driven.cascading.io/index.html#/apps/D818DD
Audience Profiling

Cascading is used to do
− complex aggregations
− create the device multi-dimensional vectors
− device pair scoring based on the vectors
− rule engine based filters

Size
− Total number of mobile devices ~ 2.7B
− ~500M devices in Giraph computation.
Example: Parallel aggregation of values across multiple fields.
Aggregations

No need to know group modes like in UDAF

Buffer

use for more complex grouping
operations

output multiple tuples per group

Aggregator (simple aggregations, prebuilt
aggregators like SumBy, CountBy)
public class MinGraphScoring extends BaseOperation implements Buffer{
@Override
public void operate(FlowProcess flowProcess, BufferCall bufferCall) {
Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator();
Graph g = new Graph();
while( arguments.hasNext() )
{
TupleEntry tpe = arguments.next();
ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro
serialization
g.put(b)
}
Node[] nodes = g.nodes;
//For each pair of nodes : i,j {
double minmaxscore = scoring(g,i,j)
Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore);
bufferCall.getOutputCollector().add(t1);
}
}
public class PotentialMatchAggregator extends
BaseOperation<PotentialMatchAggregator.IDList> implements
Aggregator<PotentialMatchAggregator.IDList> {
start(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) {
IDList idList = new IDList();
aggregatorCall.setContext(idList);
}
aggregate(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall)
{
TupleEntry arguments = aggregatorCall.getArguments();
IDList idList = aggregatorCall.getContext();
idList.updateDev(amid, match);
}
complete(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall)
{
IDList idList = aggregatorCall.getContext();
…...
}
Joins

CoGroup:

two pipes cant fit into memory

HashJoin

when one of the pipes fit into memory
Pipe jointermsPipe = new HashJoin(termsPipe, new
Fields("term_token"),dictionary, new Fields("word"), new
Fields("app","term_token","score","d_count","index","word"), new
InnerJoin());

CustomJoins and BloomJoin
Custom Src/Sink Taps

Cascading has good support to read/write to/from different form of
data sources. Slight tuning or change might be required but most of
code already exists.
− Hive (with different file formats), HBase, MySQL
− http://www.cascading.org/extensions/
− Set proper Config parameters while reading from source tap,
example while reading from Hbase Tap,
String tableName = "device_ids";
String[] familyNames = new String[] { "id:type1", "id:type2",
“id:type3”,...”id:typen” };
Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setCaching(10000);
scan.setBatch(10000);
Hive Src TapsExampleWorkflow.java
Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
Hive Sink Taps
ExampleWorkflow.java
Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>)
Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
conf.setOutputFormat( SequenceFileOutputFormat.class );
valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
Hive table
CREATE TABLE CASCADING_HIVE_INTER
(
admo_id string,
segments string
)
PARTITIONED BY ( batch_id STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS SEQUENCEFILE
Good Practices

Use Checkpointing optimally

Use subassemblies instead of rewriting logic.
For further control pass additional parameters
to subassemblies.

Use Compression and SequenceFile() in sink
taps to chain multiple cascading workflows.

Use Failure Traps to filter faulty records.

Avoid creating too small or too long workflows.
Chain them in Oozie or similar workflow
management engines
− Example: workflows with 10-20 MR jobs are good
Some Properties for Optimal Performance
Problems with improper configuration
1. Set compression parameters : Jobs would run slow and
may take sometime double the time. Set the correct
compression Type based on cluster configs
2. mapred.reduce.tasks : Its required to be set manually
depending on the size of your job. Keeping it too low would
slow down reducer jobs.
3. small file issue : The input split files read by mappers
would be too small eventually bringing up more mappers
then required.
4. Any custom configuration parameters : You should set it
here and use getProperty to access them anywhere in the
data workflow
properties.setProperty("min_cutoff_score", "0.7");
FlowConnector flowConnector = new HadoopFlowConnector(properties);
Running in Yarn

Yarn deployment is smooth with cascading 2.5
− Make sure the config properties are set as per
YARN as they are different from MR1.
− While running in in workflow engines like oozie ,
make sure properties are set for
• mapred.job.classpath.files and mapred.cache.file
are set with all dependency files in colon
separated formatted
Cascading DSLs in other languages
Scalding (Scala)
PyCascading (Python)
cascading.jruby (Jruby)
Cascalog (Closure)

Thank you for your time

Q & A

Contenu connexe

Tendances

Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineG. Bruce Berriman
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowMatthias Feys
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafInfluxData
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Vivian S. Zhang
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Scientific computing on jruby
Scientific computing on jrubyScientific computing on jruby
Scientific computing on jrubyPrasun Anand
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisDavid Gleich
 
Scientific Computation on JRuby
Scientific Computation on JRubyScientific Computation on JRuby
Scientific Computation on JRubyPrasun Anand
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big DataPingCAP
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 

Tendances (20)

Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
The next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engineThe next generation of the Montage image mosaic engine
The next generation of the Montage image mosaic engine
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
[ppt]
[ppt][ppt]
[ppt]
 
Taming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using TelegrafTaming the Tiger: Tips and Tricks for Using Telegraf
Taming the Tiger: Tips and Tricks for Using Telegraf
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Apache Nemo
Apache NemoApache Nemo
Apache Nemo
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Scientific computing on jruby
Scientific computing on jrubyScientific computing on jruby
Scientific computing on jruby
 
MapReduce for scientific simulation analysis
MapReduce for scientific simulation analysisMapReduce for scientific simulation analysis
MapReduce for scientific simulation analysis
 
Scientific Computation on JRuby
Scientific Computation on JRubyScientific Computation on JRuby
Scientific Computation on JRuby
 
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 

En vedette

Estrategias de aprendizaje
Estrategias de aprendizajeEstrategias de aprendizaje
Estrategias de aprendizajeErikaNinette
 
Presentation qp induction presentation asst5_25_sep11_vers d0
Presentation qp induction presentation asst5_25_sep11_vers d0Presentation qp induction presentation asst5_25_sep11_vers d0
Presentation qp induction presentation asst5_25_sep11_vers d0Dennis Habecker
 
Savills - Insights - World Class Cities
Savills - Insights - World Class CitiesSavills - Insights - World Class Cities
Savills - Insights - World Class CitiesEusebi Carles Pastor
 
Internet trends-2011
Internet trends-2011Internet trends-2011
Internet trends-2011Laura Lopes
 
OpenSHIP - Project presentation IT
OpenSHIP - Project presentation ITOpenSHIP - Project presentation IT
OpenSHIP - Project presentation ITOpenSHIP\
 

En vedette (6)

Estrategias de aprendizaje
Estrategias de aprendizajeEstrategias de aprendizaje
Estrategias de aprendizaje
 
June 2011 news
June 2011 newsJune 2011 news
June 2011 news
 
Presentation qp induction presentation asst5_25_sep11_vers d0
Presentation qp induction presentation asst5_25_sep11_vers d0Presentation qp induction presentation asst5_25_sep11_vers d0
Presentation qp induction presentation asst5_25_sep11_vers d0
 
Savills - Insights - World Class Cities
Savills - Insights - World Class CitiesSavills - Insights - World Class Cities
Savills - Insights - World Class Cities
 
Internet trends-2011
Internet trends-2011Internet trends-2011
Internet trends-2011
 
OpenSHIP - Project presentation IT
OpenSHIP - Project presentation ITOpenSHIP - Project presentation IT
OpenSHIP - Project presentation IT
 

Similaire à Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)

Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Stormthe100rabh
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
k-means algorithm implementation on Hadoop
k-means algorithm implementation on Hadoopk-means algorithm implementation on Hadoop
k-means algorithm implementation on HadoopStratos Gounidellis
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestPavan Chitumalla
 
Implementation of k means algorithm on Hadoop
Implementation of k means algorithm on HadoopImplementation of k means algorithm on Hadoop
Implementation of k means algorithm on HadoopLamprini Koutsokera
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
MarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkMarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkLeonardo Arcari
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEnis Afgan
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native CompilationPGConf APAC
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Rajeev Rastogi (KRR)
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaArvind Kumar G.S
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overviewjimliddle
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning serviceRuth Yakubu
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profilerIhor Bobak
 

Similaire à Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/) (20)

Amazon elastic map reduce
Amazon elastic map reduceAmazon elastic map reduce
Amazon elastic map reduce
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
k-means algorithm implementation on Hadoop
k-means algorithm implementation on Hadoopk-means algorithm implementation on Hadoop
k-means algorithm implementation on Hadoop
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Finagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at PinterestFinagle and Java Service Framework at Pinterest
Finagle and Java Service Framework at Pinterest
 
Ae31225230
Ae31225230Ae31225230
Ae31225230
 
Implementation of k means algorithm on Hadoop
Implementation of k means algorithm on HadoopImplementation of k means algorithm on Hadoop
Implementation of k means algorithm on Hadoop
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
MarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt frameworkMarGotAspect - An AspectC++ code generator for the mARGOt framework
MarGotAspect - An AspectC++ code generator for the mARGOt framework
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
 
Go Faster With Native Compilation
Go Faster With Native CompilationGo Faster With Native Compilation
Go Faster With Native Compilation
 
Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2Go faster with_native_compilation Part-2
Go faster with_native_compilation Part-2
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 
Giga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching OverviewGiga Spaces Data Grid / Data Caching Overview
Giga Spaces Data Grid / Data Caching Overview
 
Gephi Toolkit Tutorial
Gephi Toolkit TutorialGephi Toolkit Tutorial
Gephi Toolkit Tutorial
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 

Dernier

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Dernier (20)

Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)

  • 1. How AdMobius uses Cascading in AdTech Stack Jyotirmoy Sundi Sr Data Engineer in Lotame (Acquired by LOTAME on March, 2014)
  • 2. What does AdMobius do  AdMobius is a Mobile Audience Management Platform (MAMP). It helps advertiser identify mobile audiences by demographics and interest through standard, custom, private segments and reach them at scale.
  • 3. Target effectively across all platforms in multiple devices Laptop Mobile Ipod Ipad Wearables
  • 4. Topics  Device graph building and scoring device links  Cascading Taps for Hive, MySQL, HBase  Modularized Testing  Optimal Config Setups  Running in YARN  Conclusion
  • 5. AdMobius Stack Cascading | Hive | Hbase | GiraphCascading | Hive | Hbase | Giraph Hadoop | (Experimental Spark)Hadoop | (Experimental Spark) RackspaceRackspace YARN | MR1YARN | MR1 Custom WorkflowsCustom Workflows
  • 6.  Why Cascading − Easy custom aggregators. • In the existing MR framework it was very difficult to write a series of complex aggregated logic and run them in scale before making sure of its correctness. You can do that in hive by UDFs or UDAFs but we found it much easier in Cascading. − Easy for Java Developers to understand • visualize and write complicated workflows though the concept of pipes, taps, tuples.
  • 7. Workflow for audience profile scoring
  • 9.
  • 10. Audience Profiling  Cascading is used to do − complex aggregations − create the device multi-dimensional vectors − device pair scoring based on the vectors − rule engine based filters  Size − Total number of mobile devices ~ 2.7B − ~500M devices in Giraph computation.
  • 11. Example: Parallel aggregation of values across multiple fields.
  • 12. Aggregations  No need to know group modes like in UDAF  Buffer  use for more complex grouping operations  output multiple tuples per group  Aggregator (simple aggregations, prebuilt aggregators like SumBy, CountBy)
  • 13. public class MinGraphScoring extends BaseOperation implements Buffer{ @Override public void operate(FlowProcess flowProcess, BufferCall bufferCall) { Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator(); Graph g = new Graph(); while( arguments.hasNext() ) { TupleEntry tpe = arguments.next(); ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro serialization g.put(b) } Node[] nodes = g.nodes; //For each pair of nodes : i,j { double minmaxscore = scoring(g,i,j) Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore); bufferCall.getOutputCollector().add(t1); } }
  • 14. public class PotentialMatchAggregator extends BaseOperation<PotentialMatchAggregator.IDList> implements Aggregator<PotentialMatchAggregator.IDList> { start(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { IDList idList = new IDList(); aggregatorCall.setContext(idList); } aggregate(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { TupleEntry arguments = aggregatorCall.getArguments(); IDList idList = aggregatorCall.getContext(); idList.updateDev(amid, match); } complete(FlowProcess flowProcess, AggregatorCall<IDList> aggregatorCall) { IDList idList = aggregatorCall.getContext(); …... }
  • 15. Joins  CoGroup:  two pipes cant fit into memory  HashJoin  when one of the pipes fit into memory Pipe jointermsPipe = new HashJoin(termsPipe, new Fields("term_token"),dictionary, new Fields("word"), new Fields("app","term_token","score","d_count","index","word"), new InnerJoin());  CustomJoins and BloomJoin
  • 16. Custom Src/Sink Taps  Cascading has good support to read/write to/from different form of data sources. Slight tuning or change might be required but most of code already exists. − Hive (with different file formats), HBase, MySQL − http://www.cascading.org/extensions/ − Set proper Config parameters while reading from source tap, example while reading from Hbase Tap, String tableName = "device_ids"; String[] familyNames = new String[] { "id:type1", "id:type2", “id:type3”,...”id:typen” }; Scan scan = new Scan(); scan.setCacheBlocks(false); scan.setCaching(10000); scan.setBatch(10000);
  • 17. Hive Src TapsExampleWorkflow.java Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter); HiveTableTap.java public class HiveTableTap extends GlobHfs { static Scheme getScheme(SchemeType st) { if(st.equals(SchemeType.SEQUENCE_FILE)) return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class); else if(st.equals(SchemeType.TEXT_TSV)) return new TextDelimited(); else return null; } ….. }
  • 18. Hive Sink Taps ExampleWorkflow.java Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>) Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE); HiveTableTap.java public class HiveTableTap extends GlobHfs { static Scheme getScheme(SchemeType st) { if(st.equals(SchemeType.SEQUENCE_FILE)) return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class); else if(st.equals(SchemeType.TEXT_TSV)) return new TextDelimited(); else return null; } ….. } conf.setOutputFormat( SequenceFileOutputFormat.class ); valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
  • 19. Hive table CREATE TABLE CASCADING_HIVE_INTER ( admo_id string, segments string ) PARTITIONED BY ( batch_id STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' STORED AS SEQUENCEFILE
  • 20. Good Practices  Use Checkpointing optimally  Use subassemblies instead of rewriting logic. For further control pass additional parameters to subassemblies.  Use Compression and SequenceFile() in sink taps to chain multiple cascading workflows.  Use Failure Traps to filter faulty records.  Avoid creating too small or too long workflows. Chain them in Oozie or similar workflow management engines − Example: workflows with 10-20 MR jobs are good
  • 21. Some Properties for Optimal Performance
  • 22. Problems with improper configuration 1. Set compression parameters : Jobs would run slow and may take sometime double the time. Set the correct compression Type based on cluster configs 2. mapred.reduce.tasks : Its required to be set manually depending on the size of your job. Keeping it too low would slow down reducer jobs. 3. small file issue : The input split files read by mappers would be too small eventually bringing up more mappers then required. 4. Any custom configuration parameters : You should set it here and use getProperty to access them anywhere in the data workflow properties.setProperty("min_cutoff_score", "0.7"); FlowConnector flowConnector = new HadoopFlowConnector(properties);
  • 23. Running in Yarn  Yarn deployment is smooth with cascading 2.5 − Make sure the config properties are set as per YARN as they are different from MR1. − While running in in workflow engines like oozie , make sure properties are set for • mapred.job.classpath.files and mapred.cache.file are set with all dependency files in colon separated formatted
  • 24. Cascading DSLs in other languages Scalding (Scala) PyCascading (Python) cascading.jruby (Jruby) Cascalog (Closure)
  • 25.  Thank you for your time  Q & A