Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
1. How AdMobius uses Cascading in
AdTech Stack
Jyotirmoy Sundi
Sr Data Engineer in Lotame
(Acquired by LOTAME on March, 2014)
2. What does AdMobius do
AdMobius is a Mobile Audience Management
Platform (MAMP). It helps advertiser identify
mobile audiences by demographics and interest
through standard, custom, private segments
and reach them at scale.
6.
Why Cascading
− Easy custom aggregators.
• In the existing MR framework it was very difficult
to write a series of complex aggregated logic and
run them in scale before making sure of its
correctness. You can do that in hive by UDFs or
UDAFs but we found it much easier in Cascading.
− Easy for Java Developers to understand
• visualize and write complicated workflows though
the concept of pipes, taps, tuples.
10. Audience Profiling
Cascading is used to do
− complex aggregations
− create the device multi-dimensional vectors
− device pair scoring based on the vectors
− rule engine based filters
Size
− Total number of mobile devices ~ 2.7B
− ~500M devices in Giraph computation.
12. Aggregations
No need to know group modes like in UDAF
Buffer
use for more complex grouping
operations
output multiple tuples per group
Aggregator (simple aggregations, prebuilt
aggregators like SumBy, CountBy)
13. public class MinGraphScoring extends BaseOperation implements Buffer{
@Override
public void operate(FlowProcess flowProcess, BufferCall bufferCall) {
Iterator<TupleEntry> arguments = bufferCall.getArgumentsIterator();
Graph g = new Graph();
while( arguments.hasNext() )
{
TupleEntry tpe = arguments.next();
ByteBuffer b = ByteBuffer.wrap((byte[])tpe.getObject("field1"););//use kyro
serialization
g.put(b)
}
Node[] nodes = g.nodes;
//For each pair of nodes : i,j {
double minmaxscore = scoring(g,i,j)
Tuple t1 = new Tuple(nodes[i].id ,nodes[j].id ,minmaxscore);
bufferCall.getOutputCollector().add(t1);
}
}
15. Joins
CoGroup:
two pipes cant fit into memory
HashJoin
when one of the pipes fit into memory
Pipe jointermsPipe = new HashJoin(termsPipe, new
Fields("term_token"),dictionary, new Fields("word"), new
Fields("app","term_token","score","d_count","index","word"), new
InnerJoin());
CustomJoins and BloomJoin
16. Custom Src/Sink Taps
Cascading has good support to read/write to/from different form of
data sources. Slight tuning or change might be required but most of
code already exists.
− Hive (with different file formats), HBase, MySQL
− http://www.cascading.org/extensions/
− Set proper Config parameters while reading from source tap,
example while reading from Hbase Tap,
String tableName = "device_ids";
String[] familyNames = new String[] { "id:type1", "id:type2",
“id:type3”,...”id:typen” };
Scan scan = new Scan();
scan.setCacheBlocks(false);
scan.setCaching(10000);
scan.setBatch(10000);
17. Hive Src TapsExampleWorkflow.java
Tap dmTap = new HiveTableTap(HiveTableTap.SchemeType.SEQUENCE_FILE, admoFPbase, admoFPBasePartitions, dmFullFilter);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
18. Hive Sink Taps
ExampleWorkflow.java
Tap srcDstIdsSinkTap = new Hfs(new AdmobiusWritableSequenceFile(new Fields("value"), (Class<? extends Writable>)
Text.class),"/tmp/srcDstIdsSinkTap" , SinkMode.REPLACE);
HiveTableTap.java
public class HiveTableTap extends GlobHfs {
static Scheme getScheme(SchemeType st) {
if(st.equals(SchemeType.SEQUENCE_FILE))
return new AdmobiusWritableSequenceFile(new Fields("value"), BytesWritable.class);
else if(st.equals(SchemeType.TEXT_TSV))
return new TextDelimited();
else
return null;
}
…..
}
conf.setOutputFormat( SequenceFileOutputFormat.class );
valueValue = (Writable) (new Text(tupleEntry.getObject( 0 ).toString().getBytes()));
19. Hive table
CREATE TABLE CASCADING_HIVE_INTER
(
admo_id string,
segments string
)
PARTITIONED BY ( batch_id STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS SEQUENCEFILE
20. Good Practices
Use Checkpointing optimally
Use subassemblies instead of rewriting logic.
For further control pass additional parameters
to subassemblies.
Use Compression and SequenceFile() in sink
taps to chain multiple cascading workflows.
Use Failure Traps to filter faulty records.
Avoid creating too small or too long workflows.
Chain them in Oozie or similar workflow
management engines
− Example: workflows with 10-20 MR jobs are good
22. Problems with improper configuration
1. Set compression parameters : Jobs would run slow and
may take sometime double the time. Set the correct
compression Type based on cluster configs
2. mapred.reduce.tasks : Its required to be set manually
depending on the size of your job. Keeping it too low would
slow down reducer jobs.
3. small file issue : The input split files read by mappers
would be too small eventually bringing up more mappers
then required.
4. Any custom configuration parameters : You should set it
here and use getProperty to access them anywhere in the
data workflow
properties.setProperty("min_cutoff_score", "0.7");
FlowConnector flowConnector = new HadoopFlowConnector(properties);
23. Running in Yarn
Yarn deployment is smooth with cascading 2.5
− Make sure the config properties are set as per
YARN as they are different from MR1.
− While running in in workflow engines like oozie ,
make sure properties are set for
• mapred.job.classpath.files and mapred.cache.file
are set with all dependency files in colon
separated formatted
24. Cascading DSLs in other languages
Scalding (Scala)
PyCascading (Python)
cascading.jruby (Jruby)
Cascalog (Closure)