SlideShare a Scribd company logo
1 of 34
Experimentation Platform
on Hadoop
Tony Ng, Director, Data Services
Padma Gopal, Manager, Experimentation
Agenda
 Experimentation 101
 Reporting Work flow
 Why Hadoop?
 Framework Architecture
 Challenges & Learnings
 Q & A
Experimentation 101
• What is A/B Testing?
• Why is it important?
• Intuition vs. Reality
• eBay Wins
What is A/B Testing?
• A/B Testing is comparing two versions of a page or process to see which one
performs better
• Variations could be: UI Components, Content, Algorithms etc.
• Measures: Financial metrics, Click rate, Conversion rate etc.
Control - Current design Treatment - Variations of current design
EP – Hadoop Summit 2015 4
How is A/B Testing is done?
EP – Hadoop Summit 2015 5
Why is it important?
• Intuition vs. Reality
–Intuition especially on novel ideas should be backed up by data.
–Demographics and preferences vary
• Data Driven; not based on opinion
• Reduce risk
EP – Hadoop Summit 2015 6
Increased prominence of BIN button compared to Watch, leads to
faster checkouts.
EP – Hadoop Summit 2015 7
Merch placements perform much better when title and price
information is provided upfront.
EP – Hadoop Summit 2015 8
New sign-in design effectively pushed more new users to use
guest checkout
9EP – Hadoop Summit 2015
10
What do we support?
EP – Hadoop Summit 2015
Experimentation Reporting
• How does EP work?
• Work Flow
• DW Challenges
Experiment Lifecycle
EP – Hadoop Summit 2015 12
EP – Hadoop Summit 2015 13
User Behavior &
Transactional Data
Experiment
Metadata
Detail Intermediate Summaries
4 Billion Rows
4 TB
User1 Homepage
User1 Search for IPhone6
User1 View Item1
User2 Search for Coach bag
User2 View Item2
User2 Bid
Treatment 2 User1 Homepage
Treatment 1 User1 Search for IPhone6
Treatment 2 User1 Search for IPhone6
Treatment 1 User1 View Item 1
Treatment 2 User1 View Item 1
Treatment 1 User2 Search for Coach bag
Treatment 2 User2 Search for Coach bag
Treatment 1 100+ Metrics
Treatment 1 20 X Dimensions
Treatment 1 10 Metric Insights
Treatment 2 100+ Metrics
Treatment 2 20 X Dimensions
Treatment 2 10 Data Insights
EP – Hadoop Summit 2015 14
Transactional Metrics
Activity Metrics
Acquisition Metrics
AD Metrics
Email Metrics
Seller Metrics
Engagement metrics
Absolute - Actual number/counts
Normalized - Weighted mean (by GUID/UID)
Lift - Difference between treatment and control
Standard Deviation - Weighted standard deviation
Confidence Interval – Range within which treatment
effect is likely to lie
P-values – Statistically significance
Outlier capped – Trim tail values
Post Stratified – Adjustment method to reduce
variance
DATA INSIGHTS
Daily
Weekly
Cumulative
Browser
OS
Device
Site/Country
Category
Segment
Geo
Hadoop Migration
• Why Hadoop
• Tech Stack
• Architecture Overview
EP – Hadoop Summit 2015 16
Why Hadoop?
• Design & Development flexibility
• Store large amounts of data without the schemas constraints
• System to support complex data transformation logic
• Code base reduction
• Configurability
• Code not tied to environment & easier to share
• Support for complex structures
Scheduler/Client
EP – Hadoop Summit 2015 17
Physical Architecture
Hadoop Cluster
Job
Workflow
RDBMS
ETL
Bridge
Agent
BI
&
PresentationmySQL DW
User
Behavior
Data
1
2
43
5
Hive Scoobi Spark (poc)
AVRO ORC
EP – Hadoop Summit 2015 18
Tech Stack - Scoobi
•Scoobi
– Written in Scala, a functional programming language
– Supports Object Oriented Designs
– Abstraction of MR Framework code to lower
– Portability of typical dataset operations like map, flatMap, filter, groupBy, sort, orderBy, partition
– DList (Distributed Lists): Jobs are submitted as a series of “steps” representing granular MR jobs.
– Enables developers to write a more concise code compared to Java MR code.
EP – Hadoop Summit 2015 19
Word Count in Java M/R.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.waitForCompletion(true);
}
}
EP – Hadoop Summit 2015 20
Word Count in Scoobi
import Scoobi._, Reduction._
val lines = fromTextFile("hdfs://in/...")
val counts = lines.mapFlatten(_.split(" "))
.map(word => (word, 1))
.groupByKey
.combine(Sum.int)
counts.toTextFile("hdfs://out/...",
overwrite=true).persist(ScoobiConfiguration())
EP – Hadoop Summit 2015 21
Tech Stack - File Format
• Avro
– Supports rich and complex data structures such as Maps, Unions
– Self-Describing data files enabling portability (Schema co-exists with data)
– Supports schema dynamicity using Generic Records
– Supports backward compatibility for data files w.r.t schema changes
• ORC (Optimized Row Columnar)
– A single file as the output of each task, which reduces the NameNode's load
– Metadata stored using Protocol Buffers, which allows addition and removal of fields
– Better performance of queries (bound the amount of memory needed for reading or writing)
– Light-weight indexes stored within the file
EP – Hadoop Summit 2015 22
Tech Stack - Hive
• Efficient Joins for large datasets.
• UDF for use cases like median and percentile calculations.
• Hive Optimizer Joins:
- Smaller is loaded into memory as a hash table and the larger is scanned
- Map joins are automatically picked up by the optimizer.
• Ad-hoc Analysis, Data Reconciliation use-cases and Testing.
EP – Hadoop Summit 2015 23
Fun Facts of EP Processing
• We read more than 200 TB of data for processing daily.
• We run 350 M/R jobs daily.
• We perform more than 30 joins using M/R & Hive, including the ones with heavy data skew.
• We use 40 TB of YARN memory at peak time on a 170 TB Hadoop cluster.
• We can run 150+ concurrent experiments daily.
• Report generation takes around 18 hours.
24
Logical Architecture
EP – Hadoop Summit 2015
EP Reporting Services
Detail Intermediate 1 Intermediate 2 Summary
Configuration
Filters Data Providers Processors
Calculators Metric Providers
Output
ColumnsMetricsDimensions
Framework
Components
Reporting
Context
Cache
Util/Helpers
Command
Line
Input/Output
Conduit
Ancillary
Services
Alerts
Shell
Scripts
Processed
Data Store
Tools
Logging &
Monitoring
CHALLENGES &
LEARNINGS
• Joins
• Job Optimization
• Data Skew
25EP – Hadoop Summit 2015
EP – Hadoop Summit 2015 26
Key Challenges
•Performance
– Job runtimes are subject to SLA & heavily tied to
resources
•Data Skew (Long tail data distribution)
– May cause unrecoverable runtime failures
– Poor performance
•Joins, Combiner
•Job Resiliency
– Auto remediation
– Alerts and Monitoring
EP – Hadoop Summit 2015 27
Solution to Key Challenge - Performance
– Tuned the Hadoop job parameters – a few of them are listed below
• -Dmapreduce.input.fileinputformat.split.minsize and -Dmapreduce.input.fileinputformat.split.maxsize
– Job run times were reduced in the range of 9% to 35%
• -Dscoobi.mapreduce.reducers.bytesperreducer
– Adjusting this parameter helped optimize the number of reducers to use. Job run times were
reduced to the extent of 50% in some cases
• -Dscoobi.concurrentjobs
– Setting this parameter to true enables multiple steps of a scoobi job to run concurrently
• -Dmapreduce.reduce.memory.mb
– Tuning this parameter helped relieving memory pressure
EP – Hadoop Summit 2015 28
Solution to Key Challenge - Performance
– Implement Data cache for objects
• Achieved cache hit ratio of over 99% per job
• Runtime performance improved in the range of 18% to 39% depending on the job
– Redesign/Refactor Jobs and Job Schedules
• Extracted logic from existing jobs into their own jobs
• Job workflow optimization for better parallelism
– Dedicated Hadoop queue with more than 50 TB of YARN memory.
• Shared Hadoop cluster resulted in long waiting times, dedicated queue solved the problem of
resource crunch.
Joins
– Data skew in one or both datasets
 Scoobi block join divides the skewed data into blocks and joins the data one block at a time.
– Multiple joins in a process
 Rewrote a process, which needed join with 11 datasets whose size varied from 49 TB to a few mega
byte, in hive, as this process was taking 6+ hours in Scoobi and reduced the time to 3 hours in hive.
– Other join solutions
 Also looked into Hive’s bucket join, but the cost to sort and bucket the datasets was more than regular
join.
EP – Hadoop Summit 2015 29
EP – Hadoop Summit 2015 30
Combiner
To relieve Reducer memory pressure and prevent OOM
Solution – Emit part-values of the complete operation for the same key using Combiners
– Calculating Mean
• Mean = ( X1 + X2 + X3 …. Xn )/ (1 + 1 + 1 + 1 … n)
• formula is composed of 2 parts and mapper emits 2 part values combining records for the
same key.
• Reducer receives way fewer records after combining and applies the two parts from each
mapper into the actual mean formula.
• Concept can be applied to other complex formula such as Variance, as long as the formula
can be reduced to parts that are commutative and associative.
Job Resiliency
– Auto-remediation
• Auto-restart in case of job failure due to intermittent cluster issues
- Monitoring & Alerting for Hadoop jobs
• Continuous monitoring and email alert generated when a long-running job or failure detected
- Monitoring & Alerting for Data quality
• Daily monitoring of data trend set up for key metrics and email Alert on any anomaly or violations detected
- Recon scripts
• Checks and alerts setup for intermediate data
- Daily data backup
• Daily data back up with distcp to a secondary cluster and ability to restore
EP – Hadoop Summit 2015 31
Next - Evaluate Spark
Current Problems
- Data processing through Map Reduce is slow for a complex DAG, as data is persisted to disk
at each step. Multiple stages in pipeline are chained together making the overall process very
complex.
- Massive Joins against very large datasets are slow.
- Expressing every complicated business logic into Hadoop Map Reduce is a problem.
Alternatives
- Apache Spark has wide adoption, expressive, industry backing and thriving community
support.
- Apache spark has 10x to 100x speed improvements in comparison to traditional M/R jobs.
EP – Hadoop Summit 2015 32
Summary
• Hadoop is ideal for large data processing and provides a
highly scalable storage platform.
• Hadoop eco-system is still evolving and have to face the
issues around the software which is still
underdevelopment.
• Moving to Hadoop helped to free up huge capacity in DW
for deep dive analysis.
• Huge cost reduction for business like us with exploding
data sets.
EP – Hadoop Summit 2015 33
Q & A

More Related Content

What's hot

Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...VMware Tanzu
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analyticskgshukla
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBigDataExpo
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalHortonworks
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicDataWorks Summit
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopDataWorks Summit
 
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...DataWorks Summit
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 

What's hot (20)

Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
Pivotal HAWQ and Hortonworks Data Platform: Modern Data Architecture for IT T...
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
Pivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream AnalyticsPivotal Real Time Data Stream Analytics
Pivotal Real Time Data Stream Analytics
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use CasesBig Data Expo 2015 - Hortonworks Common Hadoop Use Cases
Big Data Expo 2015 - Hortonworks Common Hadoop Use Cases
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
 
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo ClinicBig Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
Big Data Platform Processes Daily Healthcare Data for Clinic Use at Mayo Clinic
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Analyzing Hadoop Using Hadoop
Analyzing Hadoop Using HadoopAnalyzing Hadoop Using Hadoop
Analyzing Hadoop Using Hadoop
 
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
Open source computer vision with TensorFlow, Apache MiniFi, Apache NiFi, Open...
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 

Similar to Experimentation Platform on Hadoop

Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Lin Qiao
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the OrganizationSeeling Cheung
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Global Business Events
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationDataWorks Summit
 
Game Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingGame Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingInside Analysis
 
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...DataWorks Summit
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...ModusOptimum
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Hortonworks
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 

Similar to Experimentation Platform on Hadoop (20)

Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
Justin Sheppard & Ankur Gupta from Sears Holdings Corporation - Single point ...
 
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy ModernizationMove to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Game Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise ThinkingGame Changed – How Hadoop is Reinventing Enterprise Thinking
Game Changed – How Hadoop is Reinventing Enterprise Thinking
 
Hadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data WarehouseHadoop and Your Enterprise Data Warehouse
Hadoop and Your Enterprise Data Warehouse
 
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
Faster, Cheaper, Easier... and Successful Best Practices for Big Data Integra...
 
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
Better Total Value of Ownership (TVO) for Complex Analytic Workflows with the...
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 

Recently uploaded (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 

Experimentation Platform on Hadoop

  • 1. Experimentation Platform on Hadoop Tony Ng, Director, Data Services Padma Gopal, Manager, Experimentation
  • 2. Agenda  Experimentation 101  Reporting Work flow  Why Hadoop?  Framework Architecture  Challenges & Learnings  Q & A
  • 3. Experimentation 101 • What is A/B Testing? • Why is it important? • Intuition vs. Reality • eBay Wins
  • 4. What is A/B Testing? • A/B Testing is comparing two versions of a page or process to see which one performs better • Variations could be: UI Components, Content, Algorithms etc. • Measures: Financial metrics, Click rate, Conversion rate etc. Control - Current design Treatment - Variations of current design EP – Hadoop Summit 2015 4
  • 5. How is A/B Testing is done? EP – Hadoop Summit 2015 5
  • 6. Why is it important? • Intuition vs. Reality –Intuition especially on novel ideas should be backed up by data. –Demographics and preferences vary • Data Driven; not based on opinion • Reduce risk EP – Hadoop Summit 2015 6
  • 7. Increased prominence of BIN button compared to Watch, leads to faster checkouts. EP – Hadoop Summit 2015 7
  • 8. Merch placements perform much better when title and price information is provided upfront. EP – Hadoop Summit 2015 8
  • 9. New sign-in design effectively pushed more new users to use guest checkout 9EP – Hadoop Summit 2015
  • 10. 10 What do we support? EP – Hadoop Summit 2015
  • 11. Experimentation Reporting • How does EP work? • Work Flow • DW Challenges
  • 12. Experiment Lifecycle EP – Hadoop Summit 2015 12
  • 13. EP – Hadoop Summit 2015 13 User Behavior & Transactional Data Experiment Metadata Detail Intermediate Summaries 4 Billion Rows 4 TB User1 Homepage User1 Search for IPhone6 User1 View Item1 User2 Search for Coach bag User2 View Item2 User2 Bid Treatment 2 User1 Homepage Treatment 1 User1 Search for IPhone6 Treatment 2 User1 Search for IPhone6 Treatment 1 User1 View Item 1 Treatment 2 User1 View Item 1 Treatment 1 User2 Search for Coach bag Treatment 2 User2 Search for Coach bag Treatment 1 100+ Metrics Treatment 1 20 X Dimensions Treatment 1 10 Metric Insights Treatment 2 100+ Metrics Treatment 2 20 X Dimensions Treatment 2 10 Data Insights
  • 14. EP – Hadoop Summit 2015 14 Transactional Metrics Activity Metrics Acquisition Metrics AD Metrics Email Metrics Seller Metrics Engagement metrics Absolute - Actual number/counts Normalized - Weighted mean (by GUID/UID) Lift - Difference between treatment and control Standard Deviation - Weighted standard deviation Confidence Interval – Range within which treatment effect is likely to lie P-values – Statistically significance Outlier capped – Trim tail values Post Stratified – Adjustment method to reduce variance DATA INSIGHTS Daily Weekly Cumulative Browser OS Device Site/Country Category Segment Geo
  • 15. Hadoop Migration • Why Hadoop • Tech Stack • Architecture Overview
  • 16. EP – Hadoop Summit 2015 16 Why Hadoop? • Design & Development flexibility • Store large amounts of data without the schemas constraints • System to support complex data transformation logic • Code base reduction • Configurability • Code not tied to environment & easier to share • Support for complex structures
  • 17. Scheduler/Client EP – Hadoop Summit 2015 17 Physical Architecture Hadoop Cluster Job Workflow RDBMS ETL Bridge Agent BI & PresentationmySQL DW User Behavior Data 1 2 43 5 Hive Scoobi Spark (poc) AVRO ORC
  • 18. EP – Hadoop Summit 2015 18 Tech Stack - Scoobi •Scoobi – Written in Scala, a functional programming language – Supports Object Oriented Designs – Abstraction of MR Framework code to lower – Portability of typical dataset operations like map, flatMap, filter, groupBy, sort, orderBy, partition – DList (Distributed Lists): Jobs are submitted as a series of “steps” representing granular MR jobs. – Enables developers to write a more concise code compared to Java MR code.
  • 19. EP – Hadoop Summit 2015 19 Word Count in Java M/R. import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job,new Path(args[1])); job.waitForCompletion(true); } }
  • 20. EP – Hadoop Summit 2015 20 Word Count in Scoobi import Scoobi._, Reduction._ val lines = fromTextFile("hdfs://in/...") val counts = lines.mapFlatten(_.split(" ")) .map(word => (word, 1)) .groupByKey .combine(Sum.int) counts.toTextFile("hdfs://out/...", overwrite=true).persist(ScoobiConfiguration())
  • 21. EP – Hadoop Summit 2015 21 Tech Stack - File Format • Avro – Supports rich and complex data structures such as Maps, Unions – Self-Describing data files enabling portability (Schema co-exists with data) – Supports schema dynamicity using Generic Records – Supports backward compatibility for data files w.r.t schema changes • ORC (Optimized Row Columnar) – A single file as the output of each task, which reduces the NameNode's load – Metadata stored using Protocol Buffers, which allows addition and removal of fields – Better performance of queries (bound the amount of memory needed for reading or writing) – Light-weight indexes stored within the file
  • 22. EP – Hadoop Summit 2015 22 Tech Stack - Hive • Efficient Joins for large datasets. • UDF for use cases like median and percentile calculations. • Hive Optimizer Joins: - Smaller is loaded into memory as a hash table and the larger is scanned - Map joins are automatically picked up by the optimizer. • Ad-hoc Analysis, Data Reconciliation use-cases and Testing.
  • 23. EP – Hadoop Summit 2015 23 Fun Facts of EP Processing • We read more than 200 TB of data for processing daily. • We run 350 M/R jobs daily. • We perform more than 30 joins using M/R & Hive, including the ones with heavy data skew. • We use 40 TB of YARN memory at peak time on a 170 TB Hadoop cluster. • We can run 150+ concurrent experiments daily. • Report generation takes around 18 hours.
  • 24. 24 Logical Architecture EP – Hadoop Summit 2015 EP Reporting Services Detail Intermediate 1 Intermediate 2 Summary Configuration Filters Data Providers Processors Calculators Metric Providers Output ColumnsMetricsDimensions Framework Components Reporting Context Cache Util/Helpers Command Line Input/Output Conduit Ancillary Services Alerts Shell Scripts Processed Data Store Tools Logging & Monitoring
  • 25. CHALLENGES & LEARNINGS • Joins • Job Optimization • Data Skew 25EP – Hadoop Summit 2015
  • 26. EP – Hadoop Summit 2015 26 Key Challenges •Performance – Job runtimes are subject to SLA & heavily tied to resources •Data Skew (Long tail data distribution) – May cause unrecoverable runtime failures – Poor performance •Joins, Combiner •Job Resiliency – Auto remediation – Alerts and Monitoring
  • 27. EP – Hadoop Summit 2015 27 Solution to Key Challenge - Performance – Tuned the Hadoop job parameters – a few of them are listed below • -Dmapreduce.input.fileinputformat.split.minsize and -Dmapreduce.input.fileinputformat.split.maxsize – Job run times were reduced in the range of 9% to 35% • -Dscoobi.mapreduce.reducers.bytesperreducer – Adjusting this parameter helped optimize the number of reducers to use. Job run times were reduced to the extent of 50% in some cases • -Dscoobi.concurrentjobs – Setting this parameter to true enables multiple steps of a scoobi job to run concurrently • -Dmapreduce.reduce.memory.mb – Tuning this parameter helped relieving memory pressure
  • 28. EP – Hadoop Summit 2015 28 Solution to Key Challenge - Performance – Implement Data cache for objects • Achieved cache hit ratio of over 99% per job • Runtime performance improved in the range of 18% to 39% depending on the job – Redesign/Refactor Jobs and Job Schedules • Extracted logic from existing jobs into their own jobs • Job workflow optimization for better parallelism – Dedicated Hadoop queue with more than 50 TB of YARN memory. • Shared Hadoop cluster resulted in long waiting times, dedicated queue solved the problem of resource crunch.
  • 29. Joins – Data skew in one or both datasets  Scoobi block join divides the skewed data into blocks and joins the data one block at a time. – Multiple joins in a process  Rewrote a process, which needed join with 11 datasets whose size varied from 49 TB to a few mega byte, in hive, as this process was taking 6+ hours in Scoobi and reduced the time to 3 hours in hive. – Other join solutions  Also looked into Hive’s bucket join, but the cost to sort and bucket the datasets was more than regular join. EP – Hadoop Summit 2015 29
  • 30. EP – Hadoop Summit 2015 30 Combiner To relieve Reducer memory pressure and prevent OOM Solution – Emit part-values of the complete operation for the same key using Combiners – Calculating Mean • Mean = ( X1 + X2 + X3 …. Xn )/ (1 + 1 + 1 + 1 … n) • formula is composed of 2 parts and mapper emits 2 part values combining records for the same key. • Reducer receives way fewer records after combining and applies the two parts from each mapper into the actual mean formula. • Concept can be applied to other complex formula such as Variance, as long as the formula can be reduced to parts that are commutative and associative.
  • 31. Job Resiliency – Auto-remediation • Auto-restart in case of job failure due to intermittent cluster issues - Monitoring & Alerting for Hadoop jobs • Continuous monitoring and email alert generated when a long-running job or failure detected - Monitoring & Alerting for Data quality • Daily monitoring of data trend set up for key metrics and email Alert on any anomaly or violations detected - Recon scripts • Checks and alerts setup for intermediate data - Daily data backup • Daily data back up with distcp to a secondary cluster and ability to restore EP – Hadoop Summit 2015 31
  • 32. Next - Evaluate Spark Current Problems - Data processing through Map Reduce is slow for a complex DAG, as data is persisted to disk at each step. Multiple stages in pipeline are chained together making the overall process very complex. - Massive Joins against very large datasets are slow. - Expressing every complicated business logic into Hadoop Map Reduce is a problem. Alternatives - Apache Spark has wide adoption, expressive, industry backing and thriving community support. - Apache spark has 10x to 100x speed improvements in comparison to traditional M/R jobs. EP – Hadoop Summit 2015 32
  • 33. Summary • Hadoop is ideal for large data processing and provides a highly scalable storage platform. • Hadoop eco-system is still evolving and have to face the issues around the software which is still underdevelopment. • Moving to Hadoop helped to free up huge capacity in DW for deep dive analysis. • Huge cost reduction for business like us with exploding data sets. EP – Hadoop Summit 2015 33
  • 34. Q & A

Editor's Notes

  1. Scoobi – Advantages compared to Java MR Written in Scala, a functional programming language, making Scoobi suitable for writing MR code Supports Object Oriented Designs (and legacy java object data models) MR Framework code is completely abstracted to lower levels leaving application developers to worry only about business logic Typical dataset operations like map, flatMap, filter, groupBy, sort, orderBy, partition are ported over in functionality to MR paradigm Large datasets are abstracted into a data type called DList (Distributed Lists). DLists represent delayed computations (a.k.a Scoobi Plan) using which jobs are submitted as a series of “steps” representing granular MR jobs. Developers do not need to create workflows for individual jobs Any MR operation can be executed on a DList enabling developers to write a more concise code compared to Java MR code. Multiple similar libraries based on Scala such as Scalding and Scrunch
  2. If a YARN container grows beyond its heap size setting, the map or reduce task will fail. Can solve this by increasing the heap size for the container for mappers or reducers, depending on which one is having the problem Increasing the memory size of mappers or reducers comes at the expense of reduced parallelism of your cluster since it can now launch fewer containers simultaneously, so do feel free to experiment with the memory settings to find the lowest heapsize that will allow you to complete your jobs comfortably. 
  3. Object caching allows applications to share objects across requests by storing frequently accessed or expensive-to-create objects in memory, object caching eliminates the need to repeatedly create and load data
  4.   Scoobi block join, where one of the datasets was heavily skewed. Join key was item_id and one of the datasets had over a million records for the same key, which was causing the job to fail. Block join divides the skewed data into blocks and joins the data one block at a time.         * Replicate the small (left) side n times including the id of the replica in the key. On the right     * side, add a random integer from 0...n-1 to the key. Join using the pseudo-key and strip out the extra     * fields.     * Useful for skewed join keys and large datasets.
  5. To relieve Reducer memory pressure and prevent OOM A combiner may be used to help by performing a map-local aggregation to prevent OOM errors on reducers due to a large number of input records. In Scoobi, a combiner takes the form of a function which may be invoked on a DList. Also, Combiner represents operations that have the Commutative and Associative properties. Further, two records must be combined in all aspects of the records’ attributes to result in a combined record. The problem of combining becomes more compounded in real-world problems where the rules of combining may not be directly applicable to attributes of records.
  6. Current Problems Data processing through Map Reduce is slow for a complex DAG as data is persisted to disk at each step. It is not designed for faster joins. Multiple stages in pipeline are chained together making the overall process very complex. Massive Joins against very large datasets. There is overwhelming need to make data more interactive/responsive and Hadoop is not built for it. Expressing every complicated business logic into Hadoop Map Reduce is a problem.