SlideShare une entreprise Scribd logo
1  sur  8
themidgame-tubea data platform for finding YouTube influencers on brand names
Pedro Moy
Data Sources / Scraping
2
19M videos
500 GB
18K brands
Data Pipeline
3
YARN
AWS Elastic MapReduce
Thrift API
MapReduce MapReduce
Design Choices
1. AWS Elastic MapReduce (EMR) vs. Mesosphere (with Docker)
 zero deployment efforts
 EMR with YARN, dynamically scale your cluster
2. batch-only design
 lower-maintenance (no Kafka, no Storm/Spark Streaming)
3. S3+HDFS together
 cheap unlimited storage with HDFS batch performance
 safely shut down your cluster
4
System Scalability Test With Spot Instances
5
Slave
Nodes
Batch
Process
Time
Overall
Throughput
Cost
Rate
Total
Cost
5 48 mins 10.4 GB/min $2.1/h $1.7
10 32 mins 15.6 GB/min $3.8/h $2.0
15 21 mins 23.8 GB/min $5.5/h $1.9
Load: +500 GB / +19M videos
Cluster Setup: Master: m3.xlarge
Slave: m2.4xlarge (bided at $0.1/h from $1.0/h)
Scalability Challenges & Salted Tables
6
Data Column
Channel Name
Channel ID
Video Title
Transcript Word Set
Word Frequency
RowKey = SALT+Brand+MetricType+…
…NumMetric+VideoID
SALTING = predefining “table” (region) splits using the SALT
HBase Schema
RowKey0
RowKey1
RowLey2
…
Born in Brazil
Love mixing with different cultures!
Former Data Analytics Engineer
Boston MA
Studying machine learning (part-time)
OMSCS, Georgia Tech
MSc, Geophysics
King Abdullah University of Science and Tech.
BSc, Mechanical Engineering
University of Massachusetts Lowell
7
System Scalability Test With Spot Instances[2]
8
Core EC2
Instances
Spark
Executors
Spark
Jobs Time
Batch
Process Time
Overall
Throughput
Total Cost Savings
5 119 23 mins 48 mins (25) 10.4 GB/min $2.1/h ($1.7) 68%
10 239 12 mins 32 mins (20) 15.6 GB/min $3.8/h ($2.0) 70%
15 359 7 mins 21 mins (14) 23.8 GB/min $5.5/h ($1.9) 70%
Load: +500 GB / +19M videos
Cluster Setup: Master:
Core:
Spark Setup: 2 GB Memory/Executor
m3.xlarge 15 GB RAM 13 units 4 cores
m2.4xlarge 68 GB RAM 26 units 8 cores

Contenu connexe

Tendances

HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud PartyHaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HPCC Systems
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgate
guest40fc7cd
 
Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
DataWorks Summit
 
Whirr devdown
Whirr devdownWhirr devdown
Whirr devdown
Puppet
 

Tendances (18)

HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud PartyHaaS: HPCC Systems as a Service – BYOD to the Cloud Party
HaaS: HPCC Systems as a Service – BYOD to the Cloud Party
 
Building maps for apps in the cloud - a Softlayer Use Case
Building maps for  apps in the cloud - a Softlayer Use CaseBuilding maps for  apps in the cloud - a Softlayer Use Case
Building maps for apps in the cloud - a Softlayer Use Case
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
Enhance! Real-time webcam video super-resolution
Enhance! Real-time webcam video super-resolutionEnhance! Real-time webcam video super-resolution
Enhance! Real-time webcam video super-resolution
 
Gx4
Gx4Gx4
Gx4
 
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
06.09.2017 Computer Science, Machine Learning & Statistiks Meetup - MULTI-GPU...
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
AWS RDS Benchmark - CMG Brasil 2012
AWS RDS Benchmark - CMG Brasil 2012AWS RDS Benchmark - CMG Brasil 2012
AWS RDS Benchmark - CMG Brasil 2012
 
Agx 2
Agx 2Agx 2
Agx 2
 
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...
 
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical ResearchBruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
Bruno Silva - eMedLab: Merging HPC and Cloud for Biomedical Research
 
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
(PFC302) Performance Benchmarking on AWS | AWS re:Invent 2014
 
Threading Successes 04 Hellgate
Threading Successes 04   HellgateThreading Successes 04   Hellgate
Threading Successes 04 Hellgate
 
Intel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learningIntel optimized tensorflow, distributed deep learning
Intel optimized tensorflow, distributed deep learning
 
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
20180926 kubeflow-meetup-1-kubeflow-operators-Preferred Networks-Shingo Omura
 
Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
 
Whirr devdown
Whirr devdownWhirr devdown
Whirr devdown
 
Cassandra Performance Benchmark
Cassandra Performance BenchmarkCassandra Performance Benchmark
Cassandra Performance Benchmark
 

Similaire à themidgame-tube-slides

Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
npinto
 

Similaire à themidgame-tube-slides (20)

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
What's New in H2O Driverless AI? - Arno Candel - H2O AI World London 2018
 
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics ToolsBuilding an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
Building an Amazon Datawarehouse and Using Business Intelligence Analytics Tools
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
Galaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWSGalaxy CloudMan performance on AWS
Galaxy CloudMan performance on AWS
 
The future of tape
The future of tapeThe future of tape
The future of tape
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Re invent 2018 meetup presentation
Re invent 2018 meetup presentationRe invent 2018 meetup presentation
Re invent 2018 meetup presentation
 
Lessons learned scaling big data in cloud
Lessons learned   scaling big data in cloudLessons learned   scaling big data in cloud
Lessons learned scaling big data in cloud
 
IBM Tape Update Dezember18 - TS1160
IBM Tape Update Dezember18 - TS1160IBM Tape Update Dezember18 - TS1160
IBM Tape Update Dezember18 - TS1160
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
Getting Started with Amazon Redshift
 Getting Started with Amazon Redshift Getting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Ri...
 
Cs264 intro-to-cloud-computing
Cs264 intro-to-cloud-computingCs264 intro-to-cloud-computing
Cs264 intro-to-cloud-computing
 
Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform Putting Kafka Together with the Best of Google Cloud Platform
Putting Kafka Together with the Best of Google Cloud Platform
 
Day of Cloud: Amazon EC2
Day of Cloud: Amazon EC2Day of Cloud: Amazon EC2
Day of Cloud: Amazon EC2
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

themidgame-tube-slides

  • 1. themidgame-tubea data platform for finding YouTube influencers on brand names Pedro Moy
  • 2. Data Sources / Scraping 2 19M videos 500 GB 18K brands
  • 3. Data Pipeline 3 YARN AWS Elastic MapReduce Thrift API MapReduce MapReduce
  • 4. Design Choices 1. AWS Elastic MapReduce (EMR) vs. Mesosphere (with Docker)  zero deployment efforts  EMR with YARN, dynamically scale your cluster 2. batch-only design  lower-maintenance (no Kafka, no Storm/Spark Streaming) 3. S3+HDFS together  cheap unlimited storage with HDFS batch performance  safely shut down your cluster 4
  • 5. System Scalability Test With Spot Instances 5 Slave Nodes Batch Process Time Overall Throughput Cost Rate Total Cost 5 48 mins 10.4 GB/min $2.1/h $1.7 10 32 mins 15.6 GB/min $3.8/h $2.0 15 21 mins 23.8 GB/min $5.5/h $1.9 Load: +500 GB / +19M videos Cluster Setup: Master: m3.xlarge Slave: m2.4xlarge (bided at $0.1/h from $1.0/h)
  • 6. Scalability Challenges & Salted Tables 6 Data Column Channel Name Channel ID Video Title Transcript Word Set Word Frequency RowKey = SALT+Brand+MetricType+… …NumMetric+VideoID SALTING = predefining “table” (region) splits using the SALT HBase Schema RowKey0 RowKey1 RowLey2 …
  • 7. Born in Brazil Love mixing with different cultures! Former Data Analytics Engineer Boston MA Studying machine learning (part-time) OMSCS, Georgia Tech MSc, Geophysics King Abdullah University of Science and Tech. BSc, Mechanical Engineering University of Massachusetts Lowell 7
  • 8. System Scalability Test With Spot Instances[2] 8 Core EC2 Instances Spark Executors Spark Jobs Time Batch Process Time Overall Throughput Total Cost Savings 5 119 23 mins 48 mins (25) 10.4 GB/min $2.1/h ($1.7) 68% 10 239 12 mins 32 mins (20) 15.6 GB/min $3.8/h ($2.0) 70% 15 359 7 mins 21 mins (14) 23.8 GB/min $5.5/h ($1.9) 70% Load: +500 GB / +19M videos Cluster Setup: Master: Core: Spark Setup: 2 GB Memory/Executor m3.xlarge 15 GB RAM 13 units 4 cores m2.4xlarge 68 GB RAM 26 units 8 cores