SlideShare une entreprise Scribd logo
1  sur  12
Overkill Analytics
Claudiu Barbura
VP of Engineering
• Architect and Dev Mgr at ubix.ai … data science platform
• Infrastructure & real-time services —-> automating data science at scale
• 3rd Seattle Spark Meetup!
• xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra …)
• Strata, Spark, C* summits & local meetups
About Me
• Ubix Data Eng & Science Platform Architecture
• High dimensional sparse feature spaces
• OKA (OverKill Analytics) and Composite Modelling
• (Kaggle)Outbrain Click Prediction: demo in DSL Workbench
• pymap deep dive: distributed scikit-learn through Spark
• python injection into DSL: pySpark scala JVM interop
• Q&A
Agenda
Data Eng & Science Platform: “Engine”
Unified big data technology stack (spark, cassandra, hadoop, kafka, es..)
Cloud agnostic architecture
Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF)
Extensible and integration via fluent and expressive API (DSL)
Enterprise grade: scalability, performance, high availability, geo-replication,
resilience, security, manageability, interoperability, testability
• high dimensional feature engineering often demands sparse representation
• spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand-
sparse, filter-sparse, load sparse (libsvm format)
• sparse format: native input to mllib, spark.ml, scikit-learn algos
• exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow)
• feature (2-way) encoding + vocabulary extraction (error analysis, importance)
• Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text)
High dimensional sparse feature spaces
• OKA: “design philosophy for predictive models favors volume over precision, utility over
elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning
• Alternative to Dimensionality reduction - train on full sparse feature space!
• Composite Modeling = managing part models as one ensemble
• distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting
• unsupervised learning output -> input supervised learning (clustering + ensembling)`
• dimensionality reduction or building semantically different models within clusters
• OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models)
OKA (OverKill Analytics) & Composite Modelling
Outbrain Click Prediction
• Outbrain: content discovery platform … 250 billion personalized recommendations/month
• Kaggle: predict which recommended content each user will click?
• sample of users’ page views and clicks (14 days) .. sets of content recommendations
served to a specific user in a specific context +
• document metadata: mentioned entities (person, organization, location), a taxonomy of
categories, the topics mentioned, and the publisher.
• 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites
Outbrain Click Prediction
• primitives for model management (model + metadata)
• optimizations for clustering + composite modeling techniques
• compute partition size/count to avoid OOM (simple with static allocation of resources
(Mesos/Coarse Grained or YARN))
• wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway)
• python-scala interop through cached temp tables (registerTempTable)
pymap - distributed python
Thank you!
Q & A

Contenu connexe

Tendances

Tendances (20)

MLflow on and inside Azure
MLflow on and inside AzureMLflow on and inside Azure
MLflow on and inside Azure
 
Finding new Customers using D&B and Excel Power Query
Finding new Customers using D&B and Excel Power QueryFinding new Customers using D&B and Excel Power Query
Finding new Customers using D&B and Excel Power Query
 
Microsoft Machine Learning Smackdown
Microsoft Machine Learning SmackdownMicrosoft Machine Learning Smackdown
Microsoft Machine Learning Smackdown
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DBBuilding near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
 
Graph Databases at Netflix
Graph Databases at NetflixGraph Databases at Netflix
Graph Databases at Netflix
 
_Search? Made Simple: Elastic + App Search
_Search? Made Simple: Elastic + App Search_Search? Made Simple: Elastic + App Search
_Search? Made Simple: Elastic + App Search
 
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
A Microservices Framework for Real-Time Model Scoring Using Structured Stream...
 
Azure AI platform - Automated ML workshop
Azure AI platform - Automated ML workshopAzure AI platform - Automated ML workshop
Azure AI platform - Automated ML workshop
 
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...How to manage one million messages per second using Azure, Radu Vunvulea, ITD...
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...
 
Winning the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive AnalyticsWinning the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive Analytics
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Scalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized ServicesScalable Data Analytics and Visualization with Cloud Optimized Services
Scalable Data Analytics and Visualization with Cloud Optimized Services
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
ONNX and MLflow
ONNX and MLflowONNX and MLflow
ONNX and MLflow
 
Machine Learning Deep Dive
Machine Learning Deep DiveMachine Learning Deep Dive
Machine Learning Deep Dive
 
Mastering Azure Monitor
Mastering Azure MonitorMastering Azure Monitor
Mastering Azure Monitor
 
CSharp
CSharpCSharp
CSharp
 

Similaire à Overkill Analytics Seattle Spark Meetup

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 

Similaire à Overkill Analytics Seattle Spark Meetup (20)

Bertenthal
BertenthalBertenthal
Bertenthal
 
Online news popularity analysis
Online news popularity analysisOnline news popularity analysis
Online news popularity analysis
 
ASGARD Splunk Conf 2016
ASGARD Splunk Conf 2016ASGARD Splunk Conf 2016
ASGARD Splunk Conf 2016
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
DA_01_Intro.pptx
DA_01_Intro.pptxDA_01_Intro.pptx
DA_01_Intro.pptx
 
Designing Artificial Intelligence
Designing Artificial IntelligenceDesigning Artificial Intelligence
Designing Artificial Intelligence
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
MediaGlu and Mongo DB
MediaGlu and Mongo DBMediaGlu and Mongo DB
MediaGlu and Mongo DB
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adx
 
201908 Overview of Automated ML
201908 Overview of Automated ML201908 Overview of Automated ML
201908 Overview of Automated ML
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
AzureSynapse.pptx
AzureSynapse.pptxAzureSynapse.pptx
AzureSynapse.pptx
 
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
Using Deep Learning to do Real-Time Scoring in Practical Applications - 2015-...
 
Introduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 TutorialIntroduction to Globus - XSEDE14 Tutorial
Introduction to Globus - XSEDE14 Tutorial
 

Plus de Claudiu Barbura

Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
Claudiu Barbura
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 

Plus de Claudiu Barbura (6)

Spark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internalsSpark, Tachyon and Mesos internals
Spark, Tachyon and Mesos internals
 
Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014Tachyon meetup San Francisco Oct 2014
Tachyon meetup San Francisco Oct 2014
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
xPatterns ... beyond Hadoop (Spark, Shark, Mesos, Tachyon)
 

Dernier

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 

Dernier (20)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

Overkill Analytics Seattle Spark Meetup

  • 2. • Architect and Dev Mgr at ubix.ai … data science platform • Infrastructure & real-time services —-> automating data science at scale • 3rd Seattle Spark Meetup! • xPatterns Big Data Platform (Spark, Mesos, Tachyon, Cassandra …) • Strata, Spark, C* summits & local meetups About Me
  • 3. • Ubix Data Eng & Science Platform Architecture • High dimensional sparse feature spaces • OKA (OverKill Analytics) and Composite Modelling • (Kaggle)Outbrain Click Prediction: demo in DSL Workbench • pymap deep dive: distributed scikit-learn through Spark • python injection into DSL: pySpark scala JVM interop • Q&A Agenda
  • 4. Data Eng & Science Platform: “Engine” Unified big data technology stack (spark, cassandra, hadoop, kafka, es..) Cloud agnostic architecture Universal predictive interface (MlLib, ML Pipeline, VW, scikit-learn, R, H20 … TF) Extensible and integration via fluent and expressive API (DSL) Enterprise grade: scalability, performance, high availability, geo-replication, resilience, security, manageability, interoperability, testability
  • 5.
  • 6.
  • 7. • high dimensional feature engineering often demands sparse representation • spark and scipy support vs ubix DSL: compress-sparse, merge-sparse, expand- sparse, filter-sparse, load sparse (libsvm format) • sparse format: native input to mllib, spark.ml, scikit-learn algos • exceptions: spark 1.6 mllib’s kmeans, gmm, RF (breeze linear algebra or … slow) • feature (2-way) encoding + vocabulary extraction (error analysis, importance) • Dimensionality Reduction via Feature Selection (ChiSquare) and Hashing (text) High dimensional sparse feature spaces
  • 8. • OKA: “design philosophy for predictive models favors volume over precision, utility over elegance, and CPU over IQ. … brute force attack on data science, compromise fine-tuning • Alternative to Dimensionality reduction - train on full sparse feature space! • Composite Modeling = managing part models as one ensemble • distributed scikit-learn/TF/VW models -> prediction table output for averaging, voting • unsupervised learning output -> input supervised learning (clustering + ensembling)` • dimensionality reduction or building semantically different models within clusters • OKA + Comp: larger feature spaces (lower variance in parts -> higher bias in part models) OKA (OverKill Analytics) & Composite Modelling
  • 10. • Outbrain: content discovery platform … 250 billion personalized recommendations/month • Kaggle: predict which recommended content each user will click? • sample of users’ page views and clicks (14 days) .. sets of content recommendations served to a specific user in a specific context + • document metadata: mentioned entities (person, organization, location), a taxonomy of categories, the topics mentioned, and the publisher. • 2 Billion page views, 16,900,000 clicks of 700 Million unique users, across 560 sites Outbrain Click Prediction
  • 11. • primitives for model management (model + metadata) • optimizations for clustering + composite modeling techniques • compute partition size/count to avoid OOM (simple with static allocation of resources (Mesos/Coarse Grained or YARN)) • wrapped pySpark (jvmContext) through gateway servercontext (JavaGateway) • python-scala interop through cached temp tables (registerTempTable) pymap - distributed python

Notes de l'éditeur

  1. Logical Architecture Diagram
  2. Physical Architecture Diagram