SlideShare une entreprise Scribd logo
1  sur  76
©2016 MediaMath Inc. 1
Moving Past
Infrastructure Limitations
Presented by MediaMath
FEB 16, 2016
©2016 MediaMath Inc. 2
02.16.2016
Rory Sawyer – Software Engineer, Data Platform
Moving Past Infrastructure
Limitations
©2016 MediaMath Inc. 3
©2016 MediaMath Inc. 4
Massive Volume of Data
 180 billion impression opportunities a day
©2016 MediaMath Inc. 5
Massive Volume of Data
 180 billion impression opportunities a day
 3+ million peak qps
©2016 MediaMath Inc. 6
Massive Volume of Data
 180 billion impression opportunities a day
 3+ million peak qps
 3+ TB of data per day (compressed)
©2016 MediaMath Inc. 7
Massive Volume of Data
 180 billion impression opportunities a day
 3+ million peak qps
 3+ TB of data per day (compressed)
 Logs represent financial transactions
Every record counts!
©2016 MediaMath Inc. 8
MediaMath’s Data Platform
 Centralized location for data at MM
Collect data from across the company
Standardize access for internal and external clients
 End-result of data warehouse transformation
©2016 MediaMath Inc. 9
Once Upon A Time
The Old Days
Etc..
©2016 MediaMath Inc. 10
Architecture – 2013
©2016 MediaMath Inc. 11
Data Warehousing at MM Circa 2013
©2016 MediaMath Inc. 12
Data Warehousing at MM Circa 2013
 No proper QA/testing environment
©2016 MediaMath Inc. 13
Data Warehousing at MM Circa 2013
 No proper QA/testing environment
 Production workflows and ad-hoc analytics ran side-by-side
©2016 MediaMath Inc. 14
Data Warehousing at MM circa 2013
 No proper QA/testing environment
 Production workflows and ad-hoc analytics ran side-by-side
 Scaling becomes an issue
Developing/testing/deploying changes to workflows frustrating
Copying data to more monolithic systems
More shell, more problems
©2016 MediaMath Inc. 15
Data Access circa 2013 – Users and Consumers
 Tools: SQL, shell
 Consumers: Data analysts, data engineers
©2016 MediaMath Inc. 16
Data Access circa 2013
 Logs: Custom FTP transfers
Merely extracting data could cause production problems
FTP could run out of space
©2016 MediaMath Inc. 17
Data Access circa 2013
 Logs: Custom FTP transfers
Merely extracting data could cause production problems
FTP could run out of space
 Heavy reliance on canned reports
Served via reporting API
Updated at most three times a day, usually just once a day
©2016 MediaMath Inc. 18
Data Access circa 2013
 Logs: Custom FTP transfers
Merely extracting data could cause production problems
FTP could run out of space
 Heavy reliance on canned reports
Served via reporting API
Updated at most three times a day, usually just once a day
 Hard to keep pace with growing demands
Internal Clients
External Clients
©2016 MediaMath Inc. 19
Data Liberation
©2016 MediaMath Inc. 20
Moving Past Infrastructure
 Resource flexibility
©2016 MediaMath Inc. 21
Moving Past Infrastructure
 Resource flexibility
 Fully own our conceptual problems
Can’t just get a bigger box or a higher support license
©2016 MediaMath Inc. 22
Moving Past Infrastructure
 Resource flexibility
 Fully own our conceptual problems
Can’t just get a bigger box or a higher support license
 Lower barrier to entry
©2016 MediaMath Inc. 23
Moving Past Infrastructure
 Resource flexibility
 Fully own our conceptual problems
Can’t just get a bigger box or a higher support license
 Lower barrier to entry
Decouple storage and computation
©2016 MediaMath Inc. 24
Move to the Cloud
©2016 MediaMath Inc. 25
Move to the Cloud
 Simple Storage Service (S3):
Primary data store; source of truth
Append-only. Update = delete + append
©2016 MediaMath Inc. 26
Move to the Cloud
 Simple Storage Service (S3):
Primary data store; source of truth
Append-only. Update = delete + append
 Elastic Map Reduce (EMR):
Transient hadoop clusters
Spot instances – save money
©2016 MediaMath Inc. 27
Move to the Cloud
 Simple Storage Service (S3):
Primary data store; source of truth
Append-only. Update = delete + append
 Elastic Map Reduce (EMR):
Transient hadoop clusters
Spot instances – save money
 Redshift:
Columnar storage for efficient querying
©2016 MediaMath Inc. 28
Data Platform – Today
©2016 MediaMath Inc. 29
Data Platform – Today
©2016 MediaMath Inc. 30
Data Access – Today
©2016 MediaMath Inc. 31
Developer Experience
©2016 MediaMath Inc. 32
Developer Experience
 Get to say “yes” more
Rapid development/testing/deployment removes inertia
©2016 MediaMath Inc. 33
Developer Experience
 Get to say “yes” more
Rapid development/testing/deployment removes inertia
 Clearly distinct, perfectly synced QA environment
Run multiple versions of workflows simultaneously on same source data
©2016 MediaMath Inc. 34
Developer Experience
 Get to say “yes” more
Rapid development/testing/deployment removes inertia
 Clearly distinct, perfectly synced QA environment
Run multiple versions of workflows simultaneously on same source data
 More control over components
©2016 MediaMath Inc. 35
Developer Experience
 Get to say “yes” more
Rapid development/testing/deployment removes inertia
 Clearly distinct, perfectly synced QA environment
Run multiple versions of workflows simultaneously on same source data
 More control over components
 Localized impact of processing
Each team uses their own compute environment
©2016 MediaMath Inc. 36
We don’t worry about this like we used to
©2016 MediaMath Inc. 37
Improved User Experience
©2016 MediaMath Inc. 38
Improved User Experience
 Augmented standard reporting with easily-accessible data
warehouse
AWS + Qubole provides value to all skill levels
©2016 MediaMath Inc. 39
Improved User Experience
 Augmented standard reporting with easily-accessible data
warehouse
AWS + Qubole provides value to all skill levels
 Transparently handle different data sources
Bridge storage types and AWS accounts
©2016 MediaMath Inc. 40
Improved User Experience
 Augmented standard reporting with easily-accessible data
warehouse
AWS + Qubole provides value to all skill levels
 Transparently handle different data sources
Bridge storage types and AWS accounts
 Choose your preferred query method
Spark, MapReduce, Flink, or BI tool
©2016 MediaMath Inc. 41
Improved User Experience
 Augmented standard reporting with easily-accessible data
warehouse
AWS + Qubole provides value to all skill levels
 Transparently handle different data sources
Bridge storage types and AWS accounts
 Choose your preferred query method
Spark, MapReduce, Flink, or BI tool
 All barriers removed
©2016 MediaMath Inc. 42
Productize it, cap’n
©2016 MediaMath Inc. 43
Productize it, cap’n
 Log level data API
Direct log access on S3
 Interactive Query
Scalable data processing with Qubole
©2016 MediaMath Inc. 44
Hive
©2016 MediaMath Inc. 45
Spark
©2016 MediaMath Inc. 46
SmartQuery
©2016 MediaMath Inc. 47
Clusters
©2016 MediaMath Inc. 48
Qubole’s Greatest Hits
©2016 MediaMath Inc. 49
Hybrid Life
©2016 MediaMath Inc. 50
New and Old
©2016 MediaMath Inc. 51
Managing a Hybrid Warehouse
©2016 MediaMath Inc. 52
Managing a Hybrid Warehouse
 Upfront effort to keep old and new consistent
After that, could migrate in pieces
©2016 MediaMath Inc. 53
Managing a Hybrid Warehouse
 Upfront effort to keep old and new consistent
After that, could migrate in pieces
 Keeping datasets in sync
Store metadata about datasets and processes
Keep record of what data was processed by which batches
©2016 MediaMath Inc. 54
Managing a Hybrid Warehouse
 Upfront effort to keep old and new consistent
After that, could migrate in pieces
 Keeping datasets in sync
Store metadata about datasets and processes
Keep record of what data was processed by which batches
©2016 MediaMath Inc. 55
Ch-ch-ch-challenges
©2016 MediaMath Inc. 56
Ch-ch-ch-challenges
 Spot instances: bid too low, jobs never start
Build processes around selecting best/cheapest zones
©2016 MediaMath Inc. 57
Ch-ch-ch-challenges
 Spot instances: bid too low, jobs never start
Build processes around selecting best/cheapest zones
 Maintaining two systems at once
Consistency, monitoring, updates…
©2016 MediaMath Inc. 58
Ch-ch-ch-challenges
 Spot instances: bid too low, jobs never start
Build processes around selecting best/cheapest zones
 Maintaining two systems at once
Consistency, monitoring, updates…
 Migrating mindset
New set of questions to answer
©2016 MediaMath Inc. 59
What we’ve learned
©2016 MediaMath Inc. 60
Life after Liberation
©2016 MediaMath Inc. 61
Life after Liberation
 Decentralize all the things
Single-machine -> distributed computing
Single data team -> data engineers on all the teams
©2016 MediaMath Inc. 62
Life after Liberation
 Decentralize all the things
Single-machine -> distributed computing
Single data team -> data engineers on all the teams
 Engineers on every team
Data Science – Spark (Scala)
Analytics – Spark/Hive (with Redshift connector)
Product – Hive
Engineering – Spark/Hive/MapReduce
Business analysts – SmartQuery
©2016 MediaMath Inc. 63
Data Access circa 2013 – Users and Consumers
 Tools: SQL, shell
 Consumers: Data analysts, data engineers
©2016 MediaMath Inc. 64
Data Access Today – Users and Consumers
 Tools: Hadoop (Scalding, Hive), Spark, RDBMS
 Consumers: Engineers, product managers, business
analysts, etc.
©2016 MediaMath Inc. 65
The Cost of Decentralization
©2016 MediaMath Inc. 66
The Cost of Decentralization
 Different producers and consumers have different priorities
File format, end-to-end latency, correctness, etc…
©2016 MediaMath Inc. 67
The Cost of Decentralization
 Different producers and consumers have different priorities
File format, end-to-end latency, correctness, etc…
 Adding a platform layer could add friction
©2016 MediaMath Inc. 68
Not Abandoning Managed Infrastructure
or: There and Back Again
©2016 MediaMath Inc. 69
Not Abandoning Managed Infrastructure
or: There and Back Again
 Managed hardware is still important
On-premises Hadoop cluster
Clients ETL into managed hardware
©2016 MediaMath Inc. 70
Not Abandoning Managed Infrastructure
or: There and Back Again
 Managed hardware is still important
On-premises Hadoop cluster
Clients ETL into managed hardware
 Experience with Data Liberation broke down “walled garden” feel
of AWS
©2016 MediaMath Inc. 71
Some sort of “last slide” title
©2016 MediaMath Inc. 72
Some sort of “last slide” title
 Moving DW to cloud has proven itself
Quick development allows us to keep pace
Ease of use helps teams and clients fine tune their own reporting
©2016 MediaMath Inc. 73
Some sort of “last slide” title
 Moving DW to cloud has proven itself
Quick development allows us to keep pace
Ease of use helps teams and clients fine tune their own reporting
 Re-thinking the tools and skills needed for data warehousing
©2016 MediaMath Inc. 74
Some sort of “last slide” title
 Moving DW to cloud has proven itself
Quick development allows us to keep pace
Ease of use helps teams and clients fine tune their own reporting
 Re-thinking the tools and skills needed for data warehousing
 Avoid tech debt by evolving our software and ideas before
committing to hardware
©2016 MediaMath Inc. 75
Some sort of “last slide” title
 Moving DW to cloud has proven itself
Quick development allows us to keep pace
Ease of use helps teams and clients fine tune their own reporting
 Re-thinking the tools and skills needed for data warehousing
 Avoid tech debt by evolving our software and ideas before
committing to hardware
 Move away from trickle-down data
©2016 MediaMath Inc. 76
THANK YOU!
Rory Sawyer
Software Engineer
Data Platform

Contenu connexe

Tendances

Tendances (20)

Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
 
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...Michael Stonebraker:  Big Data, Disruption, and the 800 Pound Gorilla in the ...
Michael Stonebraker: Big Data, Disruption, and the 800 Pound Gorilla in the ...
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
 

Similaire à Moving Past Infrastructure Limitations

Apply Machine Learning to Microservices
Apply Machine Learning to MicroservicesApply Machine Learning to Microservices
Apply Machine Learning to Microservices
Kai Wähner
 
opentextrelease16abetterwaytowork-160411183307
opentextrelease16abetterwaytowork-160411183307opentextrelease16abetterwaytowork-160411183307
opentextrelease16abetterwaytowork-160411183307
L. Phillip Urman
 
Managing and Using Information Systems A Strategic Approach –.docx
Managing and Using Information Systems A Strategic Approach –.docxManaging and Using Information Systems A Strategic Approach –.docx
Managing and Using Information Systems A Strategic Approach –.docx
croysierkathey
 

Similaire à Moving Past Infrastructure Limitations (20)

MediaMath - Big Data Warehousing Meetup - 2/16/2016
MediaMath - Big Data Warehousing Meetup - 2/16/2016MediaMath - Big Data Warehousing Meetup - 2/16/2016
MediaMath - Big Data Warehousing Meetup - 2/16/2016
 
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
Big Data: Introducing BigInsights, IBM's Hadoop- and Spark-based analytical p...
 
Apply Machine Learning to Microservices
Apply Machine Learning to MicroservicesApply Machine Learning to Microservices
Apply Machine Learning to Microservices
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
 
ICIC 2016: New Product Introduction Deep SEARCH 9
ICIC 2016: New Product Introduction Deep SEARCH 9ICIC 2016: New Product Introduction Deep SEARCH 9
ICIC 2016: New Product Introduction Deep SEARCH 9
 
Insight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital TransformationInsight Platforms Accelerate Digital Transformation
Insight Platforms Accelerate Digital Transformation
 
Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
opentextrelease16abetterwaytowork-160411183307
opentextrelease16abetterwaytowork-160411183307opentextrelease16abetterwaytowork-160411183307
opentextrelease16abetterwaytowork-160411183307
 
Delivering Services Powered by Operational Data - Connected Services
Delivering Services Powered by Operational Data -  Connected ServicesDelivering Services Powered by Operational Data -  Connected Services
Delivering Services Powered by Operational Data - Connected Services
 
Big Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data ManagementBig Data LDN 2017: Data Integration & Big Data Management
Big Data LDN 2017: Data Integration & Big Data Management
 
Managing and Using Information Systems A Strategic Approach –.docx
Managing and Using Information Systems A Strategic Approach –.docxManaging and Using Information Systems A Strategic Approach –.docx
Managing and Using Information Systems A Strategic Approach –.docx
 
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
Fiducia & GAD IT AG: From Fraud Detection to Big Data Platform: Bringing Hado...
 
Industrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an StandardsIndustrial Internet of Things: Protocols an Standards
Industrial Internet of Things: Protocols an Standards
 
Real-time Distributed Stream Processing @ Scale
Real-time Distributed Stream Processing@ ScaleReal-time Distributed Stream Processing@ Scale
Real-time Distributed Stream Processing @ Scale
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learning
 
What’s New in OpenText Content Suite 16
What’s New in OpenText Content Suite 16What’s New in OpenText Content Suite 16
What’s New in OpenText Content Suite 16
 
Modern Reporting At Scale - Migration Path for Dummies
Modern Reporting At Scale - Migration Path for DummiesModern Reporting At Scale - Migration Path for Dummies
Modern Reporting At Scale - Migration Path for Dummies
 
Modern Reporting at Scale: How to Distribute Information and Answers to the M...
Modern Reporting at Scale: How to Distribute Information and Answers to the M...Modern Reporting at Scale: How to Distribute Information and Answers to the M...
Modern Reporting at Scale: How to Distribute Information and Answers to the M...
 
Platform as Art: A Developer’s Perspective
Platform as Art: A Developer’s PerspectivePlatform as Art: A Developer’s Perspective
Platform as Art: A Developer’s Perspective
 

Plus de Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 

Plus de Caserta (9)

Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Moving Past Infrastructure Limitations

  • 1. ©2016 MediaMath Inc. 1 Moving Past Infrastructure Limitations Presented by MediaMath FEB 16, 2016
  • 2. ©2016 MediaMath Inc. 2 02.16.2016 Rory Sawyer – Software Engineer, Data Platform Moving Past Infrastructure Limitations
  • 4. ©2016 MediaMath Inc. 4 Massive Volume of Data  180 billion impression opportunities a day
  • 5. ©2016 MediaMath Inc. 5 Massive Volume of Data  180 billion impression opportunities a day  3+ million peak qps
  • 6. ©2016 MediaMath Inc. 6 Massive Volume of Data  180 billion impression opportunities a day  3+ million peak qps  3+ TB of data per day (compressed)
  • 7. ©2016 MediaMath Inc. 7 Massive Volume of Data  180 billion impression opportunities a day  3+ million peak qps  3+ TB of data per day (compressed)  Logs represent financial transactions Every record counts!
  • 8. ©2016 MediaMath Inc. 8 MediaMath’s Data Platform  Centralized location for data at MM Collect data from across the company Standardize access for internal and external clients  End-result of data warehouse transformation
  • 9. ©2016 MediaMath Inc. 9 Once Upon A Time The Old Days Etc..
  • 10. ©2016 MediaMath Inc. 10 Architecture – 2013
  • 11. ©2016 MediaMath Inc. 11 Data Warehousing at MM Circa 2013
  • 12. ©2016 MediaMath Inc. 12 Data Warehousing at MM Circa 2013  No proper QA/testing environment
  • 13. ©2016 MediaMath Inc. 13 Data Warehousing at MM Circa 2013  No proper QA/testing environment  Production workflows and ad-hoc analytics ran side-by-side
  • 14. ©2016 MediaMath Inc. 14 Data Warehousing at MM circa 2013  No proper QA/testing environment  Production workflows and ad-hoc analytics ran side-by-side  Scaling becomes an issue Developing/testing/deploying changes to workflows frustrating Copying data to more monolithic systems More shell, more problems
  • 15. ©2016 MediaMath Inc. 15 Data Access circa 2013 – Users and Consumers  Tools: SQL, shell  Consumers: Data analysts, data engineers
  • 16. ©2016 MediaMath Inc. 16 Data Access circa 2013  Logs: Custom FTP transfers Merely extracting data could cause production problems FTP could run out of space
  • 17. ©2016 MediaMath Inc. 17 Data Access circa 2013  Logs: Custom FTP transfers Merely extracting data could cause production problems FTP could run out of space  Heavy reliance on canned reports Served via reporting API Updated at most three times a day, usually just once a day
  • 18. ©2016 MediaMath Inc. 18 Data Access circa 2013  Logs: Custom FTP transfers Merely extracting data could cause production problems FTP could run out of space  Heavy reliance on canned reports Served via reporting API Updated at most three times a day, usually just once a day  Hard to keep pace with growing demands Internal Clients External Clients
  • 19. ©2016 MediaMath Inc. 19 Data Liberation
  • 20. ©2016 MediaMath Inc. 20 Moving Past Infrastructure  Resource flexibility
  • 21. ©2016 MediaMath Inc. 21 Moving Past Infrastructure  Resource flexibility  Fully own our conceptual problems Can’t just get a bigger box or a higher support license
  • 22. ©2016 MediaMath Inc. 22 Moving Past Infrastructure  Resource flexibility  Fully own our conceptual problems Can’t just get a bigger box or a higher support license  Lower barrier to entry
  • 23. ©2016 MediaMath Inc. 23 Moving Past Infrastructure  Resource flexibility  Fully own our conceptual problems Can’t just get a bigger box or a higher support license  Lower barrier to entry Decouple storage and computation
  • 24. ©2016 MediaMath Inc. 24 Move to the Cloud
  • 25. ©2016 MediaMath Inc. 25 Move to the Cloud  Simple Storage Service (S3): Primary data store; source of truth Append-only. Update = delete + append
  • 26. ©2016 MediaMath Inc. 26 Move to the Cloud  Simple Storage Service (S3): Primary data store; source of truth Append-only. Update = delete + append  Elastic Map Reduce (EMR): Transient hadoop clusters Spot instances – save money
  • 27. ©2016 MediaMath Inc. 27 Move to the Cloud  Simple Storage Service (S3): Primary data store; source of truth Append-only. Update = delete + append  Elastic Map Reduce (EMR): Transient hadoop clusters Spot instances – save money  Redshift: Columnar storage for efficient querying
  • 28. ©2016 MediaMath Inc. 28 Data Platform – Today
  • 29. ©2016 MediaMath Inc. 29 Data Platform – Today
  • 30. ©2016 MediaMath Inc. 30 Data Access – Today
  • 31. ©2016 MediaMath Inc. 31 Developer Experience
  • 32. ©2016 MediaMath Inc. 32 Developer Experience  Get to say “yes” more Rapid development/testing/deployment removes inertia
  • 33. ©2016 MediaMath Inc. 33 Developer Experience  Get to say “yes” more Rapid development/testing/deployment removes inertia  Clearly distinct, perfectly synced QA environment Run multiple versions of workflows simultaneously on same source data
  • 34. ©2016 MediaMath Inc. 34 Developer Experience  Get to say “yes” more Rapid development/testing/deployment removes inertia  Clearly distinct, perfectly synced QA environment Run multiple versions of workflows simultaneously on same source data  More control over components
  • 35. ©2016 MediaMath Inc. 35 Developer Experience  Get to say “yes” more Rapid development/testing/deployment removes inertia  Clearly distinct, perfectly synced QA environment Run multiple versions of workflows simultaneously on same source data  More control over components  Localized impact of processing Each team uses their own compute environment
  • 36. ©2016 MediaMath Inc. 36 We don’t worry about this like we used to
  • 37. ©2016 MediaMath Inc. 37 Improved User Experience
  • 38. ©2016 MediaMath Inc. 38 Improved User Experience  Augmented standard reporting with easily-accessible data warehouse AWS + Qubole provides value to all skill levels
  • 39. ©2016 MediaMath Inc. 39 Improved User Experience  Augmented standard reporting with easily-accessible data warehouse AWS + Qubole provides value to all skill levels  Transparently handle different data sources Bridge storage types and AWS accounts
  • 40. ©2016 MediaMath Inc. 40 Improved User Experience  Augmented standard reporting with easily-accessible data warehouse AWS + Qubole provides value to all skill levels  Transparently handle different data sources Bridge storage types and AWS accounts  Choose your preferred query method Spark, MapReduce, Flink, or BI tool
  • 41. ©2016 MediaMath Inc. 41 Improved User Experience  Augmented standard reporting with easily-accessible data warehouse AWS + Qubole provides value to all skill levels  Transparently handle different data sources Bridge storage types and AWS accounts  Choose your preferred query method Spark, MapReduce, Flink, or BI tool  All barriers removed
  • 42. ©2016 MediaMath Inc. 42 Productize it, cap’n
  • 43. ©2016 MediaMath Inc. 43 Productize it, cap’n  Log level data API Direct log access on S3  Interactive Query Scalable data processing with Qubole
  • 46. ©2016 MediaMath Inc. 46 SmartQuery
  • 47. ©2016 MediaMath Inc. 47 Clusters
  • 48. ©2016 MediaMath Inc. 48 Qubole’s Greatest Hits
  • 49. ©2016 MediaMath Inc. 49 Hybrid Life
  • 50. ©2016 MediaMath Inc. 50 New and Old
  • 51. ©2016 MediaMath Inc. 51 Managing a Hybrid Warehouse
  • 52. ©2016 MediaMath Inc. 52 Managing a Hybrid Warehouse  Upfront effort to keep old and new consistent After that, could migrate in pieces
  • 53. ©2016 MediaMath Inc. 53 Managing a Hybrid Warehouse  Upfront effort to keep old and new consistent After that, could migrate in pieces  Keeping datasets in sync Store metadata about datasets and processes Keep record of what data was processed by which batches
  • 54. ©2016 MediaMath Inc. 54 Managing a Hybrid Warehouse  Upfront effort to keep old and new consistent After that, could migrate in pieces  Keeping datasets in sync Store metadata about datasets and processes Keep record of what data was processed by which batches
  • 55. ©2016 MediaMath Inc. 55 Ch-ch-ch-challenges
  • 56. ©2016 MediaMath Inc. 56 Ch-ch-ch-challenges  Spot instances: bid too low, jobs never start Build processes around selecting best/cheapest zones
  • 57. ©2016 MediaMath Inc. 57 Ch-ch-ch-challenges  Spot instances: bid too low, jobs never start Build processes around selecting best/cheapest zones  Maintaining two systems at once Consistency, monitoring, updates…
  • 58. ©2016 MediaMath Inc. 58 Ch-ch-ch-challenges  Spot instances: bid too low, jobs never start Build processes around selecting best/cheapest zones  Maintaining two systems at once Consistency, monitoring, updates…  Migrating mindset New set of questions to answer
  • 59. ©2016 MediaMath Inc. 59 What we’ve learned
  • 60. ©2016 MediaMath Inc. 60 Life after Liberation
  • 61. ©2016 MediaMath Inc. 61 Life after Liberation  Decentralize all the things Single-machine -> distributed computing Single data team -> data engineers on all the teams
  • 62. ©2016 MediaMath Inc. 62 Life after Liberation  Decentralize all the things Single-machine -> distributed computing Single data team -> data engineers on all the teams  Engineers on every team Data Science – Spark (Scala) Analytics – Spark/Hive (with Redshift connector) Product – Hive Engineering – Spark/Hive/MapReduce Business analysts – SmartQuery
  • 63. ©2016 MediaMath Inc. 63 Data Access circa 2013 – Users and Consumers  Tools: SQL, shell  Consumers: Data analysts, data engineers
  • 64. ©2016 MediaMath Inc. 64 Data Access Today – Users and Consumers  Tools: Hadoop (Scalding, Hive), Spark, RDBMS  Consumers: Engineers, product managers, business analysts, etc.
  • 65. ©2016 MediaMath Inc. 65 The Cost of Decentralization
  • 66. ©2016 MediaMath Inc. 66 The Cost of Decentralization  Different producers and consumers have different priorities File format, end-to-end latency, correctness, etc…
  • 67. ©2016 MediaMath Inc. 67 The Cost of Decentralization  Different producers and consumers have different priorities File format, end-to-end latency, correctness, etc…  Adding a platform layer could add friction
  • 68. ©2016 MediaMath Inc. 68 Not Abandoning Managed Infrastructure or: There and Back Again
  • 69. ©2016 MediaMath Inc. 69 Not Abandoning Managed Infrastructure or: There and Back Again  Managed hardware is still important On-premises Hadoop cluster Clients ETL into managed hardware
  • 70. ©2016 MediaMath Inc. 70 Not Abandoning Managed Infrastructure or: There and Back Again  Managed hardware is still important On-premises Hadoop cluster Clients ETL into managed hardware  Experience with Data Liberation broke down “walled garden” feel of AWS
  • 71. ©2016 MediaMath Inc. 71 Some sort of “last slide” title
  • 72. ©2016 MediaMath Inc. 72 Some sort of “last slide” title  Moving DW to cloud has proven itself Quick development allows us to keep pace Ease of use helps teams and clients fine tune their own reporting
  • 73. ©2016 MediaMath Inc. 73 Some sort of “last slide” title  Moving DW to cloud has proven itself Quick development allows us to keep pace Ease of use helps teams and clients fine tune their own reporting  Re-thinking the tools and skills needed for data warehousing
  • 74. ©2016 MediaMath Inc. 74 Some sort of “last slide” title  Moving DW to cloud has proven itself Quick development allows us to keep pace Ease of use helps teams and clients fine tune their own reporting  Re-thinking the tools and skills needed for data warehousing  Avoid tech debt by evolving our software and ideas before committing to hardware
  • 75. ©2016 MediaMath Inc. 75 Some sort of “last slide” title  Moving DW to cloud has proven itself Quick development allows us to keep pace Ease of use helps teams and clients fine tune their own reporting  Re-thinking the tools and skills needed for data warehousing  Avoid tech debt by evolving our software and ideas before committing to hardware  Move away from trickle-down data
  • 76. ©2016 MediaMath Inc. 76 THANK YOU! Rory Sawyer Software Engineer Data Platform