SlideShare une entreprise Scribd logo
1  sur  23
Data analytics at a PB scale
200 researchers using just Presto and a data-lake
Or Koren | Head of Data @ ironSource
About Me ● Married ● 35 ● Tel Aviv ● Coding
Agenda:
ironSource overview
The past: 2016-2017
The present: 2018-2019
The future: 2020-2021
ironSource overview
ESTABLISHE
D
ACQUISITIONS TO
DATE
EMPLOYEES
San Francisco
United States
New York
United States
London
United Kingdom
Berlin
Germany
Kiev
Ukraine
Tel Aviv
Israel
Bangalore
India
Hong Kong
China
Tokyo
Japan
Seoul
South Korea
Beijing
Shanghai
Shenzhen
China
53
7
11
1
39
5
3
1
5
624
30
ironSource Overview
ESTABLISHED
SEP. 2010
ACQUISITIONS TO DATE
8
EMPLOYEES
779
R&D EMPLOYEES
395
ironSource Solutions & Products
Developer Solutions
In-app advertising network and
mediation platform for app developers
PRODUCTS & PLATFORMS
ironSource Mediation
In-App Advertising Network
PRODUCTS & PLATFORMS
Enterprise Solutions
Engagement platform for Carriers & OEMs
PRODUCTS & PLATFORMS
ironSource Aura
PRODUCTS & PLATFORMS
Digital Solutions
Software delivery platform and B2C
security products
PRODUCTS & PLATFORMS
Delivery & Ad Monetization Platform
Security products
PRODUCTS & PLATFORMS
*
*In advanced negotiations
The past
Our old data architecture
● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
Our data scientist…
To the future
● Lifetime data
● Fast SQL
● Easy scale
● Data science
● Open source
The present
Data Lake
Parquet
● Files based
● Open Source
● Column oriented
● S3 bucket
Hive
Apache Hive is a data warehouse software project
which was built on top of Apache Hadoop in order
to provide data query and analysis.
● One place to rule them all
● Hadoop Ecosystem
● Presto
● Spark
● Athena
Data Lake
Presto & Qubole
Qubole delivers a Self-Service Platform for Big Data
Analytics built on Amazon Web Services, Microsoft
and Google Clouds.
Scalable Clusters
Qubole configuration scales clusters up and down by
looking over the execution plan of the queries.
Spots
Maintenance & Versions
Qubole takes care of new versions & 24/7 support
For Every Query
Auto scaling demo
Presto UI
Our Volume
500TB
Daily scan (from S3)
70K
Daily queries over
Presto 200
Users
500
Dashboards
Our data scientist …
● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
Our data architecture
● 1 redshift cluster
● 0 RDS clusters
● 300 ETLs
● 1 Tableau & 1 Re/dash
● Reduce costs by 50%
● Agile to the business
+
Our new data architecture
The future
● Replace 90% of our ETLs to ELTs
● Help our data science team by being more clear
on the logic, reducing their work time by 80%
● Keeping raw data without any manipulation
Reduce ML Model deployment time by 50%
● No ETL time - no schedule
The New ETL
is ELT
Extract,
Load,
Transform.
Presto Connectors
Kafka
Real-time alerts over presto
ScyllaDB
Increase our insights with our ML
models
Elasticsearch
Join business KPIs with R&D logs
Key notes to take home
Data-Lake Keep all your raw data in one place.
It will help you in the future with costs, research, reduce resources and ML models
Qubole Enjoy the benefits of 3rd party companies and continue to work on your business
Scale Reach endless data with big clusters that scale per query
ELT Move 90% of your ETLs to ELTs, to reduce lags and costs
Agile Promote your business with quick insights
Free to Learn Take 10% of your time and learn!
Try and play with the data :)
Thank You
Or Koren
or@ironsrc.com
Linkedin: korenor

Contenu connexe

Tendances

Detecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David PryceDetecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David PryceDatabricks
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichDatabricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElasticsearch
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Databricks
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiDatabricks
 
A (XPages) developers guide to Cloudant
A (XPages) developers guide to CloudantA (XPages) developers guide to Cloudant
A (XPages) developers guide to CloudantFrank van der Linden
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Databricks
 
Bridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksBridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksDatabricks
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Databricks
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Databricks
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...Databricks
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewVaclav Kosar
 

Tendances (20)

Detecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David PryceDetecting Mobile Malware with Apache Spark with David Pryce
Detecting Mobile Malware with Apache Spark with David Pryce
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei VaranovichLambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Find your data
Find your dataFind your data
Find your data
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Elastic Stack roadmap deep dive
Elastic Stack roadmap deep diveElastic Stack roadmap deep dive
Elastic Stack roadmap deep dive
 
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
Monitoring Half a Million ML Models, IoT Streaming Data, and Automated Qualit...
 
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz MagdanskiHow Apache Spark Changed the Way We Hire People with Tomasz Magdanski
How Apache Spark Changed the Way We Hire People with Tomasz Magdanski
 
A (XPages) developers guide to Cloudant
A (XPages) developers guide to CloudantA (XPages) developers guide to Cloudant
A (XPages) developers guide to Cloudant
 
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
 
Bridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on DatabricksBridging the Completeness of Big Data on Databricks
Bridging the Completeness of Big Data on Databricks
 
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
Transforming Devon’s Data Pipeline with an Open Source Data Hub—Built on Data...
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...How R Developers Can Build and Share Data and AI Applications that Scale with...
How R Developers Can Build and Share Data and AI Applications that Scale with...
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Spline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture OverviewSpline 2 - Vision and Architecture Overview
Spline 2 - Vision and Architecture Overview
 

Similaire à Data analytics at a petabyte scale final

AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesTobyWilman
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters
 
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the WayPlatform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the WayVMware Tanzu
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
 
Microsoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built InMicrosoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built InDavid J Rosenthal
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecJonathan Woodward
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Dataconomy Media
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 editionDavid Talby
 
Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019George Walters
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Andy Lathrop
 
RedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedis Labs
 
Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017Travis Wright
 
Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai MichaelRoenker
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingRTTS
 
Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016Chirag Patel
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsCollective Intelligence Inc.
 

Similaire à Data analytics at a petabyte scale final (20)

AWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - SlidesAWS vs Azure vs Google (GCP) - Slides
AWS vs Azure vs Google (GCP) - Slides
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
 
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the WayPlatform Requirements for CI/CD Success—and the Enterprises Leading the Way
Platform Requirements for CI/CD Success—and the Enterprises Leading the Way
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Microsoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built InMicrosoft SQL Server 2016 - Everything Built In
Microsoft SQL Server 2016 - Everything Built In
 
Zakir_Hussain_cv
Zakir_Hussain_cvZakir_Hussain_cv
Zakir_Hussain_cv
 
Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd Dec
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019Customer migration to Azure SQL database, December 2019
Customer migration to Azure SQL database, December 2019
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
RedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter CailliauRedisGraph A Low Latency Graph DB: Pieter Cailliau
RedisGraph A Low Latency Graph DB: Pieter Cailliau
 
Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017Data Amp South Africa - SQL Server 2017
Data Amp South Africa - SQL Server 2017
 
Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai Digital transformation with microsoft data and ai
Digital transformation with microsoft data and ai
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
TestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data TestingTestGuild and QuerySurge Presentation -DevOps for Data Testing
TestGuild and QuerySurge Presentation -DevOps for Data Testing
 
Sql 2017 net raf
Sql 2017  net rafSql 2017  net raf
Sql 2017 net raf
 
Sql 2016 2017 full
Sql 2016   2017 fullSql 2016   2017 full
Sql 2016 2017 full
 
Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016Bringing your data to life using Power BI - SPS London 2016
Bringing your data to life using Power BI - SPS London 2016
 
Modern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced AnalyticsModern Business Intelligence and Advanced Analytics
Modern Business Intelligence and Advanced Analytics
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Data analytics at a petabyte scale final

  • 1. Data analytics at a PB scale 200 researchers using just Presto and a data-lake Or Koren | Head of Data @ ironSource
  • 2. About Me ● Married ● 35 ● Tel Aviv ● Coding
  • 3. Agenda: ironSource overview The past: 2016-2017 The present: 2018-2019 The future: 2020-2021
  • 5. ESTABLISHE D ACQUISITIONS TO DATE EMPLOYEES San Francisco United States New York United States London United Kingdom Berlin Germany Kiev Ukraine Tel Aviv Israel Bangalore India Hong Kong China Tokyo Japan Seoul South Korea Beijing Shanghai Shenzhen China 53 7 11 1 39 5 3 1 5 624 30 ironSource Overview ESTABLISHED SEP. 2010 ACQUISITIONS TO DATE 8 EMPLOYEES 779 R&D EMPLOYEES 395
  • 6. ironSource Solutions & Products Developer Solutions In-app advertising network and mediation platform for app developers PRODUCTS & PLATFORMS ironSource Mediation In-App Advertising Network PRODUCTS & PLATFORMS Enterprise Solutions Engagement platform for Carriers & OEMs PRODUCTS & PLATFORMS ironSource Aura PRODUCTS & PLATFORMS Digital Solutions Software delivery platform and B2C security products PRODUCTS & PLATFORMS Delivery & Ad Monetization Platform Security products PRODUCTS & PLATFORMS * *In advanced negotiations
  • 8. Our old data architecture ● 10 redshift clusters ● 5 RDS clusters ● 1000+ ETLs ● 1 Tableau ● Hard to scale ● Hard to maintain ● Hard to work ● Limited data ● Expensive
  • 10. To the future ● Lifetime data ● Fast SQL ● Easy scale ● Data science ● Open source
  • 12. Data Lake Parquet ● Files based ● Open Source ● Column oriented ● S3 bucket
  • 13. Hive Apache Hive is a data warehouse software project which was built on top of Apache Hadoop in order to provide data query and analysis. ● One place to rule them all ● Hadoop Ecosystem ● Presto ● Spark ● Athena Data Lake
  • 14. Presto & Qubole Qubole delivers a Self-Service Platform for Big Data Analytics built on Amazon Web Services, Microsoft and Google Clouds. Scalable Clusters Qubole configuration scales clusters up and down by looking over the execution plan of the queries. Spots Maintenance & Versions Qubole takes care of new versions & 24/7 support For Every Query
  • 16. Our Volume 500TB Daily scan (from S3) 70K Daily queries over Presto 200 Users 500 Dashboards
  • 18. ● 10 redshift clusters ● 5 RDS clusters ● 1000+ ETLs ● 1 Tableau ● Hard to scale ● Hard to maintain ● Hard to work ● Limited data ● Expensive Our data architecture ● 1 redshift cluster ● 0 RDS clusters ● 300 ETLs ● 1 Tableau & 1 Re/dash ● Reduce costs by 50% ● Agile to the business + Our new data architecture
  • 20. ● Replace 90% of our ETLs to ELTs ● Help our data science team by being more clear on the logic, reducing their work time by 80% ● Keeping raw data without any manipulation Reduce ML Model deployment time by 50% ● No ETL time - no schedule The New ETL is ELT Extract, Load, Transform.
  • 21. Presto Connectors Kafka Real-time alerts over presto ScyllaDB Increase our insights with our ML models Elasticsearch Join business KPIs with R&D logs
  • 22. Key notes to take home Data-Lake Keep all your raw data in one place. It will help you in the future with costs, research, reduce resources and ML models Qubole Enjoy the benefits of 3rd party companies and continue to work on your business Scale Reach endless data with big clusters that scale per query ELT Move 90% of your ETLs to ELTs, to reduce lags and costs Agile Promote your business with quick insights Free to Learn Take 10% of your time and learn! Try and play with the data :)

Notes de l'éditeur

  1. Good morning everyone!!! I am Or and I will show you today how we use presto and datalake at a PB scale
  2. Before we start i want to show you a bit about myself and my team, so this picture was taken at last Purim CUSTOME party here next by, at hangar 11 (eleven). For those have that noticed, i heart my knee 4 weeks ago Skiing in Val-Morel, france So i had to sit in the sun, drink and relax… I am: Married, 35, leave in Tel-Aviv. And Coding has been my life, since i was eleven….
  3. I will show you a bita about ironSource. Then I will take you to a journey of time since 2016 (before Presto) until today (with presto) and what we are going to use in the future.
  4. ironSource was created 10 years ago. We are almost 800 employees & more than 50% of us are R&D. Our headquarters & R&D center is located in Tel-Aviv & we have 9 more offices around the world.
  5. ironSource has few different business divisions: Developer solutions - This division focuses on providing tools and technology to mobile app developers - specifically game developers. We offer an SDK which essentially enables the developer to run ads in his app to make more money. We are very strong with rewarded video - so if any of you are gamers, you may be familiar with the moment in a game when you run out of lives and you are offered a rewarded video to watch in order to continue playing. That’s an example of what we do Enterprise solutions - Focusing on helping mobile device manufacturers and mobile carriers to engage with their customers. Instead of having 20 different applications pre-installed on your device, users have the power to set up their device the way they want to, with the apps they really want and need. Digital solutions - This is my division, we are focusing on the desktop world, (Mac, PC). We help software developers with technologies that help monetize their software and distribute it to new users.
  6. Lets have a look back on our -- AR-KI-TECH-TURE We had 10 different Redshift clusters One for BI One for Researchers One for R&D One for Data science One for Realtime data One for Historic Data One for DWH One for QA One for Critical ETLs One for Backups As you can imagine, it was really hard to work with. We had 5 RDS clusters - Mainly for our Applications (Like OLAP) We had more than 1000 ETLs... We had 1 Tableau Server And it was really hard. Hard to scale - Redshift scale very slow - from few hours till days... Hard to maintain - We had Vacum Tables, Delete old data, move data from one redshift to another. Hard to work - From two aspects: Not all the tables where on the same cluser. 30% of our Clusters power, went only for the insertion of the data. Limited data - We could not insert all the data into one cluster. Very Expensive
  7. This is how our data scientist looked like at the time. Or Even like that.
  8. So, We stop and thought where we want to be in the future. First of all, we wanted lifetime data, which is very important to our business Fast SQL - We wanted SQL that is fast enough for our dashboard usage We wanted the ability to scale very fast We focused on our data science team, as we know we are going to increase our data science team and ML models Open source - we did not wanted to be attached to a certain company
  9. So we started to create our Data lake, we choose to use parquet files, which is open source and column oriented. We keep all of our data in S3 as we convert the data from json into parquet in batch operations on near-realtime.
  10. Hive & Hive MetaStore - we have one source of truth for our table definition. Which works perfect with any Hadoop Ecosystem Such as: Presto Spark Athena And more.
  11. Presto and Qubole. We use presto to query our data-lake via Qubole. Qubole is self service platform that enables us to configure presto clusters that easily scales, uses SPOTS, and they take care of Maintenance, new Versions & 24h support. Once you configured your cluster, it can increase itself, from 3 to 50 nodes for example within seconds… And that is being done for every query you do
  12. Let’s see an example of Auto scaling over presto. I run around 50 different dashboards that uses presto and saved Presto UI snapshot every few seconds. As you can see, at the start there are 3 nodes and 4 queries. And as i run the dashboards, the number of nodes is increasing as the number of queries. After all queries are finished, the cluster is decreasing back to normal.
  13. A bit about our volume. We have around 70 thousands queries running via Presto every Day. We have 200 users 500 dashboards and increasing And half of Peta-Byte scan per day, just from S3, without the caching of Presto.
  14. Remember our data scientist? Well, I think this is the best picture to explain how he feels.
  15. Lets see how our -- AR-KI-TECH-TURE ---- looks like today We eliminate 9 of our Redshift’s, kept only one for Finance/DWH. We eliminate all our RDS’s - all the data is stored in the data-lake We reduce 70% of our ETL’s - as we don’t need to move data from one place to another. We have add to our Tableau Server, a Re/Dash server. Re/Dash is an open source BI tool, we use Re/Dash for the short terms solutions and Tableau to the long term solutions By adding the Data-Lake, all of our problems disappeared! In addition, we have reduced our costs by 50%! And most important, we became much more agile to the business, instead of having first insights for a new project in 2 to 8 weeks, we are giving the first insights in the first day OR even the first hour!
  16. What we expect to use more in the future.
  17. First of all. ELT. The new ETL is ELT. If you don’t know what is ELT: Extract, Load, Transform. It means you need to create your Business logic in a big query (OR VIEW) We are going to reduce around 90% of our ETL’s and move them to ELT. Why ELT? Data science. We see that the ELT reduce the Data science work by 80%! The main reason is that they can create a dataset within minutes. By cloning an ELT of specific business unit And add more features. 2. Deployment - ELT helps Data engineers deploy the ML model, since all the RAW Data is in one place, and the model was created upon this data and not aggregated data. 3. No lag - NO scheduler - you become more realtime.
  18. We are going to increase our usage in Presto Connectors. Kafka, we are going to change our alerts system ( for business KPI’s) from Data-lake to kafka. To ensure faster findings on real-time! Scylla DB - increase our insights into Scylla-DB for our ML models. ElasticSearch - We use ElasticSearch via Kibana to monitor server logs and r&d logs, we see strong needs to be able to join those logs with business KPI’s
  19. A few notes to take home Data-Lake - Keep all your data in one place, it will save you time, effort & money. Qubole - Use Big Data services like Qubole to be able to focus on your business and not on the maintenance Scale - Presto Scales just works perfect ELT - don’t do ETL’s, you don’t need them anymore. With Presto, you can be much more agile to your business Free to Learn As i always encourage my team TO DO, take 10% of your time, learn & play with the data.