5. ESTABLISHE
D
ACQUISITIONS TO
DATE
EMPLOYEES
San Francisco
United States
New York
United States
London
United Kingdom
Berlin
Germany
Kiev
Ukraine
Tel Aviv
Israel
Bangalore
India
Hong Kong
China
Tokyo
Japan
Seoul
South Korea
Beijing
Shanghai
Shenzhen
China
53
7
11
1
39
5
3
1
5
624
30
ironSource Overview
ESTABLISHED
SEP. 2010
ACQUISITIONS TO DATE
8
EMPLOYEES
779
R&D EMPLOYEES
395
8. Our old data architecture
● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
13. Hive
Apache Hive is a data warehouse software project
which was built on top of Apache Hadoop in order
to provide data query and analysis.
● One place to rule them all
● Hadoop Ecosystem
● Presto
● Spark
● Athena
Data Lake
14. Presto & Qubole
Qubole delivers a Self-Service Platform for Big Data
Analytics built on Amazon Web Services, Microsoft
and Google Clouds.
Scalable Clusters
Qubole configuration scales clusters up and down by
looking over the execution plan of the queries.
Spots
Maintenance & Versions
Qubole takes care of new versions & 24/7 support
For Every Query
18. ● 10 redshift clusters
● 5 RDS clusters
● 1000+ ETLs
● 1 Tableau
● Hard to scale
● Hard to maintain
● Hard to work
● Limited data
● Expensive
Our data architecture
● 1 redshift cluster
● 0 RDS clusters
● 300 ETLs
● 1 Tableau & 1 Re/dash
● Reduce costs by 50%
● Agile to the business
+
Our new data architecture
20. ● Replace 90% of our ETLs to ELTs
● Help our data science team by being more clear
on the logic, reducing their work time by 80%
● Keeping raw data without any manipulation
Reduce ML Model deployment time by 50%
● No ETL time - no schedule
The New ETL
is ELT
Extract,
Load,
Transform.
22. Key notes to take home
Data-Lake Keep all your raw data in one place.
It will help you in the future with costs, research, reduce resources and ML models
Qubole Enjoy the benefits of 3rd party companies and continue to work on your business
Scale Reach endless data with big clusters that scale per query
ELT Move 90% of your ETLs to ELTs, to reduce lags and costs
Agile Promote your business with quick insights
Free to Learn Take 10% of your time and learn!
Try and play with the data :)
Good morning everyone!!!
I am Or and I will show you today how we use presto and datalake at a PB scale
Before we start i want to show you a bit about myself and my team, so this picture was taken at last Purim CUSTOME party here next by, at hangar 11 (eleven).
For those have that noticed, i heart my knee 4 weeks ago Skiing in Val-Morel, france
So i had to sit in the sun, drink and relax…
I am: Married, 35, leave in Tel-Aviv. And Coding has been my life, since i was eleven….
I will show you a bita about ironSource.
Then I will take you to a journey of time since 2016 (before Presto) until today (with presto) and what we are going to use in the future.
ironSource was created 10 years ago.
We are almost 800 employees & more than 50% of us are R&D.
Our headquarters & R&D center is located in Tel-Aviv & we have 9 more offices around the world.
ironSource has few different business divisions:
Developer solutions - This division focuses on providing tools and technology to mobile app developers - specifically game developers.
We offer an SDK which essentially enables the developer to run ads in his app to make more money.
We are very strong with rewarded video - so if any of you are gamers, you may be familiar with the moment in a game when you run out of lives and you are offered a rewarded video to watch in order to continue playing. That’s an example of what we do
Enterprise solutions - Focusing on helping mobile device manufacturers and mobile carriers to engage with their customers.
Instead of having 20 different applications pre-installed on your device, users have the power to set up their device the way they want to, with the apps they really want and need.
Digital solutions - This is my division, we are focusing on the desktop world, (Mac, PC).
We help software developers with technologies that help monetize their software and distribute it to new users.
Lets have a look back on our -- AR-KI-TECH-TURE
We had 10 different Redshift clusters
One for BI
One for Researchers
One for R&D
One for Data science
One for Realtime data
One for Historic Data
One for DWH
One for QA
One for Critical ETLs
One for Backups
As you can imagine, it was really hard to work with.
We had 5 RDS clusters - Mainly for our Applications (Like OLAP)
We had more than 1000 ETLs...
We had 1 Tableau Server
And it was really hard.
Hard to scale - Redshift scale very slow - from few hours till days...
Hard to maintain - We had Vacum Tables, Delete old data, move data from one redshift to another.
Hard to work - From two aspects:
Not all the tables where on the same cluser.
30% of our Clusters power, went only for the insertion of the data.
Limited data - We could not insert all the data into one cluster.
Very Expensive
This is how our data scientist looked like at the time.
Or Even like that.
So, We stop and thought where we want to be in the future.
First of all, we wanted lifetime data, which is very important to our business
Fast SQL - We wanted SQL that is fast enough for our dashboard usage
We wanted the ability to scale very fast
We focused on our data science team, as we know we are going to increase our data science team and ML models
Open source - we did not wanted to be attached to a certain company
So we started to create our Data lake, we choose to use parquet files, which is open source and column oriented.
We keep all of our data in S3 as we convert the data from json into parquet in batch operations on near-realtime.
Hive & Hive MetaStore - we have one source of truth for our table definition.
Which works perfect with any Hadoop Ecosystem
Such as:
Presto
Spark
Athena
And more.
Presto and Qubole.
We use presto to query our data-lake via Qubole.
Qubole is self service platform that enables us to configure presto clusters that easily scales, uses SPOTS, and they take care of Maintenance, new Versions & 24h support.
Once you configured your cluster, it can increase itself, from 3 to 50 nodes for example within seconds…
And that is being done for every query you do
Let’s see an example of Auto scaling over presto.
I run around 50 different dashboards that uses presto and saved Presto UI snapshot every few seconds.
As you can see, at the start there are 3 nodes and 4 queries.
And as i run the dashboards, the number of nodes is increasing as the number of queries.
After all queries are finished, the cluster is decreasing back to normal.
A bit about our volume.
We have around 70 thousands queries running via Presto every Day.
We have 200 users
500 dashboards and increasing
And half of Peta-Byte scan per day, just from S3, without the caching of Presto.
Remember our data scientist?
Well, I think this is the best picture to explain how he feels.
Lets see how our -- AR-KI-TECH-TURE ---- looks like today
We eliminate 9 of our Redshift’s, kept only one for Finance/DWH.
We eliminate all our RDS’s - all the data is stored in the data-lake
We reduce 70% of our ETL’s - as we don’t need to move data from one place to another.
We have add to our Tableau Server, a Re/Dash server.Re/Dash is an open source BI tool, we use Re/Dash for the short terms solutions and Tableau to the long term solutions
By adding the Data-Lake, all of our problems disappeared!
In addition, we have reduced our costs by 50%!
And most important, we became much more agile to the business, instead of having first insights for a new project in 2 to 8 weeks, we are giving the first insights in the first day OR even the first hour!
What we expect to use more in the future.
First of all. ELT.
The new ETL is ELT.
If you don’t know what is ELT: Extract, Load, Transform.
It means you need to create your Business logic in a big query (OR VIEW)
We are going to reduce around 90% of our ETL’s and move them to ELT.
Why ELT?
Data science. We see that the ELT reduce the Data science work by 80%!The main reason is that they can create a dataset within minutes. By cloning an ELT of specific business unit
And add more features.
2. Deployment - ELT helps Data engineers deploy the ML model, since all the RAW Data is in one place, and the model was created upon this data and not aggregated data.
3. No lag - NO scheduler - you become more realtime.
We are going to increase our usage in Presto Connectors.
Kafka, we are going to change our alerts system ( for business KPI’s) from Data-lake to kafka.To ensure faster findings on real-time!
Scylla DB - increase our insights into Scylla-DB for our ML models.
ElasticSearch - We use ElasticSearch via Kibana to monitor server logs and r&d logs, we see strong needs to be able to join those logs with business KPI’s
A few notes to take home
Data-Lake - Keep all your data in one place, it will save you time, effort & money.
Qubole - Use Big Data services like Qubole to be able to focus on your business and not on the maintenance
Scale - Presto Scales just works perfect
ELT - don’t do ETL’s, you don’t need them anymore.
With Presto, you can be much more agile to your business
Free to LearnAs i always encourage my team TO DO, take 10% of your time, learn & play with the data.