SlideShare a Scribd company logo
1 of 34
Download to read offline
HELP HADOOP SURVIVE THE 300 MILLION BLOCKHELP HADOOP SURVIVE THE 300 MILLION BLOCK
BARRIER AND THEN BACK IT UPBARRIER AND THEN BACK IT UP
Stuart Pook (s.pook@criteo.com @StuartPook)
BROUGHT TO YOU BY LAKE-STORAGEBROUGHT TO YOU BY LAKE-STORAGE
Anthony Rabier, Dhia Moakhar, Marouane Benalla, Meriam Lachkar,
Stuart Pook
CRITEO DYNAMIC RETARGETINGCRITEO DYNAMIC RETARGETING
Online advertising
Target the right user
At the right time
With the right message
Tailored video and display ads
SIMPLIFED BUSINESS MODELSIMPLIFED BUSINESS MODEL
We buy
Ad spaces
We sell
Clicks
Sales
We take the risk
TODAY PRIMARY DATA INPUTTODAY PRIMARY DATA INPUT
Kafka
Up to 7 000 000 messages/s
Bids, impressions, sales, visits, …
400 billion messages/day
Up to 3 GB/s (gzipped)
HDFS
500 TB created per day
410 TB removed per day
260 000 MapReduce jobs per day
6 000 Spark jobs per day
PAID
HADOOP'S COMPUTE AND DATA AREHADOOP'S COMPUTE AND DATA ARE
ESSENTIALESSENTIAL
Extract, Transform & Load logs (ETL)
Machine Learning to generate bidding models
Billing
Business analysis
HADOOP PROVIDES LOCAL REDUNDANCYHADOOP PROVIDES LOCAL REDUNDANCY
Failing datanodes (1 or 2)
Failing racks (1)
soon one failing pod
Failing namenodes (1)
Failing resourcemanager (1)
NO PROTECTION AGAINST:NO PROTECTION AGAINST:
Data centre disaster
Multiple datanode failures in a short time
Multiple rack failures in a day
Operator error
DATA BACKUP IN THE CLOUDDATA BACKUP IN THE CLOUD
Backup requires lots of bandwidth
≥ 1.2 GB/s backup from HDFS
≥ 8 GB/s for backup catchup
≥ 28 GB/s restore to HDFS
Restore 2 PB too long at 50 Gb/s
What about compute?
COMPUTE IN THE CLOUDCOMPUTE IN THE CLOUD
60 000 cores requires reservation
Reservation expensive
No need for elasticity (batch processing)
Criteo has data centres
Criteo likes bare metal
Cloud is several times more expensive
In-house get us exactly the network & hardware we need
BUILD NEW DATA CENTRE (PA4) IN 9BUILD NEW DATA CENTRE (PA4) IN 9
MONTHSMONTHS
Space for 5000 machines
Non-blocking network
< 10% network utilisation
10 Gb/s endpoints
Level 3 routing
Clos topology
Power 1 megawatt + option for 1 MW
NEW HARDWARE FOR PA4NEW HARDWARE FOR PA4
Had one supplier
Need competition to keep prices down
3 replies to our call for tenders
3 similar 2U machines
16 (or 12) 6 TB SATA disks
2 Xeon E5-2650L v3, 24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
2 different RAID cards
TEST THE HARDWARE (PA4)TEST THE HARDWARE (PA4)
Three 10 node clusters
Disk bandwidth
Disk errors?
Network bandwidth
Teragen
Zero replication (disks)
High replication (network)
Terasort
HARDWARE IS SIMILAR (PA4)HARDWARE IS SIMILAR (PA4)
Eliminate the constructor with
4 DOA disks
Other failed disks
20% higher power consumption
Choose the cheapest and most dense
Huawei
LSI-3008 RAID card
MIX THE HARDWAREMIX THE HARDWARE
Hardware interventions are more diverse
Multiple configurations needed
Some clusters have both hardware types
We have more choice at each order
Avoid vendor lock-in
Negotiate better prices
HAVE THE DC, BUILD HADOOPHAVE THE DC, BUILD HADOOP
Configure using Chef
Infrastructure is code in git
Automate to scale (because we don't)
Test Hadoop with 10 hour petasorts
Tune Hadoop for this scale
Namenode machine crashes
Upgrade kernel
Rolling restart on all machines
Rack by rack, not node by node
PA4 CLUSTER ONLINEPA4 CLUSTER ONLINE
More capacity as have 2 clusters
Users love it
Users find new uses
Soon using more capacity than the new cluster provides
Impossible to stop the old cluster
SITUATION NOW WORSESITUATION NOW WORSE
Two critical clusters
Two Hadoop versions to support CDH4 & CDH5
Not one but two SPOFs
GROW THE NEW DATACENTREGROW THE NEW DATACENTRE
Add hundreds of new nodes
Soon the new cluster will be big enough
1370 datanodes
+ 650 in Q3 2017
+ 900 in Q4 2017
> 2800 in Q2 2018
NEED FAST NAMENODESNEED FAST NAMENODES
Namenode does many sequential operations
Long locks
Failovers too slow
Heartbeats lost
→ fast CPU
Big namenodes
768 GiB RAM
2 × Intel Xeon CPU E5-2643 v4 @ 3.40GHz
3 × RAID 1 of 2 SSD
FEDERATE FOR MORE BLOCKSFEDERATE FOR MORE BLOCKS
Too many blocks for the namenode
Federation with 3 namespaces
649 734 268 + 281 542 996 + 153 263 676
1 084 540 940 file system objects in PA4
Impact on users
ONE SPOF IS ENOUGHONE SPOF IS ENOUGH
Move all jobs and data to the new cluster (PA4)
Old cluster (AM5) for backups
Only one SPOF but still no compute redundancy
Data on two clusters
IS THE DATA SAFE?IS THE DATA SAFE?
Same operators on both clusters
One chef server for both clusters
Same code on both clusters
Single mistake → both clusters
BACKUP ON DIFFERENT TECHNOLOGYBACKUP ON DIFFERENT TECHNOLOGY
≥ 1.2 GB/s backup from HDFS
≥ 8 GB/s for catchup
≥ 28 GB/s restore to HDFS
> 200 million files
10 PB PoC on 108 nodes
All met objectives but MapR chosen
Cluster to be built
Tools to be written
Data Owners to select data
Ceph
OpenIO
MapR
HADOOP AT CRITEO TODAYHADOOP AT CRITEO TODAY
2860 Huawei datanodes
16 × 6 TB SATA disks
2 Xeon E5-2650L v3 @ 1.80GHz
24 cores, 48 threads
256 GB RAM
Mellanox 10 Gb/s network card
6 big Namenodes with 222 PB raw storage
Resource Manager shows 545 TB RAM
466 620 VCores
AM5 for backups
2018 PLAN FOR A NEW CLUSTER (AM6)2018 PLAN FOR A NEW CLUSTER (AM6)
Classify critical jobs
Identify critical data
Critical jobs must fit on one cluster
Critical data must be copied
Other jobs are distributed
Still being put into place
ANOTHER ROUND OF TENDORSANOTHER ROUND OF TENDORS
Need more CPU
Denser machines
4U, 8 nodes, 16 CPUs, 4 × 8 × 2.5" disks (8/U)
2U, 4 nodes, 8 CPUs, 6 × 4 × 2.5" disks (12/U)
Infrastructure validation
Hadoop tests and benchmarks
Added test machines to preprod
Statistics for best CPU
Disks can be added
Change disk to CPU & RAM ratio
HADOOP AT CRITEO TOMORROWHADOOP AT CRITEO TOMORROW
PA4 > 2800 nodes
AM6
≥ HPE DL380 Gen10
2 × Xeon Skylake Gold 6140 2.3 GHz
36 Cores (72 threads)
384 GiB RAM
12 × 8 TB disk
2 × 10 Gb/s network
≥ 11 HPE DL380 Gen10
2 × Xeon Skylake Gold 6146 3.2 GHz
24 Cores
512 GiB RAM
1140
STAYING ALIVE — TUNE HADOOPSTAYING ALIVE — TUNE HADOOP
Increase bandwidth and time limit for checkpoint
GC Tuning: Serial, Parallel, CMS or G1
Permanent (losing) battle
Tune Azul JVM for namenode
580 GiB heap
1 GC thread per 500 MB allocation/s
Easy and it works
STAYING ALIVE — FIX BUGSSTAYING ALIVE — FIX BUGS
The cluster crashes, find the bug, if fixed, backport it, else fix
Fix
HDFS-10220 expired leases make namenode unresponsive and
failover
Backport
YARN-4041 Slow delegation token renewal prolongs RM
recovery
HDFS-9305 Delayed heartbeat processing causes storm of
heartbeats
YARN-4546 ResourceManager crash due to scheduling
opportunity overflow
HDFS-9906 Remove spammy log spew when a datanode is
restarted
STAYING ALIVE — MONITORINGSTAYING ALIVE — MONITORING
HDFS
Namenode: missing blocks, GC, checkpoints, safemode, QPS,
live datanode
Datanodes: disks, read/write throughput, space
YARN
Queue length, memory & CPU usage, job duration (scheduling
+ run time)
ResourceManager: QPS
Bad nodes
Probes to emulate client behavior with witness jobs
Zookeeper: availability, probes
WE HAVEWE HAVE
2 prod clusters
1 prod cluster under construction
2 pre-prod clusters
1 infrastructure cluster
> 4300 datanodes
273 PB raw disk
592 180 vcores
667 TiB RAM in RM
> 265 000 jobs/day
90 TB daily HDFS growth
15 PB read per day
3 PB written per day
UPCOMING CHALLENGESUPCOMING CHALLENGES
Optimize and improve Hadoop
Install a new bare-metal data-centre
Schedule jobs over two clusters
Synchronise two HDFS clusters
Backup HDFS to MapR
We are hiring
Come and join us in Paris, Palo Alto or Ann Arbor (MI)
s.pook@criteo.com @StuartPook
Questions?

More Related Content

More from DataWorks Summit

Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
DataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
DataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
DataWorks Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 

More from DataWorks Summit (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Help hadoop survive the 300 million block barrier and then back it up

  • 1. HELP HADOOP SURVIVE THE 300 MILLION BLOCKHELP HADOOP SURVIVE THE 300 MILLION BLOCK BARRIER AND THEN BACK IT UPBARRIER AND THEN BACK IT UP Stuart Pook (s.pook@criteo.com @StuartPook)
  • 2. BROUGHT TO YOU BY LAKE-STORAGEBROUGHT TO YOU BY LAKE-STORAGE Anthony Rabier, Dhia Moakhar, Marouane Benalla, Meriam Lachkar, Stuart Pook
  • 3. CRITEO DYNAMIC RETARGETINGCRITEO DYNAMIC RETARGETING Online advertising Target the right user At the right time With the right message Tailored video and display ads
  • 4. SIMPLIFED BUSINESS MODELSIMPLIFED BUSINESS MODEL We buy Ad spaces We sell Clicks Sales We take the risk
  • 5. TODAY PRIMARY DATA INPUTTODAY PRIMARY DATA INPUT Kafka Up to 7 000 000 messages/s Bids, impressions, sales, visits, … 400 billion messages/day Up to 3 GB/s (gzipped) HDFS 500 TB created per day 410 TB removed per day 260 000 MapReduce jobs per day 6 000 Spark jobs per day
  • 6. PAID HADOOP'S COMPUTE AND DATA AREHADOOP'S COMPUTE AND DATA ARE ESSENTIALESSENTIAL Extract, Transform & Load logs (ETL) Machine Learning to generate bidding models Billing Business analysis
  • 7. HADOOP PROVIDES LOCAL REDUNDANCYHADOOP PROVIDES LOCAL REDUNDANCY Failing datanodes (1 or 2) Failing racks (1) soon one failing pod Failing namenodes (1) Failing resourcemanager (1)
  • 8. NO PROTECTION AGAINST:NO PROTECTION AGAINST: Data centre disaster Multiple datanode failures in a short time Multiple rack failures in a day Operator error
  • 9. DATA BACKUP IN THE CLOUDDATA BACKUP IN THE CLOUD Backup requires lots of bandwidth ≥ 1.2 GB/s backup from HDFS ≥ 8 GB/s for backup catchup ≥ 28 GB/s restore to HDFS Restore 2 PB too long at 50 Gb/s What about compute?
  • 10. COMPUTE IN THE CLOUDCOMPUTE IN THE CLOUD 60 000 cores requires reservation Reservation expensive No need for elasticity (batch processing) Criteo has data centres Criteo likes bare metal Cloud is several times more expensive In-house get us exactly the network & hardware we need
  • 11. BUILD NEW DATA CENTRE (PA4) IN 9BUILD NEW DATA CENTRE (PA4) IN 9 MONTHSMONTHS Space for 5000 machines Non-blocking network < 10% network utilisation 10 Gb/s endpoints Level 3 routing Clos topology Power 1 megawatt + option for 1 MW
  • 12. NEW HARDWARE FOR PA4NEW HARDWARE FOR PA4 Had one supplier Need competition to keep prices down 3 replies to our call for tenders 3 similar 2U machines 16 (or 12) 6 TB SATA disks 2 Xeon E5-2650L v3, 24 cores, 48 threads 256 GB RAM Mellanox 10 Gb/s network card 2 different RAID cards
  • 13. TEST THE HARDWARE (PA4)TEST THE HARDWARE (PA4) Three 10 node clusters Disk bandwidth Disk errors? Network bandwidth Teragen Zero replication (disks) High replication (network) Terasort
  • 14. HARDWARE IS SIMILAR (PA4)HARDWARE IS SIMILAR (PA4) Eliminate the constructor with 4 DOA disks Other failed disks 20% higher power consumption Choose the cheapest and most dense Huawei LSI-3008 RAID card
  • 15. MIX THE HARDWAREMIX THE HARDWARE Hardware interventions are more diverse Multiple configurations needed Some clusters have both hardware types We have more choice at each order Avoid vendor lock-in Negotiate better prices
  • 16. HAVE THE DC, BUILD HADOOPHAVE THE DC, BUILD HADOOP Configure using Chef Infrastructure is code in git Automate to scale (because we don't) Test Hadoop with 10 hour petasorts Tune Hadoop for this scale Namenode machine crashes Upgrade kernel Rolling restart on all machines Rack by rack, not node by node
  • 17. PA4 CLUSTER ONLINEPA4 CLUSTER ONLINE More capacity as have 2 clusters Users love it Users find new uses Soon using more capacity than the new cluster provides Impossible to stop the old cluster
  • 18. SITUATION NOW WORSESITUATION NOW WORSE Two critical clusters Two Hadoop versions to support CDH4 & CDH5 Not one but two SPOFs
  • 19. GROW THE NEW DATACENTREGROW THE NEW DATACENTRE Add hundreds of new nodes Soon the new cluster will be big enough 1370 datanodes + 650 in Q3 2017 + 900 in Q4 2017 > 2800 in Q2 2018
  • 20. NEED FAST NAMENODESNEED FAST NAMENODES Namenode does many sequential operations Long locks Failovers too slow Heartbeats lost → fast CPU Big namenodes 768 GiB RAM 2 × Intel Xeon CPU E5-2643 v4 @ 3.40GHz 3 × RAID 1 of 2 SSD
  • 21. FEDERATE FOR MORE BLOCKSFEDERATE FOR MORE BLOCKS Too many blocks for the namenode Federation with 3 namespaces 649 734 268 + 281 542 996 + 153 263 676 1 084 540 940 file system objects in PA4 Impact on users
  • 22. ONE SPOF IS ENOUGHONE SPOF IS ENOUGH Move all jobs and data to the new cluster (PA4) Old cluster (AM5) for backups Only one SPOF but still no compute redundancy Data on two clusters
  • 23. IS THE DATA SAFE?IS THE DATA SAFE? Same operators on both clusters One chef server for both clusters Same code on both clusters Single mistake → both clusters
  • 24. BACKUP ON DIFFERENT TECHNOLOGYBACKUP ON DIFFERENT TECHNOLOGY ≥ 1.2 GB/s backup from HDFS ≥ 8 GB/s for catchup ≥ 28 GB/s restore to HDFS > 200 million files 10 PB PoC on 108 nodes All met objectives but MapR chosen Cluster to be built Tools to be written Data Owners to select data Ceph OpenIO MapR
  • 25.
  • 26. HADOOP AT CRITEO TODAYHADOOP AT CRITEO TODAY 2860 Huawei datanodes 16 × 6 TB SATA disks 2 Xeon E5-2650L v3 @ 1.80GHz 24 cores, 48 threads 256 GB RAM Mellanox 10 Gb/s network card 6 big Namenodes with 222 PB raw storage Resource Manager shows 545 TB RAM 466 620 VCores AM5 for backups
  • 27. 2018 PLAN FOR A NEW CLUSTER (AM6)2018 PLAN FOR A NEW CLUSTER (AM6) Classify critical jobs Identify critical data Critical jobs must fit on one cluster Critical data must be copied Other jobs are distributed Still being put into place
  • 28. ANOTHER ROUND OF TENDORSANOTHER ROUND OF TENDORS Need more CPU Denser machines 4U, 8 nodes, 16 CPUs, 4 × 8 × 2.5" disks (8/U) 2U, 4 nodes, 8 CPUs, 6 × 4 × 2.5" disks (12/U) Infrastructure validation Hadoop tests and benchmarks Added test machines to preprod Statistics for best CPU Disks can be added Change disk to CPU & RAM ratio
  • 29. HADOOP AT CRITEO TOMORROWHADOOP AT CRITEO TOMORROW PA4 > 2800 nodes AM6 ≥ HPE DL380 Gen10 2 × Xeon Skylake Gold 6140 2.3 GHz 36 Cores (72 threads) 384 GiB RAM 12 × 8 TB disk 2 × 10 Gb/s network ≥ 11 HPE DL380 Gen10 2 × Xeon Skylake Gold 6146 3.2 GHz 24 Cores 512 GiB RAM 1140
  • 30. STAYING ALIVE — TUNE HADOOPSTAYING ALIVE — TUNE HADOOP Increase bandwidth and time limit for checkpoint GC Tuning: Serial, Parallel, CMS or G1 Permanent (losing) battle Tune Azul JVM for namenode 580 GiB heap 1 GC thread per 500 MB allocation/s Easy and it works
  • 31. STAYING ALIVE — FIX BUGSSTAYING ALIVE — FIX BUGS The cluster crashes, find the bug, if fixed, backport it, else fix Fix HDFS-10220 expired leases make namenode unresponsive and failover Backport YARN-4041 Slow delegation token renewal prolongs RM recovery HDFS-9305 Delayed heartbeat processing causes storm of heartbeats YARN-4546 ResourceManager crash due to scheduling opportunity overflow HDFS-9906 Remove spammy log spew when a datanode is restarted
  • 32. STAYING ALIVE — MONITORINGSTAYING ALIVE — MONITORING HDFS Namenode: missing blocks, GC, checkpoints, safemode, QPS, live datanode Datanodes: disks, read/write throughput, space YARN Queue length, memory & CPU usage, job duration (scheduling + run time) ResourceManager: QPS Bad nodes Probes to emulate client behavior with witness jobs Zookeeper: availability, probes
  • 33. WE HAVEWE HAVE 2 prod clusters 1 prod cluster under construction 2 pre-prod clusters 1 infrastructure cluster > 4300 datanodes 273 PB raw disk 592 180 vcores 667 TiB RAM in RM > 265 000 jobs/day 90 TB daily HDFS growth 15 PB read per day 3 PB written per day
  • 34. UPCOMING CHALLENGESUPCOMING CHALLENGES Optimize and improve Hadoop Install a new bare-metal data-centre Schedule jobs over two clusters Synchronise two HDFS clusters Backup HDFS to MapR We are hiring Come and join us in Paris, Palo Alto or Ann Arbor (MI) s.pook@criteo.com @StuartPook Questions?