SlideShare une entreprise Scribd logo
1  sur  29
Coupang Confidential and Proprietary
이 문서는 쿠팡의 대외비이며 지적자산입니다
Journey to the Continuous and
Scalable Big Data Platform
Matthew (정재화), Coupang
Coupang Confidential and Proprietary
About me
02
• Software Development Manager of BigData & DW Platform team
• 8+ years Hadoop experience
• Apache Tajo Committer and PMC
• blrunner78@gmail.com
• Blog : https://blrunner.tistory.com
• The author of Hadoop tech hand book
Coupang Confidential and Proprietary
Agenda
03
1. On-Premise
2. Cloud 1.0
3. Cloud 2.0
4. Airflow as a Service
5. Zeppelin as a Service
Coupang Confidential and Proprietary
Motivation
04
The purpose of a business is to
create and keep a customer
- Peter Drucker -
Coupang Confidential and Proprietary
1. On-Premise
Coupang Confidential and Proprietary
Architecture
06
• Aggregations and Joins
• MapReduce
• Hive/Pig/Spark
• Oozie
Logs
• Client Logs
• Server Logs
• Adhoc Query
• HiveRDBMS
External Data
ETL Cluster Read-Only Cluster
Coupang Confidential and Proprietary
Team's Responsibility
07
• Architect, build and operate our data infrastructure and tools
• Create and maintain company-wide data pipeline
• Troubleshoot and resolve all issues as users arise
Coupang Confidential and Proprietary
Areas of Improvements
08
• Pros
• A wide variety of workloads
• Continuous increase in users
• Cons
• Multiple copies of Data
• Lack of Elasticity
• Operation overhead
Coupang Confidential and Proprietary
2. Cloud 1.0
Coupang Confidential and Proprietary
Architecture : Decouple compute and storage
010
Domain Cluster #N
Domain Cluster #2
Centralized Resources
Hive
Meta store
Cloud Storage
Batch Cluster
HiveServer2
Ad-hoc Cluster
HiveServer2
Domain Cluster #1
HiveServer2
- Batch Jobs
- High throughput
- fault tolerant, ETL
- Ad-hoc Queries
- Low latency
- Interactive Analysis
- In-memory
Coupang Confidential and Proprietary
Team's Responsibility
011
• Architect, build and operate our data infrastructure and tools
• Troubleshoot and resolve all issues as users arise
• Implement company-wide data pipelines
Coupang Confidential and Proprietary
Areas of Improvements
012
• Pros
• Allows Parsing, Enriching of Data for Custom Need
• Independent scale of CPU and storage capacity
• Cons
• Learning Curve for Cloud Infrastructure
• Operation overhead
• Users want latest tools and more features
Coupang Confidential and Proprietary
3. Cloud 2.0
Coupang Confidential and Proprietary
High Level Architecture
014
Storage
Data Processing Tools
Scheduler Tools
Security
Airflow
LDAP Authentication Apache Ranger ACL & Audit
Zeppelin
Monitoring
Computing Clusters
Cloud Storage
Data Platorm
Portal
Coupang Confidential and Proprietary
Various types of Computing Clusters
015
Centralized Resource
Hive
Meta Store
Cloud
Storage
Transient Cluster
- Batch Jobs
Persistent Cluster
- Interactive Queries
Workload Specific Cluster
Coupang Confidential and Proprietary
Team's Responsibility
016
• Architect, build and our data infrastructure and tools
• Create data APIs and data services
• Support users using SLA policies
• Maintaining security and data privacy
• Application Knowledge Support Artifacts, etc.
Coupang Confidential and Proprietary
Areas of Improvements
017
• Pros
• Onboard lots of users and variety of jobs
• Easier management and added features
• Cons
• Unintended infrastructure costs have increased
• A wide variety of client tools and Dev environments
• Various types of users
Coupang Confidential and Proprietary
Lessons & Learnings
018
• Distribute traffic instead of concentrating the one place
• Optimize all types of system resources in clusters
• Enforce the Lifecycle of Hadoop Cluster
• Monitor clusters and send alarms from the efficiency perspective
• Training Users Continuously and building the community culture
Coupang Confidential and Proprietary
4. Airflow as a Service
Coupang Confidential and Proprietary
Why we love Airflow?
020
• Define Workflows as code
• Makes Workflows more maintainable, versionable, and testable
• More flexible execution and workflow generation
• Lots of features
• Sensor
• Workflow Profiling
• SLA alert
• Rich Web Interface
• Scalable Worker Processes
• In-house Airflow
Coupang Confidential and Proprietary
Airflow : deployment process
021
Cloud Storage
Coupang Confidential and Proprietary
5. Zeppelin as a Service
Coupang Confidential and Proprietary
Why we love Zeppelin?
023
• Easy spark development in personal computer
• Customized Presto Interpreter
• Run presto query easily without complex JDBC configuration
• Export the heavy data file to local machine without exception
• Persistent Storage for Notebook
Coupang Confidential and Proprietary
Zeppelin Architecture
024
Coupang Confidential and Proprietary
Areas of Improvements
025
• Users
• Load all notebooks in the main page -> Too slow
• Big notebook can consume most resources -> Zeppelin Pending
• Platform team
• Spark interpreter doesn’t support YARN cluster mode
• Doesn’t support the life cycle for notebooks
• Difficult to upgrade and improve existing zeppelins gracefully
Coupang Confidential and Proprietary
Resolution
026
• Upgrade Zeppelin to 0.8.1
• Main Page Improvements
• Yarn Cluster Mode for Spark Interpreter
• Interpreter Lifecycle manager
• Interpreter Recovery
• Containerized Zeppelin on Kubernetes
Coupang Confidential and Proprietary
Summary
027
• Understand who is the immediate customer
• Focus on the truly important things
• Detect and solve problems immediately
• Leverage the identity of infrastructure
• Best Practice is not best for you
Coupang Confidential and Proprietary
SELECT question FROM you
https://boards.greenhouse.io/coupang/
Coupang Confidential and Proprietary
Thank you

Contenu connexe

Tendances

Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groupsUnbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groupsIsabelle Van Campenhoudt
 
FOSDEM 2015 - NoSQL and SQL the best of both worlds
FOSDEM 2015 - NoSQL and SQL the best of both worldsFOSDEM 2015 - NoSQL and SQL the best of both worlds
FOSDEM 2015 - NoSQL and SQL the best of both worldsAndrew Morgan
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartoriwalk2talk srl
 
Oracle WebLogic 12c New Multitenancy features
Oracle WebLogic 12c New Multitenancy featuresOracle WebLogic 12c New Multitenancy features
Oracle WebLogic 12c New Multitenancy featuresMichel Schildmeijer
 
2 Speed IT powered by Microsoft Azure and Minecraft
2 Speed IT powered by Microsoft Azure and Minecraft2 Speed IT powered by Microsoft Azure and Minecraft
2 Speed IT powered by Microsoft Azure and MinecraftSriram Hariharan
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Cloudera, Inc.
 
Azure database services for PostgreSQL and MySQL
Azure database services for PostgreSQL and MySQLAzure database services for PostgreSQL and MySQL
Azure database services for PostgreSQL and MySQLAmit Banerjee
 
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScaleOvercoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScaleRightScale
 
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!Marco Obinu
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Zohar Elkayam
 
Project Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerProject Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerRightScale
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQLWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQLContinuent
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondJeremyOtt5
 
Cloud Design Patterns - Hong Kong Codeaholics
Cloud Design Patterns - Hong Kong CodeaholicsCloud Design Patterns - Hong Kong Codeaholics
Cloud Design Patterns - Hong Kong CodeaholicsTaswar Bhatti
 
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...Payara
 
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013RightScale
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018Taswar Bhatti
 
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017Codemotion
 
Upgrade your SQL Server like a Ninja
Upgrade your SQL Server like a NinjaUpgrade your SQL Server like a Ninja
Upgrade your SQL Server like a NinjaAmit Banerjee
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Piyush Kumar
 

Tendances (20)

Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groupsUnbreakable Sharepoint 2016 With SQL Server 2016 availability groups
Unbreakable Sharepoint 2016 With SQL Server 2016 availability groups
 
FOSDEM 2015 - NoSQL and SQL the best of both worlds
FOSDEM 2015 - NoSQL and SQL the best of both worldsFOSDEM 2015 - NoSQL and SQL the best of both worlds
FOSDEM 2015 - NoSQL and SQL the best of both worlds
 
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca SartoriCCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
CCI2017 - Considerations for Migrating Databases to Azure - Gianluca Sartori
 
Oracle WebLogic 12c New Multitenancy features
Oracle WebLogic 12c New Multitenancy featuresOracle WebLogic 12c New Multitenancy features
Oracle WebLogic 12c New Multitenancy features
 
2 Speed IT powered by Microsoft Azure and Minecraft
2 Speed IT powered by Microsoft Azure and Minecraft2 Speed IT powered by Microsoft Azure and Minecraft
2 Speed IT powered by Microsoft Azure and Minecraft
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
 
Azure database services for PostgreSQL and MySQL
Azure database services for PostgreSQL and MySQLAzure database services for PostgreSQL and MySQL
Azure database services for PostgreSQL and MySQL
 
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScaleOvercoming 5 Common Docker Challenges: How We Do It at RightScale
Overcoming 5 Common Docker Challenges: How We Do It at RightScale
 
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
Implement a disaster recovery solution for your on-prem SQL with Azure? Easy!
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem 20170527
 
Project Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on DockerProject Sherpa: How RightScale Went All in on Docker
Project Sherpa: How RightScale Went All in on Docker
 
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQLWebinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
Webinar Slides: MySQL HA/DR/Geo-Scale - High Noon #4: MS Azure Database MySQL
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
 
Cloud Design Patterns - Hong Kong Codeaholics
Cloud Design Patterns - Hong Kong CodeaholicsCloud Design Patterns - Hong Kong Codeaholics
Cloud Design Patterns - Hong Kong Codeaholics
 
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
10 Strategies for Developing Reliable Jakarta EE & MicroProfile Applications ...
 
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
Key Design Considerations Private and Hybrid Clouds - RightScale Compute 2013
 
8 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 20188 cloud design patterns you ought to know - Update Conference 2018
8 cloud design patterns you ought to know - Update Conference 2018
 
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
An Introduction to Apache Ignite - Mandhir Gidda - Codemotion Rome 2017
 
Upgrade your SQL Server like a Ninja
Upgrade your SQL Server like a NinjaUpgrade your SQL Server like a Ninja
Upgrade your SQL Server like a Ninja
 
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
Importance of ‘Centralized Event collection’ and BigData platform for Analysis !
 

Similaire à [Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
When small problems become big problems
When small problems become big problemsWhen small problems become big problems
When small problems become big problemsAdrian Cole
 
Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Will Du
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open SourceEDB
 
Praxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudPraxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudRoman Weber
 
Designing your API Server for mobile apps
Designing your API Server for mobile appsDesigning your API Server for mobile apps
Designing your API Server for mobile appsMugunth Kumar
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol PerformanceBIOVIA
 
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)Nordic APIs
 
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...Datapolis
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxCalvinSim10
 
Denodo DataFest 2017: Business Needs for a Fast Data Strategy
Denodo DataFest 2017: Business Needs for a Fast Data StrategyDenodo DataFest 2017: Business Needs for a Fast Data Strategy
Denodo DataFest 2017: Business Needs for a Fast Data StrategyDenodo
 

Similaire à [Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정 (20)

Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
When small problems become big problems
When small problems become big problemsWhen small problems become big problems
When small problems become big problems
 
Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica Ten tools for ten big data areas 01 informatica
Ten tools for ten big data areas 01 informatica
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with DatabricksBuilding a Turbo-fast Data Warehousing Platform with Databricks
Building a Turbo-fast Data Warehousing Platform with Databricks
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
 
Optimize with Open Source
Optimize with Open SourceOptimize with Open Source
Optimize with Open Source
 
Praxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloudPraxistaugliche notes strategien 4 cloud
Praxistaugliche notes strategien 4 cloud
 
Location-independent SharePoint
Location-independent SharePointLocation-independent SharePoint
Location-independent SharePoint
 
Designing your API Server for mobile apps
Designing your API Server for mobile appsDesigning your API Server for mobile apps
Designing your API Server for mobile apps
 
Oracle Database Cloud Service
Oracle Database Cloud ServiceOracle Database Cloud Service
Oracle Database Cloud Service
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
Lessons Learned from Building Enterprise APIs (Gustaf Nyman)
 
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
Datapolis Guest Expert Presentation: Top 15 SharePoint Server Configuration M...
 
Apex ace update
Apex ace updateApex ace update
Apex ace update
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
ow.ppt
ow.pptow.ppt
ow.ppt
 
ow.ppt
ow.pptow.ppt
ow.ppt
 
Ow
OwOw
Ow
 
Denodo DataFest 2017: Business Needs for a Fast Data Strategy
Denodo DataFest 2017: Business Needs for a Fast Data StrategyDenodo DataFest 2017: Business Needs for a Fast Data Strategy
Denodo DataFest 2017: Business Needs for a Fast Data Strategy
 

Dernier

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

[Coupang] Journey to the Continuous and Scalable Big Data Platform : 지속적으로 확장 가능한 빅데이터 플랫폼을 향한 여정

  • 1. Coupang Confidential and Proprietary 이 문서는 쿠팡의 대외비이며 지적자산입니다 Journey to the Continuous and Scalable Big Data Platform Matthew (정재화), Coupang
  • 2. Coupang Confidential and Proprietary About me 02 • Software Development Manager of BigData & DW Platform team • 8+ years Hadoop experience • Apache Tajo Committer and PMC • blrunner78@gmail.com • Blog : https://blrunner.tistory.com • The author of Hadoop tech hand book
  • 3. Coupang Confidential and Proprietary Agenda 03 1. On-Premise 2. Cloud 1.0 3. Cloud 2.0 4. Airflow as a Service 5. Zeppelin as a Service
  • 4. Coupang Confidential and Proprietary Motivation 04 The purpose of a business is to create and keep a customer - Peter Drucker -
  • 5. Coupang Confidential and Proprietary 1. On-Premise
  • 6. Coupang Confidential and Proprietary Architecture 06 • Aggregations and Joins • MapReduce • Hive/Pig/Spark • Oozie Logs • Client Logs • Server Logs • Adhoc Query • HiveRDBMS External Data ETL Cluster Read-Only Cluster
  • 7. Coupang Confidential and Proprietary Team's Responsibility 07 • Architect, build and operate our data infrastructure and tools • Create and maintain company-wide data pipeline • Troubleshoot and resolve all issues as users arise
  • 8. Coupang Confidential and Proprietary Areas of Improvements 08 • Pros • A wide variety of workloads • Continuous increase in users • Cons • Multiple copies of Data • Lack of Elasticity • Operation overhead
  • 9. Coupang Confidential and Proprietary 2. Cloud 1.0
  • 10. Coupang Confidential and Proprietary Architecture : Decouple compute and storage 010 Domain Cluster #N Domain Cluster #2 Centralized Resources Hive Meta store Cloud Storage Batch Cluster HiveServer2 Ad-hoc Cluster HiveServer2 Domain Cluster #1 HiveServer2 - Batch Jobs - High throughput - fault tolerant, ETL - Ad-hoc Queries - Low latency - Interactive Analysis - In-memory
  • 11. Coupang Confidential and Proprietary Team's Responsibility 011 • Architect, build and operate our data infrastructure and tools • Troubleshoot and resolve all issues as users arise • Implement company-wide data pipelines
  • 12. Coupang Confidential and Proprietary Areas of Improvements 012 • Pros • Allows Parsing, Enriching of Data for Custom Need • Independent scale of CPU and storage capacity • Cons • Learning Curve for Cloud Infrastructure • Operation overhead • Users want latest tools and more features
  • 13. Coupang Confidential and Proprietary 3. Cloud 2.0
  • 14. Coupang Confidential and Proprietary High Level Architecture 014 Storage Data Processing Tools Scheduler Tools Security Airflow LDAP Authentication Apache Ranger ACL & Audit Zeppelin Monitoring Computing Clusters Cloud Storage Data Platorm Portal
  • 15. Coupang Confidential and Proprietary Various types of Computing Clusters 015 Centralized Resource Hive Meta Store Cloud Storage Transient Cluster - Batch Jobs Persistent Cluster - Interactive Queries Workload Specific Cluster
  • 16. Coupang Confidential and Proprietary Team's Responsibility 016 • Architect, build and our data infrastructure and tools • Create data APIs and data services • Support users using SLA policies • Maintaining security and data privacy • Application Knowledge Support Artifacts, etc.
  • 17. Coupang Confidential and Proprietary Areas of Improvements 017 • Pros • Onboard lots of users and variety of jobs • Easier management and added features • Cons • Unintended infrastructure costs have increased • A wide variety of client tools and Dev environments • Various types of users
  • 18. Coupang Confidential and Proprietary Lessons & Learnings 018 • Distribute traffic instead of concentrating the one place • Optimize all types of system resources in clusters • Enforce the Lifecycle of Hadoop Cluster • Monitor clusters and send alarms from the efficiency perspective • Training Users Continuously and building the community culture
  • 19. Coupang Confidential and Proprietary 4. Airflow as a Service
  • 20. Coupang Confidential and Proprietary Why we love Airflow? 020 • Define Workflows as code • Makes Workflows more maintainable, versionable, and testable • More flexible execution and workflow generation • Lots of features • Sensor • Workflow Profiling • SLA alert • Rich Web Interface • Scalable Worker Processes • In-house Airflow
  • 21. Coupang Confidential and Proprietary Airflow : deployment process 021 Cloud Storage
  • 22. Coupang Confidential and Proprietary 5. Zeppelin as a Service
  • 23. Coupang Confidential and Proprietary Why we love Zeppelin? 023 • Easy spark development in personal computer • Customized Presto Interpreter • Run presto query easily without complex JDBC configuration • Export the heavy data file to local machine without exception • Persistent Storage for Notebook
  • 24. Coupang Confidential and Proprietary Zeppelin Architecture 024
  • 25. Coupang Confidential and Proprietary Areas of Improvements 025 • Users • Load all notebooks in the main page -> Too slow • Big notebook can consume most resources -> Zeppelin Pending • Platform team • Spark interpreter doesn’t support YARN cluster mode • Doesn’t support the life cycle for notebooks • Difficult to upgrade and improve existing zeppelins gracefully
  • 26. Coupang Confidential and Proprietary Resolution 026 • Upgrade Zeppelin to 0.8.1 • Main Page Improvements • Yarn Cluster Mode for Spark Interpreter • Interpreter Lifecycle manager • Interpreter Recovery • Containerized Zeppelin on Kubernetes
  • 27. Coupang Confidential and Proprietary Summary 027 • Understand who is the immediate customer • Focus on the truly important things • Detect and solve problems immediately • Leverage the identity of infrastructure • Best Practice is not best for you
  • 28. Coupang Confidential and Proprietary SELECT question FROM you https://boards.greenhouse.io/coupang/
  • 29. Coupang Confidential and Proprietary Thank you