SlideShare une entreprise Scribd logo
1  sur  42
OASIS – DATA ANALYSIS
PLATFORM FOR ENTERPRISE
KEIJI YOSHIDA – DATA ENGINEER, LINE CORPORATION
• We have created a web-based data analysis platform named ”OASIS”
• Employees can analyze data of a Hadoop cluster by writing Spark applications
• 100+ employees use it every day
INTRODUCTION
Agenda 1. Motivation
2. Features & System Architecture
3. Use Cases
Agenda 1. Motivation
2. Features & System Architecture
3. Use Cases
• Based in Tokyo, Japan
• Provides a messaging application named “LINE”
• 164M monthly active users across Japan, Thailand, Taiwan, and Indonesia
• Also provides various related services such as games, sticker market, etc.
LINE CORPORATION
DATA PLATFORM AT LINE
Services
Game
Sticker Market
News
Music
Video Streaming
・
・
・
Processing
Data
Hadoop Cluster
Game
LINE Ads Platform
Sticker
Market
News Music
Video
Streaming
LINE Ads
Platform
・
・
・
Analysis
BI / Reporting
• Enable all employees analyze their services’ data as they like
• Speed up their data analysis process and decision making
PUBLICATION OF HADOOP CLUSTER
1. Security
• Each employee can access only the data related to their service
2. Stability
• Queries must not affect the performance of other queries
3. Features
• Each employee can extract data from the Hadoop cluster as they like
• Results can be visualized and shared within a team or a department
REQUIREMENTS
1. Security
• Kerberize the Hadoop cluster and install Apache Ranger
2. Stability
• Use Apache Spark as a query and application execution engine
3. Features
• Try Apache Zeppelin for the Web UI
SOLUTIONS
• Framework to manage access control over a Hadoop cluster
• Used to control each employee’s data access
APACHE RANGER
1. Security
• Kerberize the Hadoop cluster and install Apache Ranger
2. Stability
• Use Apache Spark as a query and application execution engine
3. Features
• Try Apache Zeppelin for the Web UI
SOLUTIONS
• Web-based data analysis tool
• Supports Apache Spark and user impersonation
• Results can be shared within multiple users as a form of a notebook
APACHE ZEPPELIN
SYSTEM ARCHITECTURE
Apache Zeppelin
0.7.3
Hadoop YARN
Cluster
LDAP
Spark application submission
+
User impersonation
End Users
HDFS + Apache
Ranger
User authentication Data access
1. Security
• An arbitrary user can be set to scheduled notebook’s execution user
• Users can access the data which they don’t have access rights to
ISSUES OF APACHE ZEPPELIN 0.7.3
2. Stability
• Runs only on a single server
• Does not support the “yarn-cluster” mode of Apache Spark
• Freezes when a Spark driver program consumes many server resources
ISSUES OF APACHE ZEPPELIN 0.7.3
3. Features
• Users have to set access control to each notebook
• Users cannot execute a notebook while changing only its parameters
without saving it
ISSUES OF APACHE ZEPPELIN 0.7.3
OASIS
COMPARISON
For a single team
• Does not go with Apache Ranger
• Runs only on a single server
• Access control per notebook
Apache Zeppelin
For an enterprise
• Works well with Apache Ranger
• Runs on multiple servers
• Access control per team
OASIS
Agenda 1. Motivation
2. Features & System Architecture
3. Use Cases
• User impersonation
• Data Visualization
• Notebooks Sharing
• Scalable
OVERVIEW
TOP PAGE
• A root directory of notebooks for a team, a department, or a service
• Access right for each user is separately set in each space
• 2 types of access rights: ”read write” and “read only”
SPACE
Space 1
User A (read write) User B (read only)
Space 2
User C (read write) User D (read only)
NOTEBOOK CREATION
• A single Spark application launches per notebook session
• Notebook author’s account is used to access files
• Spark, Spark SQL, PySpark, and SparkR are available
• Each language of a single notebook session shares a single Spark application
SPARK APPLICATION
SPARK APPLICATION
UTILIZATION OF IN-MEMORY CACHING
• Notebooks can be executed automatically on a prescribed schedule
• Contents of notebooks can be kept updated by this feature
• This feature is also used for creating a light weight ETL processing
SCHEDULING
• Parameters can be injected into a notebook during its execution
• “read only” users can execute a notebook while changing its parameters
PARAMETERS
TABLE DEFINITIONS
FILE UPLOAD
SYSTEM ARCHITECTURE
End Users
Frontend /
API
OASIS
Job
Scheduler
MySQL Redis
Spark
Interpreter
Hadoop
YARN
Cluster
HDFS /
Apache
Ranger
• 500 DataNodes / NodeManagers
• HDFS usage: 20PB
• 150+ Hive databases
• 1,500+ Hive tables
HADOOP CLUSTER
Agenda 1. Motivation
2. Features & System Architecture
3. Use Cases
• 1,500+ users
• 100+ daily active users
• 300+ monthly active users
• 30+ spaces (i.e. departments, teams, or services)
• 1,100+ notebooks
• 200+ scheduled notebooks
STATS
1. Report
2. Interactive dashboard
3. ETL
4. Monitoring
5. Ad hoc analysis
USE CASES
1. REPORT
2. INTERACTIVE DASHBOARD
3. ETL
4. MONITORING
5. AD HOC ANALYSIS
• We created OASIS to solve the issues of Apache Zeppelin
• Extracted data can be visualized and shared within a team
• OASIS utilizes the user impersonation feature of Apache Spark
• At LINE, OASIS is used for reporting, data monitoring, ad hoc analysis, etc.
RECAP
THANK YOU

Contenu connexe

Tendances

More Best Practices With Share Point Solutions
More Best Practices With Share Point SolutionsMore Best Practices With Share Point Solutions
More Best Practices With Share Point Solutions
Alexander Meijers
 
Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16
Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16
Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16
Alfredo Abate
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosMeson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on Mesos
Antony Arokiasamy
 

Tendances (20)

More Best Practices With Share Point Solutions
More Best Practices With Share Point SolutionsMore Best Practices With Share Point Solutions
More Best Practices With Share Point Solutions
 
Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!Spark UDFs are EviL, Catalyst to the rEsCue!
Spark UDFs are EviL, Catalyst to the rEsCue!
 
"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...
"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ..."Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...
"Introduction to Sparkling Water" — Jakub Hava, Senior Software Engineer, at ...
 
SharePoint 2013 - What's New
SharePoint 2013 - What's NewSharePoint 2013 - What's New
SharePoint 2013 - What's New
 
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache BahirWriting Apache Spark and Apache Flink Applications Using Apache Bahir
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
 
Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16
Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16
Extending_EBS_12_1_3_with_APEX_5_0_COLLABORATE16
 
Feature Toggle for .Net Core Apps on Azure with Azure App Configuration Featu...
Feature Toggle for .Net Core Apps on Azure with Azure App Configuration Featu...Feature Toggle for .Net Core Apps on Azure with Azure App Configuration Featu...
Feature Toggle for .Net Core Apps on Azure with Azure App Configuration Featu...
 
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
Boston Future of Data Meetup: May 2017: Spark Introduction with Credit Card F...
 
SQL TUNING 101
SQL TUNING 101SQL TUNING 101
SQL TUNING 101
 
PerfTest in SOA
PerfTest in SOAPerfTest in SOA
PerfTest in SOA
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
 
Elixir in the Wild
Elixir in the WildElixir in the Wild
Elixir in the Wild
 
Scala & Spark Online Training
Scala & Spark Online TrainingScala & Spark Online Training
Scala & Spark Online Training
 
KoprowskiT_SQLSat409_MaintenancePlansForBeginners
KoprowskiT_SQLSat409_MaintenancePlansForBeginnersKoprowskiT_SQLSat409_MaintenancePlansForBeginners
KoprowskiT_SQLSat409_MaintenancePlansForBeginners
 
Meson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on MesosMeson: Building a Machine Learning Orchestration Framework on Mesos
Meson: Building a Machine Learning Orchestration Framework on Mesos
 
SharePoint NYC search presentation
SharePoint NYC search presentationSharePoint NYC search presentation
SharePoint NYC search presentation
 
Spark Hsinchu meetup
Spark Hsinchu meetupSpark Hsinchu meetup
Spark Hsinchu meetup
 
Individual Serverless Development Environments for AWS
Individual Serverless Development Environments for AWSIndividual Serverless Development Environments for AWS
Individual Serverless Development Environments for AWS
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup   extended jupyter kernel gatewayJupyter con meetup   extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
 

Similaire à Oasis – data analysis platform for enterprise

One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
Tim Vaillancourt
 
Getting Started with Oracle APEX
Getting Started with Oracle APEXGetting Started with Oracle APEX
Getting Started with Oracle APEX
DataNext Solutions
 

Similaire à Oasis – data analysis platform for enterprise (20)

OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster
OASIS - Data Analysis Platform for Multi-tenant Hadoop ClusterOASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster
OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Getting Started with Oracle APEX
Getting Started with Oracle APEXGetting Started with Oracle APEX
Getting Started with Oracle APEX
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Apereo OAE - Bootcamp
Apereo OAE - BootcampApereo OAE - Bootcamp
Apereo OAE - Bootcamp
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Profiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty DetailsProfiling and Tuning a Web Application - The Dirty Details
Profiling and Tuning a Web Application - The Dirty Details
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath LakkundiApache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
 
App_Engine_PPT.ppt
App_Engine_PPT.pptApp_Engine_PPT.ppt
App_Engine_PPT.ppt
 

Plus de LINE Corporation

Plus de LINE Corporation (20)

JJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LTJJUG CCC 2018 Fall 懇親会LT
JJUG CCC 2018 Fall 懇親会LT
 
Reduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin CoroutinesReduce dependency on Rx with Kotlin Coroutines
Reduce dependency on Rx with Kotlin Coroutines
 
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみたKotlin/NativeでAndroidのNativeメソッドを実装してみた
Kotlin/NativeでAndroidのNativeメソッドを実装してみた
 
Use Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extensionUse Kotlin scripts and Clova SDK to build your Clova extension
Use Kotlin scripts and Clova SDK to build your Clova extension
 
The Magic of LINE 購物 Testing
The Magic of LINE 購物 TestingThe Magic of LINE 購物 Testing
The Magic of LINE 購物 Testing
 
GA Test Automation
GA Test AutomationGA Test Automation
GA Test Automation
 
UI Automation Test with JUnit5
UI Automation Test with JUnit5UI Automation Test with JUnit5
UI Automation Test with JUnit5
 
Feature Detection for UI Testing
Feature Detection for UI TestingFeature Detection for UI Testing
Feature Detection for UI Testing
 
LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享LINE 新星計劃介紹與新創團隊分享
LINE 新星計劃介紹與新創團隊分享
 
​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享​LINE 技術合作夥伴與應用分享
​LINE 技術合作夥伴與應用分享
 
LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣LINE 開發者社群經營與技術推廣
LINE 開發者社群經營與技術推廣
 
日本開發者大會短講分享
日本開發者大會短講分享日本開發者大會短講分享
日本開發者大會短講分享
 
LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享LINE Chatbot - 活動報名報到設計分享
LINE Chatbot - 活動報名報到設計分享
 
在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes在 LINE 私有雲中使用 Managed Kubernetes
在 LINE 私有雲中使用 Managed Kubernetes
 
LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧LINE TODAY高效率的敏捷測試開發技巧
LINE TODAY高效率的敏捷測試開發技巧
 
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
LINE 區塊鏈平台及代幣經濟 - LINK Chain及LINK介紹
 
LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享LINE Things - LINE IoT平台新技術分享
LINE Things - LINE IoT平台新技術分享
 
LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗LINE Pay - 一卡通支付新體驗
LINE Pay - 一卡通支付新體驗
 
LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務LINE Platform API Update - 打造一個更好的Chatbot服務
LINE Platform API Update - 打造一個更好的Chatbot服務
 
Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發Keynote - ​LINE 的技術策略佈局與跨國產品開發
Keynote - ​LINE 的技術策略佈局與跨國產品開發
 

Dernier

Dernier (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Oasis – data analysis platform for enterprise

  • 1. OASIS – DATA ANALYSIS PLATFORM FOR ENTERPRISE KEIJI YOSHIDA – DATA ENGINEER, LINE CORPORATION
  • 2. • We have created a web-based data analysis platform named ”OASIS” • Employees can analyze data of a Hadoop cluster by writing Spark applications • 100+ employees use it every day INTRODUCTION
  • 3. Agenda 1. Motivation 2. Features & System Architecture 3. Use Cases
  • 4. Agenda 1. Motivation 2. Features & System Architecture 3. Use Cases
  • 5. • Based in Tokyo, Japan • Provides a messaging application named “LINE” • 164M monthly active users across Japan, Thailand, Taiwan, and Indonesia • Also provides various related services such as games, sticker market, etc. LINE CORPORATION
  • 6. DATA PLATFORM AT LINE Services Game Sticker Market News Music Video Streaming ・ ・ ・ Processing Data Hadoop Cluster Game LINE Ads Platform Sticker Market News Music Video Streaming LINE Ads Platform ・ ・ ・ Analysis BI / Reporting
  • 7. • Enable all employees analyze their services’ data as they like • Speed up their data analysis process and decision making PUBLICATION OF HADOOP CLUSTER
  • 8. 1. Security • Each employee can access only the data related to their service 2. Stability • Queries must not affect the performance of other queries 3. Features • Each employee can extract data from the Hadoop cluster as they like • Results can be visualized and shared within a team or a department REQUIREMENTS
  • 9. 1. Security • Kerberize the Hadoop cluster and install Apache Ranger 2. Stability • Use Apache Spark as a query and application execution engine 3. Features • Try Apache Zeppelin for the Web UI SOLUTIONS
  • 10. • Framework to manage access control over a Hadoop cluster • Used to control each employee’s data access APACHE RANGER
  • 11. 1. Security • Kerberize the Hadoop cluster and install Apache Ranger 2. Stability • Use Apache Spark as a query and application execution engine 3. Features • Try Apache Zeppelin for the Web UI SOLUTIONS
  • 12. • Web-based data analysis tool • Supports Apache Spark and user impersonation • Results can be shared within multiple users as a form of a notebook APACHE ZEPPELIN
  • 13. SYSTEM ARCHITECTURE Apache Zeppelin 0.7.3 Hadoop YARN Cluster LDAP Spark application submission + User impersonation End Users HDFS + Apache Ranger User authentication Data access
  • 14. 1. Security • An arbitrary user can be set to scheduled notebook’s execution user • Users can access the data which they don’t have access rights to ISSUES OF APACHE ZEPPELIN 0.7.3
  • 15. 2. Stability • Runs only on a single server • Does not support the “yarn-cluster” mode of Apache Spark • Freezes when a Spark driver program consumes many server resources ISSUES OF APACHE ZEPPELIN 0.7.3
  • 16. 3. Features • Users have to set access control to each notebook • Users cannot execute a notebook while changing only its parameters without saving it ISSUES OF APACHE ZEPPELIN 0.7.3
  • 17. OASIS
  • 18. COMPARISON For a single team • Does not go with Apache Ranger • Runs only on a single server • Access control per notebook Apache Zeppelin For an enterprise • Works well with Apache Ranger • Runs on multiple servers • Access control per team OASIS
  • 19. Agenda 1. Motivation 2. Features & System Architecture 3. Use Cases
  • 20. • User impersonation • Data Visualization • Notebooks Sharing • Scalable OVERVIEW
  • 22. • A root directory of notebooks for a team, a department, or a service • Access right for each user is separately set in each space • 2 types of access rights: ”read write” and “read only” SPACE Space 1 User A (read write) User B (read only) Space 2 User C (read write) User D (read only)
  • 24. • A single Spark application launches per notebook session • Notebook author’s account is used to access files • Spark, Spark SQL, PySpark, and SparkR are available • Each language of a single notebook session shares a single Spark application SPARK APPLICATION
  • 27. • Notebooks can be executed automatically on a prescribed schedule • Contents of notebooks can be kept updated by this feature • This feature is also used for creating a light weight ETL processing SCHEDULING
  • 28. • Parameters can be injected into a notebook during its execution • “read only” users can execute a notebook while changing its parameters PARAMETERS
  • 31. SYSTEM ARCHITECTURE End Users Frontend / API OASIS Job Scheduler MySQL Redis Spark Interpreter Hadoop YARN Cluster HDFS / Apache Ranger
  • 32. • 500 DataNodes / NodeManagers • HDFS usage: 20PB • 150+ Hive databases • 1,500+ Hive tables HADOOP CLUSTER
  • 33. Agenda 1. Motivation 2. Features & System Architecture 3. Use Cases
  • 34. • 1,500+ users • 100+ daily active users • 300+ monthly active users • 30+ spaces (i.e. departments, teams, or services) • 1,100+ notebooks • 200+ scheduled notebooks STATS
  • 35. 1. Report 2. Interactive dashboard 3. ETL 4. Monitoring 5. Ad hoc analysis USE CASES
  • 40. 5. AD HOC ANALYSIS
  • 41. • We created OASIS to solve the issues of Apache Zeppelin • Extracted data can be visualized and shared within a team • OASIS utilizes the user impersonation feature of Apache Spark • At LINE, OASIS is used for reporting, data monitoring, ad hoc analysis, etc. RECAP