SlideShare une entreprise Scribd logo
1  sur  37
Copyright 2017 Trend Micro Inc.1
Data Lake:
centralize in on-prem vs. decentralize on cloud
Jeff Hung /Trend Micro
Sep 30, 2017
Copyright 2017 Trend Micro Inc.2
@jeffhung
• Smart Protection Network (SPN)
• Big-data Platform & Solution
• Hadoop PROD since 2009
• Work on AWS/EMR since 2014
Copyright 2017 Trend Micro Inc.3 Copyright 2017 Trend Micro Inc.3
500,000,000
total malware in 2016
Malware Explosion
Copyright 2017 Trend Micro Inc.4
Web
Crawler
Trend Micro
Endpoint
Protection
Trend Micro
Mail Protection
Trend Micro
Web Protection
Honeypot
CDN / xSP
Researcher
Intelligence
From data to solution
Threat Protections
Regional distribution of ransomware threats
from January 2016 to March 2017
Copyright 2017 Trend Micro Inc.5
2009
~40 Hadoop nodes
~15 Service/user accounts
3 Teams
<50 TB storage
<100 Jobs per day
Copyright 2017 Trend Micro Inc.6
2014
~250 Hadoop nodes
~140 Service/user accounts
13 Teams
~1,500 TB storage
>16,000 Jobs per day
Copyright 2017 Trend Micro Inc.7
The SPN Journey
Why cloud?
2009 2014 2017
on-prem
on-cloud
AWS
Data
Center
Scalability Agility Cost
Copyright 2017 Trend Micro Inc.8
Big-data on the Cloud
EMR
Copyright 2017 Trend Micro Inc.9 Copyright 2017 Trend Micro Inc.9
Analytic EngineData Lake
Copyright 2017 Trend Micro Inc.10
Data Lake
Copyright 2017 Trend Micro Inc.11
“If you think of a datamart as a store of bottled water – cleansed and
packaged and structured for easy consumption – the data lake is a large
body of water in a more natural state. The contents of the data lake
stream in from a source to fill the lake, and various users of the lake can
come to examine, dive in, or take samples.”
– James Dixon, CTO of Pentaho
Data Lake
Exam / Dive in / Take Example
Stream in Dump in
Copyright 2017 Trend Micro Inc.12
Data Lake fixes 2 problems
• Unknown Questions
• Information Silos
Copyright 2017 Trend Micro Inc.13
Unknown Questions
• You don’t know what you don’t know
• Premature optimization is the root of all evil
Data Warehouse/Mart vs. Data Lake
structured and preprocessed
data
DATA structured / semi-structured /
unstructured / raw data
schema-on-write PROCESSING schema-on-read
expensive for large data volume STORAGE designed for low-cost storage
Copyright 2017 Trend Micro Inc.14
• Incompatible data system built by different teams
– Conway’s Law & NIH Syndrome
– Technology Limitation
• Hadoop come to rescue
– Free technology
– Free computing
– Free storage
Information Silos
But...
– Cost justification
– Development agility
Copyright 2017 Trend Micro Inc.15
Data Lake fixes 2 problems – How ?
• Unknown Questions  Preserve low-level visibility
• Information Silos  Make accessing data easy
We’ve done these right in on-premises datacenter.
Now we need to make them work on Cloud, too.
Copyright 2017 Trend Micro Inc.16
vs.
Centralized Infrastructure 
One Big Hadoop Cluster 
Fine-grain Resource Sharing 
Single Big-data Stack 
UNIX Style Permission Control 
 Decentralized on Cloud Services
 Many EMR Clusters / S3 Buckets
 Go Dutch & Pay-as-you-Go
 Diverse Technology
 Cloud Compatible ACL
Things apply to both scenarios: File Format & Schema Design
On-Prem vs. On Cloud
Copyright 2017 Trend Micro Inc.17
Centralized Infrastructure
• Have only one datacenter
• PROD/STG in same place
• Jobs & data in same cluster
• Do everything on our own
Decentralized on Cloud Services
• Could use multiple AWS regions
• PROD/STG in different regions
• Jobs & data in different places
• Could leverage cloud services
vs
Access data in one place Access data from anywhere
Copyright 2017 Trend Micro Inc.18
One Big Hadoop Cluster
• Runs all applications
in same big cluster
• Bigger cluster = better flexibility
• Resource managed by YARN
Many EMR Clusters / S3 Buckets
• Each applications runs
in its own EMR cluster
• Specific cluster = better flexibility
• Resource managed by AWS
vs
Work in shared place Work in my place
Copyright 2017 Trend Micro Inc.19
Fine-grain Resource Sharing
• Finer granularity by CPU/Mem
• Centralized budget account
• Do usage statistics on our own
Go Dutch & Pay-as-you-Go
• Granularity in EMR level
• Budget in each application team
• AWS provides billing reports
vs
Finer usage management Manage my own works
Copyright 2017 Trend Micro Inc.20
Single Big-data Stack
• Tend to select best practices
that are well tested
• Other technology brings trouble
• Easy to optimize from infra-level
• Access through HDFS interface
Diverse Technology
• Tend to allow using different
AWS services
• Technologies covered by AWS
• Infra optimization rely on AWS
• Different access mechanism
vs
Use verified best practice Use tools I like
Copyright 2017 Trend Micro Inc.21
UNIX Style Permission Control
• UNIX style “file” permissions
• Tend to make data lake
read-only to everybody
• No widely adopted encryptions
Cloud Compatible ACL
• IAM-based ACL policies
• Allow granting access rights
in dataset level
• S3 provides native encryptions
vs
Simple that just works Complex but rich
Copyright 2017 Trend Micro Inc.22
Make Accessing Data Easy
On-prem in datacenter
• Raw data is read-only to
everybody
• Canonical software stack with
best practices
• Plan ahead the infra by strong
team
On-cloud in AWS
• Simplify permission granting
process
• Encourage leveraging AWS
services
• Pay-as-you-Go by every involved
teams
Copyright 2017 Trend Micro Inc.23
• Size reduction to lower storage consumption
• Read/write performance to speed up computing
• Data characteristics to hold complex data structure
• Schema evolution which changes from time to time
• Tool interoperability for different kinds of access
File Format & Schema Design Considerations
23
• Size reduction
• Read/write performance
• Data characteristics
• Schema evolution
• Tool interoperability
Copyright 2017 Trend Micro Inc.24
Data Characteristics
• The characteristics of the data schema
– What will not work?
– The mitigations
Fundamental characteristics and principles for “logs”:
• Records shall be independent – don’t reference other record
• Records shall be self-contained – repeat info in path
• Record exist means something, not-exist may mean another
These can only be documented  need schema portal
24
Copyright 2017 Trend Micro Inc.25
Data Characteristics – Recursion
• Avoid recursive schema
• Recursive schema?
25
ProcessInfo {
process_id: long,
image_path: chararray,
image_sha1: bytearray,
is_detected: int,
action: long,
action_result: {
terminate: chararray
},
children_processes: [
ProcessInfo
]
}
root_process: ProcessInfo;
Bad Design
Copyright 2017 Trend Micro Inc.26
Data Characteristics – Recursion
• Mitigation: array and index/id
26
0
1
2
3
... ...
process_id 652
parent_process_id 360
... ...
process_id 696
parent_process_id 652
... ...
process_id 952
parent_process_id 696
... ...
process_id 828
parent_process_id 696
... ...
ProcessInfo {
process_id: long,
image_path: chararray,
image_sha1: bytearray,
is_detected: int,
action: long,
action_result: {
terminate: chararray
},
parent_process_id: long
}
process_infos: [ ProcessInfo ];
Good Design
Copyright 2017 Trend Micro Inc.27
Data Characteristics – Tabular or Nested?
• Depends on data nature
– Raw data like feedback logs are often nested
– Intermediate data are often tabular
• File Format Capability
• Strategy
– Parquet for nested data
– ORC for tabular data
27
Protobuf Parquet ORC key/value pairs
Good at nested data Good at tabular data
Copyright 2017 Trend Micro Inc.28
• Avoid custom key names for KV pairs
– Runtime-determined key names are hard to process
– Hard to parallelize key-enumeration (tool constraint)
• Mitigation
Data Characteristics – Custom Key Name
28
[
{ "name": "key1", "value": "value1" },
{ "name": "key2", "value": "value2" },
...
]
{ "key1": "value1", "key2": "value2", ... }
Copyright 2017 Trend Micro Inc.29
Data Characteristics – Not All JSON Works
• In summary, not all data presentable by JSON are easy to
process using big-data tools & other formats
• Be careful to design schema if you choose JSON
29
Characteristic Parquet ORC JSON
Recursion No Yes Yes
Tabular vs. Nested Nested Tabular Yes
Custom Key Name No No Yes
Copyright 2017 Trend Micro Inc.30
• Schema evolves from time to time
– Backward compatible is not that simple as imaged
– In reality each role will have multiple versions
• Roles:
Schema Evolution
30
Storage (HDFS or S3)
Data Provider Data Consumer
Storer Loader
DataDataData
Copyright 2017 Trend Micro Inc.31 Copyright 2017 Trend Micro Inc.31
Product/v1 Validator/v3 Data/v1 Loader/v3 Service1/v1
Product/v2 Validator/v3 Data/v2 Loader/v3
Product/v3 Validator/v3 Data/v3 Loader/v3 Service2/v3
31
Data Consumer
Storer DataDataData Loader
Data Provider
Product/v1 Validator/v1 Data/v1 Loader/v1 Service1/v1
Product/v1 Validator/v2 Data/v1 Loader/v2 Service1/v1
Product/v2 Validator/v2 Data/v2 Loader/v2 Service2/v2
Product/v1 Validator/v3 Data/v1 Loader/v2 Service1/v1
Product/v2 Validator/v3 Data/v2 Loader/v2 Service2/v2
Product/v3 Validator/v3 Data/v3 Loader/v2
Year 1 – Only one version in the universe
Year 2 – Second version appeared while v1 still exist
Year 3 – Collect v3 data in advance due to schedule or to accumulate data
Year 4 – Data consumer catch up and start using v3 data
Copyright 2017 Trend Micro Inc.32
• File format requirement to enable schema evolution
– Mostly “Loader” requirement, for Pig and Spark
• Requirements
Schema Evolution – Requirements
32
Processing Program
Data File
Future
Schema
Data File
Current
Schema
Load
Save
Internal Structure
Current
Schema
Process
Data File
Current
Schema
Missing
Fields
=
Default Value
or NULL
Unknown
Fields
=
Invisible
or Ignore
Data File
Previous
Schema
– Can load current version
– Can load previous version
– Can load future version
Copyright 2017 Trend Micro Inc.33
Data File
Current
Schema
Data File
Previous
Schema
Data File
Future
Schema
Processing Program
Data File
Current
Schema
Data File
Previous
Schema
Data File
Future
Schema
• File format requirement to enable schema evolution
– Mostly “Loader” requirement, for Pig and Spark
• Requirements
Schema Evolution – Requirements
33
– Can load current version
– Can load previous version
– Can load future version
– Store as original version
when collecting data
Load
Save
Internal Structure
Current
Schema
Process
Missing
Fields
=
Default Value
or NULL
Unknown
Fields
=
Invisible
or Ignore
Copyright 2017 Trend Micro Inc.34
• File format requirement to enable schema evolution
– Mostly “Loader” requirement, for Pig and Spark
• Requirements
– The internal structure of loaded data in Pig and Spark
cannot preserve entire original structure
– Write data ingestion tool in Java whenever possible
Schema Evolution – Requirements
34
– Can load current version
– Can load previous version
– Can load future version
– Store as original versionRequirement SF+PB Parquet ORC SF+PB Parquet ORC Parquet ORC
Load Previous v v v v v v v v
Load Current v v v v v v v v
Load Future v v v v v v v v
Save Original v v v x x x x x
Language Java Pig Spark
Copyright 2017 Trend Micro Inc.35
Preserve low-level visibility
• Choose file format based on usage scenario
• Deview schema to avoid bad design
Copyright 2017 Trend Micro Inc.36
Wrap ups
• Preserve low-level visibility
– Choose file format wisely
– Design schema carefully
• Make accessing data easy
– On-prem & on-cloud are different
– Do something to lower barrier
Copyright 2017 Trend Micro Inc.37 Copyright 2017 Trend Micro Inc.37
Any Questions?

Contenu connexe

Tendances

NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
Lynn Langit
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
Capgemini
 

Tendances (19)

IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0   virtual - subhadeep duttaCWIN17 India / Insights platform architecture v1 0   virtual - subhadeep dutta
CWIN17 India / Insights platform architecture v1 0 virtual - subhadeep dutta
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?Big Data - in the cloud or rather on-premises?
Big Data - in the cloud or rather on-premises?
 
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for AnalyticsVerizon Centralizes Data into a Data Lake in Real Time for Analytics
Verizon Centralizes Data into a Data Lake in Real Time for Analytics
 
Practical Machine Learning
Practical Machine LearningPractical Machine Learning
Practical Machine Learning
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes Democratizing Data Science on Kubernetes
Democratizing Data Science on Kubernetes
 

Similaire à [DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud

Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
Stylight
 
Accelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyAccelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data Strategy
MongoDB
 

Similaire à [DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud (20)

Cloud Data Strategy event London
Cloud Data Strategy event LondonCloud Data Strategy event London
Cloud Data Strategy event London
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStackAdam Dagnall: Advanced S3 compatible storage integration in CloudStack
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
 
The rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computingThe rise of “Big Data” on cloud computing
The rise of “Big Data” on cloud computing
 
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, A...
 
Unit 3 -Data storage and cloud computing
Unit 3 -Data storage and cloud computingUnit 3 -Data storage and cloud computing
Unit 3 -Data storage and cloud computing
 
Big Data
Big DataBig Data
Big Data
 
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
AWS re:Invent 2016: FINRA in the Cloud: the Big Data Enterprise (ENT313)
 
Cloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native appsCloud-Native Data: What data questions to ask when building cloud-native apps
Cloud-Native Data: What data questions to ask when building cloud-native apps
 
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive AdvantageFueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
 
Hadoop
HadoopHadoop
Hadoop
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersR+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers
 
Accelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data StrategyAccelerating a Path to Digital With a Cloud Data Strategy
Accelerating a Path to Digital With a Cloud Data Strategy
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
SRV415 NEW LAUNCH! DynamoDB just got faster: Deep Dive on DAX and more
SRV415 NEW LAUNCH!  DynamoDB just got faster: Deep Dive on DAX and moreSRV415 NEW LAUNCH!  DynamoDB just got faster: Deep Dive on DAX and more
SRV415 NEW LAUNCH! DynamoDB just got faster: Deep Dive on DAX and more
 

Dernier

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 

Dernier (20)

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud

  • 1. Copyright 2017 Trend Micro Inc.1 Data Lake: centralize in on-prem vs. decentralize on cloud Jeff Hung /Trend Micro Sep 30, 2017
  • 2. Copyright 2017 Trend Micro Inc.2 @jeffhung • Smart Protection Network (SPN) • Big-data Platform & Solution • Hadoop PROD since 2009 • Work on AWS/EMR since 2014
  • 3. Copyright 2017 Trend Micro Inc.3 Copyright 2017 Trend Micro Inc.3 500,000,000 total malware in 2016 Malware Explosion
  • 4. Copyright 2017 Trend Micro Inc.4 Web Crawler Trend Micro Endpoint Protection Trend Micro Mail Protection Trend Micro Web Protection Honeypot CDN / xSP Researcher Intelligence From data to solution Threat Protections Regional distribution of ransomware threats from January 2016 to March 2017
  • 5. Copyright 2017 Trend Micro Inc.5 2009 ~40 Hadoop nodes ~15 Service/user accounts 3 Teams <50 TB storage <100 Jobs per day
  • 6. Copyright 2017 Trend Micro Inc.6 2014 ~250 Hadoop nodes ~140 Service/user accounts 13 Teams ~1,500 TB storage >16,000 Jobs per day
  • 7. Copyright 2017 Trend Micro Inc.7 The SPN Journey Why cloud? 2009 2014 2017 on-prem on-cloud AWS Data Center Scalability Agility Cost
  • 8. Copyright 2017 Trend Micro Inc.8 Big-data on the Cloud EMR
  • 9. Copyright 2017 Trend Micro Inc.9 Copyright 2017 Trend Micro Inc.9 Analytic EngineData Lake
  • 10. Copyright 2017 Trend Micro Inc.10 Data Lake
  • 11. Copyright 2017 Trend Micro Inc.11 “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” – James Dixon, CTO of Pentaho Data Lake Exam / Dive in / Take Example Stream in Dump in
  • 12. Copyright 2017 Trend Micro Inc.12 Data Lake fixes 2 problems • Unknown Questions • Information Silos
  • 13. Copyright 2017 Trend Micro Inc.13 Unknown Questions • You don’t know what you don’t know • Premature optimization is the root of all evil Data Warehouse/Mart vs. Data Lake structured and preprocessed data DATA structured / semi-structured / unstructured / raw data schema-on-write PROCESSING schema-on-read expensive for large data volume STORAGE designed for low-cost storage
  • 14. Copyright 2017 Trend Micro Inc.14 • Incompatible data system built by different teams – Conway’s Law & NIH Syndrome – Technology Limitation • Hadoop come to rescue – Free technology – Free computing – Free storage Information Silos But... – Cost justification – Development agility
  • 15. Copyright 2017 Trend Micro Inc.15 Data Lake fixes 2 problems – How ? • Unknown Questions  Preserve low-level visibility • Information Silos  Make accessing data easy We’ve done these right in on-premises datacenter. Now we need to make them work on Cloud, too.
  • 16. Copyright 2017 Trend Micro Inc.16 vs. Centralized Infrastructure  One Big Hadoop Cluster  Fine-grain Resource Sharing  Single Big-data Stack  UNIX Style Permission Control   Decentralized on Cloud Services  Many EMR Clusters / S3 Buckets  Go Dutch & Pay-as-you-Go  Diverse Technology  Cloud Compatible ACL Things apply to both scenarios: File Format & Schema Design On-Prem vs. On Cloud
  • 17. Copyright 2017 Trend Micro Inc.17 Centralized Infrastructure • Have only one datacenter • PROD/STG in same place • Jobs & data in same cluster • Do everything on our own Decentralized on Cloud Services • Could use multiple AWS regions • PROD/STG in different regions • Jobs & data in different places • Could leverage cloud services vs Access data in one place Access data from anywhere
  • 18. Copyright 2017 Trend Micro Inc.18 One Big Hadoop Cluster • Runs all applications in same big cluster • Bigger cluster = better flexibility • Resource managed by YARN Many EMR Clusters / S3 Buckets • Each applications runs in its own EMR cluster • Specific cluster = better flexibility • Resource managed by AWS vs Work in shared place Work in my place
  • 19. Copyright 2017 Trend Micro Inc.19 Fine-grain Resource Sharing • Finer granularity by CPU/Mem • Centralized budget account • Do usage statistics on our own Go Dutch & Pay-as-you-Go • Granularity in EMR level • Budget in each application team • AWS provides billing reports vs Finer usage management Manage my own works
  • 20. Copyright 2017 Trend Micro Inc.20 Single Big-data Stack • Tend to select best practices that are well tested • Other technology brings trouble • Easy to optimize from infra-level • Access through HDFS interface Diverse Technology • Tend to allow using different AWS services • Technologies covered by AWS • Infra optimization rely on AWS • Different access mechanism vs Use verified best practice Use tools I like
  • 21. Copyright 2017 Trend Micro Inc.21 UNIX Style Permission Control • UNIX style “file” permissions • Tend to make data lake read-only to everybody • No widely adopted encryptions Cloud Compatible ACL • IAM-based ACL policies • Allow granting access rights in dataset level • S3 provides native encryptions vs Simple that just works Complex but rich
  • 22. Copyright 2017 Trend Micro Inc.22 Make Accessing Data Easy On-prem in datacenter • Raw data is read-only to everybody • Canonical software stack with best practices • Plan ahead the infra by strong team On-cloud in AWS • Simplify permission granting process • Encourage leveraging AWS services • Pay-as-you-Go by every involved teams
  • 23. Copyright 2017 Trend Micro Inc.23 • Size reduction to lower storage consumption • Read/write performance to speed up computing • Data characteristics to hold complex data structure • Schema evolution which changes from time to time • Tool interoperability for different kinds of access File Format & Schema Design Considerations 23 • Size reduction • Read/write performance • Data characteristics • Schema evolution • Tool interoperability
  • 24. Copyright 2017 Trend Micro Inc.24 Data Characteristics • The characteristics of the data schema – What will not work? – The mitigations Fundamental characteristics and principles for “logs”: • Records shall be independent – don’t reference other record • Records shall be self-contained – repeat info in path • Record exist means something, not-exist may mean another These can only be documented  need schema portal 24
  • 25. Copyright 2017 Trend Micro Inc.25 Data Characteristics – Recursion • Avoid recursive schema • Recursive schema? 25 ProcessInfo { process_id: long, image_path: chararray, image_sha1: bytearray, is_detected: int, action: long, action_result: { terminate: chararray }, children_processes: [ ProcessInfo ] } root_process: ProcessInfo; Bad Design
  • 26. Copyright 2017 Trend Micro Inc.26 Data Characteristics – Recursion • Mitigation: array and index/id 26 0 1 2 3 ... ... process_id 652 parent_process_id 360 ... ... process_id 696 parent_process_id 652 ... ... process_id 952 parent_process_id 696 ... ... process_id 828 parent_process_id 696 ... ... ProcessInfo { process_id: long, image_path: chararray, image_sha1: bytearray, is_detected: int, action: long, action_result: { terminate: chararray }, parent_process_id: long } process_infos: [ ProcessInfo ]; Good Design
  • 27. Copyright 2017 Trend Micro Inc.27 Data Characteristics – Tabular or Nested? • Depends on data nature – Raw data like feedback logs are often nested – Intermediate data are often tabular • File Format Capability • Strategy – Parquet for nested data – ORC for tabular data 27 Protobuf Parquet ORC key/value pairs Good at nested data Good at tabular data
  • 28. Copyright 2017 Trend Micro Inc.28 • Avoid custom key names for KV pairs – Runtime-determined key names are hard to process – Hard to parallelize key-enumeration (tool constraint) • Mitigation Data Characteristics – Custom Key Name 28 [ { "name": "key1", "value": "value1" }, { "name": "key2", "value": "value2" }, ... ] { "key1": "value1", "key2": "value2", ... }
  • 29. Copyright 2017 Trend Micro Inc.29 Data Characteristics – Not All JSON Works • In summary, not all data presentable by JSON are easy to process using big-data tools & other formats • Be careful to design schema if you choose JSON 29 Characteristic Parquet ORC JSON Recursion No Yes Yes Tabular vs. Nested Nested Tabular Yes Custom Key Name No No Yes
  • 30. Copyright 2017 Trend Micro Inc.30 • Schema evolves from time to time – Backward compatible is not that simple as imaged – In reality each role will have multiple versions • Roles: Schema Evolution 30 Storage (HDFS or S3) Data Provider Data Consumer Storer Loader DataDataData
  • 31. Copyright 2017 Trend Micro Inc.31 Copyright 2017 Trend Micro Inc.31 Product/v1 Validator/v3 Data/v1 Loader/v3 Service1/v1 Product/v2 Validator/v3 Data/v2 Loader/v3 Product/v3 Validator/v3 Data/v3 Loader/v3 Service2/v3 31 Data Consumer Storer DataDataData Loader Data Provider Product/v1 Validator/v1 Data/v1 Loader/v1 Service1/v1 Product/v1 Validator/v2 Data/v1 Loader/v2 Service1/v1 Product/v2 Validator/v2 Data/v2 Loader/v2 Service2/v2 Product/v1 Validator/v3 Data/v1 Loader/v2 Service1/v1 Product/v2 Validator/v3 Data/v2 Loader/v2 Service2/v2 Product/v3 Validator/v3 Data/v3 Loader/v2 Year 1 – Only one version in the universe Year 2 – Second version appeared while v1 still exist Year 3 – Collect v3 data in advance due to schedule or to accumulate data Year 4 – Data consumer catch up and start using v3 data
  • 32. Copyright 2017 Trend Micro Inc.32 • File format requirement to enable schema evolution – Mostly “Loader” requirement, for Pig and Spark • Requirements Schema Evolution – Requirements 32 Processing Program Data File Future Schema Data File Current Schema Load Save Internal Structure Current Schema Process Data File Current Schema Missing Fields = Default Value or NULL Unknown Fields = Invisible or Ignore Data File Previous Schema – Can load current version – Can load previous version – Can load future version
  • 33. Copyright 2017 Trend Micro Inc.33 Data File Current Schema Data File Previous Schema Data File Future Schema Processing Program Data File Current Schema Data File Previous Schema Data File Future Schema • File format requirement to enable schema evolution – Mostly “Loader” requirement, for Pig and Spark • Requirements Schema Evolution – Requirements 33 – Can load current version – Can load previous version – Can load future version – Store as original version when collecting data Load Save Internal Structure Current Schema Process Missing Fields = Default Value or NULL Unknown Fields = Invisible or Ignore
  • 34. Copyright 2017 Trend Micro Inc.34 • File format requirement to enable schema evolution – Mostly “Loader” requirement, for Pig and Spark • Requirements – The internal structure of loaded data in Pig and Spark cannot preserve entire original structure – Write data ingestion tool in Java whenever possible Schema Evolution – Requirements 34 – Can load current version – Can load previous version – Can load future version – Store as original versionRequirement SF+PB Parquet ORC SF+PB Parquet ORC Parquet ORC Load Previous v v v v v v v v Load Current v v v v v v v v Load Future v v v v v v v v Save Original v v v x x x x x Language Java Pig Spark
  • 35. Copyright 2017 Trend Micro Inc.35 Preserve low-level visibility • Choose file format based on usage scenario • Deview schema to avoid bad design
  • 36. Copyright 2017 Trend Micro Inc.36 Wrap ups • Preserve low-level visibility – Choose file format wisely – Design schema carefully • Make accessing data easy – On-prem & on-cloud are different – Do something to lower barrier
  • 37. Copyright 2017 Trend Micro Inc.37 Copyright 2017 Trend Micro Inc.37 Any Questions?

Notes de l'éditeur

  1. Conway’s law – architecture structure tend to mirror organization structure
  2. Read/write performance Read scenarios Input to EMR Ingest by Kinesis Write scenarios Accumulated from volume process Dump DB snapshot Size reduction Baseline: Sequence File + Protobuf Explicitly sub-schema Data characteristics tabular vs. nested recursive data structure different types in array/tuple key/value pairs with custom keys json subset Schema evolution Principle: backward compatible data x validator x loader Tool interoperability Pig vs. Spark Batch vs. Streaming (in Spark) Write to Firehose Read from S3 to EMR
  3. Recursion: Parquet (x), ORC (?)
  4. Recursion: Parquet (x), ORC (?)
  5. Custom Key Name: Parquet (?), ORC (?)