RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

2
RUNNING A PETABYTE
SCALE DATA SYSTEM
Alexey Kharlamov
Nov 14st, 2016
Good, Bad, and Ugly Decisions

3
2
1
3
AGENDA
MULTITENANCY
• Problem statement
• Resource management
• Workload isolation
CONTINOUS
INTEGRATION
• What is different?
• Caveats of the conventional approach
• BigData release pipeline
INTRODUCTION
• Who?
• What?
• Why?

4
SERVICES
Data Strategy Big Data
Architecture
Data Science Big Data DevOps
and Support
Solutions
and
Accelerators
BIG DATA AND DATA SCIENCE PRACTICE
15+
World-Class Data
Architects
200+
Big Data Engineers
& Hadoop DevOps
10%
Hadoop Certified
Engineers
20+
Data Scientists

5
BIO
Alexey a Solution Architect at EPAM Systems Ltd, where he leads EMEA Big Data
Competency Center. He has over 20 years of software engineering experience
and built multiple systems in the area of low-latency and distributed data
processing in financial, e-retail and advertising industries.
During his career, Alexey has designed systems processing millions of messages
per second and managing petabytes of stored data. He uses RDBMs, NoSQL, data
grids, and Big Data toolchain in his daily work to help companies on their Big
Data journey.
Alexey Kharlamov
EPAM Systems, Solution Architect

6
DATA THAT CAN NOT BE PROCESSED
ON A SINGLE MACHINE

7
• Data
– Machine generated data by social networks, games, sensors, ad networks
– Large volumes
– Allow to build fine grained models of reality
• Traits
– ~1000 USD/TB
– Hundreds of servers, thousands of rotational drives (Failure is a reality)
– High performance server to server network
– It takes days to copy data from a single server
BIG DATA SYSTEM

8
CONTINOUS INTEGRATION @ SCALE

9
• Multiple environments for
different purposes
– Local/Continuous
Integration
– Quality Assurance
– Production
• The environments are kept
in sync
– Configuration
– Databases
• Code and test datasets are
deployed to the
environments to test
different aspects of a
system
CLASSICAL (WEB) APPROACH
1 Laptop 1 VM 2 hosts 100+
hosts
TRADITIONAL APPROACH

10
TOTALLY DIFFERENT
ENVIRONMENT SYNCRHONIZATION OUTCOME
• CI, QA and PROD are constantly different
• Test failure on CI and QA does not mean it
will fail in PROD and visa versa
• People stop to rely on additional
environments to test their jobs
• The most frequent bugs
– Unexpected field value / rubbish
– Input data change
– Resource issue due data skew or growth
• Environments have different hardware
– Number of nodes
– Generations of servers
• Hard to synchronize configuration
– Reprovisioning takes hours
– Engineers tend to forget to copy
configuration parameters
• Hard to synchronize data
– Different amount of disk space and CPU
– Coping takes hours

11
PREVAILING ISSUE TYPES
• Unexpected field value / rubbish
– Test data do not cover all possible values
– Sampled data may miss exactly this error
– Need to test on production data
• Incompatible change in data format
– Frequently brought in by third-parties and unexpected
– Fall through ETL layers
• Resource issue due data skew or growth
– Causes job termination or cluster failure
– Must be tested on exactly the same hardware
configuration

12
PERFECT TEST USES PRODUCTION DATA
PERFECT TEST USES PRODUCTION
HARDWARE

13
• Logical partitions for DEV, QA, PROD on the
cluster
– Full processing capacity available
– Always up-to-date data and
configuration
– No environment synchronization
needed
• Cluster becomes multitenant
– Partitions must be isolated!
– Code must be portable!
• Developers need more
– Faster turnaround times
– Easy interactive debugging and cross-
process traceability
QA: SINGLE CLUSTER FOR EVERYTHING

14
QA: HADOOP MINICLUSTER
• Full clone of a Hadoop Cluster in a single JVM
– Job Driver
– NameNode
– DataNode
– Hive
– Hbase
• Step Into... Hadoop and debug
– MapReduce Jobs
– User Defined Functions
– Coprocessors
– Queries

15
QA: CONTINUOUS QUALITY MONITORING
• Assertion of invariants per data chunk or
time period
– Number of records
– Field data profile
– Conversion failures
– Missing dictionary/dimension data
– Field values range
• Alerting on assertion failure
– Too many errors!
– Number of records differs!

17
• Uses unit allocated to them, but always
would like to get more
• Wants independence from others
• Do not want to be bothered by other, but can
throw a party from time to time
APARTMENT RENTAL
TENANT
• Provides unit fulfilling tenant needs
• Fixes broken facilities
• Ensures tenants follow rules
• Evicts misbehaving tenants
LANDLORD

18
• A logical partition of platform resources
independently executing a cluster application
– Data processing scripts and drivers
– Cluster services (workflow managers, query engines)
– Bespoken services (REST, Web UI, etc)
• Resource management
– YARN resource pool defines share of resource available
to application
– HDFS quotes for data volume control
• Isolation
– Linux Cgroups enforce CPU/RAM utilization
– Filesystem ACLs restrict access
– Own service instance per domain (Hive, scheduler,
etc)
– YARN can preempt tasks running for too long
– Watchdog processes terminates ran away jobs
APPLICATION DOMAIN

19
ELASTIC COMPUTING CAPACITY
Mesosphere
• Researchers and Developers frequently need a
playground
• Application domains need to dynamically allocate
resources
– Metal as a Service
– Virtualization
– Containerization
• Containers are perfect for portable code bundling
– Statelessness encourages externalization of
configuration
– All dependencies included
– Explicit amount of resources allocated
– Easy migration between hosts

20
2
1
3
TAKE AWAYS
AUGMENT HADOOP WITH FLUID COMPUTATIONAL
CAPACITY
CREATE ISOLATED DOMAINS FOR TENANTS AND
WORKLOADS
USE UNIFIED PLATFORM FOR ALL ACTIVITIES

21
THANK
YOU
alexey@kharlamov.biz
@aih1013

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (11)

Similaire à RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Similaire à RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov (20)

Plus de Big Data Spain

Plus de Big Data Spain (20)

Dernier

Dernier (20)

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov