Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

301 vues

Publié le

https://www.bigdataspain.org/2016/program/thu-running-petascale-data-system-good-bad-ugly-choices.html

https://www.youtube.com/watch?v=gMrFSwT_O-g&t=18s&list=PL6O3g23-p8Tr5eqnIIPdBD_8eE5JBDBik&index=14

Publié dans : Technologie
  • There is a useful site for you that will help you to write a perfect and valuable essay and so on. Check out, please ⇒ www.WritePaper.info ⇐
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • My brother found Custom Writing Service ⇒ www.WritePaper.info ⇐ and ordered a couple of works. Their customer service is outstanding, never left a query unanswered.
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici
  • Soyez le premier à aimer ceci

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

  1. 1. 1
  2. 2. 2 RUNNING A PETABYTE SCALE DATA SYSTEM Alexey Kharlamov Nov 14st, 2016 Good, Bad, and Ugly Decisions
  3. 3. 3 2 1 3 AGENDA MULTITENANCY • Problem statement • Resource management • Workload isolation CONTINOUS INTEGRATION • What is different? • Caveats of the conventional approach • BigData release pipeline INTRODUCTION • Who? • What? • Why?
  4. 4. 4 SERVICES Data Strategy Big Data Architecture Data Science Big Data DevOps and Support Solutions and Accelerators BIG DATA AND DATA SCIENCE PRACTICE 15+ World-Class Data Architects 200+ Big Data Engineers & Hadoop DevOps 10% Hadoop Certified Engineers 20+ Data Scientists
  5. 5. 5 BIO Alexey a Solution Architect at EPAM Systems Ltd, where he leads EMEA Big Data Competency Center. He has over 20 years of software engineering experience and built multiple systems in the area of low-latency and distributed data processing in financial, e-retail and advertising industries. During his career, Alexey has designed systems processing millions of messages per second and managing petabytes of stored data. He uses RDBMs, NoSQL, data grids, and Big Data toolchain in his daily work to help companies on their Big Data journey. Alexey Kharlamov EPAM Systems, Solution Architect
  6. 6. 6 DATA THAT CAN NOT BE PROCESSED ON A SINGLE MACHINE
  7. 7. 7 • Data – Machine generated data by social networks, games, sensors, ad networks – Large volumes – Allow to build fine grained models of reality • Traits – ~1000 USD/TB – Hundreds of servers, thousands of rotational drives (Failure is a reality) – High performance server to server network – It takes days to copy data from a single server BIG DATA SYSTEM
  8. 8. 8 CONTINOUS INTEGRATION @ SCALE
  9. 9. 9 • Multiple environments for different purposes – Local/Continuous Integration – Quality Assurance – Production • The environments are kept in sync – Configuration – Databases • Code and test datasets are deployed to the environments to test different aspects of a system CLASSICAL (WEB) APPROACH 1 Laptop 1 VM 2 hosts 100+ hosts TRADITIONAL APPROACH
  10. 10. 10 TOTALLY DIFFERENT ENVIRONMENT SYNCRHONIZATION OUTCOME • CI, QA and PROD are constantly different • Test failure on CI and QA does not mean it will fail in PROD and visa versa • People stop to rely on additional environments to test their jobs • The most frequent bugs – Unexpected field value / rubbish – Input data change – Resource issue due data skew or growth • Environments have different hardware – Number of nodes – Generations of servers • Hard to synchronize configuration – Reprovisioning takes hours – Engineers tend to forget to copy configuration parameters • Hard to synchronize data – Different amount of disk space and CPU – Coping takes hours
  11. 11. 11 PREVAILING ISSUE TYPES • Unexpected field value / rubbish – Test data do not cover all possible values – Sampled data may miss exactly this error – Need to test on production data • Incompatible change in data format – Frequently brought in by third-parties and unexpected – Fall through ETL layers – Need to test on production data • Resource issue due data skew or growth – Causes job termination or cluster failure – Must be tested on exactly the same hardware configuration – Need to test on production data
  12. 12. 12 PERFECT TEST USES PRODUCTION DATA PERFECT TEST USES PRODUCTION HARDWARE
  13. 13. 13 • Logical partitions for DEV, QA, PROD on the cluster – Full processing capacity available – Always up-to-date data and configuration – No environment synchronization needed • Cluster becomes multitenant – Partitions must be isolated! – Code must be portable! • Developers need more – Faster turnaround times – Easy interactive debugging and cross- process traceability QA: SINGLE CLUSTER FOR EVERYTHING
  14. 14. 14 QA: HADOOP MINICLUSTER • Full clone of a Hadoop Cluster in a single JVM – Job Driver – NameNode – DataNode – Hive – Hbase • Step Into... Hadoop and debug – MapReduce Jobs – User Defined Functions – Coprocessors – Queries
  15. 15. 15 QA: CONTINUOUS QUALITY MONITORING • Assertion of invariants per data chunk or time period – Number of records – Field data profile – Conversion failures – Missing dictionary/dimension data – Field values range • Alerting on assertion failure – Too many errors! – Number of records differs!
  16. 16. 16 MULTITENANCY
  17. 17. 17 • Uses unit allocated to them, but always would like to get more • Wants independence from others • Do not want to be bothered by other, but can throw a party from time to time APARTMENT RENTAL TENANT • Provides unit fulfilling tenant needs • Fixes broken facilities • Ensures tenants follow rules • Evicts misbehaving tenants LANDLORD
  18. 18. 18 • A logical partition of platform resources independently executing a cluster application – Data processing scripts and drivers – Cluster services (workflow managers, query engines) – Bespoken services (REST, Web UI, etc) • Resource management – YARN resource pool defines share of resource available to application – HDFS quotes for data volume control • Isolation – Linux Cgroups enforce CPU/RAM utilization – Filesystem ACLs restrict access – Own service instance per domain (Hive, scheduler, etc) – YARN can preempt tasks running for too long – Watchdog processes terminates ran away jobs APPLICATION DOMAIN
  19. 19. 19 ELASTIC COMPUTING CAPACITY Mesosphere • Researchers and Developers frequently need a playground • Application domains need to dynamically allocate resources – Metal as a Service – Virtualization – Containerization • Containers are perfect for portable code bundling – Statelessness encourages externalization of configuration – All dependencies included – Explicit amount of resources allocated – Easy migration between hosts
  20. 20. 20 2 1 3 TAKE AWAYS AUGMENT HADOOP WITH FLUID COMPUTATIONAL CAPACITY CREATE ISOLATED DOMAINS FOR TENANTS AND WORKLOADS USE UNIFIED PLATFORM FOR ALL ACTIVITIES
  21. 21. 21 THANK YOU alexey@kharlamov.biz @aih1013

×