Challenges & Capabilites in Managing a MapR Cluster by David Tucker

•Télécharger en tant que PPTX, PDF•

2 j'aime•1,858 vues

"If you're using Hadoop in production, how do you manage it? Does the distribution you're using provide any tools to make the job easier? What are the pitfalls? Are there parts of the system that are less robust or that have problems more often? Are you running Hadoop on bare metal, or in a cloud environment, and is one easier than the other?" MapR Senior Solutions Architect David Tucker speaks about the challenges and capabilites in managing a cluster. This talk was given at the SF Bay Area Large Scale Production Engineering Meetup (Sept 19, 2013).

Technologie Business

1©MapR Technologies - Do Not Redistribute
Challenges and Capabilities in
Managing a MapR Cluster
David Tucker
Senior Solution Architect
MapR Technologies

2©MapR Technologies - Do Not Redistribute
Overview
Business Challenge
 Keep the cluster running
 Keep the data safe and secure
 Optimize resource utilization
Cluster Capability
 Management at scale
 Integrated HA
 Resiliency
 Authentication / authorization
 Designed for high performance
 Data and processing locality

3©MapR Technologies - Do Not Redistribute
Business Challenge
 Keep the cluster running
 Keep the data safe and secure
 Optimize resource utilization
Cluster Capability
 Management at scale
 Integrated HA
 Resiliency
 Authentication / authorization
 Designed for high performance
 Data and processing locality

4©MapR Technologies - Do Not Redistribute
Easy Management at Scale
 Health
Monitoring
 Cluster
Administration
 Application
Resource
Provisioning

5©MapR Technologies - Do Not Redistribute
High Availability and Dependability
Reliable Compute
Dependable
Storage
 Automated stateful failover
 Automated re-replication
 Automated recovery from HW
and SW failures
 Load balancing of critical
services
 Rolling upgrades
 No lost jobs or data
 99999’s of uptime
• Business continuity with
snapshots and mirrors
• Point-in-time recovery
• End-to-end check-summing
• Strong consistency
• Data safe
• Multi-site mirroring to meet
Recovery Time Objectives

6©MapR Technologies - Do Not Redistribute
NameNode
NAS
APPLIANCE
DataNode DataNode DataNode
DataNode DataNode DataNode
DataNode DataNode DataNode
No NameNode Architecture
Other Distributions (HDFS Federation) MapR
 Multiple single points of failure
 Limited to 50M files per NameNode
 Performance bottleneck
 Commercial NAS required
 Metadata must fit in memory
 HA w/ automatic failover and re-replication
 Up to 1T files (> 5000x advantage)
 Higher performance
 100% commodity hardware
 Metadata is persisted to disk
NameNode
A B
NameNode
C D
NameNode
E F
A F C D E D
B C E B
C F B F
A B
A D
E

7©MapR Technologies - Do Not Redistribute
JobTracker HA
Other Distributions (MR or YARN) MapR
JT
JT

8©MapR Technologies - Do Not Redistribute
NFS HA (via managed VIPs)

9©MapR Technologies - Do Not Redistribute
Business Challenge
 Keep the cluster running
 Keep the data safe and secure
 Optimize resource utilization
Cluster Capability
 Management at scale
 Integrated HA
 Resiliency
 Authentication / authorization
 Designed for high performance
 Data and processing locality

10©MapR Technologies - Do Not Redistribute
Hadoop / HBASE
APPLICATIONS
NFS
APPLICAITONS
Hadoop / HBASE
APPLICATIONS
NFS
APPLICAITONS
Data Protection via MapR Snapshots
 Snapshots without data
duplication
 Saves space by sharing
blocks
 Lightning fast
 Zero performance loss on
writing to original
 Scheduled, or on-demand
 Easy recovery by user
REDIRECT ON WRITE
FOR SNAPSHOT
Data Blocks
Snapshot 1 Snapshot 2 Snapshot 3
READ / WRITE
MapR Storage Services
Hadoop / HBASE
APPLICATIONS
NFS
APPLICAITONS
A B C C’ D

11©MapR Technologies - Do Not Redistribute
Production
Business Continuity via MapR Mirroring
Business Continuity
and Efficiency
Efficient design
 Differential deltas are updated
 Compressed and
check-summed
Easy to manage
 Scheduled or on-demand
 WAN, Remote Seeding
 Consistent point-in-time
WAN
Production Research
Datacenter 1 Datacenter 1
WAN
EC2

12©MapR Technologies - Do Not Redistribute
User Authentication and Authorization
 PAM interfaces
– multiple options for authentication registries
 Basic Hadoop authorization
– file and directory permissions
– job queues
 Advanced authorization options
 Don’t forget separation of roles !!!
– Cluster administration vs data access

13©MapR Technologies - Do Not Redistribute
Business Challenge
 Keep the cluster running
 Keep the data safe and secure
 Optimize resource utilization
Cluster Capability
 Management at scale
 Integrated HA
 Resiliency
 Authentication / authorization
 Designed for high performance
 Data and processing locality

14©MapR Technologies - Do Not Redistribute
Managing Cluster Resources
 Isolation
– Tasks sandboxed so they don’t impact other
tasks or system daemons
– System resources protected from runaway jobs
– Volume-based data segregation based on users
and groups
– Volume-based data placement
– Label-based job scheduling
 Quotas
– Storage quotas by volume/user/group
– CPU and memory quotas by queue/user/group
 Reporting
– Detailed reporting on resource usage
• ~100 different cluster metrics !
– All reports are available via UI, CLI and REST API

15©MapR Technologies - Do Not Redistribute
Advanced Job Management
 Job monitoring and
management
 Job and data placement
control
 Advanced monitoring,
management, isolation
and security for Hadoop

16©MapR Technologies - Do Not Redistribute
Q & A

17©MapR Technologies - Do Not Redistribute
Thank You

Contenu connexe

Tendances

Apache Hadoop YARN 3.x in AlibabaDataWorks Summit

Design, Scale and Performance of MapR's Distribution for Hadoopmcsrivas

MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies

Big Data Performance and Capacity Managementrightsize

10c introductionmapr-academy

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France

Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit

Hadoop on Azure, Blue elephantsOvidiu Dimulescu

Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang

The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies

Philly DB MapR OverviewMapR Technologies

Dealing with an Upside Down InternetMapR Technologies

Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler

Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open

Hadoop ABHIJEET RAJ

Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.

Introduction to Hadoop part 2Giovanna Roda

Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA EditionBig Data Joe™ Rossi

Big Data JourneyTugdual Grall

Hadoop configuration & performance tuningVitthal Gogate

Tendances (20)

Apache Hadoop YARN 3.x in Alibaba

Design, Scale and Performance of MapR's Distribution for Hadoop

MapR 5.2: Getting More Value from the MapR Converged Community Edition

Big Data Performance and Capacity Management

10c introduction

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013

Data Protection in Hybrid Enterprise Data Lake Environment

Hadoop on Azure, Blue elephants

Hadoop 3 (2017 hadoop taiwan workshop)

The Future of Hadoop: MapR VP of Product Management, Tomer Shiran

Philly DB MapR Overview

Dealing with an Upside Down Internet

Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable

Troubleshooting Hadoop: Distributed Debugging

Hadoop

Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...

Introduction to Hadoop part 2

Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

Big Data Journey

Hadoop configuration & performance tuning

Similaire à Challenges & Capabilites in Managing a MapR Cluster by David Tucker

HP: HP 3PAR - Storage zrodený pre virtualizované prostredieASBIS SK

Availability Considerations for SQL ServerBob Roudebush

Business Track Session 1: The Power of udparcserve data protection

20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD

Inside MapR's M7MapR Technologies

[NetApp] Simplified HA:DR Using Storage SolutionsPerforce

Disaster Recovery Cook BookPT Datacomm Diangraha

Audax Group: CIO Perspectives - Managing The Copy Data Explosionactifio

Commercial track 1_The Power of UDParcserve data protection

Cloud - High Availability @ Low Cost - Workshop - Gurpreet ahujaResellerClub

Architecting virtualized infrastructure for big data presentationVlad Ponomarev

HP Storage: Delivering Storage without Boundariesjameshub12

Spark One Platform WebinarCloudera, Inc.

Continuity Software 4.3 Detailed GapsGilHecht

Application-level Disaster Recovery on OpenStackAli Hodroj

Data core overview - haluk-finalHaluk Ulubay

Capital One's Next Generation Decision in less than 2 msApache Apex

MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...Dell EMC World

Next-Gen Decision Making in Under 2msIlya Ganelin

PHD recovery management suite - presentation by DataManage Arik Lev

Similaire à Challenges & Capabilites in Managing a MapR Cluster by David Tucker (20)

HP: HP 3PAR - Storage zrodený pre virtualizované prostredie

Availability Considerations for SQL Server

Business Track Session 1: The Power of udp

20140228 - Singapore - BDAS - Ensuring Hadoop Production Success

Inside MapR's M7

[NetApp] Simplified HA:DR Using Storage Solutions

Disaster Recovery Cook Book

Audax Group: CIO Perspectives - Managing The Copy Data Explosion

Commercial track 1_The Power of UDP

Cloud - High Availability @ Low Cost - Workshop - Gurpreet ahuja

Architecting virtualized infrastructure for big data presentation

HP Storage: Delivering Storage without Boundaries

Spark One Platform Webinar

Continuity Software 4.3 Detailed Gaps

Application-level Disaster Recovery on OpenStack

Data core overview - haluk-final

Capital One's Next Generation Decision in less than 2 ms

MT125 Virtustream Enterprise Cloud: Purpose Built to Run Mission Critical App...

Next-Gen Decision Making in Under 2ms

PHD recovery management suite - presentation by DataManage

Plus de MapR Technologies

Converging your data landscapeMapR Technologies

ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies

Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies

Enabling Real-Time Business with Change Data CaptureMapR Technologies

Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies

ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies

Machine Learning Success: The Key to Easier Model ManagementMapR Technologies

Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies

Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies

Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies

Live Machine Learning Tutorial: Churn PredictionMapR Technologies

An Introduction to the MapR Converged Data PlatformMapR Technologies

How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies

Best Practices for Data Convergence in HealthcareMapR Technologies

Geo-Distributed Big Data and AnalyticsMapR Technologies

MapR Product Update - Spring 2017MapR Technologies

3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies

Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies

MapR and Cisco Make IT BetterMapR Technologies

Evolving from RDBMS to NoSQL + SQLMapR Technologies

Plus de MapR Technologies (20)

Converging your data landscape

ML Workshop 2: Machine Learning Model Comparison & Evaluation

Self-Service Data Science for Leveraging ML & AI on All of Your Data

Enabling Real-Time Business with Change Data Capture

Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...

ML Workshop 1: A New Architecture for Machine Learning Logistics

Machine Learning Success: The Key to Easier Model Management

Data Warehouse Modernization: Accelerating Time-To-Action

Live Tutorial – Streaming Real-Time Events Using Apache APIs

Bringing Structure, Scalability, and Services to Cloud-Scale Storage

Live Machine Learning Tutorial: Churn Prediction

An Introduction to the MapR Converged Data Platform

How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...

Best Practices for Data Convergence in Healthcare

Geo-Distributed Big Data and Analytics

MapR Product Update - Spring 2017

3 Benefits of Multi-Temperature Data Management for Data Analytics

Cisco & MapR bring 3 Superpowers to SAP HANA Deployments

MapR and Cisco Make IT Better

Evolving from RDBMS to NoSQL + SQL

Dernier

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Manual 508 Accessibility Compliance AuditSkynet Technologies

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

Connecting the Dots for Information Discovery.pdfNeo4j

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

How to write a Business Continuity PlanDatabarracks

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Scale your database traffic with Read & Write split using MySQL RouterMydbops

From Family Reminiscence to Scholarly Archive .Alan Dix

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

A Framework for Development in the AI AgeCprime

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf

Manual 508 Accessibility Compliance Audit

DevEX - reference for building teams, processes, and platforms

Decarbonising Buildings: Making a net-zero built environment a reality

Potential of AI (Generative AI) in Business: Learnings and Insights

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

Connecting the Dots for Information Discovery.pdf

Generative Artificial Intelligence: How generative AI works.pdf

How to write a Business Continuity Plan

The State of Passkeys with FIDO Alliance.pptx

Take control of your SAP testing with UiPath Test Suite

Scale your database traffic with Read & Write split using MySQL Router

From Family Reminiscence to Scholarly Archive .

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

A Framework for Development in the AI Age

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Long journey of Ruby standard library at RubyConf AU 2024

Genislab builds better products and faster go-to-market with Lean project man...

Challenges & Capabilites in Managing a MapR Cluster by David Tucker

1. 1©MapR Technologies - Do Not Redistribute Challenges and Capabilities in Managing a MapR Cluster David Tucker Senior Solution Architect MapR Technologies

2. 2©MapR Technologies - Do Not Redistribute Overview Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality

3. 3©MapR Technologies - Do Not Redistribute Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality

4. 4©MapR Technologies - Do Not Redistribute Easy Management at Scale  Health Monitoring  Cluster Administration  Application Resource Provisioning

5. 5©MapR Technologies - Do Not Redistribute High Availability and Dependability Reliable Compute Dependable Storage  Automated stateful failover  Automated re-replication  Automated recovery from HW and SW failures  Load balancing of critical services  Rolling upgrades  No lost jobs or data  99999’s of uptime • Business continuity with snapshots and mirrors • Point-in-time recovery • End-to-end check-summing • Strong consistency • Data safe • Multi-site mirroring to meet Recovery Time Objectives

6. 6©MapR Technologies - Do Not Redistribute NameNode NAS APPLIANCE DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode DataNode No NameNode Architecture Other Distributions (HDFS Federation) MapR  Multiple single points of failure  Limited to 50M files per NameNode  Performance bottleneck  Commercial NAS required  Metadata must fit in memory  HA w/ automatic failover and re-replication  Up to 1T files (> 5000x advantage)  Higher performance  100% commodity hardware  Metadata is persisted to disk NameNode A B NameNode C D NameNode E F A F C D E D B C E B C F B F A B A D E

9. 9©MapR Technologies - Do Not Redistribute Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality

10. 10©MapR Technologies - Do Not Redistribute Hadoop / HBASE APPLICATIONS NFS APPLICAITONS Hadoop / HBASE APPLICATIONS NFS APPLICAITONS Data Protection via MapR Snapshots  Snapshots without data duplication  Saves space by sharing blocks  Lightning fast  Zero performance loss on writing to original  Scheduled, or on-demand  Easy recovery by user REDIRECT ON WRITE FOR SNAPSHOT Data Blocks Snapshot 1 Snapshot 2 Snapshot 3 READ / WRITE MapR Storage Services Hadoop / HBASE APPLICATIONS NFS APPLICAITONS A B C C’ D

11. 11©MapR Technologies - Do Not Redistribute Production Business Continuity via MapR Mirroring Business Continuity and Efficiency Efficient design  Differential deltas are updated  Compressed and check-summed Easy to manage  Scheduled or on-demand  WAN, Remote Seeding  Consistent point-in-time WAN Production Research Datacenter 1 Datacenter 1 WAN EC2

12. 12©MapR Technologies - Do Not Redistribute User Authentication and Authorization  PAM interfaces – multiple options for authentication registries  Basic Hadoop authorization – file and directory permissions – job queues  Advanced authorization options  Don’t forget separation of roles !!! – Cluster administration vs data access

13. 13©MapR Technologies - Do Not Redistribute Business Challenge  Keep the cluster running  Keep the data safe and secure  Optimize resource utilization Cluster Capability  Management at scale  Integrated HA  Resiliency  Authentication / authorization  Designed for high performance  Data and processing locality

14. 14©MapR Technologies - Do Not Redistribute Managing Cluster Resources  Isolation – Tasks sandboxed so they don’t impact other tasks or system daemons – System resources protected from runaway jobs – Volume-based data segregation based on users and groups – Volume-based data placement – Label-based job scheduling  Quotas – Storage quotas by volume/user/group – CPU and memory quotas by queue/user/group  Reporting – Detailed reporting on resource usage • ~100 different cluster metrics ! – All reports are available via UI, CLI and REST API

15. 15©MapR Technologies - Do Not Redistribute Advanced Job Management  Job monitoring and management  Job and data placement control  Advanced monitoring, management, isolation and security for Hadoop

Notes de l'éditeur

We all know about hadoop .. so no need to get specific there
We all know about hadoop .. so no need to get specific there
Another area of ease of use is with the MapR Control system and Heatmap. This simplifies health monitoring, cluster administration and application provisioning at scale. Each small rectangle in the UI represents a separate node. You can select a wide variety of elements to monitor include custom services. MapR also includes alerts and alarms so administrators are not required to constantly monitor. There are also filters and group operations to simplify actions.
With MapR Hadoop is Lights out Data Center ReadyMapR provides 5 99999’s of availability including support for rolling upgrades, self–healing and automated stateful failover. MapR is the only distribution that provides these capabilities, MapR also provides dependable data storage with full data protection and business continuity features. MapR provides point in time recovery to protect against application and user errors. There is end to end check summing so data corruption is automatically detected and corrected with MapR’s self healing capabilities. Mirroring across sites is fully supported.All these features support lights out data center operations. Every two weeks an administrator can take a MapR report and a shopping cart full of drives and replace failed drives.
The Namenode today in Hadoop is a single point of failure, a scalability limitation, and a performance bottleneck.With MapR there is no dedicated NameNode. The NameNode function is distributed across the cluster. This provides major advantages in terms of HA, data loss avoidance, scalability and performance. Other distributions you have a bottleneck regardless of the number of nodes in the cluster. With other distributions the most number of files that you can support is 200M at the maximum and that is with an extremely high end server. 50% of the processing of Hadoop in Facebook is to pack and unpack files to try to work around this limitation. MapR scales uniformly.
We all know about hadoop .. so no need to get specific there
MapR also uniquely provides full Snapshots. No other Hadoop distribution provides this capability. They provide replication that provides additional copies to protect against data loss but it does nothing to protect against application or user errors that are replicated across a cluster. With MapR you have a snapshot and point in time recovery. A user or administrator can simply open up the snapshot directory and recovery a full directory or individual file. The snapshots are provided on a redirect on write method which provides this protection without duplicating the data. In other words you can snapshot a 1 petabyte cluster in seconds with no additional data storage.
MapR is also the only distribution for Apache Hadoop that provides wide area replication and mirroring allowing you to provide full business continuity. MapR’s Hadoop distribution allows you to automatically and transparently mirror your data to another cluster. The system performs incremental synchronization of clusters on the changed data. That means there is very low overhead and higher performance. With MapR, you can also easily deploy a research cluster alongside a production cluster so that researchers, developers and analysts can experiment without impacting the production cluster. You can mirror between two clusters which are geographically separated for disaster recovery and implement your Recovery Time Objectives to assure business continuity. MapR’s mirroring also supports bulk data transfer to other clusters. Hadoop users today do not have a way to interoperate between private and public clouds. You can use MapR’s mirroring to synchronize data between a research cluster and your production cluster, or between a private and public cloud.
Snowden story : he got docs because he was administering a file server with classified information
We all know about hadoop .. so no need to get specific there
The MapR Control System also provides advanced job management capabilities, enabling an administrator to have complete visibility and control over the operation of the cluster, jobs and tasks. Unique capabilities of MapR Control System: AutomatedComprehensive – hw and software (Cloudera has no visibility into hardware faults)Full Visibility and controlSupports lights out operation

Challenges & Capabilites in Managing a MapR Cluster by David Tucker

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Challenges & Capabilites in Managing a MapR Cluster by David Tucker

Similaire à Challenges & Capabilites in Managing a MapR Cluster by David Tucker (20)

Plus de MapR Technologies

Plus de MapR Technologies (20)

Dernier

Dernier (20)

Challenges & Capabilites in Managing a MapR Cluster by David Tucker

Notes de l'éditeur