1. Page 1 of 4
Big Data – Apache Hadoop Administrator Training
Objective
This training aims to provide the participants with a comprehensive understanding
of all the steps necessary to operate and maintain a Hadoop cluster. From
Installation and configuration through load-balancing and tuning.
The participants will learn the complete Installation of Hadoop Cluster, understand
the basic and advanced concepts of Map Reduce and the best practices for Apache
Hadoop Development as experienced by the developers and architects of core
Apache Hadoop. With the help of hands-on exercises, participants will learn the
following topics during the course.
1. The internals of MapReduce and HDFS and how to build Hadoop
Architecture.
2. Proper cluster configuration and deployment to integrate with systems
and hardware in data centre.
3. How to load data into cluster from dynamically-generated files using
Flume and from RDBMS using Sqoop.
4. Configuring the FairScheduler to provide service-level agreements for
multiple users of a cluster.
5. Discussing Kerberos-based security for your cluster.
6. Best practices for preparing and maintaining Apache Hadoop in
production.
7. Troubleshooting, diagnosing, tuning and solving Hadoop issues.
Note: The course will be have 20% of theoretical discussion and 80% of actual
hands on
Audience & Pre-Requisites
This course is designed for Systems Administrators and IT Managers who have
basic Linux experience. No need for prior knowledge of Apache Hadoop.
Duration: 30 hours
Course Outline
• Introduction
• The Case for Apache Hadoop
o A Brief History of Hadoop
2. Page 2 of 4
o Core Hadoop Components
o Fundamental Concepts
• The Hadoop Distributed File System
o HDFS Features
o HDFS Design Assumptions
o Overview of HDFS Architecture
• MapReduce and YARN
o What Is MapReduce?
o Features of MapReduce
o Basic MapReduce Concepts
o Architectural Overview
o Hands-On Exercise
• An Overview of the Hadoop Ecosystem
o What is the Hadoop Ecosystem?
o Analysis Tools
o Data Storage and Retrieval Tools
• Overview of Cloudera Distributions of Hadoop
o What is CDH?
• Overview of Hortonworks Distributions of Hadoop
• Planning your Hadoop Cluster
o General planning Considerations
o Choosing the Right Hardware
o Network Considerations
• Gen1 – Pseudo and 4 Node Cluster -Vanilla Hadoop
o Installation
o Configuration
o Performance Aspects
• Installation a 4 Node with NN, SNN, JT in EC2
• Hadoop Installation
o Deployment Types
o Installing Hadoop
o Basic Configuration Parameters
o Hands-On Exercise
3. Page 3 of 4
• Advanced Configuration
o Advanced Parameters
o Configuring Rack Awareness
• Hadoop Security
o Why Hadoop Security Is Important
o Hadoop’ s Security System Concepts
o What Kerberos Is and How it Works
• Gen2 Pseudo Cluster – Vanilla Cluster
o Installation of Hadoop
o Hadoop 2 Configuration
o Hadoop Federation Capability
• Configuring HA in Gen2
• Configuring Federation in Gen2
Managing and Scheduling Jobs
o Managing Running Jobs
o Hands-On Exercise
o The Capacity Scheduler
• Cluster Maintenance
o Checking HDFS Status
o Hands-On Exercise
o Copying Data Between Clusters
o Adding and Removing Cluster Nodes [ Node Maintenance]
o Rebalancing the Cluster
o Hands-On Exercise
o NameNode Metadata Backup
o Cluster Upgrading
o User Management
o Quota Management
• Cluster Monitoring and Troubleshooting
o General System Monitoring
o Managing Hadoop’ s Log Files
o Using the NameNode and JobTracker Web UIs
o Hands-On Exercise
o Cluster Monitoring with Ganglia
o Common Troubleshooting Issues
o Benchmarking Your Cluster
4. Page 4 of 4
• Installing and Managing Other Hadoop Projects
o Hive
o Pig
o Sqoop
• Working with Apache Ambari
o Installation of a 4 Node cluster
o Web HDFS
o Security in Ambari
o Adding new host via Ambari
o Configuring Capacity Scheduler
o Mounting HDFS
o HDFS Snapshots