Why is everyone interested in Big Data and Hadoop?
Why you should use Hadoop?
Read this to and you as well you quickly and easily be the proud owner of a Hadoop kit of your own, using Cloudera Free Edition.
************************NOTE**********************
This presentation is still being edited and new slides added every day. Stay tuned...
****************************************************
1. Instant Hadoop of your Own
Created by Jack Bezalel
Senior IT Architect
As part of the CTE Mentorship Program
CA Technologies
2. What’s Hadoop all about?
• OPPORTUNITY: We have access to amazingly
valuable data (Social Media, Mobile, …)
• Problem: Data is seldom UN-Structured
• Relational and data warehouse MUST have
Structured Data, so they are off the list
• Hadoop = fast, reliable analysis of both
structured data and complex data
3. What’s in Hadoop?
• Reliable data storage using the Hadoop
Distributed File System (HDFS)
• High-Performance parallel data processing
using a technique called MapReduce.
4. How does it scale so well?
• Hadoop runs on a collection of commodity,
shared-nothing servers
• You can add or remove servers in a Hadoop
cluster at will
• The system detects and compensates for
hardware or system problems on any server.
(self-healing)
5. Who uses Hadoop?
• Originally developed and employed by Yahoo and
Facebook
• Hadoop is now widely used in
– Finance
– Technology
– Telecom
– media and entertainment
– Government
– research institutions and other markets with
significant data.
6. Why did we use Cloudera’s Hadoop
kit?
• Cloudera is an active contributor to the
Hadoop project
• Provides an enterprise-ready, commercial
Distribution for Hadoop
• Cloudera Distribution saves time by bundling
and testing the most popular projects related
to Hadoop into a single easier to use package
7. The solution we tested is provided by
Cloudera Free Edition
• Automates the installation and configuration
of CDH3
• Entire cluster (up to 50 nodes)
• requiring only root SSH access to your cluster's
machines
• Download Here:
https://ccp.cloudera.com/display/SUPPORT/Cl
oudera+Manager+Free+Edition+Download
8. Cloudera Manager Free Edition
consists of:
• A small self-executing Cloudera Manager
installation program
• Server and other packages in preparation for
cluster host installation
• Cloudera Manager wizard for automating
CDH3 installation and configuration on the
cluster
• Cloudera Manager monitoring and configuring
the cluster after installation is completed
9. What does Cloudera Include - Flume
• Flume — Reliable Data Mover
• The primary use case
– a logging system
– gathers a set of log files on every machine
– aggregates them to a centralized persistent store
(such as HDFS)
10. What does Cloudera Include - Sqoop
• Sqoop — A tool that imports / exports data
between relational databases and Hadoop
clusters.
• Using JDBC imports into a Hadoop HDFS
• Generates Java classes that enable users to
interpret the table's schema
11. What does Cloudera Include - Hue
• Hue — GUI to work with CDH
• Web application
12. What does Cloudera Include - Pig
• Pig — Analyzes large amounts of data
• Using Pig's query language called Pig Latin
• Queries run distributed on a Hadoop cluster
13. What does Cloudera Include - Hive
• Hive — A powerful data warehousing APP
• Enables access your data using Hive QL
• Hive QL = language that is similar to SQL.
14. What does Cloudera Include - HBase
• HBase — Large-scale tabular storage
• Using HDFS
• Cloudera recommends installing HBase in a
standalone mode before you try to run it on a
whole cluster.
15. What does Cloudera Include -
ZooKeeper
• Zookeeper — Service that provides
coordination between distributed processes.
16. What does Cloudera Include - Oozie
• Oozie — A server-based workflow engine
• Runs workflow jobs with actions that execute
Hadoop jobs
• A command line client is also available for
Remote Management
17. What does Cloudera Include – 3 last
strangely named tools…
• Whirr — Provides a fast way to run cloud
services
• Snappy — A compression/decompression
library
• Mahout — A machine-learning tool. By
enabling you to build machine-learning
libraries that are scalable to "reasonably
large" datasets, it aims to make building
intelligent applications easier and faster
18. Setup Walkthrough
• Use Redhat RH5.5+ (CentOS and others
supported as well, we used RH5.7)
• 64bit only
• 3 VMs used:
– Cloudera Manager
– 2 Nodes to deploy Hadoop on
19. About the Cloudera Manager Free
Edition Installation Program
• Automatically Installs the package repositories
for Cloudera Manager and the Oracle (JDK)
• Installs the Cloudera Manager Server
• Installs and configures an embedded
PostgreSQL database
20. Download the CDH3 (Cloudera)
Manager
• http://archive.cloudera.com/cloudera-
manager/installer/latest/cloudera-manager-
installer.bin
21. Set yum.conf with your proxy if exists
• Add those lines to /etc/yum.conf in your first
Redhat Hadoop node (example here)
proxy=http://proxy.corp.com:80
proxy_username=username
proxy_password=password
22. Let the show begin!
• Make sure Selinux is disabled, or this won’t work!
– View file /etc/sysconfig/selinux
– Make sure you have this line:
SELINUX=disabled
– You will need to reboot to if you changed the SELINUX
setting
• Launch the Cloudera Manager Installation:
Sudo chmod u+x ./cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin
55. How to start Hadooping – using its GUI
option (HUE)
• Download the HUE user guide right here:
https://ccp.cloudera.com/display/CDH4B2/Hu
e+2.0+User+Guide
58. Wait a Minute…
• Expect undocumented issues if you do this:
• HUE requires a special user (let’s say “admin”)
• Tell HUE about it, the first time you use it
• Add the user to the Unix system as well
• Add the user to groups “hive” and “hadoop”