SlideShare une entreprise Scribd logo
1  sur  79
Getting Started with Hadoop
3
Agenda
• 1999 the Database Story
• Intro to Hadoop
• Basic Hadoop Eco-system
• Why CDH is Awesome
• QA
5
6
Indexing the Web
• Web is huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real-time!
7
8
Databases in 1999
1. Buy a really big machine
2. Install an expensive DBMS on it
3. Point your workload at it
4. Hope it doesn’t fail
5. Ambitious: buy another really big machine as
a backup
9
Database Limitations
• Didn’t scale horizontally
• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock in ($$$)
• SQL unsuited for search ranking
• Complex analysis (PageRank)
• Unstructured data
10
Google Does Something Different
• Designed their own storage and processing
infrastructure
• Google File System and MapReduce
• Goals:
• Cheap
• Scalable
• Reliable
11
Google Does Something Different
• It worked!
• Powered Google Search for many years
• General framework for large-scale batch
computation tasks
• Still used internally at Google to this day
14
15
Google’s messages from the future
• Google was benevolent enough to publish
• 2003: Google File System (GFS) paper
• 2004: MapReduce paper
• Already mature technologies at this point
16
Google’s messages from the future
• Community didn’t get it immediately
• DB people thought it was silly
• Non-Google weren’t at the same scale yet
• Google had little interest in releasing GFS and
MapReduce
• Business was ads, not infrastructure
17
Birth of Hadoop
• Doug Cutting and Mike Cafarella
• Nutch
• Open-source search platform
• Ran into scaling issues
• 4 nodes
• Hard to program
• Hard to manage
• Immediate application for GFS and MR
18
Birth of Hadoop
• 2004-2006:
Implemented GFS/MR
and ported Nutch to it
• 2006: Spun out into
Apache Hadoop
• Name of Doug’s son’s
stuffed elephant
Birth of Hadoop
20
Summary
• The web is huge and unstructured
• Databases didn’t fit the problem
• Didn’t scale, expensive, SQL limitations
• Google did their own thing: GFS + MR
• Hadoop is based on the Google papers
21
HDFS and MapReduce
22
HDFS
• Based on GFS
• Distributed, fault-tolerant filesystem
• Primarily designed for cost and scale
• Works on commodity hardware
• 20PB / 4000 node cluster at Facebook
23
HDFS design assumptions
• Failures are common
• Massive scale means more failures
• Disks, network, node
• Files are append-only
• Files are large (GBs to TBs)
• Accesses are large and sequential
24
Quick primers
• Filesystems
• Hard drives
• Datacenter networking
25
Quick filesystem primer
• Same concepts as the FS on your laptop
• Directory tree
• Create, read, write, delete files
• Filesystems store metadata and data
• Metadata: filename, size, permissions, …
• Data: contents of a file
• Other concerns
• Data integrity, durability, management
26
Quick disk primer
• Disk does a seek for each I/O operation
• Seeks are expensive (~10ms)
• Throughput / IOPS tradeoff
• 100 MB/s and 10 IOPS
• 10MB/s and 100 IOPS
• Big I/Os mean better throughput
27
Quick networking primer
Rack
Top-of-rack switch
Core switch
28
Quick networking primer
40 Gbit
1, 2, 4, 10
Gbit
10 Gbit
29
HDFS Architecture Overview
Secondary
Namenode
Host 2
Namenode
Host 1
DataNode
Host 3
DataNode
Host 4
DataNode
Host 5
DataNode
Host n
30
Block Size = 64MB
Replication Factor = 3
HDFS Block Replication
1
2
3
4
5 2
3
4
2
4
5
1
3
5
1
2
5
1
3
4
HDFS
Node 1 Node 2
Node 3
Node 4
Node 5
Blocks
31
HDFS Write Path
• Talk to NameNode
• Store metadata for new file
• Get topology-aware list of DataNodes
• Setup the write pipeline
• Stream data to pipeline
• Tell NameNode when done
32
HDFS Fault-tolerance
• Many different failure modes
• Disk corruption, node failure, switch failure
• Primary concern
• Data is safe!!!
• Secondary concerns
• Keep accepting reads and writes
• Do it transparently to clients
33
MapReduce – Map
• Records from the data source (lines out of files, rows of a database, etc)
are fed into the map function as key*value pairs: e.g., (filename, line).
• map() produces one or more intermediate values along with an output
key from the input.
Map
Task
(key 1,
values)
(key 2,
values)
(key 3,
values)
Shuffle
Phase
(key 1, int.
values)
(key 1, int.
values)
(key 1, int.
values)
Reduce
Task
Final (key,
values)
34
MapReduce – Reduce
• After the map phase is over, all the intermediate values for a
given output key are combined together into a list
• reduce() combines those intermediate values into one or
more final values for that same output key
Map
Task
(key 1,
values)
(key 2,
values)
(key 3,
values)
Shuffle
Phase
(key 1, int.
values)
(key 1, int.
values)
(key 1, int.
values)
Reduce
Task
Final (key,
values)
35
MapReduce – Shuffle and Sort
36
Word Count Example
The cat sat on the mat
The aardvark sat on the sofa
The, 1
cat, 1
sat, 1
on, 1
the, 1
mat, 1
The, 1
aardvark, 1
sat, 1
on, 1
the, 1
sofa, 1
Mapper Input
Mapping
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
aardvark, 1
cat, 1
mat, 1
on [1, 1]
sat [1, 1]
sofa, 1
the [1, 1, 1, 1]
Shuffling Reducing
Final Result
37
Summary
• GFS and MR co-design
• Cheap, simple, effective at scale
• Fault-tolerance baked in
• Replicate data 3x
• Incrementally re-execute computation
• Avoid single points of failure
38
Hadoop Ecosystem Overview
What Are All These Things?
39
Sqoop
Performs bidirectional
data transfers between
Hadoop and almost
any SQL database with
a JDBC driver
40
Flume
Client
Client
Client
Client
Agent
Agent
Agent
A streaming data
collection and
aggregation system
for massive volumes
of data, such as RPC
services, Log4J,
Syslog, etc.
41
Hive
SELECT
s.word, s.freq, k.freq
FROM shakespeare
JOIN ON (s.word=
k.word)
WHERE s.freq >= 5;
• Relational database
abstraction using a SQL
like dialect called HiveQL
• Statements are executed
as one or more
MapReduce Jobs
42
Pig
• High-level scripting language
for for executing one or more
MapReduce jobs
• Created to simplify authoring
of MapReduce jobs
• Can be extended with user
defined functions
emps = LOAD 'people.txt’ AS
(id,name,salary);
rich = FILTER emps BY salary >
200000;
sorted_rich = ORDER rich BY
salary DESC;
STORE sorted_rich INTO
’rich_people.txt';
43
HBase
• Low-latency, distributed,
columnar key-value store
• Based on BigTable
• Efficient random
reads/writes on HDFS
• Useful for frontend
applications
44
Oozie
A workflow engine
and scheduler built
specifically for large-
scale job orchestration
on a Hadoop cluster
45
Hue
• Hue is an open source web-
based application for making it
easier to use Apache Hadoop.
• Hue features
• File Browser for HDFS
• Job Designer/Browser for MapReduce
• Query editors for Hive, Pig and
Cloudera Impala
• Oozie
46
Zookeeper
• Zookeeper is a distributed consensus
engine
• Provides well-defined concurrent
access semantics:
• Leader election
• Service discovery
• Distributed locking / mutual
exclusion
• Message board / mailboxes
47
Hadoop Ecosystem
INGEST STORE EXPLORE PROCESS ANALYZE SERVE
CONNECTORS
STORAGE
RESOURCE MGMT
& COORDINATION
USER INTERFACE WORKFLOW MGMT METADATACLOUD
INTEGRATION
YA
YARN
ZO
ZOOKEEPER
HDFS
HADOOP DFS
HB
HBASE
HU
HUE
OO
OOZIE
WH
WHIRR
SQ
SQOOP
FL
FLUME
FILE
FUSE-DFS
REST
WEBHDFS / HTTPFS
SQL
ODBC / JDBC
MS
META STORE
AC
ACCESS
BI ETL RDBMS
BATCH COMPUTE
BATCH PROCESSING REAL-TIME ACCESS
& COMPUTE
MR
MAPREDUCE
MR2
MAPREDUCE2
HI
HIVE
PI
PIG
MA
MAHOUT
DF
DATAFU
IM
IMPALA
MANAGEMENT SOFTWARE &
TECHNICAL SUPPORT
SUBSCIPTION OPTIONS
CLOUDERA NAVIGATOR
CLOUDERA MANAGER
CORE
(REQUIRED)
RTD RTQ
BDR
AUDIT
(v1.0)
LINEAGE
ACCESS
(v1.0)
LIFECYCLE
EXPLORE
CORE
48
The Cloudera Advantage
49
Cloudera Impala
Cost-effective, ad hoc query environment that offloads the
data warehouse for:
• Interactive BI/analytics on more data
• Asking new questions
• Data processing with tight SLAs
• Query-able archive w/full fidelity
50
Cloudera Impala
Interactive SQL for Hadoop
• Responses in seconds
• Nearly ANSI-92 standard SQL with Hive SQL
Native MPP Query Engine
• Purpose-built for low-latency queries
• Separate runtime from MapReduce
• Designed as part of the Hadoop ecosystem
Open Source
• Apache-licensed
51
Impala Key Features
Fast Flexible Secure
Easy to Implement Easy to Use Simple to Manage
• In-memory data transfers
• Partitioned joins
• Fully distributed aggregations
• Query data in HDFS & HBase
• Supports multiple file formats &
compression algorithms
• Integrated with Hadoop security
• Kerberos authentication
• Authorization (Sentry)
• Leverages Hive’s ODBC/JDBC
connectors, metastore & SQL syntax
• Open source
• Interact with data via SQL
• Certified with leading BI tools
• Deploy, configure & monitor with
Cloudera Manager
• Integrated with Hadoop resource
management
52
The Impala Advantage
BI Partners: Building on the Enterprise Standard
53
Cloudera Search
Offer easy access to non-technical resources
Explore data prior to processing and modeling
Gain immediate access and find correlations in
mission-critical data
Powerful, proven search capabilities
that let organizations:
54
Cloudera Search
Interactive Search for All Data
• Full-text and faceted navigation
• Batch, near real-time, and on-demand indexing
Apache Solr Integrated with CDH
• Established, mature search with vibrant community
• Separate runtime like MapReduce, Impala
• Incorporated as part of the Hadoop ecosystem
Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
55
Search Key Features
Scalable Flexible Timely
Mature Simple to Use Easy to Manage
• Index storage & retrieval on HDFS
• Indexing with MapReduce and
Flume
• Shard management with Zookeeper
• Indexing and query of any data in
HDFS and HBase
• Support for multiple file formats
• Field mapping and matching with
Morphlines
• Indexing in batch, on-demand, and
in near real-time
• Scalable extraction and mapping
with built-in Solr sink for Flume
• Proven, enterprise-ready
technology
• Rich ecosystem and knowledge
within community
• Familiar full-text search and faceted
navigation
• Out-of-the-box Search GUI
• Known, readily available standard
Solr APIs
• Integrated with Cloudera Manager
and Apache Sentry
• Integrated coordination and
execution of jobs
• GoLive for incremental changes
56
The Search Advantage
Search Partners:
Building on the
Cloudera Enterprise
Data Hub
57
Sentry
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration
Open Source
Submitted to ASF
Open Source authorization module for Impala & Hive
58
Defining Security Functions
Perimeter
Guarding access to the cluster
itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the cluster
from unauthorized visibility
Technical Concepts:
Encryption, Tokenization
Data masking
Access
Defining what users and
applications can do with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where data
came from and how it’s being
used
Technical Concepts:
Auditing
Lineage
59
Enabling Enterprise Security
Perimeter
Guarding access to the cluster
itself
Technical Concepts:
Authentication
Network isolation
Data
Protecting data in the cluster
from unauthorized visibility
Technical Concepts:
Encryption, Tokenization
Data masking
Access
Defining what users and
applications can do with data
Technical Concepts:
Permissions
Authorization
Visibility
Reporting on where data
came from and how it’s being
used
Technical Concepts:
Auditing
Lineage
Apache SentryKerberos, AD/LDAP Cloudera NavigatorCertified Partners
60
Authorization Requirements
Secure Authorization
Ability to control access to data and/or privileges on data for authenticated users
Fine-Grained Authorization
Ability to give users access to a subset of data (e.g. column) in a database
Role-Based Authorization
Ability to create/apply templatized privileges based on functional roles
Multitenant Administration
Ability for central admin group to empower lower-level admins to manage
security for each database/schema
61
Key Capabilities of Sentry
Fine-Grained Authorization
Specify security for SERVERS, DATABASES, TABLES & VIEWS
Role-Based Authorization
SELECT privilege on views & tables
INSERT privilege on tables
TRANSFORM privilege on servers
ALL privilege on the server, databases, tables & views
ALL privilege is needed to create/modify schema
Multitenant Administration
Separate policies for each database/schema
Can be maintained by separate admins
62
Challenges with Hadoop without Management
Hadoop is more than a dozen services running across many machines
• Hundreds of hardware components
• Thousands of settings
• Limitless permutations
Complexity
Hadoop is a system, not just a collection of parts
• Everything is interrelated
• Raw data about individual pieces is not enough
• Must extract what’s important
Context
Managing Hadoop with multiple tools and manual process takes longer
• Complicated, error-prone workflows
• Longer issue resolution
• Lack of consistent and repeatable processes
Efficiency
63
Cloudera Manager
End-to-End Administration for Your Enterprise Data Hub
Manage
Easily deploy, configure & optimize clusters1
Monitor
Maintain a central view of all activity2
Diagnose
Easily identify and resolve issues3
Integrate
Use Cloudera Manager with existing tools4
64
One Tool For Everything
Managing Complexity
+
DEPLOYMENT &
CONFIGURATION
MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING ACTIVITY MONITORING
DO-IT-YOURSELF
WITH CLOUDERA
65
$7M
$5M
$3M
$1.8M
$852K
$3.5M
$1.7M
$8.4M
$5.4M
SELF
MANAGE
CLOUDERA
ENTERPRISE
SELF
MANAGE
CLOUDERA
ENTERPRISE
SELF
MANAGE
CLOUDERA
ENTERPRISE
25 NODES
$948K SAVINGS
50 NODES
$1.8M SAVINGS
100 NODES
$3M SAVINGS
Three-Year TCO Comparison
Maximizing Efficiency
66
Why Cloudera Manager
End-to-end administration for the Enterprise Data Hub in a single tool
Simple
Manages Hadoop at a system level – Cloudera’s experience realized in software
Intelligent
Simplifies complex workflows and makes administrators more productive
Efficient
The only enterprise-grade Hadoop management application available
Best-in-Class
67
Why Backup and Disaster Recovery?
Cloudera Enterprise is a Mission-Critical Part
of the Data Management Infrastructure
• Stores valuable data and runs important workloads
• Business continuity is a MUST HAVE
1
Managing Business Continuity for Hadoop is
Complex
• Different services that store data – HDFS, HBase, Hive
• Backup and disaster recovery is configured separately for each
• Processes are manual
2
68
BDR in Cloudera Enterprise
Simplified Management of Backup & DR Policies
Central Configuration
• HDFS - Select files & directories to replicate
• Hive - Select tables to replicate
• Schedule replication jobs for optimal times
HDFS
HIVE
NODES
Monitoring & Alerting
• Track progress of replication jobs
• Get notified when data is out of sync
Performance & Reliability
• High performance replication using MapReduce
• CDH-optimized version of DistCP
SITE A SITE B
HDFS
HIVE
NODES
69
Benefits of BDR
Reduce Complexity • Centrally manage backup and DR workflows
• Simple setup via an intuitive user interface
Maximize Efficiency
• Simplify processes to meet or exceed SLAs and
Recovery Time Objectives (RTOs)
• Optimize system performance and network
impact through scheduling
Reduce Risk & Exposure
• Eliminate error-prone manual processes
• Get notified when issues occur
• The only solution for metadata replication (Hive)
70
One Tool For Everything
Managing Complexity
+
DEPLOYMENT &
CONFIGURATION
MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING
ACTIVITY
MONITORING
DO-IT-YOURSELF
WITH CLOUDERA
71
1 2 3Find Nodes Install Components Assign Roles
Enter the names of the hosts which will be
included in the Hadoop cluster. Click Continue.
Cloudera Manager automatically installs the CDH
components on the hosts you specified.
Verify the roles of the nodes within your cluster.
Make changes as necessary.
Install a Cluster in Three Simple Steps
Cloudera Manager Key Features
72
View Service Health and Performance
72
Cloudera Manager Key Features
73
Monitor and Diagnose Cluster Workloads
Cloudera Manager Key Features
74
Rolling Upgrades
Cloudera Manager Key Features
75
Why You Need Cloudera Navigator
Lots of Data Landing in Cloudera Enterprise
• Huge quantities
• Many different sources – structured and unstructured
• Varying levels of sensitivity
1
Many Users Working with the Data
• Administrators and compliance officers
• Analysts and data scientists
• Business users
2
Need to Effectively Control and Consume Data
• Get visibility and control over the environment
• Discover and explore data
3
76
Cloudera Navigator
Data Management Layer for Cloudera Enterprise
Audit & Access Control
Ensuring appropriate permissions and reporting on
data access for compliance
Discovery & Exploration
Finding out what data is available and what it
looks like
Lineage
Tracing data back to its original source
Lifecycle Management
Migration of data based on policies
Enterprise Metadata Repository
• Business metadata
• Lineage metadata
• Operational metadata
Audit &
Access
Control
Discovery &
Exploration
Lineage
Lifecycle
Mgmt.
HDFS HBASE HIVE
CLOUDERA NAVIGATOR
CDH
77
Cloudera Navigator
Data Audit & Access Control
Verify Permissions
View which users and groups have access to files and
directories
Audit Configuration
Configuration of audit tracking for HDFS, HBase
and Hive
Audit Dashboard
Simple, queryable interface to view data access
Information Export
Export audit information for integration with SIEM tools
HDFS
HBASE
HIVE
IAM / LDAP SYSTEM
3rd PARTY SIEM / GRC SYSTEM
ACCESS
SERVICE
AUDIT LOG
SERVICE
VIEW PERMISSIONS AUDIT LOG CONFIG
AUDIT LOG
COLLECTION
CLOUDERA NAVIGATOR 1.0
78
Benefits of Cloudera Navigator
Control • Store sensitive data
• Maintain full audit history
• The first and only centralized audit tool for Hadoop
Visibility
• Verify access permissions to files and directories
• Report on data access by user and type
Integration • View permissions for LDAP/IAM users
• Export audit data for integration with third-party SIEM tools
Getting Started with Hadoop

Contenu connexe

Tendances

Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
larsgeorge
 

Tendances (20)

Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
HBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBaseHBaseCon 2013: Full-Text Indexing for Apache HBase
HBaseCon 2013: Full-Text Indexing for Apache HBase
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
Realtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBaseRealtime Analytics with Hadoop and HBase
Realtime Analytics with Hadoop and HBase
 
Apache drill
Apache drillApache drill
Apache drill
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 
Apachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to knowApachecon Europe 2012: Operating HBase - Things you need to know
Apachecon Europe 2012: Operating HBase - Things you need to know
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetHBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 

En vedette (10)

Marco tulio
Marco tulioMarco tulio
Marco tulio
 
Auditoria
AuditoriaAuditoria
Auditoria
 
Независимая аналитика Инвесткафе
Независимая аналитика ИнвесткафеНезависимая аналитика Инвесткафе
Независимая аналитика Инвесткафе
 
5 Tips for Making Charitable Donations Count
5 Tips for Making Charitable Donations Count5 Tips for Making Charitable Donations Count
5 Tips for Making Charitable Donations Count
 
Letter of reference
Letter of referenceLetter of reference
Letter of reference
 
Malt Shovel Article
Malt Shovel ArticleMalt Shovel Article
Malt Shovel Article
 
Resilience Program Development
Resilience Program DevelopmentResilience Program Development
Resilience Program Development
 
Ecological Effects of CSA
Ecological Effects of CSAEcological Effects of CSA
Ecological Effects of CSA
 
Resume
ResumeResume
Resume
 
Majemite Kelvin Esisi-CV Nigeria
Majemite Kelvin Esisi-CV NigeriaMajemite Kelvin Esisi-CV Nigeria
Majemite Kelvin Esisi-CV Nigeria
 

Similaire à Getting Started with Hadoop

Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
Gwen (Chen) Shapira
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 

Similaire à Getting Started with Hadoop (20)

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop DB
Hadoop DBHadoop DB
Hadoop DB
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Getting Started with Hadoop

  • 1.
  • 3. 3 Agenda • 1999 the Database Story • Intro to Hadoop • Basic Hadoop Eco-system • Why CDH is Awesome • QA
  • 4.
  • 5. 5
  • 6. 6 Indexing the Web • Web is huge • Hundreds of millions of pages in 1999 • How do you index it? • Crawl all the pages • Rank pages based on relevance metrics • Build search index of keywords to pages • Do it in real-time!
  • 7. 7
  • 8. 8 Databases in 1999 1. Buy a really big machine 2. Install an expensive DBMS on it 3. Point your workload at it 4. Hope it doesn’t fail 5. Ambitious: buy another really big machine as a backup
  • 9. 9 Database Limitations • Didn’t scale horizontally • High marginal cost ($$$) • No real fault-tolerance story • Vendor lock in ($$$) • SQL unsuited for search ranking • Complex analysis (PageRank) • Unstructured data
  • 10. 10 Google Does Something Different • Designed their own storage and processing infrastructure • Google File System and MapReduce • Goals: • Cheap • Scalable • Reliable
  • 11. 11 Google Does Something Different • It worked! • Powered Google Search for many years • General framework for large-scale batch computation tasks • Still used internally at Google to this day
  • 12.
  • 13.
  • 14. 14
  • 15. 15 Google’s messages from the future • Google was benevolent enough to publish • 2003: Google File System (GFS) paper • 2004: MapReduce paper • Already mature technologies at this point
  • 16. 16 Google’s messages from the future • Community didn’t get it immediately • DB people thought it was silly • Non-Google weren’t at the same scale yet • Google had little interest in releasing GFS and MapReduce • Business was ads, not infrastructure
  • 17. 17 Birth of Hadoop • Doug Cutting and Mike Cafarella • Nutch • Open-source search platform • Ran into scaling issues • 4 nodes • Hard to program • Hard to manage • Immediate application for GFS and MR
  • 18. 18 Birth of Hadoop • 2004-2006: Implemented GFS/MR and ported Nutch to it • 2006: Spun out into Apache Hadoop • Name of Doug’s son’s stuffed elephant
  • 20. 20 Summary • The web is huge and unstructured • Databases didn’t fit the problem • Didn’t scale, expensive, SQL limitations • Google did their own thing: GFS + MR • Hadoop is based on the Google papers
  • 22. 22 HDFS • Based on GFS • Distributed, fault-tolerant filesystem • Primarily designed for cost and scale • Works on commodity hardware • 20PB / 4000 node cluster at Facebook
  • 23. 23 HDFS design assumptions • Failures are common • Massive scale means more failures • Disks, network, node • Files are append-only • Files are large (GBs to TBs) • Accesses are large and sequential
  • 24. 24 Quick primers • Filesystems • Hard drives • Datacenter networking
  • 25. 25 Quick filesystem primer • Same concepts as the FS on your laptop • Directory tree • Create, read, write, delete files • Filesystems store metadata and data • Metadata: filename, size, permissions, … • Data: contents of a file • Other concerns • Data integrity, durability, management
  • 26. 26 Quick disk primer • Disk does a seek for each I/O operation • Seeks are expensive (~10ms) • Throughput / IOPS tradeoff • 100 MB/s and 10 IOPS • 10MB/s and 100 IOPS • Big I/Os mean better throughput
  • 28. 28 Quick networking primer 40 Gbit 1, 2, 4, 10 Gbit 10 Gbit
  • 29. 29 HDFS Architecture Overview Secondary Namenode Host 2 Namenode Host 1 DataNode Host 3 DataNode Host 4 DataNode Host 5 DataNode Host n
  • 30. 30 Block Size = 64MB Replication Factor = 3 HDFS Block Replication 1 2 3 4 5 2 3 4 2 4 5 1 3 5 1 2 5 1 3 4 HDFS Node 1 Node 2 Node 3 Node 4 Node 5 Blocks
  • 31. 31 HDFS Write Path • Talk to NameNode • Store metadata for new file • Get topology-aware list of DataNodes • Setup the write pipeline • Stream data to pipeline • Tell NameNode when done
  • 32. 32 HDFS Fault-tolerance • Many different failure modes • Disk corruption, node failure, switch failure • Primary concern • Data is safe!!! • Secondary concerns • Keep accepting reads and writes • Do it transparently to clients
  • 33. 33 MapReduce – Map • Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). • map() produces one or more intermediate values along with an output key from the input. Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase (key 1, int. values) (key 1, int. values) (key 1, int. values) Reduce Task Final (key, values)
  • 34. 34 MapReduce – Reduce • After the map phase is over, all the intermediate values for a given output key are combined together into a list • reduce() combines those intermediate values into one or more final values for that same output key Map Task (key 1, values) (key 2, values) (key 3, values) Shuffle Phase (key 1, int. values) (key 1, int. values) (key 1, int. values) Reduce Task Final (key, values)
  • 36. 36 Word Count Example The cat sat on the mat The aardvark sat on the sofa The, 1 cat, 1 sat, 1 on, 1 the, 1 mat, 1 The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 Mapper Input Mapping aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 aardvark, 1 cat, 1 mat, 1 on, 2 sat, 2 sofa, 1 the, 4 aardvark, 1 cat, 1 mat, 1 on [1, 1] sat [1, 1] sofa, 1 the [1, 1, 1, 1] Shuffling Reducing Final Result
  • 37. 37 Summary • GFS and MR co-design • Cheap, simple, effective at scale • Fault-tolerance baked in • Replicate data 3x • Incrementally re-execute computation • Avoid single points of failure
  • 38. 38 Hadoop Ecosystem Overview What Are All These Things?
  • 39. 39 Sqoop Performs bidirectional data transfers between Hadoop and almost any SQL database with a JDBC driver
  • 40. 40 Flume Client Client Client Client Agent Agent Agent A streaming data collection and aggregation system for massive volumes of data, such as RPC services, Log4J, Syslog, etc.
  • 41. 41 Hive SELECT s.word, s.freq, k.freq FROM shakespeare JOIN ON (s.word= k.word) WHERE s.freq >= 5; • Relational database abstraction using a SQL like dialect called HiveQL • Statements are executed as one or more MapReduce Jobs
  • 42. 42 Pig • High-level scripting language for for executing one or more MapReduce jobs • Created to simplify authoring of MapReduce jobs • Can be extended with user defined functions emps = LOAD 'people.txt’ AS (id,name,salary); rich = FILTER emps BY salary > 200000; sorted_rich = ORDER rich BY salary DESC; STORE sorted_rich INTO ’rich_people.txt';
  • 43. 43 HBase • Low-latency, distributed, columnar key-value store • Based on BigTable • Efficient random reads/writes on HDFS • Useful for frontend applications
  • 44. 44 Oozie A workflow engine and scheduler built specifically for large- scale job orchestration on a Hadoop cluster
  • 45. 45 Hue • Hue is an open source web- based application for making it easier to use Apache Hadoop. • Hue features • File Browser for HDFS • Job Designer/Browser for MapReduce • Query editors for Hive, Pig and Cloudera Impala • Oozie
  • 46. 46 Zookeeper • Zookeeper is a distributed consensus engine • Provides well-defined concurrent access semantics: • Leader election • Service discovery • Distributed locking / mutual exclusion • Message board / mailboxes
  • 47. 47 Hadoop Ecosystem INGEST STORE EXPLORE PROCESS ANALYZE SERVE CONNECTORS STORAGE RESOURCE MGMT & COORDINATION USER INTERFACE WORKFLOW MGMT METADATACLOUD INTEGRATION YA YARN ZO ZOOKEEPER HDFS HADOOP DFS HB HBASE HU HUE OO OOZIE WH WHIRR SQ SQOOP FL FLUME FILE FUSE-DFS REST WEBHDFS / HTTPFS SQL ODBC / JDBC MS META STORE AC ACCESS BI ETL RDBMS BATCH COMPUTE BATCH PROCESSING REAL-TIME ACCESS & COMPUTE MR MAPREDUCE MR2 MAPREDUCE2 HI HIVE PI PIG MA MAHOUT DF DATAFU IM IMPALA MANAGEMENT SOFTWARE & TECHNICAL SUPPORT SUBSCIPTION OPTIONS CLOUDERA NAVIGATOR CLOUDERA MANAGER CORE (REQUIRED) RTD RTQ BDR AUDIT (v1.0) LINEAGE ACCESS (v1.0) LIFECYCLE EXPLORE CORE
  • 49. 49 Cloudera Impala Cost-effective, ad hoc query environment that offloads the data warehouse for: • Interactive BI/analytics on more data • Asking new questions • Data processing with tight SLAs • Query-able archive w/full fidelity
  • 50. 50 Cloudera Impala Interactive SQL for Hadoop • Responses in seconds • Nearly ANSI-92 standard SQL with Hive SQL Native MPP Query Engine • Purpose-built for low-latency queries • Separate runtime from MapReduce • Designed as part of the Hadoop ecosystem Open Source • Apache-licensed
  • 51. 51 Impala Key Features Fast Flexible Secure Easy to Implement Easy to Use Simple to Manage • In-memory data transfers • Partitioned joins • Fully distributed aggregations • Query data in HDFS & HBase • Supports multiple file formats & compression algorithms • Integrated with Hadoop security • Kerberos authentication • Authorization (Sentry) • Leverages Hive’s ODBC/JDBC connectors, metastore & SQL syntax • Open source • Interact with data via SQL • Certified with leading BI tools • Deploy, configure & monitor with Cloudera Manager • Integrated with Hadoop resource management
  • 52. 52 The Impala Advantage BI Partners: Building on the Enterprise Standard
  • 53. 53 Cloudera Search Offer easy access to non-technical resources Explore data prior to processing and modeling Gain immediate access and find correlations in mission-critical data Powerful, proven search capabilities that let organizations:
  • 54. 54 Cloudera Search Interactive Search for All Data • Full-text and faceted navigation • Batch, near real-time, and on-demand indexing Apache Solr Integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Hadoop ecosystem Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 55. 55 Search Key Features Scalable Flexible Timely Mature Simple to Use Easy to Manage • Index storage & retrieval on HDFS • Indexing with MapReduce and Flume • Shard management with Zookeeper • Indexing and query of any data in HDFS and HBase • Support for multiple file formats • Field mapping and matching with Morphlines • Indexing in batch, on-demand, and in near real-time • Scalable extraction and mapping with built-in Solr sink for Flume • Proven, enterprise-ready technology • Rich ecosystem and knowledge within community • Familiar full-text search and faceted navigation • Out-of-the-box Search GUI • Known, readily available standard Solr APIs • Integrated with Cloudera Manager and Apache Sentry • Integrated coordination and execution of jobs • GoLive for incremental changes
  • 56. 56 The Search Advantage Search Partners: Building on the Cloudera Enterprise Data Hub
  • 57. 57 Sentry Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Open Source Submitted to ASF Open Source authorization module for Impala & Hive
  • 58. 58 Defining Security Functions Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization Data masking Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage
  • 59. 59 Enabling Enterprise Security Perimeter Guarding access to the cluster itself Technical Concepts: Authentication Network isolation Data Protecting data in the cluster from unauthorized visibility Technical Concepts: Encryption, Tokenization Data masking Access Defining what users and applications can do with data Technical Concepts: Permissions Authorization Visibility Reporting on where data came from and how it’s being used Technical Concepts: Auditing Lineage Apache SentryKerberos, AD/LDAP Cloudera NavigatorCertified Partners
  • 60. 60 Authorization Requirements Secure Authorization Ability to control access to data and/or privileges on data for authenticated users Fine-Grained Authorization Ability to give users access to a subset of data (e.g. column) in a database Role-Based Authorization Ability to create/apply templatized privileges based on functional roles Multitenant Administration Ability for central admin group to empower lower-level admins to manage security for each database/schema
  • 61. 61 Key Capabilities of Sentry Fine-Grained Authorization Specify security for SERVERS, DATABASES, TABLES & VIEWS Role-Based Authorization SELECT privilege on views & tables INSERT privilege on tables TRANSFORM privilege on servers ALL privilege on the server, databases, tables & views ALL privilege is needed to create/modify schema Multitenant Administration Separate policies for each database/schema Can be maintained by separate admins
  • 62. 62 Challenges with Hadoop without Management Hadoop is more than a dozen services running across many machines • Hundreds of hardware components • Thousands of settings • Limitless permutations Complexity Hadoop is a system, not just a collection of parts • Everything is interrelated • Raw data about individual pieces is not enough • Must extract what’s important Context Managing Hadoop with multiple tools and manual process takes longer • Complicated, error-prone workflows • Longer issue resolution • Lack of consistent and repeatable processes Efficiency
  • 63. 63 Cloudera Manager End-to-End Administration for Your Enterprise Data Hub Manage Easily deploy, configure & optimize clusters1 Monitor Maintain a central view of all activity2 Diagnose Easily identify and resolve issues3 Integrate Use Cloudera Manager with existing tools4
  • 64. 64 One Tool For Everything Managing Complexity + DEPLOYMENT & CONFIGURATION MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING ACTIVITY MONITORING DO-IT-YOURSELF WITH CLOUDERA
  • 66. 66 Why Cloudera Manager End-to-end administration for the Enterprise Data Hub in a single tool Simple Manages Hadoop at a system level – Cloudera’s experience realized in software Intelligent Simplifies complex workflows and makes administrators more productive Efficient The only enterprise-grade Hadoop management application available Best-in-Class
  • 67. 67 Why Backup and Disaster Recovery? Cloudera Enterprise is a Mission-Critical Part of the Data Management Infrastructure • Stores valuable data and runs important workloads • Business continuity is a MUST HAVE 1 Managing Business Continuity for Hadoop is Complex • Different services that store data – HDFS, HBase, Hive • Backup and disaster recovery is configured separately for each • Processes are manual 2
  • 68. 68 BDR in Cloudera Enterprise Simplified Management of Backup & DR Policies Central Configuration • HDFS - Select files & directories to replicate • Hive - Select tables to replicate • Schedule replication jobs for optimal times HDFS HIVE NODES Monitoring & Alerting • Track progress of replication jobs • Get notified when data is out of sync Performance & Reliability • High performance replication using MapReduce • CDH-optimized version of DistCP SITE A SITE B HDFS HIVE NODES
  • 69. 69 Benefits of BDR Reduce Complexity • Centrally manage backup and DR workflows • Simple setup via an intuitive user interface Maximize Efficiency • Simplify processes to meet or exceed SLAs and Recovery Time Objectives (RTOs) • Optimize system performance and network impact through scheduling Reduce Risk & Exposure • Eliminate error-prone manual processes • Get notified when issues occur • The only solution for metadata replication (Hive)
  • 70. 70 One Tool For Everything Managing Complexity + DEPLOYMENT & CONFIGURATION MONITORING WORKFLOWS EVENTS & ALERTS LOG SEARCH DIAGNOSTICS REPORTING ACTIVITY MONITORING DO-IT-YOURSELF WITH CLOUDERA
  • 71. 71 1 2 3Find Nodes Install Components Assign Roles Enter the names of the hosts which will be included in the Hadoop cluster. Click Continue. Cloudera Manager automatically installs the CDH components on the hosts you specified. Verify the roles of the nodes within your cluster. Make changes as necessary. Install a Cluster in Three Simple Steps Cloudera Manager Key Features
  • 72. 72 View Service Health and Performance 72 Cloudera Manager Key Features
  • 73. 73 Monitor and Diagnose Cluster Workloads Cloudera Manager Key Features
  • 75. 75 Why You Need Cloudera Navigator Lots of Data Landing in Cloudera Enterprise • Huge quantities • Many different sources – structured and unstructured • Varying levels of sensitivity 1 Many Users Working with the Data • Administrators and compliance officers • Analysts and data scientists • Business users 2 Need to Effectively Control and Consume Data • Get visibility and control over the environment • Discover and explore data 3
  • 76. 76 Cloudera Navigator Data Management Layer for Cloudera Enterprise Audit & Access Control Ensuring appropriate permissions and reporting on data access for compliance Discovery & Exploration Finding out what data is available and what it looks like Lineage Tracing data back to its original source Lifecycle Management Migration of data based on policies Enterprise Metadata Repository • Business metadata • Lineage metadata • Operational metadata Audit & Access Control Discovery & Exploration Lineage Lifecycle Mgmt. HDFS HBASE HIVE CLOUDERA NAVIGATOR CDH
  • 77. 77 Cloudera Navigator Data Audit & Access Control Verify Permissions View which users and groups have access to files and directories Audit Configuration Configuration of audit tracking for HDFS, HBase and Hive Audit Dashboard Simple, queryable interface to view data access Information Export Export audit information for integration with SIEM tools HDFS HBASE HIVE IAM / LDAP SYSTEM 3rd PARTY SIEM / GRC SYSTEM ACCESS SERVICE AUDIT LOG SERVICE VIEW PERMISSIONS AUDIT LOG CONFIG AUDIT LOG COLLECTION CLOUDERA NAVIGATOR 1.0
  • 78. 78 Benefits of Cloudera Navigator Control • Store sensitive data • Maintain full audit history • The first and only centralized audit tool for Hadoop Visibility • Verify access permissions to files and directories • Report on data access by user and type Integration • View permissions for LDAP/IAM users • Export audit data for integration with third-party SIEM tools