SlideShare une entreprise Scribd logo
1  sur  8
Télécharger pour lire hors ligne
Hadoop in
the Enterprise
A Dell Technical White Paper
By Joey Jablonski




Dell | Hadoop White Paper Series
Dell | Hadoop White Paper Series: Hadoop in the Enterprise




Table of Contents
Introduction                                                                                                      3
Managing Hadoop as an island versus part of corporate IT                                                          3
Top challenges and methods to overcome                                                                            4
       Automated deployment                                                                                       4
       Configuration management                                                                                   4
       Monitoring and alerting                                                                                    4
Hardware sizing                                                                                                   4
       Hadoop node sizing                                                                                         4
       Hadoop cluster sizing parameters                                                                           4
Hadoop configuration parameters                                                                                   5
Hadoop network design                                                                                             5
       Dell recommended network architecture                                                                      6
Hadoop security                                                                                                   7
       Authentication                                                                                             7
       Authorization                                                                                              7
       Logging                                                                                                    7
Why Dell                                                                                                          7
About the author                                                                                                  7
Special thanks                                                                                                    8
About Dell Next Generation Computing Solutions                                                                    8
References                                                                                                        8
To learn more                                                                                                     8




This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express
or implied warranties of any kind.


© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden.
For more information, contact Dell. Dell, the Dell logo, and the Dell badge, and PowerEdge are trademarks of Dell Inc.




                                                                                   2
Dell | Hadoop White Paper Series: Hadoop in the Enterprise

Introduction
This is the second in a series of white papers from Dell about Hadoop. If you are new to Hadoop or Dell’s Hadoop
solutions, Dell recommends reading the “Introduction to Hadoop” white paper before this paper.

Hadoop is becoming a critical part of many modern information technology (IT) departments. It is being used for a
growing range of requirements, including analytics, data storage, data processing, and shared compute resources. As
Hadoop’s significance grows, it is important that it be treated as a component of your larger IT organization, and
managed as one. Hadoop is no longer relegated to only research projects, and should be managed as your company
would manage any other large component of your IT infrastructure.

Analytics is a growing need in many organizations. As the volumes of data from multiple sources continue to grow,
Hadoop is in the lead as an enterprise tool that can ingest that data and provide the means to analyze the data.

Data warehouses are a key component to many IT departments. They provide the business intelligence that many
companies use for decision making. Hadoop is beginning to augment these as a central location to feed many different
business intelligence platforms within an organization. This means that Hadoop must be available at the same level or
greater than the tools it feeds and supports.

Hadoop has taken a standard path into most IT organizations, first being used for testing and development, then
migrating into production operations. Because of this, your IT department must ensure that the Hadoop environment is
properly planned and designed at initial deployment to support the more rigorous demands of a production environment.
This paper talks about the considerations for your IT department to ensure your Hadoop environment is able to grow and
change with your business needs, without requiring major refactoring work after the initial deployment due to suboptimal
solution architecture.


Managing Hadoop as an island versus part of corporate IT
Hadoop environments can contain many hundreds or possibly thousands of servers. This large number of devices can
become a management burden for your IT department. As your environment grows, your IT administrators can be left
struggling with complexity.

Maximum attention should be paid during your initial Hadoop deployment to ensure it does not become an island that IT
must manage outside of standard tools and processes. A Hadoop environment, optimally, should utilize external company
shared resources for authentication, monitoring, backup, alerting, and processes. This integration, early and often, will
ensure that the Hadoop environment does not consume an unnecessary amount of time from the IT department relative
to other applications within the corporation.

The design of a Hadoop solution should be optimized for the performance needs and usage model of the intended
Hadoop environment. That is not to say it should completely contradict other solutions in the environment. The Hadoop
environment should share IT best practices and processes with other solutions in the enterprise to ensure consistency in
deployment and operations. This consistency can be either between common hardware or software across IT
environments. Common hardware will ensure ease of servicing while common software will ensure that tools, scripts, and
processes do not require major modification for supporting the Hadoop environment.

Hadoop deployment can be a time-consuming process. The number of systems involved in a Hadoop environment can
easily overwhelm the most experienced of system administrators. It is important that an automated solution be utilized for
deploying the Hadoop environment, both to save time and ensure consistency. If your enterprise has an existing
operating system (OS) deployment strategy, it should be evaluated to determine if it will be viable for Hadoop
deployment; if not, a vendor deployment strategy for the Hadoop environment should be considered.

Operating most IT environments can be more costly than the initial purchase and deployment. Processes for IT
operations, support, and escalations should be updated to accommodate the Hadoop environment—you do not want to
start over and create a parallel set of processes and structure. Like any IT environment, Hadoop will require regular
monitoring and user support. Documentation should be updated to accommodate the differences in the Hadoop
environment, and staff should be trained to ensure long-term support of the Hadoop environment.




                                                               3
Dell | Hadoop White Paper Series: Hadoop in the Enterprise

Top challenges and methods to overcome
Automated deployment
Automated deployment of both operating systems and the Hadoop software ensures consistent, streamlined
deployments. Dell provides proven configurations that are documented and simpler to deploy than traditional manual IT
deployment strategies. Dell augments these proven solutions with services and tools to streamline solution deployment,
testing, and field validation.

Configuration management
Dell recommends the use of a configuration management tool for all Hadoop environments. Dell | Hadoop solutions
include Chef for this purpose. Chef is used for deploying configuration changes, managing the installation of the Hadoop
software, and providing a single interface to the Hadoop environment for updates and configuration changes.

Monitoring and alerting
Hardware monitoring and alerting is an important part of all dynamic IT environments. Successful monitoring and alerting
ensures that problems are caught as soon as possible and administrators are alerted so the problems can be corrected
before users are impacted. Dell provides support for integration with Nagios and Ganglia as part of our Hadoop solution
stack for monitoring the software and hardware environment.
Dell also recommends integrating the Hadoop environment into any existing enterprise monitoring and management
packages. Dell | Hadoop solutions support standard interfaces, including Simple Network Management Protocol (SNMP)
and Intelligent Platform Management Interface (IPMI) for integration with third-party management and operations tools.


Hardware sizing
Sizing Hadoop environments is often a trial-and-error-prone process. With Dell | Hadoop solutions, Dell provides tested,
known configurations to streamline the sizing process and deliver sizing guidance based on known, real-world
workloads.

There are two aspects to sizing within a Hadoop environment:

    1.   Individual node size – This is the hardware configuration of each individual node.

    2.   Hadoop cluster – This is the size of the entire Hadoop environment, including the number of nodes and
         interconnects between them.

Hadoop node sizing
The four primary sizing considerations for Hadoop nodes are physical memory, disk capacity, network bandwidth, and
CPU speed and core count. A properly balanced Hadoop node will contain enough in each category, but not create a
bottleneck to negatively impact performance by a shortage of another.

Simpler is better when sizing any Hadoop environment. Feedback from seasoned Hadoop users has always been the
same recommendation: Targeting a middle-ground configuration, with a good balance of the key components of the
individual servers, is better than trying to optimize the hardware design for the expected workload. First, workloads
change much more often than the hardware is replaced. Second, the mix of user workloads will also change over time,
and a balanced configuration, while not optimized for a specific use, will not penalize any one job type over another.

Hadoop cluster sizing parameters
Sizing a Hadoop cluster is a different activity from sizing the individual nodes, and should be planned for accordingly.
There are separate parameters that must be considered when sizing the entire cluster; these include number of jobs to be
run, data volume, expected growth, and number of users. Considerations should also be taken for the amount of data to
be replicated, the number of data replicas, and the availability needs of the Hadoop cluster.

Hadoop contains more than 200 tunable parameters, many of which will not influence your specific job. Dell | Hadoop
solutions provide recommended configuration parameters for each of the three use cases of Hadoop: Compute, Storage,
and Database. Dell has validated these parameters for documented workloads, streamlining system bring-up and tuning
for new Hadoop environments.



                                                               4
Dell | Hadoop White Paper Series: Hadoop in the Enterprise

Hadoop configuration parameters
Hadoop tuning and optimization is a never-ending process. This recurring process is depicted in Figure 1. The beginning
of each cycle is “Determine parameter to change.” The process then goes in a clockwise fashion and repeats. All Hadoop
environments will need to be reviewed for tuning and performance optimization as often as the mix of jobs and workload
characteristics change.




Figure 1. The ongoing Hadoop tuning and optimization process.


Hadoop network design
The network design is a key component of a Hadoop environment, and has important factors related to the expected
usage of the environment, as well as the scalability of the environment. The network design should factor in the expected
workloads, provide for adequate growth of the environment without a major rework, and support monitoring of the
network to alert administration staff to network congestion.
Hadoop isolation is a primary design requirement for Hadoop clusters. The network that the Hadoop nodes use for node-
to-node communication should be isolated from the other corporate networks. This ensures maximum bandwidth for
the Hadoop environment and minimizes the impacts of other operations negatively affecting Hadoop.

One important consideration of the network design for Hadoop solutions is the switch capabilities to accommodate
high-packet-count communication patterns. Hadoop can create large volumes of packets on the network during rebuild
operations as well as during high Hadoop Distributed File System (HDFS) I/O activity. The network architecture should
ensure the switches have the capability to handle this traffic pattern without additional network latency or dropped
packets. The Dell white paper “Introduction to Hadoop” covers our network architecture in greater detail.

Like many modern applications, Hadoop utilizes IP for all communications. This means Hadoop has the same restrictions
and design requirements you would see for any other network regarding broadcast domains, network separation, and
routing and switching considerations. Dell recommends that Hadoop clusters utilize approximately 60 nodes within a
single switched environment. For clusters larger than that, Dell recommends utilizing a layer 3 network device to segment
the network and maintain adequately sized broadcast domains.

Today, most Hadoop solutions are built utilizing Gigabit Ethernet. Many DataNodes will utilize two Gigabit Ethernet
connections in an aggregated link to the switch. This provides additional bandwidth for the node. Some users are
beginning to look at 10 Gigabit Ethernet as the price continues to come down. The majority of the users today do not



                                                              5
Dell | Hadoop White Paper Series: Hadoop in the Enterprise

require the additional performance of 10 Gigabit Ethernet, and can save some cost in their Hadoop environments by
utilizing Gigabit Ethernet.

Availability of the network infrastructure is a primary concern if the Hadoop environment is critical to business operations
or functions. The network should include the appropriate level of redundancy to ensure business functions will continue if
components within the network fail. Hadoop has facilities within the software for handling failures of the hardware and
network; these should be accounted for when determining which tiers of the network will contain redundant
components and which will not. You can get a lot of efficiencies by leveraging Hadoop’s capabilities to replicate
information to separate racks or servers on alternate switches, allowing access in the event of a network failure.

Dell recommended network architecture
Figure 2 shows the recommended network architecture for the top-of-rack (ToR) switches within a Hadoop environment.
These connect directly to the DataNodes and allow for all inter-node communication within the Hadoop environment.
The standard configuration Dell recommends is six 48-port Gigabit Ethernet switches (i.e. Dell™ PowerConnect™ 6248)
that have stacking capabilities. These six switches will be stacked and act as a single switch for management purposes.
This network configuration will support up to 60 DataNodes for Hadoop; additional nodes can easily be added by utilizing
an end-of-row (EoR) switch as described in the next section.




Figure 2. Recommended network architecture for the top-of-rack switches within a Hadoop environment.


Hadoop is a highly scalable software platform that requires a network designed with the same scalability in mind. To
ensure maximum scalability, Dell recommends a network architecture that allows you to start with small Hadoop
configurations and grow those over time by adding components, without requiring a rework of the existing environment.
To that design goal, Dell recommends the use of two switches acting as EoR devices. These will connect to the ToR
switches, as shown in Figure 3, but add routing and advanced functionality to scale above 60 DataNodes. These two EoR
switches will allow a maximum of 720 DataNodes within the Hadoop environment before an additional layer of network
connectivity is needed or larger switches are required for EoR connectivity.




Figure 3. Two switches acting as EoR devices and connecting to ToR switches.




                                                                6
Dell | Hadoop White Paper Series: Hadoop in the Enterprise

Hadoop security
Authentication
Hadoop supports a variety of authentication mechanisms, including Kerberos, Active Directory, and Lightweight Directory
Access Protocol. These all enable a list of authorized users and their credentials to be centrally stored and validated
against from the Hadoop environment. Dell recommends utilizing an existing companywide authentication scheme for
Hadoop, eliminating the need to support a separate authentication system supporting only Hadoop.

Authorization
Authorization is an additional layer on top of the authentication that must occur for users. Authentication verifies that the
user credentials are correct—in essence that the user name and password are correct, active, and valid for some period of
time. Authorization builds on that validation to determine if the user is allowed to do the action requested. Authorization
is commonly implemented as file permissions in a file system and access controls within a relational database
management system (RDBMS).

By utilizing centralized authentication and authorization for Hadoop and other corporate services, security models can be
developed between the environments to ensure permissions are properly mapped across environments. Hadoop is
commonly used as an intermediary point for data processing and storage; it should enforce all corporate security policies
for data that passes through Hadoop and is processed by Hadoop.

Logging
Logging is a critical part of both system operation and security. The logs for Hadoop and the underlying systems provide
insight into unexpected access that could lead to a system or data compromise as well as system stability issues. Hadoop
environments should utilize a central logging facility for correlation of logs, recovery of logs from failed hosts, and event
alerting of unexpected events and anomalies.

In development environments, as well as when your environment grows, it will be beneficial for comparing logs from
Hadoop to the logs from the application utilizing Hadoop. This capability enables your administrators to correlate
problems in the entire stack of components that make up a functioning application.


Why Dell
Dell has worked with Cloudera to design, test, and support an integrated solution of hardware and software for
implementing the Hadoop ecosystem. This solution has been engineered and validated to work together and provide
known performance parameters and deployment methods. Dell recommends that you utilize known hardware and
software solutions when deploying Hadoop to ensure low-risk deployments and minimal compatibility issues. Dell’s
solution ensures that you get maximum performance with minimal testing prior to purchase.

Dell recommends that you purchase and maintain support on the entire ecosystem of your Hadoop solution. Today’s
solutions are complex combinations of components that require upgrades as new software becomes available and
assistance when staff is working on new parts of the solutions. The Dell | Cloudera Hadoop solution provides a full line of
support, including hardware and software, so you always have a primary contact for assistance to ensure maximum
availability and stability of your Hadoop environment.


About the author
Joey Jablonski is a principal solution architect with Dell’s Data Center Solutions team. Joey works to define and
implement Dell’s solutions for big data, including solutions based on Apache Hadoop. Joey has spent more than 10 years
working in high performance computing, with an emphasis on interconnects, including Infiniband and parallel file
systems. Joey has led technical solution design and implementation at Sun Microsystems and Hewlett-Packard, as well as
consulted for customers, including Sandia National Laboratories, BP, ExxonMobil, E*Trade, Juelich Supercomputing
Centre, and Clumeq.




                                                                 7
Dell | Hadoop White Paper Series: Hadoop in the Enterprise

Special thanks
The author extends special thanks to:

            Aurelian Dumitru, Senior Cloud Solutions Architect, Dell

            Rebecca Brenton, Cloud Software Alliances, Dell

            Scott Jensen, Director, Cloud Solutions Software Engineering, Dell




About Dell Next Generation Computing Solutions
When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next
Generation Computing Solutions are Dell’s response to your unique needs. We understand your challenges—from
compute and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune
your company’s “factory” for maximum performance and efficiency.

Dell Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the
needs of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while
allowing scalability as your company grows.

Deployment and support are tailored to your unique operational requirements. Dell Cloud Computing Solutions can help
you minimize the tangible operating costs that have hyperscale impact on your business results.


References
Chef
http://wiki.opscode.com/display/chef/Home

SNMP
http://www.net-snmp.org/

IPMI
http://www.intel.com/design/servers/ipmi/ipmi.htm

Cloudera Hadoop
http://www.cloudera.com/products-services/enterprise/




    To learn more
    To learn more about Dell cloud solutions, contact your Dell representative
    or visit:

    www.dell.com/cloud


© 2011 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerConnect are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to
refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others. This document is for informational purposes
only. Dell reserves the right to make changes without further notice to the products herein. The content provided is as-is and without expressed or implied warranties of any kind.




                                                                                              8

Contenu connexe

Tendances

Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security ManagementSecuring Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security ManagementNovell
 
Taashee Linux Services Profile
Taashee Linux Services ProfileTaashee Linux Services Profile
Taashee Linux Services ProfileManojkummar Garg
 
Lessons Learned: Novell Open Enterprise Server Upgrades Made Easy
Lessons Learned: Novell Open Enterprise Server Upgrades Made EasyLessons Learned: Novell Open Enterprise Server Upgrades Made Easy
Lessons Learned: Novell Open Enterprise Server Upgrades Made EasyNovell
 
Hadoop Summit 2012 | Improving HBase Availability and Repair
Hadoop Summit 2012 | Improving HBase Availability and RepairHadoop Summit 2012 | Improving HBase Availability and Repair
Hadoop Summit 2012 | Improving HBase Availability and RepairCloudera, Inc.
 
Big Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumBig Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumIntelAPAC
 
Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1
Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1
Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1Novell
 
Improving h base availability and repair
Improving h base availability and repairImproving h base availability and repair
Improving h base availability and repairDataWorks Summit
 
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Cloudera, Inc.
 
Integrating Novell Teaming within Your Existing Infrastructure
Integrating Novell Teaming within Your Existing InfrastructureIntegrating Novell Teaming within Your Existing Infrastructure
Integrating Novell Teaming within Your Existing InfrastructureNovell
 
CentricsIT: We see IT differently.
CentricsIT: We see IT differently.CentricsIT: We see IT differently.
CentricsIT: We see IT differently.mbrandon17
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseCloudera, Inc.
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Pactera_US
 
Novell Success Stories: Collaboration in Education
Novell Success Stories: Collaboration in EducationNovell Success Stories: Collaboration in Education
Novell Success Stories: Collaboration in EducationNovell
 
Integrating Apple Macs Using Novell Technologies
Integrating Apple Macs Using Novell TechnologiesIntegrating Apple Macs Using Novell Technologies
Integrating Apple Macs Using Novell TechnologiesNovell
 
Open Development in the Enterprise, Jazoon 2012
Open Development in the Enterprise, Jazoon 2012Open Development in the Enterprise, Jazoon 2012
Open Development in the Enterprise, Jazoon 2012Bertrand Delacretaz
 
Consolidate and upgrade to save up to $172K: Dell PowerEdge R620 and Microso...
Consolidate and upgrade to save up to $172K:  Dell PowerEdge R620 and Microso...Consolidate and upgrade to save up to $172K:  Dell PowerEdge R620 and Microso...
Consolidate and upgrade to save up to $172K: Dell PowerEdge R620 and Microso...Principled Technologies
 
eFolder Webinar: a Deep Dive Into Deduplication
eFolder Webinar: a Deep Dive Into DeduplicationeFolder Webinar: a Deep Dive Into Deduplication
eFolder Webinar: a Deep Dive Into DeduplicationDropbox
 
Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...
Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...
Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...Principled Technologies
 

Tendances (19)

Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security ManagementSecuring Your Endpoints Using Novell ZENworks Endpoint Security Management
Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
 
Taashee Linux Services Profile
Taashee Linux Services ProfileTaashee Linux Services Profile
Taashee Linux Services Profile
 
Lessons Learned: Novell Open Enterprise Server Upgrades Made Easy
Lessons Learned: Novell Open Enterprise Server Upgrades Made EasyLessons Learned: Novell Open Enterprise Server Upgrades Made Easy
Lessons Learned: Novell Open Enterprise Server Upgrades Made Easy
 
Hadoop Summit 2012 | Improving HBase Availability and Repair
Hadoop Summit 2012 | Improving HBase Availability and RepairHadoop Summit 2012 | Improving HBase Availability and Repair
Hadoop Summit 2012 | Improving HBase Availability and Repair
 
Big Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick BuddenbaumBig Data launch Singapore Patrick Buddenbaum
Big Data launch Singapore Patrick Buddenbaum
 
Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1
Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1
Introduction to Crystal and Jasper Reports for Novell Sentinel 6.1
 
Improving h base availability and repair
Improving h base availability and repairImproving h base availability and repair
Improving h base availability and repair
 
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
Hadoop World 2011: Big Data Analytics – Data Professionals: The New Enterpris...
 
Integrating Novell Teaming within Your Existing Infrastructure
Integrating Novell Teaming within Your Existing InfrastructureIntegrating Novell Teaming within Your Existing Infrastructure
Integrating Novell Teaming within Your Existing Infrastructure
 
CentricsIT: We see IT differently.
CentricsIT: We see IT differently.CentricsIT: We see IT differently.
CentricsIT: We see IT differently.
 
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the EnterpriseHadoop Summit 2012 | Integrating Hadoop Into the Enterprise
Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise
 
Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks Transform Your Business with Big Data and Hortonworks
Transform Your Business with Big Data and Hortonworks
 
Novell Success Stories: Collaboration in Education
Novell Success Stories: Collaboration in EducationNovell Success Stories: Collaboration in Education
Novell Success Stories: Collaboration in Education
 
Integrating Apple Macs Using Novell Technologies
Integrating Apple Macs Using Novell TechnologiesIntegrating Apple Macs Using Novell Technologies
Integrating Apple Macs Using Novell Technologies
 
Open Development in the Enterprise, Jazoon 2012
Open Development in the Enterprise, Jazoon 2012Open Development in the Enterprise, Jazoon 2012
Open Development in the Enterprise, Jazoon 2012
 
Consolidate and upgrade to save up to $172K: Dell PowerEdge R620 and Microso...
Consolidate and upgrade to save up to $172K:  Dell PowerEdge R620 and Microso...Consolidate and upgrade to save up to $172K:  Dell PowerEdge R620 and Microso...
Consolidate and upgrade to save up to $172K: Dell PowerEdge R620 and Microso...
 
eFolder Webinar: a Deep Dive Into Deduplication
eFolder Webinar: a Deep Dive Into DeduplicationeFolder Webinar: a Deep Dive Into Deduplication
eFolder Webinar: a Deep Dive Into Deduplication
 
Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...
Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...
Power efficiency and cost: AMD Opteron 6300 series processor-based Dell Power...
 
Bilcare ltd
Bilcare ltdBilcare ltd
Bilcare ltd
 

En vedette

Redefining Security for Big Data - Cassandra Summit 2013
Redefining Security for Big Data - Cassandra Summit 2013Redefining Security for Big Data - Cassandra Summit 2013
Redefining Security for Big Data - Cassandra Summit 2013Joey Jablonski
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project GuidanceVarad Meru
 
Feeding 10 Billion People with Cloud-Scale Compute and Analytics
Feeding 10 Billion People with  Cloud-Scale Compute and AnalyticsFeeding 10 Billion People with  Cloud-Scale Compute and Analytics
Feeding 10 Billion People with Cloud-Scale Compute and AnalyticsJoey Jablonski
 

En vedette (6)

Security for Big Data
Security for Big DataSecurity for Big Data
Security for Big Data
 
Virtualized Hadoop
Virtualized HadoopVirtualized Hadoop
Virtualized Hadoop
 
Big Data for Security
Big Data for SecurityBig Data for Security
Big Data for Security
 
Redefining Security for Big Data - Cassandra Summit 2013
Redefining Security for Big Data - Cassandra Summit 2013Redefining Security for Big Data - Cassandra Summit 2013
Redefining Security for Big Data - Cassandra Summit 2013
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
Feeding 10 Billion People with Cloud-Scale Compute and Analytics
Feeding 10 Billion People with  Cloud-Scale Compute and AnalyticsFeeding 10 Billion People with  Cloud-Scale Compute and Analytics
Feeding 10 Billion People with Cloud-Scale Compute and Analytics
 

Similaire à Hadoop in the Enterprise

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopJoey Jablonski
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesCloudera, Inc.
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.OW2
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Hadoop Enterprise Readiness
Hadoop Enterprise ReadinessHadoop Enterprise Readiness
Hadoop Enterprise Readinessad17633
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionBenoit Perroud
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data TechnologiesDATAVERSITY
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataCloudera, Inc.
 
Cloudera Manager Webinar | Cloudera Enterprise 3.7
Cloudera Manager Webinar | Cloudera Enterprise 3.7Cloudera Manager Webinar | Cloudera Enterprise 3.7
Cloudera Manager Webinar | Cloudera Enterprise 3.7Cloudera, Inc.
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopCloudera, Inc.
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...Cloudera, Inc.
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placementsofia taylor
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placementIqbal Patel
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Cloudera, Inc.
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 

Similaire à Hadoop in the Enterprise (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
 
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo SlidesWebinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
Webinar | From Zero to Big Data Answers in Less Than an Hour – Live Demo Slides
 
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
Scalable ETL with Talend and Hadoop, Cédric Carbone, Talend.
 
Hadoop
HadoopHadoop
Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Hadoop Enterprise Readiness
Hadoop Enterprise ReadinessHadoop Enterprise Readiness
Hadoop Enterprise Readiness
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Cloudera Manager Webinar | Cloudera Enterprise 3.7
Cloudera Manager Webinar | Cloudera Enterprise 3.7Cloudera Manager Webinar | Cloudera Enterprise 3.7
Cloudera Manager Webinar | Cloudera Enterprise 3.7
 
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache HadoopBusiness Intelligence and Data Analytics Revolutionized with Apache Hadoop
Business Intelligence and Data Analytics Revolutionized with Apache Hadoop
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
How CBS Interactive uses Cloudera Manager to effectively manage their Hadoop ...
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placement
 
Hadoop training-and-placement
Hadoop training-and-placementHadoop training-and-placement
Hadoop training-and-placement
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 

Dernier

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Dernier (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Hadoop in the Enterprise

  • 1. Hadoop in the Enterprise A Dell Technical White Paper By Joey Jablonski Dell | Hadoop White Paper Series
  • 2. Dell | Hadoop White Paper Series: Hadoop in the Enterprise Table of Contents Introduction 3 Managing Hadoop as an island versus part of corporate IT 3 Top challenges and methods to overcome 4 Automated deployment 4 Configuration management 4 Monitoring and alerting 4 Hardware sizing 4 Hadoop node sizing 4 Hadoop cluster sizing parameters 4 Hadoop configuration parameters 5 Hadoop network design 5 Dell recommended network architecture 6 Hadoop security 7 Authentication 7 Authorization 7 Logging 7 Why Dell 7 About the author 7 Special thanks 8 About Dell Next Generation Computing Solutions 8 References 8 To learn more 8 This white paper is for informational purposes only, and may contain typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any kind. © 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell, the Dell logo, and the Dell badge, and PowerEdge are trademarks of Dell Inc. 2
  • 3. Dell | Hadoop White Paper Series: Hadoop in the Enterprise Introduction This is the second in a series of white papers from Dell about Hadoop. If you are new to Hadoop or Dell’s Hadoop solutions, Dell recommends reading the “Introduction to Hadoop” white paper before this paper. Hadoop is becoming a critical part of many modern information technology (IT) departments. It is being used for a growing range of requirements, including analytics, data storage, data processing, and shared compute resources. As Hadoop’s significance grows, it is important that it be treated as a component of your larger IT organization, and managed as one. Hadoop is no longer relegated to only research projects, and should be managed as your company would manage any other large component of your IT infrastructure. Analytics is a growing need in many organizations. As the volumes of data from multiple sources continue to grow, Hadoop is in the lead as an enterprise tool that can ingest that data and provide the means to analyze the data. Data warehouses are a key component to many IT departments. They provide the business intelligence that many companies use for decision making. Hadoop is beginning to augment these as a central location to feed many different business intelligence platforms within an organization. This means that Hadoop must be available at the same level or greater than the tools it feeds and supports. Hadoop has taken a standard path into most IT organizations, first being used for testing and development, then migrating into production operations. Because of this, your IT department must ensure that the Hadoop environment is properly planned and designed at initial deployment to support the more rigorous demands of a production environment. This paper talks about the considerations for your IT department to ensure your Hadoop environment is able to grow and change with your business needs, without requiring major refactoring work after the initial deployment due to suboptimal solution architecture. Managing Hadoop as an island versus part of corporate IT Hadoop environments can contain many hundreds or possibly thousands of servers. This large number of devices can become a management burden for your IT department. As your environment grows, your IT administrators can be left struggling with complexity. Maximum attention should be paid during your initial Hadoop deployment to ensure it does not become an island that IT must manage outside of standard tools and processes. A Hadoop environment, optimally, should utilize external company shared resources for authentication, monitoring, backup, alerting, and processes. This integration, early and often, will ensure that the Hadoop environment does not consume an unnecessary amount of time from the IT department relative to other applications within the corporation. The design of a Hadoop solution should be optimized for the performance needs and usage model of the intended Hadoop environment. That is not to say it should completely contradict other solutions in the environment. The Hadoop environment should share IT best practices and processes with other solutions in the enterprise to ensure consistency in deployment and operations. This consistency can be either between common hardware or software across IT environments. Common hardware will ensure ease of servicing while common software will ensure that tools, scripts, and processes do not require major modification for supporting the Hadoop environment. Hadoop deployment can be a time-consuming process. The number of systems involved in a Hadoop environment can easily overwhelm the most experienced of system administrators. It is important that an automated solution be utilized for deploying the Hadoop environment, both to save time and ensure consistency. If your enterprise has an existing operating system (OS) deployment strategy, it should be evaluated to determine if it will be viable for Hadoop deployment; if not, a vendor deployment strategy for the Hadoop environment should be considered. Operating most IT environments can be more costly than the initial purchase and deployment. Processes for IT operations, support, and escalations should be updated to accommodate the Hadoop environment—you do not want to start over and create a parallel set of processes and structure. Like any IT environment, Hadoop will require regular monitoring and user support. Documentation should be updated to accommodate the differences in the Hadoop environment, and staff should be trained to ensure long-term support of the Hadoop environment. 3
  • 4. Dell | Hadoop White Paper Series: Hadoop in the Enterprise Top challenges and methods to overcome Automated deployment Automated deployment of both operating systems and the Hadoop software ensures consistent, streamlined deployments. Dell provides proven configurations that are documented and simpler to deploy than traditional manual IT deployment strategies. Dell augments these proven solutions with services and tools to streamline solution deployment, testing, and field validation. Configuration management Dell recommends the use of a configuration management tool for all Hadoop environments. Dell | Hadoop solutions include Chef for this purpose. Chef is used for deploying configuration changes, managing the installation of the Hadoop software, and providing a single interface to the Hadoop environment for updates and configuration changes. Monitoring and alerting Hardware monitoring and alerting is an important part of all dynamic IT environments. Successful monitoring and alerting ensures that problems are caught as soon as possible and administrators are alerted so the problems can be corrected before users are impacted. Dell provides support for integration with Nagios and Ganglia as part of our Hadoop solution stack for monitoring the software and hardware environment. Dell also recommends integrating the Hadoop environment into any existing enterprise monitoring and management packages. Dell | Hadoop solutions support standard interfaces, including Simple Network Management Protocol (SNMP) and Intelligent Platform Management Interface (IPMI) for integration with third-party management and operations tools. Hardware sizing Sizing Hadoop environments is often a trial-and-error-prone process. With Dell | Hadoop solutions, Dell provides tested, known configurations to streamline the sizing process and deliver sizing guidance based on known, real-world workloads. There are two aspects to sizing within a Hadoop environment: 1. Individual node size – This is the hardware configuration of each individual node. 2. Hadoop cluster – This is the size of the entire Hadoop environment, including the number of nodes and interconnects between them. Hadoop node sizing The four primary sizing considerations for Hadoop nodes are physical memory, disk capacity, network bandwidth, and CPU speed and core count. A properly balanced Hadoop node will contain enough in each category, but not create a bottleneck to negatively impact performance by a shortage of another. Simpler is better when sizing any Hadoop environment. Feedback from seasoned Hadoop users has always been the same recommendation: Targeting a middle-ground configuration, with a good balance of the key components of the individual servers, is better than trying to optimize the hardware design for the expected workload. First, workloads change much more often than the hardware is replaced. Second, the mix of user workloads will also change over time, and a balanced configuration, while not optimized for a specific use, will not penalize any one job type over another. Hadoop cluster sizing parameters Sizing a Hadoop cluster is a different activity from sizing the individual nodes, and should be planned for accordingly. There are separate parameters that must be considered when sizing the entire cluster; these include number of jobs to be run, data volume, expected growth, and number of users. Considerations should also be taken for the amount of data to be replicated, the number of data replicas, and the availability needs of the Hadoop cluster. Hadoop contains more than 200 tunable parameters, many of which will not influence your specific job. Dell | Hadoop solutions provide recommended configuration parameters for each of the three use cases of Hadoop: Compute, Storage, and Database. Dell has validated these parameters for documented workloads, streamlining system bring-up and tuning for new Hadoop environments. 4
  • 5. Dell | Hadoop White Paper Series: Hadoop in the Enterprise Hadoop configuration parameters Hadoop tuning and optimization is a never-ending process. This recurring process is depicted in Figure 1. The beginning of each cycle is “Determine parameter to change.” The process then goes in a clockwise fashion and repeats. All Hadoop environments will need to be reviewed for tuning and performance optimization as often as the mix of jobs and workload characteristics change. Figure 1. The ongoing Hadoop tuning and optimization process. Hadoop network design The network design is a key component of a Hadoop environment, and has important factors related to the expected usage of the environment, as well as the scalability of the environment. The network design should factor in the expected workloads, provide for adequate growth of the environment without a major rework, and support monitoring of the network to alert administration staff to network congestion. Hadoop isolation is a primary design requirement for Hadoop clusters. The network that the Hadoop nodes use for node- to-node communication should be isolated from the other corporate networks. This ensures maximum bandwidth for the Hadoop environment and minimizes the impacts of other operations negatively affecting Hadoop. One important consideration of the network design for Hadoop solutions is the switch capabilities to accommodate high-packet-count communication patterns. Hadoop can create large volumes of packets on the network during rebuild operations as well as during high Hadoop Distributed File System (HDFS) I/O activity. The network architecture should ensure the switches have the capability to handle this traffic pattern without additional network latency or dropped packets. The Dell white paper “Introduction to Hadoop” covers our network architecture in greater detail. Like many modern applications, Hadoop utilizes IP for all communications. This means Hadoop has the same restrictions and design requirements you would see for any other network regarding broadcast domains, network separation, and routing and switching considerations. Dell recommends that Hadoop clusters utilize approximately 60 nodes within a single switched environment. For clusters larger than that, Dell recommends utilizing a layer 3 network device to segment the network and maintain adequately sized broadcast domains. Today, most Hadoop solutions are built utilizing Gigabit Ethernet. Many DataNodes will utilize two Gigabit Ethernet connections in an aggregated link to the switch. This provides additional bandwidth for the node. Some users are beginning to look at 10 Gigabit Ethernet as the price continues to come down. The majority of the users today do not 5
  • 6. Dell | Hadoop White Paper Series: Hadoop in the Enterprise require the additional performance of 10 Gigabit Ethernet, and can save some cost in their Hadoop environments by utilizing Gigabit Ethernet. Availability of the network infrastructure is a primary concern if the Hadoop environment is critical to business operations or functions. The network should include the appropriate level of redundancy to ensure business functions will continue if components within the network fail. Hadoop has facilities within the software for handling failures of the hardware and network; these should be accounted for when determining which tiers of the network will contain redundant components and which will not. You can get a lot of efficiencies by leveraging Hadoop’s capabilities to replicate information to separate racks or servers on alternate switches, allowing access in the event of a network failure. Dell recommended network architecture Figure 2 shows the recommended network architecture for the top-of-rack (ToR) switches within a Hadoop environment. These connect directly to the DataNodes and allow for all inter-node communication within the Hadoop environment. The standard configuration Dell recommends is six 48-port Gigabit Ethernet switches (i.e. Dell™ PowerConnect™ 6248) that have stacking capabilities. These six switches will be stacked and act as a single switch for management purposes. This network configuration will support up to 60 DataNodes for Hadoop; additional nodes can easily be added by utilizing an end-of-row (EoR) switch as described in the next section. Figure 2. Recommended network architecture for the top-of-rack switches within a Hadoop environment. Hadoop is a highly scalable software platform that requires a network designed with the same scalability in mind. To ensure maximum scalability, Dell recommends a network architecture that allows you to start with small Hadoop configurations and grow those over time by adding components, without requiring a rework of the existing environment. To that design goal, Dell recommends the use of two switches acting as EoR devices. These will connect to the ToR switches, as shown in Figure 3, but add routing and advanced functionality to scale above 60 DataNodes. These two EoR switches will allow a maximum of 720 DataNodes within the Hadoop environment before an additional layer of network connectivity is needed or larger switches are required for EoR connectivity. Figure 3. Two switches acting as EoR devices and connecting to ToR switches. 6
  • 7. Dell | Hadoop White Paper Series: Hadoop in the Enterprise Hadoop security Authentication Hadoop supports a variety of authentication mechanisms, including Kerberos, Active Directory, and Lightweight Directory Access Protocol. These all enable a list of authorized users and their credentials to be centrally stored and validated against from the Hadoop environment. Dell recommends utilizing an existing companywide authentication scheme for Hadoop, eliminating the need to support a separate authentication system supporting only Hadoop. Authorization Authorization is an additional layer on top of the authentication that must occur for users. Authentication verifies that the user credentials are correct—in essence that the user name and password are correct, active, and valid for some period of time. Authorization builds on that validation to determine if the user is allowed to do the action requested. Authorization is commonly implemented as file permissions in a file system and access controls within a relational database management system (RDBMS). By utilizing centralized authentication and authorization for Hadoop and other corporate services, security models can be developed between the environments to ensure permissions are properly mapped across environments. Hadoop is commonly used as an intermediary point for data processing and storage; it should enforce all corporate security policies for data that passes through Hadoop and is processed by Hadoop. Logging Logging is a critical part of both system operation and security. The logs for Hadoop and the underlying systems provide insight into unexpected access that could lead to a system or data compromise as well as system stability issues. Hadoop environments should utilize a central logging facility for correlation of logs, recovery of logs from failed hosts, and event alerting of unexpected events and anomalies. In development environments, as well as when your environment grows, it will be beneficial for comparing logs from Hadoop to the logs from the application utilizing Hadoop. This capability enables your administrators to correlate problems in the entire stack of components that make up a functioning application. Why Dell Dell has worked with Cloudera to design, test, and support an integrated solution of hardware and software for implementing the Hadoop ecosystem. This solution has been engineered and validated to work together and provide known performance parameters and deployment methods. Dell recommends that you utilize known hardware and software solutions when deploying Hadoop to ensure low-risk deployments and minimal compatibility issues. Dell’s solution ensures that you get maximum performance with minimal testing prior to purchase. Dell recommends that you purchase and maintain support on the entire ecosystem of your Hadoop solution. Today’s solutions are complex combinations of components that require upgrades as new software becomes available and assistance when staff is working on new parts of the solutions. The Dell | Cloudera Hadoop solution provides a full line of support, including hardware and software, so you always have a primary contact for assistance to ensure maximum availability and stability of your Hadoop environment. About the author Joey Jablonski is a principal solution architect with Dell’s Data Center Solutions team. Joey works to define and implement Dell’s solutions for big data, including solutions based on Apache Hadoop. Joey has spent more than 10 years working in high performance computing, with an emphasis on interconnects, including Infiniband and parallel file systems. Joey has led technical solution design and implementation at Sun Microsystems and Hewlett-Packard, as well as consulted for customers, including Sandia National Laboratories, BP, ExxonMobil, E*Trade, Juelich Supercomputing Centre, and Clumeq. 7
  • 8. Dell | Hadoop White Paper Series: Hadoop in the Enterprise Special thanks The author extends special thanks to:  Aurelian Dumitru, Senior Cloud Solutions Architect, Dell  Rebecca Brenton, Cloud Software Alliances, Dell  Scott Jensen, Director, Cloud Solutions Software Engineering, Dell About Dell Next Generation Computing Solutions When cloud computing is the core of your business and its efficiency and vitality underpin your success, the Dell Next Generation Computing Solutions are Dell’s response to your unique needs. We understand your challenges—from compute and power density to global scaling and environmental impact. Dell has the knowledge and expertise to tune your company’s “factory” for maximum performance and efficiency. Dell Next Generation Computing Solutions provide operational models backed by unique product solutions to meet the needs of companies at all stages of their lifecycles. Solutions are designed to meet the needs of small startups while allowing scalability as your company grows. Deployment and support are tailored to your unique operational requirements. Dell Cloud Computing Solutions can help you minimize the tangible operating costs that have hyperscale impact on your business results. References Chef http://wiki.opscode.com/display/chef/Home SNMP http://www.net-snmp.org/ IPMI http://www.intel.com/design/servers/ipmi/ipmi.htm Cloudera Hadoop http://www.cloudera.com/products-services/enterprise/ To learn more To learn more about Dell cloud solutions, contact your Dell representative or visit: www.dell.com/cloud © 2011 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerConnect are trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others. This document is for informational purposes only. Dell reserves the right to make changes without further notice to the products herein. The content provided is as-is and without expressed or implied warranties of any kind. 8