SlideShare une entreprise Scribd logo
1  sur  39
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Keep Your Hadoop
Cluster at its Best!
Chris Nauroth
Sheetal Dolas
Hadoop Summit, San Jose, 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Us
⬢ Principal Engineer @ Hortonworks
⬢ Committer and PMC, Apache Hadoop
– Key contributor to HDFS ACLs, Windows compatibility, and operability improvements
⬢ Hadoop user since 2010
– Experience deploying, maintaining and using Hadoop clusters
cnauroth@hortonworks.com
cnauroth
Chris Nauroth
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Us
⬢ SmartSense Engineering Lead @ Hortonworks
⬢ Most of the career has been in the field, solving real life business problems
⬢ Last 6+ years in Big Data
⬢ Committer and PMC, Apache Metron
sheetal@hortonworks.com
sheetal_dolas
Sheetal Dolas
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
⬢ Days in a life of Hadoop users – Real war stories!
⬢ Hadoop Operational Challenges
⬢ Winning and avoiding the wars
⬢ Q & A
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Days in a life of
Hadoop users
Real war stories!
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story I: Unstable NameNode, Frequent Fail Overs
⬢ NameNode periodically becomes unresponsive
⬢ In HA scenario, fails over to standby
⬢ In short time, falls back again
⬢ Very frequent fail overs and fail backs
It was the garbage
collection!
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story II: Very high CPU usage but low throughput
⬢ Unusually high system CPU usage
⬢ Jobs slowed down
⬢ Reduced data IO
System CPU
User CPU N/W IO
Transparent Huge Pages (THP) was turned
on!
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story III: Cascading impact and cluster melt down
⬢ HDFS upgraded
⬢ HDFS utilization kept on increasing even after large data deletion
⬢ Rebalancing made the situation worse
⬢ Eventually HDFS became unresponsive
un-finalized HDFS had
cascading impact on cluster!
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story IV: Overloaded cluster
⬢ Jobs run slower
⬢ Always waiting containers and jobs, all YARN queues are fully utilized
⬢ Some jobs had to wait for hours to get the container slots
Sub optimally configured container sizes!
Requested
Memory
Used Memory
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story V: Accidental deletion of critical datasets
⬢ User accidentally executed hdfs dfs -rm -R on a root directory
⬢ Delete is issued in parallel, control + c did not help
⬢ In panic, user shuts down HDFS immediately (fortunately)
⬢ Restarts later to check trash, loses all data
⬢ It’s nearly impossible to recover blocks from local file system
This is a more common mistake than one may
think!
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story VI: Hive query returning random results
⬢ A hive query returns different results every time
⬢ Results are usually accurate during office hours
⬢ After office hours, results keep changing randomly on every execution
-- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ?
SELECT SUM(amount)
FROM sales
WHERE sale_date = TO_DATE (UNIX_TIMESTAMP())
One of the host had a different time zone!
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
and the stories continue…
Hadoop operational
challenges
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop has lots of configurations
⬢ So many configurations! Overwhelming for many users
⬢ Best practices are evolving and change across versions
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Many configurations are cluster and workload specific
⬢ A configuration good for one cluster may not be suitable for another cluster
⬢ Optimally configured clusters may become sub optimal tomorrow as they
grow
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Large clusters add to the complexities
⬢ Managing, updating and keeping nodes in sync becomes challenging
⬢ Nodes going down miss the maintenance cycles and get out of sync
⬢ Newly added nodes may have different standards (java version, os, user
configurations etc.)
⬢ Clusters start having heterogeneous hardware over period of time
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Winning
and
avoiding
the wars with
SmartSense
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Proactive support & personalized cluster insights by
– Enabling faster case resolution
– Applying industry best practices
– Providing proactive analysis
⬢ SmartSense is a collection of tools and services
– Evaluates cluster’s current configuration and runtime environment against rich set of rules
– Rules are dynamic, reacting to thresholds tailored to the specific cluster and its workloads
– Continuously evolving and improving rule sets, developed by or in close consultation with active
committers, support engineers, field engineers.
SmartSense
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A G E N T A G E N T
A G E N TA G E N TA G E N T
A G E N T
L A N D I N G Z O N E
S E R V E R
A M B A R I
A G E N T A G E N T
A G E N TA G E N TA G E N T
A G E N T
B U N D L E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
S m a r t S e n s e
A n a l y t i c s
SmartSense Architecture
G A T E W A Y
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Unstable NameNode, Frequent Fail Overs
Daunting Questions
⬢ What is right Heap size for my
NN ?
⬢ What should be the new gen
size ?
⬢ Which GC should I use ?
⬢ What GC options to be
configured?
⬢ What if my cluster grows ?
SmartSense Answer
⬢ Rule: hdfs_nn_jvm_opts
⬢ Calculates Heap size based on
– Current heap usage
– Total number of objects in file system
– Best practices
⬢ Recalculates dependent JVM
options based on Heap size
⬢ Validates existing JVM opts
⬢ Provides continuous validations
and proactive
recommendations
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Heap Size
– 200 bytes per HDFS object (files, directories, blocks)
– 25 % buffer
 -Xms should be same as –Xmx
 New generation size should be 1/8th of –Xmx (capped at 8G)
 Use Concurrent Mark Sweep (CMS) Garbage Collection
– -XX:+UseConcMarkSweepGC
– -XX:CMSInitiatingOccupancyFraction=70
– -XX:+UseCMSInitiatingOccupancyOnly
– -XX:ParallelGCThreads=8
NameNode JVM Opts
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Very high CPU usage but low throughput
Daunting Questions
⬢ Is THP applicable to my OS
version ?
⬢ Is it disabled ? Completely
disabled ?
⬢ How do I make sure it is
disabled on newly added nodes
too ?
⬢ How do I make these
configurations person
independent ?
SmartSense Answer
⬢ Rule: os_thp
⬢ Checks if thp is completely
disabled
⬢ Provides OS specific disabling
instructions
⬢ Continuous evaluation that
validates newly added nodes
and re-commissioned nodes
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disable THP
⬢ For RedHat & CentOS
echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled
⬢ For Debian, Ubuntu & SUSE
echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
System CPU
User CPU
N/W IO
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Cascading impact and cluster melt down
Daunting Questions
⬢ Should I finalize upgrade ?
⬢ What is right time to finalize ?
⬢ How do I make sure it does not
fall through cracks ?
SmartSense Answer
⬢ Rule: hdfs_nn_finalize_upgrade
⬢ Checks HDFS health after
upgrade
⬢ Evaluates how long HDFS is
running in un-finalized state
⬢ Reminds until it is finalized
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Check NN UI / JMX for upgrade status
 Do not finalize HDFS upgrade until
– All files and blocks have been verified after upgrade
– Critical jobs have been executed at least once after upgrade
 Finalize between 2 - 7 days after upgrade
hdfs dfsadmin -finalizeUpgrade
HDFS Upgrade finalization
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing : Overloaded cluster
Daunting Questions
⬢ What is right container size for
my cluster ?
⬢ If I add additional components
(HBase, Storm), how does the
container size change ?
⬢ How does container sizes
change when I add new types
of nodes in the cluster ?
⬢ What’s impact on container
sizes if I add SSDs to the
nodes?
SmartSense Answer
⬢ Rules: yarn_container_size,
mr_container_size,
tez_container_size
⬢ Evaluates resources available
on individual host (CPU,
Memory, Disks, Running
Services etc.)
⬢ Calculates technology specific
container sizes (MR, Tez, Hive)
⬢ Continuously evaluates as the
cluster dynamics change
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Container sizing
 Identify resources (CPU, Memory, Disks) available on each node
 Keep aside resources required for other processes (OS, DN, NM, HBase
RS)
 Calculate max possible containers for each resource (CPU, Memory, Disks)
– CPU Containers: 4x cores
– Disk Containers: ( 3x HDD + 10x SSD )
– Memory Containers: (Available RAM / 2 )
 Number of containers = Min (CPU Containers, Disk Containers, Memory
Containers)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Accidental deletion of critical datasets
Daunting Questions
⬢ Is HDFS trash enabled ?
⬢ What is safe trash interval ?
⬢ How to prevent accidental
deletion of critical data ?
SmartSense Answer
⬢ Rule: hdfs_trash_interval
– Checks if trash is enabled
– Validates if trash interval is within
reasonable limits
⬢ Rule:
hdfs_nn_protect_imp_dirs
– New feature available in Hadoop 2.8
– Helps you mark critical directories such
as “/”, “/user”, “/user/apps/hive”,
“/user/apps/hbase” etc. are delete
protected.
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Trash interval and directory protection
 fs.trash.interval detects number of minutes after which the trashed
data gets deleted
– 0 means trash disabled (data gets deleted immediately)
– Keep it the range 1440 (1 day) – 10080 (7 days)
– Recommended 4320 (3 days)
 fs.protected.directories specifies directories that will be delete
protected
– Available from Hadoop 2.8
– List all key directories there ("/", "/user","/user/apps",
"/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data",
"/mapred", "/mapred/system", "/tmp" etc. )
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing : Hive query returning random results
Daunting Questions
⬢ Is my cluster configured
consistently ?
⬢ How do I prevent such hard to
analyze issues ?
⬢ How do I make sure newly
added do not bring these types
of issues ?
⬢ How do I make these set ups
person independent ?
SmartSense Answer
⬢ Rule: os_time_zone
⬢ Checks if all hosts have same
time zone
⬢ Rule os_service_ntpd_on make
sure all host times are in sync
⬢ Continuous evaluation that
validates newly added nodes
and re-commissioned nodes
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There are 250+ more such rules
Operations
 hdfs_dn_volume_tolerance
 hdfs_dn_xceivers
 hdfs_nn_handler_count
 …
 yarn_zk_quorum
 yarn_nm_recovery
 …
 os_hostname_reverse_lookup
 os_ssd_tuning
 …
 hive_mr_strict_mode
 hive_datanucleus_cache
 …
 tez_am_heap
 tez_shuffle_buffer
 …
Performance
 ams_mc_distributed_configs
 ams_mc_write_path
 ...
 hbase_jvm_opts
 hbase_rs_open_region_thread
s
 hbase_tcp_nodelay
 ...
 hdfs_dn_jvm_opts
 hdfs_mount_options
 hdfs_nn_dn_staleness_interva
l
 ...
 hive_auto_convert_join
 hive_disable_caching
 hive_enable_cbo
 ...
Security
 hdfs_dn_volume_tolerance
 hdfs_audit_log
 hdfs_block_access_token
 hdfs_enable_security_check
 hdfs_nn_super_user_group
 hdfs_zkfc_ha_acl
 ...
 ranger_policy_refresh_interval
 smartsense_2_way_ssl_enabl
ed
 ...
 yarn_ats_security
 yarn_enable_acl
 ...
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There is more than just configurations
How do I
show
back/charg
e back my
tenants ?
Who are the
top users of
my platform
?What type of
work loads
are running
on my cluster
?
Which jobs
have
significant
impact on my
cluster ?
How do I
improve
performanc
e of key
jobs ?
What is good
time for
maintenance
?
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Activity Analysis
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 There are many things involved in managing Hadoop cluster
 Best practices evolve and change across versions
 What is optimal today may not be optimal for tomorrow
 Changing cluster dynamics, workload characteristic need continuous re-
evaluation and configuration adjustments
 SmartSense can significantly help avoid common mistakes, issues, pitfalls
and simplify Hadoop operations
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lets keep your
Hadoop cluster at
its best!
Thank You!
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Appendix
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Resources
⬢ https://docs.hortonworks.com/index.html
⬢ http://hortonworks.com/products/subscriptions/smartsense/
⬢ http://hortonworks.com/info/smartsense/
⬢ http://hortonworks.com/blog/introducing-hortonworks-smartsense/
⬢ https://www.youtube.com/watch?v=IKulo9c8PjE
⬢ https://community.hortonworks.com/topics/smartsense.html
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SmartSense Bundle Security
⬢ All Bundles are Anonymized and Encrypted
⬢ Multiple built-in security measures
– Ambari clear text passwords are not collected
– Hive and Oozie database properties are not collected
– All IP addresses and host names are anonymized
⬢ Extensible security rules
– Exclude properties within specific Hadoop configuration files
– Global REGEX replacements across all configuration, metrics, and logs
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SmartSense Stack Support
HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0
SmartSense 1.x
Ambari 2.2
Built-In!
Ambari 2.1
Plug-In
Ambari 2.0
Plug-In
Ambari 1.7 Ambari 1.6
SmartSense 1.x

Contenu connexe

Tendances

Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
 

Tendances (20)

Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Introduction to Hadoop - The Essentials
Introduction to Hadoop - The EssentialsIntroduction to Hadoop - The Essentials
Introduction to Hadoop - The Essentials
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it's Cool, but are You Doing it...
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 

En vedette

En vedette (20)

How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark Open Source Ingredients for Interactive Data Analysis in Spark
Open Source Ingredients for Interactive Data Analysis in Spark
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Reliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at AirbnbReliable and Scalable Data Ingestion at Airbnb
Reliable and Scalable Data Ingestion at Airbnb
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
Self-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons LearnedSelf-Service Analytics on Hadoop: Lessons Learned
Self-Service Analytics on Hadoop: Lessons Learned
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
The Ecosystem is too damn big
The Ecosystem is too damn big The Ecosystem is too damn big
The Ecosystem is too damn big
 

Similaire à Keep your Hadoop Cluster at its Best

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
Bhupesh Bansal
 

Similaire à Keep your Hadoop Cluster at its Best (20)

Keep your Hadoop cluster at its best!
Keep your Hadoop cluster at its best!Keep your Hadoop cluster at its best!
Keep your Hadoop cluster at its best!
 
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
 
Double Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSenseDouble Your Hadoop Hardware Performance with SmartSense
Double Your Hadoop Hardware Performance with SmartSense
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016Hadoop Summit - Scheduling policies in YARN - San Jose 2016
Hadoop Summit - Scheduling policies in YARN - San Jose 2016
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Democratizing Memory Storage
Democratizing Memory StorageDemocratizing Memory Storage
Democratizing Memory Storage
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache Storm
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 

Plus de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Plus de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Dernier

Dernier (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Keep your Hadoop Cluster at its Best

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Keep Your Hadoop Cluster at its Best! Chris Nauroth Sheetal Dolas Hadoop Summit, San Jose, 2016
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Us ⬢ Principal Engineer @ Hortonworks ⬢ Committer and PMC, Apache Hadoop – Key contributor to HDFS ACLs, Windows compatibility, and operability improvements ⬢ Hadoop user since 2010 – Experience deploying, maintaining and using Hadoop clusters cnauroth@hortonworks.com cnauroth Chris Nauroth
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Us ⬢ SmartSense Engineering Lead @ Hortonworks ⬢ Most of the career has been in the field, solving real life business problems ⬢ Last 6+ years in Big Data ⬢ Committer and PMC, Apache Metron sheetal@hortonworks.com sheetal_dolas Sheetal Dolas
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda ⬢ Days in a life of Hadoop users – Real war stories! ⬢ Hadoop Operational Challenges ⬢ Winning and avoiding the wars ⬢ Q & A
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Days in a life of Hadoop users Real war stories!
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story I: Unstable NameNode, Frequent Fail Overs ⬢ NameNode periodically becomes unresponsive ⬢ In HA scenario, fails over to standby ⬢ In short time, falls back again ⬢ Very frequent fail overs and fail backs It was the garbage collection!
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story II: Very high CPU usage but low throughput ⬢ Unusually high system CPU usage ⬢ Jobs slowed down ⬢ Reduced data IO System CPU User CPU N/W IO Transparent Huge Pages (THP) was turned on!
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story III: Cascading impact and cluster melt down ⬢ HDFS upgraded ⬢ HDFS utilization kept on increasing even after large data deletion ⬢ Rebalancing made the situation worse ⬢ Eventually HDFS became unresponsive un-finalized HDFS had cascading impact on cluster!
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story IV: Overloaded cluster ⬢ Jobs run slower ⬢ Always waiting containers and jobs, all YARN queues are fully utilized ⬢ Some jobs had to wait for hours to get the container slots Sub optimally configured container sizes! Requested Memory Used Memory
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story V: Accidental deletion of critical datasets ⬢ User accidentally executed hdfs dfs -rm -R on a root directory ⬢ Delete is issued in parallel, control + c did not help ⬢ In panic, user shuts down HDFS immediately (fortunately) ⬢ Restarts later to check trash, loses all data ⬢ It’s nearly impossible to recover blocks from local file system This is a more common mistake than one may think!
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Story VI: Hive query returning random results ⬢ A hive query returns different results every time ⬢ Results are usually accurate during office hours ⬢ After office hours, results keep changing randomly on every execution -- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ? SELECT SUM(amount) FROM sales WHERE sale_date = TO_DATE (UNIX_TIMESTAMP()) One of the host had a different time zone!
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved and the stories continue…
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop has lots of configurations ⬢ So many configurations! Overwhelming for many users ⬢ Best practices are evolving and change across versions
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Many configurations are cluster and workload specific ⬢ A configuration good for one cluster may not be suitable for another cluster ⬢ Optimally configured clusters may become sub optimal tomorrow as they grow
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Large clusters add to the complexities ⬢ Managing, updating and keeping nodes in sync becomes challenging ⬢ Nodes going down miss the maintenance cycles and get out of sync ⬢ Newly added nodes may have different standards (java version, os, user configurations etc.) ⬢ Clusters start having heterogeneous hardware over period of time
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Winning and avoiding the wars with SmartSense
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Proactive support & personalized cluster insights by – Enabling faster case resolution – Applying industry best practices – Providing proactive analysis ⬢ SmartSense is a collection of tools and services – Evaluates cluster’s current configuration and runtime environment against rich set of rules – Rules are dynamic, reacting to thresholds tailored to the specific cluster and its workloads – Continuously evolving and improving rule sets, developed by or in close consultation with active committers, support engineers, field engineers. SmartSense
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A G E N T A G E N T A G E N TA G E N TA G E N T A G E N T L A N D I N G Z O N E S E R V E R A M B A R I A G E N T A G E N T A G E N TA G E N TA G E N T A G E N T B U N D L E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E W O R K E R N O D E S m a r t S e n s e A n a l y t i c s SmartSense Architecture G A T E W A Y
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Unstable NameNode, Frequent Fail Overs Daunting Questions ⬢ What is right Heap size for my NN ? ⬢ What should be the new gen size ? ⬢ Which GC should I use ? ⬢ What GC options to be configured? ⬢ What if my cluster grows ? SmartSense Answer ⬢ Rule: hdfs_nn_jvm_opts ⬢ Calculates Heap size based on – Current heap usage – Total number of objects in file system – Best practices ⬢ Recalculates dependent JVM options based on Heap size ⬢ Validates existing JVM opts ⬢ Provides continuous validations and proactive recommendations
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Heap Size – 200 bytes per HDFS object (files, directories, blocks) – 25 % buffer  -Xms should be same as –Xmx  New generation size should be 1/8th of –Xmx (capped at 8G)  Use Concurrent Mark Sweep (CMS) Garbage Collection – -XX:+UseConcMarkSweepGC – -XX:CMSInitiatingOccupancyFraction=70 – -XX:+UseCMSInitiatingOccupancyOnly – -XX:ParallelGCThreads=8 NameNode JVM Opts
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Very high CPU usage but low throughput Daunting Questions ⬢ Is THP applicable to my OS version ? ⬢ Is it disabled ? Completely disabled ? ⬢ How do I make sure it is disabled on newly added nodes too ? ⬢ How do I make these configurations person independent ? SmartSense Answer ⬢ Rule: os_thp ⬢ Checks if thp is completely disabled ⬢ Provides OS specific disabling instructions ⬢ Continuous evaluation that validates newly added nodes and re-commissioned nodes
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Disable THP ⬢ For RedHat & CentOS echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled ⬢ For Debian, Ubuntu & SUSE echo "never" > /sys/kernel/mm/transparent_hugepage/enabled System CPU User CPU N/W IO
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Cascading impact and cluster melt down Daunting Questions ⬢ Should I finalize upgrade ? ⬢ What is right time to finalize ? ⬢ How do I make sure it does not fall through cracks ? SmartSense Answer ⬢ Rule: hdfs_nn_finalize_upgrade ⬢ Checks HDFS health after upgrade ⬢ Evaluates how long HDFS is running in un-finalized state ⬢ Reminds until it is finalized
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Check NN UI / JMX for upgrade status  Do not finalize HDFS upgrade until – All files and blocks have been verified after upgrade – Critical jobs have been executed at least once after upgrade  Finalize between 2 - 7 days after upgrade hdfs dfsadmin -finalizeUpgrade HDFS Upgrade finalization
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing : Overloaded cluster Daunting Questions ⬢ What is right container size for my cluster ? ⬢ If I add additional components (HBase, Storm), how does the container size change ? ⬢ How does container sizes change when I add new types of nodes in the cluster ? ⬢ What’s impact on container sizes if I add SSDs to the nodes? SmartSense Answer ⬢ Rules: yarn_container_size, mr_container_size, tez_container_size ⬢ Evaluates resources available on individual host (CPU, Memory, Disks, Running Services etc.) ⬢ Calculates technology specific container sizes (MR, Tez, Hive) ⬢ Continuously evaluates as the cluster dynamics change
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Container sizing  Identify resources (CPU, Memory, Disks) available on each node  Keep aside resources required for other processes (OS, DN, NM, HBase RS)  Calculate max possible containers for each resource (CPU, Memory, Disks) – CPU Containers: 4x cores – Disk Containers: ( 3x HDD + 10x SSD ) – Memory Containers: (Available RAM / 2 )  Number of containers = Min (CPU Containers, Disk Containers, Memory Containers)
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing: Accidental deletion of critical datasets Daunting Questions ⬢ Is HDFS trash enabled ? ⬢ What is safe trash interval ? ⬢ How to prevent accidental deletion of critical data ? SmartSense Answer ⬢ Rule: hdfs_trash_interval – Checks if trash is enabled – Validates if trash interval is within reasonable limits ⬢ Rule: hdfs_nn_protect_imp_dirs – New feature available in Hadoop 2.8 – Helps you mark critical directories such as “/”, “/user”, “/user/apps/hive”, “/user/apps/hbase” etc. are delete protected.
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Trash interval and directory protection  fs.trash.interval detects number of minutes after which the trashed data gets deleted – 0 means trash disabled (data gets deleted immediately) – Keep it the range 1440 (1 day) – 10080 (7 days) – Recommended 4320 (3 days)  fs.protected.directories specifies directories that will be delete protected – Available from Hadoop 2.8 – List all key directories there ("/", "/user","/user/apps", "/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data", "/mapred", "/mapred/system", "/tmp" etc. )
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Addressing : Hive query returning random results Daunting Questions ⬢ Is my cluster configured consistently ? ⬢ How do I prevent such hard to analyze issues ? ⬢ How do I make sure newly added do not bring these types of issues ? ⬢ How do I make these set ups person independent ? SmartSense Answer ⬢ Rule: os_time_zone ⬢ Checks if all hosts have same time zone ⬢ Rule os_service_ntpd_on make sure all host times are in sync ⬢ Continuous evaluation that validates newly added nodes and re-commissioned nodes
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved There are 250+ more such rules Operations  hdfs_dn_volume_tolerance  hdfs_dn_xceivers  hdfs_nn_handler_count  …  yarn_zk_quorum  yarn_nm_recovery  …  os_hostname_reverse_lookup  os_ssd_tuning  …  hive_mr_strict_mode  hive_datanucleus_cache  …  tez_am_heap  tez_shuffle_buffer  … Performance  ams_mc_distributed_configs  ams_mc_write_path  ...  hbase_jvm_opts  hbase_rs_open_region_thread s  hbase_tcp_nodelay  ...  hdfs_dn_jvm_opts  hdfs_mount_options  hdfs_nn_dn_staleness_interva l  ...  hive_auto_convert_join  hive_disable_caching  hive_enable_cbo  ... Security  hdfs_dn_volume_tolerance  hdfs_audit_log  hdfs_block_access_token  hdfs_enable_security_check  hdfs_nn_super_user_group  hdfs_zkfc_ha_acl  ...  ranger_policy_refresh_interval  smartsense_2_way_ssl_enabl ed  ...  yarn_ats_security  yarn_enable_acl  ...
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved There is more than just configurations How do I show back/charg e back my tenants ? Who are the top users of my platform ?What type of work loads are running on my cluster ? Which jobs have significant impact on my cluster ? How do I improve performanc e of key jobs ? What is good time for maintenance ?
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Activity Analysis
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  There are many things involved in managing Hadoop cluster  Best practices evolve and change across versions  What is optimal today may not be optimal for tomorrow  Changing cluster dynamics, workload characteristic need continuous re- evaluation and configuration adjustments  SmartSense can significantly help avoid common mistakes, issues, pitfalls and simplify Hadoop operations
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lets keep your Hadoop cluster at its best! Thank You!
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Appendix
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Resources ⬢ https://docs.hortonworks.com/index.html ⬢ http://hortonworks.com/products/subscriptions/smartsense/ ⬢ http://hortonworks.com/info/smartsense/ ⬢ http://hortonworks.com/blog/introducing-hortonworks-smartsense/ ⬢ https://www.youtube.com/watch?v=IKulo9c8PjE ⬢ https://community.hortonworks.com/topics/smartsense.html
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SmartSense Bundle Security ⬢ All Bundles are Anonymized and Encrypted ⬢ Multiple built-in security measures – Ambari clear text passwords are not collected – Hive and Oozie database properties are not collected – All IP addresses and host names are anonymized ⬢ Extensible security rules – Exclude properties within specific Hadoop configuration files – Global REGEX replacements across all configuration, metrics, and logs
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SmartSense Stack Support HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0 SmartSense 1.x Ambari 2.2 Built-In! Ambari 2.1 Plug-In Ambari 2.0 Plug-In Ambari 1.7 Ambari 1.6 SmartSense 1.x

Notes de l'éditeur

  1. SmartSense bundles include configuration, and metrics, and bundles used for Support Case troubleshooting included configuration, metrics, and log files. This data is captured for the Operating System of cluster nodes, as well as for all of the installed HDP services. The capture process can be configured to exclude specific files from capture, or specific Hadoop properties within HDP configuration files. In order to provide protection to organization-specific data, such as customer ID’s, patient ID’s, Credit Card #’s, etc. We provide the capability to specify a regular expression that can be removed or replaced in any file that is captured by SmartSense. This allows protection of sensitive data in the event that data is unintentionally leaked into log files. By default we remove all properties associated with clear text passwords. Ambari, Hive, and Oozie by default store DB credentials as cleartext, unless they’ve been configured to encrypt them. Just in case Hadoop Operators have not taken the time to do so, we exclude those properties by default.