Contenu connexe Similaire à Keep your Hadoop Cluster at its Best (20) Plus de DataWorks Summit/Hadoop Summit (20) Keep your Hadoop Cluster at its Best1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Keep Your Hadoop
Cluster at its Best!
Chris Nauroth
Sheetal Dolas
Hadoop Summit, San Jose, 2016
2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Us
⬢ Principal Engineer @ Hortonworks
⬢ Committer and PMC, Apache Hadoop
– Key contributor to HDFS ACLs, Windows compatibility, and operability improvements
⬢ Hadoop user since 2010
– Experience deploying, maintaining and using Hadoop clusters
cnauroth@hortonworks.com
cnauroth
Chris Nauroth
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Us
⬢ SmartSense Engineering Lead @ Hortonworks
⬢ Most of the career has been in the field, solving real life business problems
⬢ Last 6+ years in Big Data
⬢ Committer and PMC, Apache Metron
sheetal@hortonworks.com
sheetal_dolas
Sheetal Dolas
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
⬢ Days in a life of Hadoop users – Real war stories!
⬢ Hadoop Operational Challenges
⬢ Winning and avoiding the wars
⬢ Q & A
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Days in a life of
Hadoop users
Real war stories!
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story I: Unstable NameNode, Frequent Fail Overs
⬢ NameNode periodically becomes unresponsive
⬢ In HA scenario, fails over to standby
⬢ In short time, falls back again
⬢ Very frequent fail overs and fail backs
It was the garbage
collection!
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story II: Very high CPU usage but low throughput
⬢ Unusually high system CPU usage
⬢ Jobs slowed down
⬢ Reduced data IO
System CPU
User CPU N/W IO
Transparent Huge Pages (THP) was turned
on!
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story III: Cascading impact and cluster melt down
⬢ HDFS upgraded
⬢ HDFS utilization kept on increasing even after large data deletion
⬢ Rebalancing made the situation worse
⬢ Eventually HDFS became unresponsive
un-finalized HDFS had
cascading impact on cluster!
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story IV: Overloaded cluster
⬢ Jobs run slower
⬢ Always waiting containers and jobs, all YARN queues are fully utilized
⬢ Some jobs had to wait for hours to get the container slots
Sub optimally configured container sizes!
Requested
Memory
Used Memory
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story V: Accidental deletion of critical datasets
⬢ User accidentally executed hdfs dfs -rm -R on a root directory
⬢ Delete is issued in parallel, control + c did not help
⬢ In panic, user shuts down HDFS immediately (fortunately)
⬢ Restarts later to check trash, loses all data
⬢ It’s nearly impossible to recover blocks from local file system
This is a more common mistake than one may
think!
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Story VI: Hive query returning random results
⬢ A hive query returns different results every time
⬢ Results are usually accurate during office hours
⬢ After office hours, results keep changing randomly on every execution
-- QUERY: WHAT IS TODAY’S TOTAL SALE AS OF NOW ?
SELECT SUM(amount)
FROM sales
WHERE sale_date = TO_DATE (UNIX_TIMESTAMP())
One of the host had a different time zone!
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
and the stories continue…
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop has lots of configurations
⬢ So many configurations! Overwhelming for many users
⬢ Best practices are evolving and change across versions
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Many configurations are cluster and workload specific
⬢ A configuration good for one cluster may not be suitable for another cluster
⬢ Optimally configured clusters may become sub optimal tomorrow as they
grow
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Large clusters add to the complexities
⬢ Managing, updating and keeping nodes in sync becomes challenging
⬢ Nodes going down miss the maintenance cycles and get out of sync
⬢ Newly added nodes may have different standards (java version, os, user
configurations etc.)
⬢ Clusters start having heterogeneous hardware over period of time
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Winning
and
avoiding
the wars with
SmartSense
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Proactive support & personalized cluster insights by
– Enabling faster case resolution
– Applying industry best practices
– Providing proactive analysis
⬢ SmartSense is a collection of tools and services
– Evaluates cluster’s current configuration and runtime environment against rich set of rules
– Rules are dynamic, reacting to thresholds tailored to the specific cluster and its workloads
– Continuously evolving and improving rule sets, developed by or in close consultation with active
committers, support engineers, field engineers.
SmartSense
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
A G E N T A G E N T
A G E N TA G E N TA G E N T
A G E N T
L A N D I N G Z O N E
S E R V E R
A M B A R I
A G E N T A G E N T
A G E N TA G E N TA G E N T
A G E N T
B U N D L E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
W O R K E R
N O D E
S m a r t S e n s e
A n a l y t i c s
SmartSense Architecture
G A T E W A Y
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Unstable NameNode, Frequent Fail Overs
Daunting Questions
⬢ What is right Heap size for my
NN ?
⬢ What should be the new gen
size ?
⬢ Which GC should I use ?
⬢ What GC options to be
configured?
⬢ What if my cluster grows ?
SmartSense Answer
⬢ Rule: hdfs_nn_jvm_opts
⬢ Calculates Heap size based on
– Current heap usage
– Total number of objects in file system
– Best practices
⬢ Recalculates dependent JVM
options based on Heap size
⬢ Validates existing JVM opts
⬢ Provides continuous validations
and proactive
recommendations
21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Heap Size
– 200 bytes per HDFS object (files, directories, blocks)
– 25 % buffer
-Xms should be same as –Xmx
New generation size should be 1/8th of –Xmx (capped at 8G)
Use Concurrent Mark Sweep (CMS) Garbage Collection
– -XX:+UseConcMarkSweepGC
– -XX:CMSInitiatingOccupancyFraction=70
– -XX:+UseCMSInitiatingOccupancyOnly
– -XX:ParallelGCThreads=8
NameNode JVM Opts
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Very high CPU usage but low throughput
Daunting Questions
⬢ Is THP applicable to my OS
version ?
⬢ Is it disabled ? Completely
disabled ?
⬢ How do I make sure it is
disabled on newly added nodes
too ?
⬢ How do I make these
configurations person
independent ?
SmartSense Answer
⬢ Rule: os_thp
⬢ Checks if thp is completely
disabled
⬢ Provides OS specific disabling
instructions
⬢ Continuous evaluation that
validates newly added nodes
and re-commissioned nodes
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disable THP
⬢ For RedHat & CentOS
echo "never" > /sys/kernel/mm/redhat_transparent_hugepage/enabled
⬢ For Debian, Ubuntu & SUSE
echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
System CPU
User CPU
N/W IO
24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Cascading impact and cluster melt down
Daunting Questions
⬢ Should I finalize upgrade ?
⬢ What is right time to finalize ?
⬢ How do I make sure it does not
fall through cracks ?
SmartSense Answer
⬢ Rule: hdfs_nn_finalize_upgrade
⬢ Checks HDFS health after
upgrade
⬢ Evaluates how long HDFS is
running in un-finalized state
⬢ Reminds until it is finalized
25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Check NN UI / JMX for upgrade status
Do not finalize HDFS upgrade until
– All files and blocks have been verified after upgrade
– Critical jobs have been executed at least once after upgrade
Finalize between 2 - 7 days after upgrade
hdfs dfsadmin -finalizeUpgrade
HDFS Upgrade finalization
26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing : Overloaded cluster
Daunting Questions
⬢ What is right container size for
my cluster ?
⬢ If I add additional components
(HBase, Storm), how does the
container size change ?
⬢ How does container sizes
change when I add new types
of nodes in the cluster ?
⬢ What’s impact on container
sizes if I add SSDs to the
nodes?
SmartSense Answer
⬢ Rules: yarn_container_size,
mr_container_size,
tez_container_size
⬢ Evaluates resources available
on individual host (CPU,
Memory, Disks, Running
Services etc.)
⬢ Calculates technology specific
container sizes (MR, Tez, Hive)
⬢ Continuously evaluates as the
cluster dynamics change
27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Container sizing
Identify resources (CPU, Memory, Disks) available on each node
Keep aside resources required for other processes (OS, DN, NM, HBase
RS)
Calculate max possible containers for each resource (CPU, Memory, Disks)
– CPU Containers: 4x cores
– Disk Containers: ( 3x HDD + 10x SSD )
– Memory Containers: (Available RAM / 2 )
Number of containers = Min (CPU Containers, Disk Containers, Memory
Containers)
28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing: Accidental deletion of critical datasets
Daunting Questions
⬢ Is HDFS trash enabled ?
⬢ What is safe trash interval ?
⬢ How to prevent accidental
deletion of critical data ?
SmartSense Answer
⬢ Rule: hdfs_trash_interval
– Checks if trash is enabled
– Validates if trash interval is within
reasonable limits
⬢ Rule:
hdfs_nn_protect_imp_dirs
– New feature available in Hadoop 2.8
– Helps you mark critical directories such
as “/”, “/user”, “/user/apps/hive”,
“/user/apps/hbase” etc. are delete
protected.
29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS Trash interval and directory protection
fs.trash.interval detects number of minutes after which the trashed
data gets deleted
– 0 means trash disabled (data gets deleted immediately)
– Keep it the range 1440 (1 day) – 10080 (7 days)
– Recommended 4320 (3 days)
fs.protected.directories specifies directories that will be delete
protected
– Available from Hadoop 2.8
– List all key directories there ("/", "/user","/user/apps",
"/user/apps/hive", "/user/apps/hbase", "/user/apps/hbase/data",
"/mapred", "/mapred/system", "/tmp" etc. )
30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Addressing : Hive query returning random results
Daunting Questions
⬢ Is my cluster configured
consistently ?
⬢ How do I prevent such hard to
analyze issues ?
⬢ How do I make sure newly
added do not bring these types
of issues ?
⬢ How do I make these set ups
person independent ?
SmartSense Answer
⬢ Rule: os_time_zone
⬢ Checks if all hosts have same
time zone
⬢ Rule os_service_ntpd_on make
sure all host times are in sync
⬢ Continuous evaluation that
validates newly added nodes
and re-commissioned nodes
31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There are 250+ more such rules
Operations
hdfs_dn_volume_tolerance
hdfs_dn_xceivers
hdfs_nn_handler_count
…
yarn_zk_quorum
yarn_nm_recovery
…
os_hostname_reverse_lookup
os_ssd_tuning
…
hive_mr_strict_mode
hive_datanucleus_cache
…
tez_am_heap
tez_shuffle_buffer
…
Performance
ams_mc_distributed_configs
ams_mc_write_path
...
hbase_jvm_opts
hbase_rs_open_region_thread
s
hbase_tcp_nodelay
...
hdfs_dn_jvm_opts
hdfs_mount_options
hdfs_nn_dn_staleness_interva
l
...
hive_auto_convert_join
hive_disable_caching
hive_enable_cbo
...
Security
hdfs_dn_volume_tolerance
hdfs_audit_log
hdfs_block_access_token
hdfs_enable_security_check
hdfs_nn_super_user_group
hdfs_zkfc_ha_acl
...
ranger_policy_refresh_interval
smartsense_2_way_ssl_enabl
ed
...
yarn_ats_security
yarn_enable_acl
...
32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There is more than just configurations
How do I
show
back/charg
e back my
tenants ?
Who are the
top users of
my platform
?What type of
work loads
are running
on my cluster
?
Which jobs
have
significant
impact on my
cluster ?
How do I
improve
performanc
e of key
jobs ?
What is good
time for
maintenance
?
34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
There are many things involved in managing Hadoop cluster
Best practices evolve and change across versions
What is optimal today may not be optimal for tomorrow
Changing cluster dynamics, workload characteristic need continuous re-
evaluation and configuration adjustments
SmartSense can significantly help avoid common mistakes, issues, pitfalls
and simplify Hadoop operations
35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lets keep your
Hadoop cluster at
its best!
Thank You!
37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Resources
⬢ https://docs.hortonworks.com/index.html
⬢ http://hortonworks.com/products/subscriptions/smartsense/
⬢ http://hortonworks.com/info/smartsense/
⬢ http://hortonworks.com/blog/introducing-hortonworks-smartsense/
⬢ https://www.youtube.com/watch?v=IKulo9c8PjE
⬢ https://community.hortonworks.com/topics/smartsense.html
38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SmartSense Bundle Security
⬢ All Bundles are Anonymized and Encrypted
⬢ Multiple built-in security measures
– Ambari clear text passwords are not collected
– Hive and Oozie database properties are not collected
– All IP addresses and host names are anonymized
⬢ Extensible security rules
– Exclude properties within specific Hadoop configuration files
– Global REGEX replacements across all configuration, metrics, and logs
39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SmartSense Stack Support
HDP 2.4 HDP 2.3 HDP 2.2 HDP 2.1 HDP 2.0
SmartSense 1.x
Ambari 2.2
Built-In!
Ambari 2.1
Plug-In
Ambari 2.0
Plug-In
Ambari 1.7 Ambari 1.6
SmartSense 1.x
Notes de l'éditeur SmartSense bundles include configuration, and metrics, and bundles used for Support Case troubleshooting included configuration, metrics, and log files. This data is captured for the Operating System of cluster nodes, as well as for all of the installed HDP services.
The capture process can be configured to exclude specific files from capture, or specific Hadoop properties within HDP configuration files. In order to provide protection to organization-specific data, such as customer ID’s, patient ID’s, Credit Card #’s, etc. We provide the capability to specify a regular expression that can be removed or replaced in any file that is captured by SmartSense. This allows protection of sensitive data in the event that data is unintentionally leaked into log files.
By default we remove all properties associated with clear text passwords. Ambari, Hive, and Oozie by default store DB credentials as cleartext, unless they’ve been configured to encrypt them. Just in case Hadoop Operators have not taken the time to do so, we exclude those properties by default.