SlideShare une entreprise Scribd logo
1  sur  51
DEBUGGING HIVE WITH
HADOOP IN THE CLOUD
Soam Acharya, David Chaiken, Denis Sheahan, Charles Wimmer
Altiscale, Inc.
June 4, 2014
WHO ARE WE?
• Altiscale: Infrastructure Nerds!
• Hadoop As A Service
• Rack and build our own Hadoop clusters
• Provide a suite of Hadoop tools
o Hive, Pig, Oozie
o Others as needed: R, Python, Spark, Mahout, Impala, etc.
• Monthly billing plan: compute, storage
• https://www.altiscale.com
• @Altiscale #HadoopSherpa
TALK ROADMAP
• Our Platform and Perspective
• Hadoop 2 Primer
• Hadoop Debugging Tools
• Accessing Logs in Hadoop 2
• Hive + Hadoop Architecture
• Hive Logs
• Hive Issues + Case Studies
• Conclusion: Making Hive Easier to Use
OUR DYNAMIC PLATFORM
• Hadoop 2.0.5 => Hadoop 2.2.0
• Hive 0.10 => Hive 0.12
• Hive, Pig and Oozie most commonly used tools
• Working with customers on:
Spark, Stinger (Hive 0.13 + Tez), Impala, …
ALTISCALE PERSPECTIVE
• Service provider
o Hadoop Dialtone!
o Keep Hadoop/Hive + other tools running
o Service Level Agreements target application-level metrics
o Multiple clusters/customers
o Operational scalability
o Multi-tenancy
• Operational approach
o How to use Hadoop 2 cluster tools and logs
to debug and to tune
o This talk will not focus on query optimization
Hadoop 2 Cluster
Name Node Hadoop Slave
Hadoop Slave
Hadoop Slave
Resource Manager
Secondary NameNode
Hadoop Slave
Node Managers
+
Data Nodes
QUICK PRIMER – HADOOP 2
QUICK PRIMER – HADOOP 2 YARN
• Resource Manager (per cluster)
o Manages job scheduling and execution
o Global resource allocation
• Application Master (per job)
o Manages task scheduling and execution
o Local resource allocation
• Node Manager (per-machine agent)
o Manages the lifecycle of task containers
o Reports to RM on health and resource usage
HADOOP 1 VS HADOOP 2
• No more JobTrackers, TaskTrackers
• YARN ~ Operating System for Clusters
o MapReduce is implemented as a YARN application
o Bring on the applications! (Spark is just the start…)
• Should be Transparent to Hive users
HADOOP 2 DEBUGGING TOOLS
• Monitoring
o System state of cluster:
 CPU, Memory, Network, Disk
 Nagios, Ganglia, Sensu!
 Collectd, statd, Graphite
o Hadoop level
 HDFS usage
 Resource usage:
• Container memory allocated vs used
• # of jobs running at the same time
• Long running tasks
HADOOP 2 DEBUGGING TOOLS
• Hadoop logs
o Daemon logs: Resource Manager, NameNode, DataNode
o Application logs: Application Master, MapReduce tasks
o Job history file: resources allocated during job lifetime
o Application configuration files: store all Hadoop application
parameters
• Source code instrumentation
ACCESSING LOGS IN HADOOP 2
• To view the logs for a job, click on the link under the ID
column in Resource Manager UI.
ACCESSING LOGS IN HADOOP 2
• To view application top level logs, click on logs.
• To view individual logs for the mappers and reducers,
click on History.
ACCESSING LOGS IN HADOOP 2
• Log output for the entire application.
ACCESSING LOGS IN HADOOP 2
• Click on the Map link for mapper logs and the Reduce
link for reducer logs.
ACCESSING LOGS IN HADOOP 2
• Clicking on a single link under Name provides an
overview for that particular map job.
ACCESSING LOGS IN HADOOP 2
• Finally, clicking on the logs link will take you to the log
output for that map job.
ACCESSING LOGS IN HADOOP 2
• Fun, fun, donuts, and more fun…
HIVE + HADOOP 2 ARCHITECTURE
• Hive 0.10+
Hadoop 2 Cluster
Hive CLI Hive
Metastore
HiveserverJDBC/ODBC
Kettle,
Hue, …
HIVE LOGS
• Query Log location
• From /etc/hive/hive-site.xml:
<property>
<name>hive.querylog.location</name>
<value>/home/hive/log/${user.name}</value>
</property>
SessionStart SESSION_ID="soam_201402032341"
TIME="1391470900594"
HIVE CLIENT LOGS
• /etc/hive/hive-log4j.properties:
o hive.log.dir=/var/log/hive/${user.name}
2014-05-29 19:51:09,830 INFO parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing
command: select count(*) from dogfood_job_data
2014-05-29 19:51:09,852 INFO parse.ParseDriver (ParseDriver.java:parse(197)) - Parse
Completed
2014-05-29 19:51:09,852 INFO ql.Driver (PerfLogger.java:PerfLogEnd(124)) - </PERFLOG
method=parse start=1401393069830 end=1401393069852 duration=22>
2014-05-29 19:51:09,853 INFO ql.Driver (PerfLogger.java:PerfLogBegin(97)) - <PERFLOG
method=semanticAnalyze>
2014-05-29 19:51:09,890 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:analyzeInternal(8305)) - Starting Semantic Analysis
2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:analyzeInternal(8340)) - Completed phase 1 of Semantic Analysis
2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1060)) - Get metadata for source tables
2014-05-29 19:51:09,906 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1167)) - Get metadata for subqueries
2014-05-29 19:51:09,909 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1187)) - Get metadata for destination tables
HIVE METASTORE LOGS
• /etc/hive-metastore/hive-log4j.properties:
o hive.log.dir=/service/log/hive-metastore/${user.name}
2014-05-29 19:50:50,179 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data
2014-05-29 19:50:50,180 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94
cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data
2014-05-29 19:50:50,236 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data
2014-05-29 19:50:50,236 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94
cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data
2014-05-29 19:50:50,261 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data
HIVE ISSUES + CASE STUDIES
• Hive Issues
o Hive client out of memory
o Hive map/reduce task out of memory
o Hive metastore out of memory
o Hive launches too many tasks
• Case Studies:
o Hive “stuck” job
o Hive “missing directories”
o Analyze Hive Query Execution
HIVE CLIENT OUT OF MEMORY
• Memory intensive client side hive query (map-side join)
Number of reduce tasks not specified. Estimated from input data size: 999
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
java.lang.OutOfMemoryError: Java heap space
at java.nio.CharBuffer.wrap(CharBuffer.java:350)
at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
HIVE CLIENT OUT OF MEMORY
• Use HADOOP_HEAPSIZE prior to launching Hive client
• HADOOP_HEAPSIZE=<new heapsize> hive <fileName>
• Watch out for HADOOP_CLIENT_OPTS issue in hive-env.sh!
• Important to know the amount of memory available on
machine running client… Do not exceed or use
disproportionate amount.
$ free -m
total used free shared buffers cached
Mem: 1695 1388 306 0 60 424
-/+ buffers/cache: 903 791
Swap: 895 101 794
HIVE TASK OUT OF MEMORY
• Query spawns MapReduce jobs that run out of memory
• How to find this issue?
o Hive diagnostic message
o Hadoop MapReduce logs
HIVE TASK OUT OF MEMORY
• Fix is to increase task RAM allocation…
set mapreduce.map.memory.mb=<new RAM allocation>;
set mapreduce.reduce.memory.mb=<new RAM allocation>;
• Also watch out for…
set mapreduce.map.java.opts=-Xmx<heap size>m;
set mapreduce.reduce.java.opts=-Xmx<heap size>m;
• Not a magic bullet – requires manual tuning
• Increase in individual container memory size:
o Decrease in overall containers that can be run
o Decrease in overall parallelism
HIVE METASTORE OUT OF MEMORY
• Out of memory issues not necessarily dumped to logs
• Metastore can become unresponsive
• Can’t submit queries
• Restart with a higher heap size:
export HADOOP_HEAPSIZE in hcat_server.sh
• After notifying hive users about downtime:
service hcat restart
HIVE LAUNCHES TOO MANY TASKS
• Typically a function of the input data set
• Lots of little files
HIVE LAUNCHES TOO MANY TASKS
• Set mapred.max.split.size to appropriate fraction of data size
• Also verify that
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
CASE STUDY: HIVE STUCK JOB
From an Altiscale customer:
“This job [jobid] has been running now for
41 hours. Is it still progressing or has
something hung up the map/reduce so it’s
just spinning? Do you have any insight?”
HIVE STUCK JOB
1. Received jobId,
application_1382973574141_4536, from client
2. Logged into client cluster.
3. Pulled up Resource Manager
4. Entered part of jobId (4536) in the search box.
5. Clicked on the link that says:
application_1382973574141_4536
6. On resulting Application Overview page, clicked on link
next to “Tracking URL” that said Application Master
HIVE STUCK JOB
7. On resulting MapReduce Application page, we clicked on the
Job Id (job_1382973574141_4536).
8. The resulting MapReduce Job page displayed detailed status
of the mappers, including 4 failed mappers
9. We then clicked on the 4 link on the Maps row in the Failed
column.
10. Title of the next page was “FAILED Map attempts in
job_1382973574141_4536.”
11. Each failed mapper generated an error message.
12. Buried in the 16th line:
Caused by: java.io.FileNotFoundException: File
does not exist:
hdfs://opaque_hostname:8020/HiveTableDir/FileNa
me.log.date.seq
HIVE STUCK JOB
• Job was stuck for a day or so, retrying a mapper that
would never finish successfully.
• During the job, our customers’ colleague realized input
file was corrupted and deleted it.
• Colleague did not anticipate the affect of removing
corrupted data on a running job
• Hadoop didn’t make it easy to find out:
o RM => search => application link => AM overview page => MR
Application Page => MR Job Page => Failed jobs page =>
parse long logs
o Task retry without hope of success
HIVE “MISSING DIRECTORIES”
From an Altiscale customer:
“One problem we are seeing after the
[Hive Metastore] restart is that we lost
quite a few directories in [HDFS]. Is there
a way to recover these?”
HIVE “MISSING DIRECTORIES”
• Obtained list of “missing” directories from customer:
o /hive/biz/prod/*
• Confirmed they were missing from HDFS
• Searched through NameNode audit log to get block IDs that
belonged to missing directories.
13/07/24 21:10:08 INFO hdfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/hive/biz/prod/incremental/carryoverstore/postdepuis
/lmt_unmapped_pggroup_schema._COPYING_. BP-
798113632-10.251.255.251-1370812162472
blk_3560522076897293424_2448396{blockUCState=UNDER_C
ONSTRUCTION, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.251.255.177:50
010|RBW],
ReplicaUnderConstruction[10.251.255.174:50010|RBW],
ReplicaUnderConstruction[10.251.255.169:50010|RBW]]}
HIVE “MISSING DIRECTORIES”
• Used blockID to locate exact time of file deletion from
Namenode logs:
13/07/31 08:10:33 INFO hdfs.StateChange:
BLOCK* addToInvalidates:
blk_3560522076897293424_2448396 to
10.251.255.177:50010 10.251.255.169:50010
10.251.255.174:50010
• Used time of deletion to inspect hive logs
HIVE “MISSING DIRECTORIES”
QueryStart QUERY_STRING="create database biz_weekly location
'/hive/biz/prod'" QUERY_ID=“usrprod_20130731043232_0a40fd32-8c8a-479c-
ba7d-3bd8a2698f4b" TIME="1375245164667"
:
QueryEnd QUERY_STRING="create database biz_weekly location
'/hive/biz/prod'" QUERY_ID=”usrprod_20130731043232_0a40fd32-8c8a-479c-
ba7d-3bd8a2698f4b" QUERY_RET_CODE="0" QUERY_NUM_TASKS="0"
TIME="1375245166203"
:
QueryStart QUERY_STRING="drop database biz_weekly"
QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733"
TIME="1375256014799"
:
QueryEnd QUERY_STRING="drop database biz_weekly"
QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733"
QUERY_NUM_TASKS="0" TIME="1375256014838"
HIVE “MISSING DIRECTORIES”
• In effect, user “usrprod” issued:
At 2013-07-31 04:32:44: create database biz_weekly
location '/hive/biz/prod'
At 2013-07-31 07:33:24: drop database biz_weekly
• This is functionally equivalent to:
hdfs dfs -rm -r /hive/biz/prod
HIVE “MISSING DIRECTORIES”
• Customer manually placed their own data in /hive –
the warehouse directory managed and controlled by hive
• Customer used CREATE and DROP db commands in
their code
o Hive deletes database and table locations in /hive with
impunity
• Why didn’t deleted data end up in .Trash?
o Trash collection not turned on in configuration settings
o It is now, but need a –skipTrash option (HIVE-6469)
HIVE “MISSING DIRECTORIES”
• Hadoop forensics: piece together disparate sources…
o Hadoop daemon logs (NameNode)
o Hive query and metastore logs
o Hadoop config files
• Need better tools to correlate the different layers of the
system: hive client, hive metastore, MapReduce job,
YARN, HDFS, operating sytem metrics, …
By the way… Operating any distributed system would be
totally insane without NTP and a standard time zone (UTC).
CASE STUDY – ANALYZE QUERY
• Customer provided Hive query + data sets
(100GBs to ~5 TBs)
• Needed help optimizing the query
• Didn’t rewrite query immediately
• Wanted to characterize query performance and isolate
bottlenecks first
ANALYZE AND TUNE EXECUTION
• Ran original query on the datasets in our environment:
o Two M/R Stages: Stage-1, Stage-2
• Long running reducers run out of memory
o set mapreduce.reduce.memory.mb=5120
o Reduces slots and extends reduce time
• Query fails to launch Stage-2 with out of memory
o set HADOOP_HEAPSIZE=1024 on client machine
• Query has 250,000 Mappers in Stage-2 which causes
failure
o set mapred.max.split.size=5368709120
to reduce Mappers
ANALYSIS: HOW TO VISUALIZE?
• Next challenge - how to visualize job execution?
• Existing hadoop/hive logs not sufficient for this task
• Wrote internal tools
o parse job history files
o plot mapper and reducer execution
ANALYSIS: MAP STAGE-1
Single reduce task
ANALYSIS: REDUCE STAGE-1
ANALYSIS: MAP STAGE-2
ANALYSIS: REDUCE STAGE-2
ANALYZE EXECUTION: FINDINGS
• Lone, long running reducer in first stage of query
• Analyzed input data:
o Query split input data by userId
o Bucketizing input data by userId
o One very large bucket: “invalid” userId
o Discussed “invalid” userid with customer
• An error value is a common pattern!
o Need to differentiate between “Don’t know and don’t care”
or “don’t know and do care.”
CONCLUSIONS
• Hive + Hadoop debugging can get very complex
o Sifting through many logs and screens
o Automatic transmission versus manual transmission
• Static partitioning induced by Java Virtual Machine has
benefits but also induces challenges.
• Where there are difficulties, there’s opportunity
o Better tooling
o Better instrumentation
o Better integration of disparate logs and metrics
• Hadoop as a Service: aggregate and share expertise
• Need to learn from the traditional database community!
QUESTIONS? COMMENTS?

Contenu connexe

Tendances

Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemDataWorks Summit/Hadoop Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHortonworks
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 

Tendances (19)

Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
How to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARNHow to manage Hortonworks HDB Resources with YARN
How to manage Hortonworks HDB Resources with YARN
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 

Similaire à De-Bugging Hive with Hadoop-in-the-Cloud

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleData Con LA
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsDataWorks Summit
 
Data Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoTData Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoTAmmarHassan80
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive Alex Silva
 
Rapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache HadoopRapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache Hadoopzenyk
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to HabitatMandi Walls
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopHortonworks
 

Similaire à De-Bugging Hive with Hadoop-in-the-Cloud (20)

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of AltiscaleDebugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Data Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoTData Analytics and IoT, how to analyze data from IoT
Data Analytics and IoT, how to analyze data from IoT
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
BIG DATA ANALYSIS
BIG DATA ANALYSISBIG DATA ANALYSIS
BIG DATA ANALYSIS
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive	Data Engineering with Spring, Hadoop and Hive
Data Engineering with Spring, Hadoop and Hive
 
Rapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache HadoopRapid Development of Big Data applications using Spring for Apache Hadoop
Rapid Development of Big Data applications using Spring for Apache Hadoop
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
ha_module5
ha_module5ha_module5
ha_module5
 
Hadoop content
Hadoop contentHadoop content
Hadoop content
 
Containerdays Intro to Habitat
Containerdays Intro to HabitatContainerdays Intro to Habitat
Containerdays Intro to Habitat
 
Yahoo! Hack Europe Workshop
Yahoo! Hack Europe WorkshopYahoo! Hack Europe Workshop
Yahoo! Hack Europe Workshop
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

De-Bugging Hive with Hadoop-in-the-Cloud

  • 1. DEBUGGING HIVE WITH HADOOP IN THE CLOUD Soam Acharya, David Chaiken, Denis Sheahan, Charles Wimmer Altiscale, Inc. June 4, 2014
  • 2. WHO ARE WE? • Altiscale: Infrastructure Nerds! • Hadoop As A Service • Rack and build our own Hadoop clusters • Provide a suite of Hadoop tools o Hive, Pig, Oozie o Others as needed: R, Python, Spark, Mahout, Impala, etc. • Monthly billing plan: compute, storage • https://www.altiscale.com • @Altiscale #HadoopSherpa
  • 3. TALK ROADMAP • Our Platform and Perspective • Hadoop 2 Primer • Hadoop Debugging Tools • Accessing Logs in Hadoop 2 • Hive + Hadoop Architecture • Hive Logs • Hive Issues + Case Studies • Conclusion: Making Hive Easier to Use
  • 4. OUR DYNAMIC PLATFORM • Hadoop 2.0.5 => Hadoop 2.2.0 • Hive 0.10 => Hive 0.12 • Hive, Pig and Oozie most commonly used tools • Working with customers on: Spark, Stinger (Hive 0.13 + Tez), Impala, …
  • 5. ALTISCALE PERSPECTIVE • Service provider o Hadoop Dialtone! o Keep Hadoop/Hive + other tools running o Service Level Agreements target application-level metrics o Multiple clusters/customers o Operational scalability o Multi-tenancy • Operational approach o How to use Hadoop 2 cluster tools and logs to debug and to tune o This talk will not focus on query optimization
  • 6. Hadoop 2 Cluster Name Node Hadoop Slave Hadoop Slave Hadoop Slave Resource Manager Secondary NameNode Hadoop Slave Node Managers + Data Nodes QUICK PRIMER – HADOOP 2
  • 7. QUICK PRIMER – HADOOP 2 YARN • Resource Manager (per cluster) o Manages job scheduling and execution o Global resource allocation • Application Master (per job) o Manages task scheduling and execution o Local resource allocation • Node Manager (per-machine agent) o Manages the lifecycle of task containers o Reports to RM on health and resource usage
  • 8. HADOOP 1 VS HADOOP 2 • No more JobTrackers, TaskTrackers • YARN ~ Operating System for Clusters o MapReduce is implemented as a YARN application o Bring on the applications! (Spark is just the start…) • Should be Transparent to Hive users
  • 9. HADOOP 2 DEBUGGING TOOLS • Monitoring o System state of cluster:  CPU, Memory, Network, Disk  Nagios, Ganglia, Sensu!  Collectd, statd, Graphite o Hadoop level  HDFS usage  Resource usage: • Container memory allocated vs used • # of jobs running at the same time • Long running tasks
  • 10. HADOOP 2 DEBUGGING TOOLS • Hadoop logs o Daemon logs: Resource Manager, NameNode, DataNode o Application logs: Application Master, MapReduce tasks o Job history file: resources allocated during job lifetime o Application configuration files: store all Hadoop application parameters • Source code instrumentation
  • 11.
  • 12. ACCESSING LOGS IN HADOOP 2 • To view the logs for a job, click on the link under the ID column in Resource Manager UI.
  • 13. ACCESSING LOGS IN HADOOP 2 • To view application top level logs, click on logs. • To view individual logs for the mappers and reducers, click on History.
  • 14. ACCESSING LOGS IN HADOOP 2 • Log output for the entire application.
  • 15. ACCESSING LOGS IN HADOOP 2 • Click on the Map link for mapper logs and the Reduce link for reducer logs.
  • 16. ACCESSING LOGS IN HADOOP 2 • Clicking on a single link under Name provides an overview for that particular map job.
  • 17. ACCESSING LOGS IN HADOOP 2 • Finally, clicking on the logs link will take you to the log output for that map job.
  • 18. ACCESSING LOGS IN HADOOP 2 • Fun, fun, donuts, and more fun…
  • 19. HIVE + HADOOP 2 ARCHITECTURE • Hive 0.10+ Hadoop 2 Cluster Hive CLI Hive Metastore HiveserverJDBC/ODBC Kettle, Hue, …
  • 20. HIVE LOGS • Query Log location • From /etc/hive/hive-site.xml: <property> <name>hive.querylog.location</name> <value>/home/hive/log/${user.name}</value> </property> SessionStart SESSION_ID="soam_201402032341" TIME="1391470900594"
  • 21. HIVE CLIENT LOGS • /etc/hive/hive-log4j.properties: o hive.log.dir=/var/log/hive/${user.name} 2014-05-29 19:51:09,830 INFO parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: select count(*) from dogfood_job_data 2014-05-29 19:51:09,852 INFO parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed 2014-05-29 19:51:09,852 INFO ql.Driver (PerfLogger.java:PerfLogEnd(124)) - </PERFLOG method=parse start=1401393069830 end=1401393069852 duration=22> 2014-05-29 19:51:09,853 INFO ql.Driver (PerfLogger.java:PerfLogBegin(97)) - <PERFLOG method=semanticAnalyze> 2014-05-29 19:51:09,890 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(8305)) - Starting Semantic Analysis 2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(8340)) - Completed phase 1 of Semantic Analysis 2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1060)) - Get metadata for source tables 2014-05-29 19:51:09,906 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1167)) - Get metadata for subqueries 2014-05-29 19:51:09,909 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1187)) - Get metadata for destination tables
  • 22. HIVE METASTORE LOGS • /etc/hive-metastore/hive-log4j.properties: o hive.log.dir=/service/log/hive-metastore/${user.name} 2014-05-29 19:50:50,179 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data 2014-05-29 19:50:50,180 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94 cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data 2014-05-29 19:50:50,236 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data 2014-05-29 19:50:50,236 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94 cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data 2014-05-29 19:50:50,261 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data
  • 23. HIVE ISSUES + CASE STUDIES • Hive Issues o Hive client out of memory o Hive map/reduce task out of memory o Hive metastore out of memory o Hive launches too many tasks • Case Studies: o Hive “stuck” job o Hive “missing directories” o Analyze Hive Query Execution
  • 24. HIVE CLIENT OUT OF MEMORY • Memory intensive client side hive query (map-side join) Number of reduce tasks not specified. Estimated from input data size: 999 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> java.lang.OutOfMemoryError: Java heap space at java.nio.CharBuffer.wrap(CharBuffer.java:350) at java.nio.CharBuffer.wrap(CharBuffer.java:373) at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
  • 25. HIVE CLIENT OUT OF MEMORY • Use HADOOP_HEAPSIZE prior to launching Hive client • HADOOP_HEAPSIZE=<new heapsize> hive <fileName> • Watch out for HADOOP_CLIENT_OPTS issue in hive-env.sh! • Important to know the amount of memory available on machine running client… Do not exceed or use disproportionate amount. $ free -m total used free shared buffers cached Mem: 1695 1388 306 0 60 424 -/+ buffers/cache: 903 791 Swap: 895 101 794
  • 26. HIVE TASK OUT OF MEMORY • Query spawns MapReduce jobs that run out of memory • How to find this issue? o Hive diagnostic message o Hadoop MapReduce logs
  • 27. HIVE TASK OUT OF MEMORY • Fix is to increase task RAM allocation… set mapreduce.map.memory.mb=<new RAM allocation>; set mapreduce.reduce.memory.mb=<new RAM allocation>; • Also watch out for… set mapreduce.map.java.opts=-Xmx<heap size>m; set mapreduce.reduce.java.opts=-Xmx<heap size>m; • Not a magic bullet – requires manual tuning • Increase in individual container memory size: o Decrease in overall containers that can be run o Decrease in overall parallelism
  • 28. HIVE METASTORE OUT OF MEMORY • Out of memory issues not necessarily dumped to logs • Metastore can become unresponsive • Can’t submit queries • Restart with a higher heap size: export HADOOP_HEAPSIZE in hcat_server.sh • After notifying hive users about downtime: service hcat restart
  • 29. HIVE LAUNCHES TOO MANY TASKS • Typically a function of the input data set • Lots of little files
  • 30. HIVE LAUNCHES TOO MANY TASKS • Set mapred.max.split.size to appropriate fraction of data size • Also verify that hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
  • 31. CASE STUDY: HIVE STUCK JOB From an Altiscale customer: “This job [jobid] has been running now for 41 hours. Is it still progressing or has something hung up the map/reduce so it’s just spinning? Do you have any insight?”
  • 32. HIVE STUCK JOB 1. Received jobId, application_1382973574141_4536, from client 2. Logged into client cluster. 3. Pulled up Resource Manager 4. Entered part of jobId (4536) in the search box. 5. Clicked on the link that says: application_1382973574141_4536 6. On resulting Application Overview page, clicked on link next to “Tracking URL” that said Application Master
  • 33. HIVE STUCK JOB 7. On resulting MapReduce Application page, we clicked on the Job Id (job_1382973574141_4536). 8. The resulting MapReduce Job page displayed detailed status of the mappers, including 4 failed mappers 9. We then clicked on the 4 link on the Maps row in the Failed column. 10. Title of the next page was “FAILED Map attempts in job_1382973574141_4536.” 11. Each failed mapper generated an error message. 12. Buried in the 16th line: Caused by: java.io.FileNotFoundException: File does not exist: hdfs://opaque_hostname:8020/HiveTableDir/FileNa me.log.date.seq
  • 34. HIVE STUCK JOB • Job was stuck for a day or so, retrying a mapper that would never finish successfully. • During the job, our customers’ colleague realized input file was corrupted and deleted it. • Colleague did not anticipate the affect of removing corrupted data on a running job • Hadoop didn’t make it easy to find out: o RM => search => application link => AM overview page => MR Application Page => MR Job Page => Failed jobs page => parse long logs o Task retry without hope of success
  • 35. HIVE “MISSING DIRECTORIES” From an Altiscale customer: “One problem we are seeing after the [Hive Metastore] restart is that we lost quite a few directories in [HDFS]. Is there a way to recover these?”
  • 36. HIVE “MISSING DIRECTORIES” • Obtained list of “missing” directories from customer: o /hive/biz/prod/* • Confirmed they were missing from HDFS • Searched through NameNode audit log to get block IDs that belonged to missing directories. 13/07/24 21:10:08 INFO hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hive/biz/prod/incremental/carryoverstore/postdepuis /lmt_unmapped_pggroup_schema._COPYING_. BP- 798113632-10.251.255.251-1370812162472 blk_3560522076897293424_2448396{blockUCState=UNDER_C ONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.251.255.177:50 010|RBW], ReplicaUnderConstruction[10.251.255.174:50010|RBW], ReplicaUnderConstruction[10.251.255.169:50010|RBW]]}
  • 37. HIVE “MISSING DIRECTORIES” • Used blockID to locate exact time of file deletion from Namenode logs: 13/07/31 08:10:33 INFO hdfs.StateChange: BLOCK* addToInvalidates: blk_3560522076897293424_2448396 to 10.251.255.177:50010 10.251.255.169:50010 10.251.255.174:50010 • Used time of deletion to inspect hive logs
  • 38. HIVE “MISSING DIRECTORIES” QueryStart QUERY_STRING="create database biz_weekly location '/hive/biz/prod'" QUERY_ID=“usrprod_20130731043232_0a40fd32-8c8a-479c- ba7d-3bd8a2698f4b" TIME="1375245164667" : QueryEnd QUERY_STRING="create database biz_weekly location '/hive/biz/prod'" QUERY_ID=”usrprod_20130731043232_0a40fd32-8c8a-479c- ba7d-3bd8a2698f4b" QUERY_RET_CODE="0" QUERY_NUM_TASKS="0" TIME="1375245166203" : QueryStart QUERY_STRING="drop database biz_weekly" QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733" TIME="1375256014799" : QueryEnd QUERY_STRING="drop database biz_weekly" QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733" QUERY_NUM_TASKS="0" TIME="1375256014838"
  • 39. HIVE “MISSING DIRECTORIES” • In effect, user “usrprod” issued: At 2013-07-31 04:32:44: create database biz_weekly location '/hive/biz/prod' At 2013-07-31 07:33:24: drop database biz_weekly • This is functionally equivalent to: hdfs dfs -rm -r /hive/biz/prod
  • 40. HIVE “MISSING DIRECTORIES” • Customer manually placed their own data in /hive – the warehouse directory managed and controlled by hive • Customer used CREATE and DROP db commands in their code o Hive deletes database and table locations in /hive with impunity • Why didn’t deleted data end up in .Trash? o Trash collection not turned on in configuration settings o It is now, but need a –skipTrash option (HIVE-6469)
  • 41. HIVE “MISSING DIRECTORIES” • Hadoop forensics: piece together disparate sources… o Hadoop daemon logs (NameNode) o Hive query and metastore logs o Hadoop config files • Need better tools to correlate the different layers of the system: hive client, hive metastore, MapReduce job, YARN, HDFS, operating sytem metrics, … By the way… Operating any distributed system would be totally insane without NTP and a standard time zone (UTC).
  • 42. CASE STUDY – ANALYZE QUERY • Customer provided Hive query + data sets (100GBs to ~5 TBs) • Needed help optimizing the query • Didn’t rewrite query immediately • Wanted to characterize query performance and isolate bottlenecks first
  • 43. ANALYZE AND TUNE EXECUTION • Ran original query on the datasets in our environment: o Two M/R Stages: Stage-1, Stage-2 • Long running reducers run out of memory o set mapreduce.reduce.memory.mb=5120 o Reduces slots and extends reduce time • Query fails to launch Stage-2 with out of memory o set HADOOP_HEAPSIZE=1024 on client machine • Query has 250,000 Mappers in Stage-2 which causes failure o set mapred.max.split.size=5368709120 to reduce Mappers
  • 44. ANALYSIS: HOW TO VISUALIZE? • Next challenge - how to visualize job execution? • Existing hadoop/hive logs not sufficient for this task • Wrote internal tools o parse job history files o plot mapper and reducer execution
  • 49. ANALYZE EXECUTION: FINDINGS • Lone, long running reducer in first stage of query • Analyzed input data: o Query split input data by userId o Bucketizing input data by userId o One very large bucket: “invalid” userId o Discussed “invalid” userid with customer • An error value is a common pattern! o Need to differentiate between “Don’t know and don’t care” or “don’t know and do care.”
  • 50. CONCLUSIONS • Hive + Hadoop debugging can get very complex o Sifting through many logs and screens o Automatic transmission versus manual transmission • Static partitioning induced by Java Virtual Machine has benefits but also induces challenges. • Where there are difficulties, there’s opportunity o Better tooling o Better instrumentation o Better integration of disparate logs and metrics • Hadoop as a Service: aggregate and share expertise • Need to learn from the traditional database community!