3. EMC Corporation All rights reserved
• How many developers?
INTRODUCTION
A SURVEY
4. EMC Corporation All rights reserved
• How many BI/SQL Developer?
INTRODUCTION
A SURVEY
5. EMC Corporation All rights reserved
• How many Business analyst/Sales?
INTRODUCTION
A SURVEY
6. EMC Corporation All rights reserved
• How many have used Hadoop?
INTRODUCTION
A SURVEY
7. EMC Corporation All rights reserved
• How many have used SQL on Hadoop?
INTRODUCTION
A SURVEY
8. EMC Corporation All rights reserved
• Hadoop is an open source framework for large-
scale data storing & processing.
WHAT IS HADOOP
9. EMC Corporation All rights reserved
• Application Workgroup in EMC
– Focused on
•Big data development/infrastructure
•Application modernization
•DevOps
ABOUT THE HOSTS
10. EMC Corporation All rights reserved
• Fahim Kundi
– 10+ years experience in EDW and big data
• Haden Pareira
– Data engineer with 5+ years of Hadoop experience
• Muhammad Ali
– Data engineer 2+ years with Hadoop
ABOUT THE HOSTS
APPLICATION WORKGROUP IN EMC
12. EMC Corporation All rights reserved
• HDFS is a file system – it’s all files
• MapReduce requires strong programming skills
• It’s so difficult
WHAT IS HADOOP
13. EMC Corporation All rights reserved
• SQL is well known in analytics community
• Faster and easier data insights
• Allows SQL/BI developer to retain their expertise
and create value out of big data
SQL ON HADOOP
14. EMC Corporation All rights reserved
• Cloudera – Impala
• Hortonworks – Hive/Tez
• Pivotal – HAWQ … now HDB
• MapR – Drill
• IBM – Big SQL
SQL ON HADOOP
17. EMC Corporation All rights reserved
CONTENTS
• Hive Introduction
• How Hive Works
• Apache Tez
• Hive with Tez Vs Mapreduce
• ORC and Parquet Format
• HAWQ Introduction
• Query Optimizer
• PxF
18. EMC Corporation All rights reserved
HIVE INTRODUCTION (1)
• Apache Hive is high level query language
and data warehouse features built on top of
Hadoop.
• It is initially developed by yahoo and made
open source in 2008.
• SQL Like Query Language called HQL.
• Partitioning and Bucketing for faster Query
processing.
• Integration with Visualization tool like
Tableau.
19. EMC Corporation All rights reserved
HIVE INTRODUCTION (2)
• Hive supports all the common primitive data
formats such as INT, BINARY, BOOLEAN,
CHAR, DECIMAL, FLOAT, STRING, TIMESTAMP
etc.
• In addition, analysts can combine primitive
data types to form complex data types, such
as structs, maps and arrays.
20. EMC Corporation All rights reserved
HOW HIVE WORKS (1)
• The tables in Hive are similar to tables in a relational
database.
• Databases are comprised of tables, which are made up
of partitions.
• Data can be accessed via a simple query language and
Hive supports overwriting or appending data.
• Hive queries internally will be converted to map reduce
programs or Tez.
21. EMC Corporation All rights reserved
HOW HIVE WORKS (2)
• Within a particular database, data in the tables is
serialized and each table has a corresponding Hadoop
Distributed File System (HDFS) directory.
• Each table can be sub-divided into partitions that
determine how data is distributed within sub-
directories of the table directory.
• Data within partitions can be further broken down into
buckets.
22. EMC Corporation All rights reserved
APACHE TEZ (1)
• Apache Tez, a new distributed execution framework
that is targeted towards data-processing applications
on Hadoop.
• Tez is developed by Hortonwork and built on top of
YARN (Resource Management Framework for Hadoop)
• Tez generalizes Mapreduce to more powerful
framework as it creates Dataflow Graph for job
executed by User. (Example)
23. EMC Corporation All rights reserved
APACHE TEZ (2)
• The Tez API has the following components –
– DAG (Directed Acyclic Graph) – defines the overall job.
One DAG object corresponds to one job
– Vertex – defines the user logic along with the resources
and the environment needed to execute the user logic.
One Vertex corresponds to one step in the job
– Edge – defines the connection between producer and
consumer vertices.
• Tez is not meant directly for end-users – in fact it
enables developers to build end-user applications with
much better performance and flexibility.
25. EMC Corporation All rights reserved
ORC FILE
• ORC(Optimal Row Columnar) is columnar file format designed
for Hadoop workloads.
• ORC files developed to massively speed up Apache Hive and
improve the storage efficiency of data stored in Apache Hadoop.
It is optimized for large streaming reads.
• ORC Features:
– Columnar format for complex data types
– Built into Hive from 0.11
– Support for Pig and Mapreduce via Hcat.
– Two level of compression
• Light weight type specific
• General
– Built in Indexes
27. EMC Corporation All rights reserved
PARQUET
• Apache Parquet is a columnar storage format available
to any project in the Hadoop ecosystem, regardless of
the choice of data processing framework, data model
or programming language.
• Parquet Feature:
– Columnar File Format
– Support Nested Data Structures
– Accessible by Hive, Spark, Pig, Drill, MR
– R/W in HDFS or local file system
29. EMC Corporation All rights reserved
ORC VS PARQUET
• Two major consideration for considering ORC over Parquet
– Many of the performance improvements provided in the Stinger
initiative are dependent on features of the ORC format including
block level index for each column. This leads to potentially more
efficient I/O allowing Hive to skip reading entire blocks of data if it
determines predicate values are not present there.
– Also the Cost Based Optimizer has the ability to consider column
level metadata present in ORC files in order to generate the most
efficient graph.
– ACID transactions are only possible when using ORC as the file
format.
31. EMC Corporation All rights reserved
HAWQ INTRODUCTION
• HAWQ is MPP(Parallel) SQL-query engine that uses HDFS for
its storage layer.
• HAWQ evolves from the Greenplum Database query planner
to handle query processing and does not rely on MapReduce
under the hood to do processing.
• HAWQ reads data from and writes data to HDFS natively.
• It also has extensions(PxF) to allow it to interact with data
contained in other services (HBase, Hive, Avro, etc) that also
reside in HDFS.
32. EMC Corporation All rights reserved
HAWQ FEATURES
• HAWQ provides all major features found in Greenplum
database
– SQL Completeness: 2003 Extensions
– JDBC Compliant
– Robust Query Optimizer
– Row or Column-Oriented Table Storage
– Parallel Loading and Unloading
– Distributions
– Multi-level Partitioning
– High speed data redistribution
– Views
– External Tables
– Compression
– Resource Management
– Security
– Authentication
– Management and Monitoring
33. EMC Corporation All rights reserved
HAWQ ARCHITECTURE
Interconnect
Local Storage
HAWQ Master
Parser Query Optimizer
PXF
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
DataNode
Local Temp Storage
Segment Host
Query Executor
HDFS
PXF
Segment
[Segment …]
HAWQ Standby
Master
NameNode
HDFS
Secondary NameNode
HDFS
34. EMC Corporation All rights reserved
HAWQ PARALLEL QUERY OPTIMIZER
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on
lineitem
Hash
Seq Scan on
orders
Hash
HashJoin
Seq Scan on
customer
Hash
Broadcast Motion
Seq Scan on
nation
• Turn SQL Query into execution Plan
• Cost based Optimizer
35. EMC Corporation All rights reserved
PIVOTAL EXTENSION FRAMEWORK (PXF)
• PXF is a fast, extensible framework connecting HAWQ to a
HDFS data store of choice that exposes a parallel API
An advanced version of external
tables
Enables combining HAWQ data
and Hadoop data in a single query
Supports connectors for HDFS,
HBase and Hive
Provides extensible framework API
to enable custom connector
development for any data sources
HDFS HBase Hive
Xtension Framework
37. EMC Corporation. All rights reserved.
• Interactive Query on top of Hadoop
• ANSI-92 SQL Standard
• Native MPP query engine
• Written in C++
IMPALA
OVERVIEW
38. EMC Corporation. All rights reserved.
• Native to Hadoop
– Blends with the eco system
– Security
– Hive MetaStore / HCatalog
– Query existing HDFS data
• Not as fault-tolerant as MapReduce
– (or Hive or SparkSQL or …)
– Single node fails during query the whole query fails
– But if it’s 20x faster, you can rerun and still finish faster ;)
IMPALA
OVERVIEW
39. EMC Corporation. All rights reserved.
IMPALAARCHITECTURE
Image courtesy cloudera
40. EMC Corporation. All rights reserved.
• Query execution times (small to medium size)
• Parquet Format
– Compression
• High Concurrency – kills the competitors
• Partitioning
• Query Optimizer (Compute Statistics!)
IMPALA
WHERE IT SHINES
42. EMC Corporation. All rights reserved.
• Distributed columnar storage manager
• Performance of Parquet
– Great for analytical queries
• Mutability of HBase
– Supports UPDATE/DELETE unlike Parquet
• One common storage to rule them all!
– (not exactly!)
WHAT THE HELL IS KUDU!
44. EMC Corporation. All rights reserved.
• IoT use cases
– High velocity data
– Same data read for analytical queries near real time
• Predictive Modeling
– Large datasets updated frequently
– Retraining models
• Time-series applications
– Kudu offers compound keys/hash based partitioning
– Avoids hot spotting
KUDU USE CASES
47. EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
• General Purpose Distributed Computing System
– Multiple language support (Java, Scala, Python, and R)
– Fault tolerant, data distribution, in-memory caching etc.
• RDD
– Resilient distributed datasets
• Operations
– Transformations (define new RDDs)
– Actions (return value)
• No nonsense
– 100x faster than MapReduce
– Disk used only when can’t be avoided
48. EMC Corporation. All rights reserved.
2 MIN INTRO TO SPARK
Image Courtesy: Sachin Parmar
http://www.slideshare.net/sachinparmarss/deep-dive-spark-data-frames-sql-and-catalyst-optimizer?
50. EMC Corporation. All rights reserved.
SPARKSQL
• Structured Data Processing
– Commonly known to us as tables
• Integrated into Spark programming model
• Unified Data Access
• Scalability
• Support for HiveQL
• Cache it!
51. EMC Corporation. All rights reserved.
SPARKSQL
• Two APIs
– DataFrames
• Data organized into named columns
• Similar to Tables
• Can be constructed from structured data files, Hive, external DBs
– DataSets
• Experimental interface
• Strongly typed & SQL execution engine
• Can be constructed from regular JVM objects
Hadoop has traditionally been a batch-processing platform for large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as Machine Learning, which do not fit will into the MapReduce paradigm. Tez helps Hadoop address these use cases.
Compared with RCFile format, for example, ORC file format has many advantages such as:
a single file as the output of each task, which reduces the NameNode's load
Hive type support including datetime, decimal, and the complex types (struct, list, map, and union)
light-weight indexes stored within the file that skip row groups that don't pass predicate filtering
block-mode compression based on data type
run-length encoding for integer columns and dictionary encoding for string columns
concurrent reads of the same file using separate RecordReaders
Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query.
Advantages of Columnar Storage:
Limits IO by loading the columns that is needed.
Save space as columnar layout compress better
Converts SQL into a physical execution plan
Cost-based optimization looks for the most efficient plan
Physical plan contains scans, joins, sorts, aggregations, etc.
Global planning avoids sub-optimal ‘SQL pushing’ to segments
Directly inserts motion nodes for inter-segment communication
Directly inserts motion nodes for efficient non-local join processing