Hadoop databases for oracle DBAs

Session ID:
Prepared by:
Hadoop databases:
Hive, Impala, Spark, Presto
For ORACLE DBAs
557
Maxym Kharchenko, Gluent
@maxymkh

April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami
• Database Kernel developer
-> ORACLE DBA
-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop

Shameless plug about my company
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z

Agenda
• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start

What is Hadoop:
• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql

Yes, but what does it all mean ?

Imagine that you are Google
in the early 2000s

Target Ads
• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast
• (reasonably) Cheap
• (reasonably) Easy to use

Let’s build a Data Warehouse

(traditional) Data warehouse
• Been there for years
• Mature and
(relatively) advanced
• SQL !!!

Data Warehouse scorecard
Requirements RDBMS
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data ¯_(ツ)_/¯

Scaling up “Big data” ain’t cheap
• Can’t fit all of the data
on a single box
• Cost is quickly
getting out of hand

(cheap) Commodity systems
make “big data” feasible

Solution = commodity systems
=
$$$$$ $$

Commodity systems scorecard
Requirements Commodity
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data   

All your queries are Java Classes

Google
• 2003:
Google File System
(GFS) paper
• 2004:
Google MapReduce
(MR) paper

Hadoop
• 2006: Hadoop

”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse
(reasonably) Fast      
(reasonably) Cheap      
(reasonably) Easy to use      
Able to process data    ¯_(ツ)_/¯

• 2010: Facebook releases
Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive

• 2012: Cloudera announces
Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala

And then, it exploded …

“Hadoop” vs “Relational”
databases
Demo … hopefully 

This is not about NoSql :-)

Same: Running SQL queries
sql> select prod_id, count(1)
from sh.sales s, sh.channels c
where c.channel_id = s.channel_id
and c.channel_desc='Catalog'
group by prod_id
order by 2 desc
limit 5;
+------------------------+----------+
| prod_id | count(1) |
+------------------------+----------+
| 43.000000000000000000 | 5182 |
| 46.000000000000000000 | 5165 |
| 22.000000000000000000 | 5162 |
| 123.000000000000000000 | 5152 |
| 32.000000000000000000 | 5145 |
+------------------------+----------+
Fetched 5 row(s) in 3.26s

Different: What gets optimized
• No “regular” indexes
• But many operations
are distributed
SALES 1
TIMES 1
SALES 2
TIMES 2
SALES 3
TIMES 3

Different: Native cloud filesystem support
sql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET

Database engine does NOT ”own” data

example01.dbf
sysaux01.dbf
system01.dbf
temp01.dbf
undotbs01.dbf
users01.dbf
a01_data.parq
a01_data.parq
a03_data.parq
a04_data.parq
a05_data.parq
a06_data.parq
Different: Different engines can work with
the same data files (even at the same time)

Different: … or copies of the data files
hdfs://adhoc/a.parq
hdfs://adhoc/b.parq
hdfs://adhoc/c.parq
hdfs://adhoc/d.parq
hdfs://adhoc/e.parq
hdfs://adhoc/f.parq
hdfs://prod/a.parq
hdfs://prod/b.parq
hdfs://prod/c.parq
hdfs://prod/d.parq
hdfs://prod/e.parq
hdfs://prod/f.parq
s3://backup/a.parq
s3://backup/b.parq
s3://backup/c.parq
s3://backup/d.parq
s3://backup/e.parq
s3://backup/f.parq

Different: Open data formats
• Not proprietary – many
tools can read/write
• No additional $$
for “advanced features”:
• Columnar storage
• Storage indexes
• Compression

Same: “sqlplus-like” clients
> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+
| prod_id | count(1) |
+-----------------------+----------+
| 48.000000000000000000 | 74026 |
+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'
0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;

Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore

Different: Append only, “ETL-like” DML
• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: some
interpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc

Apache Hive
Slave C
• “Designed” for
“batch” queries (*)
• Runs on top of standard
Hadoop RM: YARN
• Supports multiple
“engines”: MR, TEZ,
Spark
• SerDes
YARN
NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN
NM
datanode
Slave C
YARN
NM
datanode
YARN RM

Slave A
Apache Impala
• Designed for
“quick interactive”
queries
• “Data-local” execution
• In-memory processing
impalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd

Apache Spark
• “Better Hadoop”
with “native”:
SQL, Mlib, GraphX
• In-memory processing,
based on RDDs
• Supports many clusters:
“native”, YARN, Mesos
• Flexible programming
model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor

Presto
Slave A
• Designed for
“interactive” queries
• In-memory processing
• Custom storage
“plugins”: Hive, Kafka,
MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator

Step 1: Google “Hadoop ecosystem”

Step 2: Try to install the simplest thing

Step 3

Step 4

Hint: Nobody builds their own Linux
anymore

Chose Hadoop distribution that suits you

Hadoop distributions
• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database

So what’s in it for me ?
• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented

Please Complete Your
Session Evaluation
Evaluate this session in your COLLABORATE app.
Pull up this session and tap "Session Evaluation"
to complete the survey.
Session ID: 557

Hadoop databases for oracle DBAs

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Hadoop databases for oracle DBAs

Similaire à Hadoop databases for oracle DBAs (20)

Plus de Maxym Kharchenko

Plus de Maxym Kharchenko (7)

Dernier

Dernier (20)

Hadoop databases for oracle DBAs

Notes de l'éditeur