2. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami
• Database Kernel developer
-> ORACLE DBA
-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop
3. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Shameless plug about my company
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z
4. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Agenda
• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start
5. April 2-6, 2017 in Las Vegas, NV USA #C17LV
What is Hadoop:
• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql
6. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Yes, but what does it all mean ?
7. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Imagine that you are Google
in the early 2000s
8. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Target Ads
• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast
• (reasonably) Cheap
• (reasonably) Easy to use
9. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Let’s build a Data Warehouse
10. April 2-6, 2017 in Las Vegas, NV USA #C17LV
(traditional) Data warehouse
• Been there for years
• Mature and
(relatively) advanced
• SQL !!!
11. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Data Warehouse scorecard
Requirements RDBMS
(reasonably) Fast
(reasonably) Cheap
(reasonably) Easy to use
Able to process data ¯_(ツ)_/¯
12. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Scaling up “Big data” ain’t cheap
• Can’t fit all of the data
on a single box
• Cost is quickly
getting out of hand
13. April 2-6, 2017 in Las Vegas, NV USA #C17LV
(cheap) Commodity systems
make “big data” feasible
14. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Solution = commodity systems
=
$$$$$ $$
15. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Commodity systems scorecard
Requirements Commodity
(reasonably) Fast
(reasonably) Cheap
(reasonably) Easy to use
Able to process data
16. April 2-6, 2017 in Las Vegas, NV USA #C17LV
All your queries are Java Classes
17. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Google
• 2003:
Google File System
(GFS) paper
• 2004:
Google MapReduce
(MR) paper
18. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop
• 2006: Hadoop
19. April 2-6, 2017 in Las Vegas, NV USA #C17LV
”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse
(reasonably) Fast
(reasonably) Cheap
(reasonably) Easy to use
Able to process data ¯_(ツ)_/¯
20. April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2010: Facebook releases
Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive
21. April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2012: Cloudera announces
Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala
22. April 2-6, 2017 in Las Vegas, NV USA #C17LV
And then, it exploded …
28. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: What gets optimized
• No “regular” indexes
• But many operations
are distributed
SALES 1
TIMES 1
SALES 2
TIMES 2
SALES 3
TIMES 3
29. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Native cloud filesystem support
sql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
30. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Database engine does NOT ”own” data
31. April 2-6, 2017 in Las Vegas, NV USA #C17LV
example01.dbf
sysaux01.dbf
system01.dbf
temp01.dbf
undotbs01.dbf
users01.dbf
a01_data.parq
a01_data.parq
a03_data.parq
a04_data.parq
a05_data.parq
a06_data.parq
Different: Different engines can work with
the same data files (even at the same time)
32. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: … or copies of the data files
hdfs://adhoc/a.parq
hdfs://adhoc/b.parq
hdfs://adhoc/c.parq
hdfs://adhoc/d.parq
hdfs://adhoc/e.parq
hdfs://adhoc/f.parq
hdfs://prod/a.parq
hdfs://prod/b.parq
hdfs://prod/c.parq
hdfs://prod/d.parq
hdfs://prod/e.parq
hdfs://prod/f.parq
s3://backup/a.parq
s3://backup/b.parq
s3://backup/c.parq
s3://backup/d.parq
s3://backup/e.parq
s3://backup/f.parq
33. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Open data formats
• Not proprietary – many
tools can read/write
• No additional $$
for “advanced features”:
• Columnar storage
• Storage indexes
• Compression
34. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: “sqlplus-like” clients
> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+
| prod_id | count(1) |
+-----------------------+----------+
| 48.000000000000000000 | 74026 |
+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'
0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
35. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore
36. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Append only, “ETL-like” DML
• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: some
interpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc
38. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Hive
Slave C
• “Designed” for
“batch” queries (*)
• Runs on top of standard
Hadoop RM: YARN
• Supports multiple
“engines”: MR, TEZ,
Spark
• SerDes
YARN
NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN
NM
datanode
Slave C
YARN
NM
datanode
YARN RM
39. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Slave A
Apache Impala
• Designed for
“quick interactive”
queries
• “Data-local” execution
• In-memory processing
impalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd
40. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Spark
• “Better Hadoop”
with “native”:
SQL, Mlib, GraphX
• In-memory processing,
based on RDDs
• Supports many clusters:
“native”, YARN, Mesos
• Flexible programming
model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor
41. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Presto
Slave A
• Designed for
“interactive” queries
• In-memory processing
• Custom storage
“plugins”: Hive, Kafka,
MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator
47. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hint: Nobody builds their own Linux
anymore
48. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Chose Hadoop distribution that suits you
49. April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop distributions
• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database
50. April 2-6, 2017 in Las Vegas, NV USA #C17LV
So what’s in it for me ?
• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented
52. Please Complete Your
Session Evaluation
Evaluate this session in your COLLABORATE app.
Pull up this session and tap "Session Evaluation"
to complete the survey.
Session ID: 557
Notes de l'éditeur
All your queries are essentially Java Classes
With Apache Hadoop, “everybody” can run “big data” queries
With Apache Hadoop, “everybody” can run “big data” queries
Columns and RowsSchemasSimilar data typesOther familiar objects: Views (*), Functions (**)
Notably missing: Indexes