SlideShare une entreprise Scribd logo
1  sur  52
Session ID:
Prepared by:
Hadoop databases:
Hive, Impala, Spark, Presto
For ORACLE DBAs
557
Maxym Kharchenko, Gluent
@maxymkh
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami
• Database Kernel developer
-> ORACLE DBA
-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Shameless plug about my company
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Agenda
• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start
April 2-6, 2017 in Las Vegas, NV USA #C17LV
What is Hadoop:
• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Yes, but what does it all mean ?
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Imagine that you are Google
in the early 2000s
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Target Ads
• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast
• (reasonably) Cheap
• (reasonably) Easy to use
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Let’s build a Data Warehouse
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(traditional) Data warehouse
• Been there for years
• Mature and
(relatively) advanced
• SQL !!!
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Data Warehouse scorecard
Requirements RDBMS
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data ¯_(ツ)_/¯
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Scaling up “Big data” ain’t cheap
• Can’t fit all of the data
on a single box
• Cost is quickly
getting out of hand
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(cheap) Commodity systems
make “big data” feasible
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Solution = commodity systems
=
$$$$$ $$
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Commodity systems scorecard
Requirements Commodity
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data   
April 2-6, 2017 in Las Vegas, NV USA #C17LV
All your queries are Java Classes
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Google
• 2003:
Google File System
(GFS) paper
• 2004:
Google MapReduce
(MR) paper
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop
• 2006: Hadoop
April 2-6, 2017 in Las Vegas, NV USA #C17LV
”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse
(reasonably) Fast      
(reasonably) Cheap      
(reasonably) Easy to use      
Able to process data    ¯_(ツ)_/¯
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2010: Facebook releases
Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2012: Cloudera announces
Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala
April 2-6, 2017 in Las Vegas, NV USA #C17LV
And then, it exploded …
“Hadoop” vs “Relational”
databases
Demo … hopefully 
April 2-6, 2017 in Las Vegas, NV USA #C17LV
This is not about NoSql :-)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Tables
sql> describe sh.products;
+-----------------------+----------------+---------+
| name | type | comment |
+-----------------------+----------------+---------+
| prod_id | bigint | |
| prod_name | string | |
| prod_desc | string | |
| prod_category_id | bigint | |
| prod_category_desc | string | |
| supplier_id | bigint | |
| prod_total_id | decimal(38,18) | |
| prod_src_id | decimal(38,18) | |
| prod_eff_from | timestamp | |
| prod_eff_to | timestamp | |
| prod_valid | string | |
+-----------------------+----------------+---------+
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Running SQL queries
sql> select prod_id, count(1)
from sh.sales s, sh.channels c
where c.channel_id = s.channel_id
and c.channel_desc='Catalog'
group by prod_id
order by 2 desc
limit 5;
+------------------------+----------+
| prod_id | count(1) |
+------------------------+----------+
| 43.000000000000000000 | 5182 |
| 46.000000000000000000 | 5165 |
| 22.000000000000000000 | 5162 |
| 123.000000000000000000 | 5152 |
| 32.000000000000000000 | 5145 |
+------------------------+----------+
Fetched 5 row(s) in 3.26s
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Queries are optimized
sql> explain select count(1) from sh.times;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 |
| |
| 03:AGGREGATE [FINALIZE] |
| | output: count:merge(1) |
| | |
| 02:EXCHANGE [UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(1) |
| | |
| 00:SCAN HDFS [sh.times] |
| partitions=16/16 files=32 size=500.45KB |
+----------------------------------------------------------+
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: What gets optimized
• No “regular” indexes
• But many operations
are distributed
SALES 1
TIMES 1
SALES 2
TIMES 2
SALES 3
TIMES 3
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Native cloud filesystem support
sql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET
s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Database engine does NOT ”own” data
April 2-6, 2017 in Las Vegas, NV USA #C17LV
example01.dbf
sysaux01.dbf
system01.dbf
temp01.dbf
undotbs01.dbf
users01.dbf
a01_data.parq
a01_data.parq
a03_data.parq
a04_data.parq
a05_data.parq
a06_data.parq
Different: Different engines can work with
the same data files (even at the same time)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: … or copies of the data files
hdfs://adhoc/a.parq
hdfs://adhoc/b.parq
hdfs://adhoc/c.parq
hdfs://adhoc/d.parq
hdfs://adhoc/e.parq
hdfs://adhoc/f.parq
hdfs://prod/a.parq
hdfs://prod/b.parq
hdfs://prod/c.parq
hdfs://prod/d.parq
hdfs://prod/e.parq
hdfs://prod/f.parq
s3://backup/a.parq
s3://backup/b.parq
s3://backup/c.parq
s3://backup/d.parq
s3://backup/e.parq
s3://backup/f.parq
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Open data formats
• Not proprietary – many
tools can read/write
• No additional $$
for “advanced features”:
• Columnar storage
• Storage indexes
• Compression
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: “sqlplus-like” clients
> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+
| prod_id | count(1) |
+-----------------------+----------+
| 48.000000000000000000 | 74026 |
+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'
0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Append only, “ETL-like” DML
• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: some
interpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc
Databases
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Hive
Slave C
• “Designed” for
“batch” queries (*)
• Runs on top of standard
Hadoop RM: YARN
• Supports multiple
“engines”: MR, TEZ,
Spark
• SerDes
YARN
NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN
NM
datanode
Slave C
YARN
NM
datanode
YARN RM
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Slave A
Apache Impala
• Designed for
“quick interactive”
queries
• “Data-local” execution
• In-memory processing
impalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Spark
• “Better Hadoop”
with “native”:
SQL, Mlib, GraphX
• In-memory processing,
based on RDDs
• Supports many clusters:
“native”, YARN, Mesos
• Flexible programming
model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Presto
Slave A
• Designed for
“interactive” queries
• In-memory processing
• Custom storage
“plugins”: Hive, Kafka,
MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator
How to start
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 1: Google “Hadoop ecosystem”
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 2: Try to install the simplest thing
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 3
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 4
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hint: Nobody builds their own Linux
anymore
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Chose Hadoop distribution that suits you
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop distributions
• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database
April 2-6, 2017 in Las Vegas, NV USA #C17LV
So what’s in it for me ?
• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented
Q&A
Please Complete Your
Session Evaluation
Evaluate this session in your COLLABORATE app.
Pull up this session and tap "Session Evaluation"
to complete the survey.
Session ID: 557

Contenu connexe

Tendances

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For HadoopCloudera, Inc.
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Scott Leberknight
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMark Kromer
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationEyad Garelnabi
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance DataWorks Summit/Hadoop Summit
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkEvan Chan
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 

Tendances (20)

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLCompressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Cloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for HadoopCloudera Impala: A Modern SQL Engine for Hadoop
Cloudera Impala: A Modern SQL Engine for Hadoop
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0Cloudera Impala, updated for v1.0
Cloudera Impala, updated for v1.0
 
SQL on Hadoop in Taiwan
SQL on Hadoop in TaiwanSQL on Hadoop in Taiwan
SQL on Hadoop in Taiwan
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
 
Microsoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAsMicrosoft SQL Server Data Warehouses for SQL Server DBAs
Microsoft SQL Server Data Warehouses for SQL Server DBAs
 
Hive Data Modeling and Query Optimization
Hive Data Modeling and Query OptimizationHive Data Modeling and Query Optimization
Hive Data Modeling and Query Optimization
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance Debunking the Myths of HDFS Erasure Coding Performance
Debunking the Myths of HDFS Erasure Coding Performance
 
Building a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and SparkBuilding a High-Performance Database with Scala, Akka, and Spark
Building a High-Performance Database with Scala, Akka, and Spark
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 

Similaire à Hadoop databases for oracle DBAs

Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 
Virtualization & the Cloud for Collaborate 2017
Virtualization & the Cloud for Collaborate 2017Virtualization & the Cloud for Collaborate 2017
Virtualization & the Cloud for Collaborate 2017Kellyn Pot'Vin-Gorman
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Zohar Elkayam
 
Uponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case StudyUponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case StudySimo Vilmunen
 
Oracle EBS database upgrade to 12c
Oracle EBS database upgrade to 12cOracle EBS database upgrade to 12c
Oracle EBS database upgrade to 12cvasuballa
 
Automating Your Clone in E-Business Suite R12.2
Automating Your Clone in E-Business Suite R12.2Automating Your Clone in E-Business Suite R12.2
Automating Your Clone in E-Business Suite R12.2Michael Brown
 
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAsAn Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAsJim Czuprynski
 
EBS on Oracle Cloud
EBS on Oracle CloudEBS on Oracle Cloud
EBS on Oracle Cloudvasuballa
 
Collaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and TomorrowCollaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and TomorrowSmart ERP Solutions, Inc.
 
Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...
Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...
Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...Project Control | PROJ CTRL
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Thamme Gowda
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Karanjeet Singh
 
Where did my day go?: Oracle Enterprise Manager 12c/13c Administration
Where did my day go?: Oracle Enterprise Manager 12c/13c AdministrationWhere did my day go?: Oracle Enterprise Manager 12c/13c Administration
Where did my day go?: Oracle Enterprise Manager 12c/13c AdministrationAlfredo Krieg
 
Adop patching gotchas ppt
Adop patching gotchas pptAdop patching gotchas ppt
Adop patching gotchas pptOT Ometie
 
E1 Pages Contest Hosted by Terillium
E1 Pages Contest Hosted by TerilliumE1 Pages Contest Hosted by Terillium
E1 Pages Contest Hosted by TerilliumTerillium
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017techmaddy
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchTaswar Bhatti
 

Similaire à Hadoop databases for oracle DBAs (20)

Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Virtualization & the Cloud for Collaborate 2017
Virtualization & the Cloud for Collaborate 2017Virtualization & the Cloud for Collaborate 2017
Virtualization & the Cloud for Collaborate 2017
 
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
Things Every Oracle DBA Needs to Know About the Hadoop Ecosystem (c17lv version)
 
Uponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case StudyUponor Exadata e-Business Suite Migration Case Study
Uponor Exadata e-Business Suite Migration Case Study
 
Oracle EBS database upgrade to 12c
Oracle EBS database upgrade to 12cOracle EBS database upgrade to 12c
Oracle EBS database upgrade to 12c
 
Automating Your Clone in E-Business Suite R12.2
Automating Your Clone in E-Business Suite R12.2Automating Your Clone in E-Business Suite R12.2
Automating Your Clone in E-Business Suite R12.2
 
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAsAn Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
An Autonomous Singularity Approaches: Force Multipliers For Overwhelmed DBAs
 
EBS on Oracle Cloud
EBS on Oracle CloudEBS on Oracle Cloud
EBS on Oracle Cloud
 
Collaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and TomorrowCollaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
 
Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...
Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...
Implementinga projectportfoliomanagementprocessatthecityofarvadausingapex pre...
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Sparkler at spark summit east 2017
Sparkler at spark summit east 2017Sparkler at spark summit east 2017
Sparkler at spark summit east 2017
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Where did my day go?: Oracle Enterprise Manager 12c/13c Administration
Where did my day go?: Oracle Enterprise Manager 12c/13c AdministrationWhere did my day go?: Oracle Enterprise Manager 12c/13c Administration
Where did my day go?: Oracle Enterprise Manager 12c/13c Administration
 
Adop patching gotchas ppt
Adop patching gotchas pptAdop patching gotchas ppt
Adop patching gotchas ppt
 
E1 Pages Contest Hosted by Terillium
E1 Pages Contest Hosted by TerilliumE1 Pages Contest Hosted by Terillium
E1 Pages Contest Hosted by Terillium
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Devteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearchDevteach 2017 Store 2 million of audit a day into elasticsearch
Devteach 2017 Store 2 million of audit a day into elasticsearch
 

Plus de Maxym Kharchenko

Build a DataWarehouse for your logs with Python, AWS Athena and Glue
Build a DataWarehouse for your logs with Python, AWS Athena and GlueBuild a DataWarehouse for your logs with Python, AWS Athena and Glue
Build a DataWarehouse for your logs with Python, AWS Athena and GlueMaxym Kharchenko
 
How to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LVHow to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LVMaxym Kharchenko
 
Visualizing ORACLE performance data with R @ #C16LV
Visualizing ORACLE performance data with R @ #C16LVVisualizing ORACLE performance data with R @ #C16LV
Visualizing ORACLE performance data with R @ #C16LVMaxym Kharchenko
 
Commit2015 kharchenko - python generators - ext
Commit2015   kharchenko - python generators - extCommit2015   kharchenko - python generators - ext
Commit2015 kharchenko - python generators - extMaxym Kharchenko
 
Finding SQL execution outliers
Finding SQL execution outliersFinding SQL execution outliers
Finding SQL execution outliersMaxym Kharchenko
 
SQL Top-N and pagination pattern (IOUG)
SQL Top-N and pagination pattern (IOUG)SQL Top-N and pagination pattern (IOUG)
SQL Top-N and pagination pattern (IOUG)Maxym Kharchenko
 

Plus de Maxym Kharchenko (7)

Build a DataWarehouse for your logs with Python, AWS Athena and Glue
Build a DataWarehouse for your logs with Python, AWS Athena and GlueBuild a DataWarehouse for your logs with Python, AWS Athena and Glue
Build a DataWarehouse for your logs with Python, AWS Athena and Glue
 
How to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LVHow to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LV
 
Visualizing ORACLE performance data with R @ #C16LV
Visualizing ORACLE performance data with R @ #C16LVVisualizing ORACLE performance data with R @ #C16LV
Visualizing ORACLE performance data with R @ #C16LV
 
Commit2015 kharchenko - python generators - ext
Commit2015   kharchenko - python generators - extCommit2015   kharchenko - python generators - ext
Commit2015 kharchenko - python generators - ext
 
2015 555 kharchenko_ppt
2015 555 kharchenko_ppt2015 555 kharchenko_ppt
2015 555 kharchenko_ppt
 
Finding SQL execution outliers
Finding SQL execution outliersFinding SQL execution outliers
Finding SQL execution outliers
 
SQL Top-N and pagination pattern (IOUG)
SQL Top-N and pagination pattern (IOUG)SQL Top-N and pagination pattern (IOUG)
SQL Top-N and pagination pattern (IOUG)
 

Dernier

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 

Dernier (20)

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 

Hadoop databases for oracle DBAs

  • 1. Session ID: Prepared by: Hadoop databases: Hive, Impala, Spark, Presto For ORACLE DBAs 557 Maxym Kharchenko, Gluent @maxymkh
  • 2. April 2-6, 2017 in Las Vegas, NV USA #C17LV Whoami • Database Kernel developer -> ORACLE DBA -> Database Hadoop/Cloud developer • Worked with ORACLE for the last 15 years • OCM, ORACLE Ace alumni, Amazon alumni • Last year: OLTP -> Hadoop
  • 3. April 2-6, 2017 in Las Vegas, NV USA #C17LV Shameless plug about my company Gluent Oracle Teradata NoSQL Big Data Sources MSSQL App X App Y App Z
  • 4. April 2-6, 2017 in Las Vegas, NV USA #C17LV Agenda • What’s Hadoop databases ? • Hive/Impala/Spark vs. ORACLE (hopefully, demo) • Best ways to start
  • 5. April 2-6, 2017 in Las Vegas, NV USA #C17LV What is Hadoop: • For “Big data” • Can deal with “Unstructured” data • Distributed • Consists of: HDFS + MapReduce • Requires you to write MapReduce jobs, NoSql
  • 6. April 2-6, 2017 in Las Vegas, NV USA #C17LV Yes, but what does it all mean ?
  • 7. April 2-6, 2017 in Las Vegas, NV USA #C17LV Imagine that you are Google in the early 2000s
  • 8. April 2-6, 2017 in Las Vegas, NV USA #C17LV Target Ads • You need to query web crawler data • Which is unbelievably huge • These queries need to be: • (reasonably) Fast • (reasonably) Cheap • (reasonably) Easy to use
  • 9. April 2-6, 2017 in Las Vegas, NV USA #C17LV Let’s build a Data Warehouse
  • 10. April 2-6, 2017 in Las Vegas, NV USA #C17LV (traditional) Data warehouse • Been there for years • Mature and (relatively) advanced • SQL !!!
  • 11. April 2-6, 2017 in Las Vegas, NV USA #C17LV Data Warehouse scorecard Requirements RDBMS (reasonably) Fast    (reasonably) Cheap    (reasonably) Easy to use    Able to process data ¯_(ツ)_/¯
  • 12. April 2-6, 2017 in Las Vegas, NV USA #C17LV Scaling up “Big data” ain’t cheap • Can’t fit all of the data on a single box • Cost is quickly getting out of hand
  • 13. April 2-6, 2017 in Las Vegas, NV USA #C17LV (cheap) Commodity systems make “big data” feasible
  • 14. April 2-6, 2017 in Las Vegas, NV USA #C17LV Solution = commodity systems = $$$$$ $$
  • 15. April 2-6, 2017 in Las Vegas, NV USA #C17LV Commodity systems scorecard Requirements Commodity (reasonably) Fast    (reasonably) Cheap    (reasonably) Easy to use    Able to process data   
  • 16. April 2-6, 2017 in Las Vegas, NV USA #C17LV All your queries are Java Classes
  • 17. April 2-6, 2017 in Las Vegas, NV USA #C17LV Google • 2003: Google File System (GFS) paper • 2004: Google MapReduce (MR) paper
  • 18. April 2-6, 2017 in Las Vegas, NV USA #C17LV Hadoop • 2006: Hadoop
  • 19. April 2-6, 2017 in Las Vegas, NV USA #C17LV ”Traditional Data Warehouse” vs. Hadoop Requirements Hadoop Data Warehouse (reasonably) Fast       (reasonably) Cheap       (reasonably) Easy to use       Able to process data    ¯_(ツ)_/¯
  • 20. April 2-6, 2017 in Las Vegas, NV USA #C17LV • 2010: Facebook releases Apache Hive • SQL on Hadoop ! SQL on Hadoop - Hive
  • 21. April 2-6, 2017 in Las Vegas, NV USA #C17LV • 2012: Cloudera announces Impala • Faster SQL on Hadoop ! Another SQL on Hadoop - Impala
  • 22. April 2-6, 2017 in Las Vegas, NV USA #C17LV And then, it exploded …
  • 24. April 2-6, 2017 in Las Vegas, NV USA #C17LV This is not about NoSql :-)
  • 25. April 2-6, 2017 in Las Vegas, NV USA #C17LV Same: Tables sql> describe sh.products; +-----------------------+----------------+---------+ | name | type | comment | +-----------------------+----------------+---------+ | prod_id | bigint | | | prod_name | string | | | prod_desc | string | | | prod_category_id | bigint | | | prod_category_desc | string | | | supplier_id | bigint | | | prod_total_id | decimal(38,18) | | | prod_src_id | decimal(38,18) | | | prod_eff_from | timestamp | | | prod_eff_to | timestamp | | | prod_valid | string | | +-----------------------+----------------+---------+
  • 26. April 2-6, 2017 in Las Vegas, NV USA #C17LV Same: Running SQL queries sql> select prod_id, count(1) from sh.sales s, sh.channels c where c.channel_id = s.channel_id and c.channel_desc='Catalog' group by prod_id order by 2 desc limit 5; +------------------------+----------+ | prod_id | count(1) | +------------------------+----------+ | 43.000000000000000000 | 5182 | | 46.000000000000000000 | 5165 | | 22.000000000000000000 | 5162 | | 123.000000000000000000 | 5152 | | 32.000000000000000000 | 5145 | +------------------------+----------+ Fetched 5 row(s) in 3.26s
  • 27. April 2-6, 2017 in Las Vegas, NV USA #C17LV Same: Queries are optimized sql> explain select count(1) from sh.times; +----------------------------------------------------------+ | Explain String | +----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=10.00MB VCores=1 | | | | 03:AGGREGATE [FINALIZE] | | | output: count:merge(1) | | | | | 02:EXCHANGE [UNPARTITIONED] | | | | | 01:AGGREGATE | | | output: count(1) | | | | | 00:SCAN HDFS [sh.times] | | partitions=16/16 files=32 size=500.45KB | +----------------------------------------------------------+
  • 28. April 2-6, 2017 in Las Vegas, NV USA #C17LV Different: What gets optimized • No “regular” indexes • But many operations are distributed SALES 1 TIMES 1 SALES 2 TIMES 2 SALES 3 TIMES 3
  • 29. April 2-6, 2017 in Las Vegas, NV USA #C17LV Different: Native cloud filesystem support sql> show partition sh.sales; s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
  • 30. April 2-6, 2017 in Las Vegas, NV USA #C17LV Database engine does NOT ”own” data
  • 31. April 2-6, 2017 in Las Vegas, NV USA #C17LV example01.dbf sysaux01.dbf system01.dbf temp01.dbf undotbs01.dbf users01.dbf a01_data.parq a01_data.parq a03_data.parq a04_data.parq a05_data.parq a06_data.parq Different: Different engines can work with the same data files (even at the same time)
  • 32. April 2-6, 2017 in Las Vegas, NV USA #C17LV Different: … or copies of the data files hdfs://adhoc/a.parq hdfs://adhoc/b.parq hdfs://adhoc/c.parq hdfs://adhoc/d.parq hdfs://adhoc/e.parq hdfs://adhoc/f.parq hdfs://prod/a.parq hdfs://prod/b.parq hdfs://prod/c.parq hdfs://prod/d.parq hdfs://prod/e.parq hdfs://prod/f.parq s3://backup/a.parq s3://backup/b.parq s3://backup/c.parq s3://backup/d.parq s3://backup/e.parq s3://backup/f.parq
  • 33. April 2-6, 2017 in Las Vegas, NV USA #C17LV Different: Open data formats • Not proprietary – many tools can read/write • No additional $$ for “advanced features”: • Columnar storage • Storage indexes • Compression
  • 34. April 2-6, 2017 in Las Vegas, NV USA #C17LV Same: “sqlplus-like” clients > impala-shell -i 10.0.0.1 [10.0.0.1:21000] > select prod_id, count(1) from sh.sales group by prod_id order by 2 desc limit 1; +-----------------------+----------+ | prod_id | count(1) | +-----------------------+----------+ | 48.000000000000000000 | 74026 | +-----------------------+----------+ > beeline –u 'jdbc:hive2://10.0.0.1:10000' 0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1) from sh.sales group by prod_id order by 2 desc limit 1;
  • 35. April 2-6, 2017 in Las Vegas, NV USA #C17LV Different: External dictionary User data Dictionary (SYS) User data Dictionary (SYS) Hive Metastore
  • 36. April 2-6, 2017 in Las Vegas, NV USA #C17LV Different: Append only, “ETL-like” DML • Hadoop DML is more like ETL • Data is presumed static • ACID: some interpretation required • Schema on read UPDATE t SET a=12 WHERE b=1; Table T (base): a_data.orc Table T (base): a_data.orc Table T (delta): b_data.orc Compactor runs … Table T (base): c_data.orc
  • 38. April 2-6, 2017 in Las Vegas, NV USA #C17LV Apache Hive Slave C • “Designed” for “batch” queries (*) • Runs on top of standard Hadoop RM: YARN • Supports multiple “engines”: MR, TEZ, Spark • SerDes YARN NM datanode Master Hiveserver2 namenode Slave C YARN NM datanode Slave C YARN NM datanode YARN RM
  • 39. April 2-6, 2017 in Las Vegas, NV USA #C17LV Slave A Apache Impala • Designed for “quick interactive” queries • “Data-local” execution • In-memory processing impalad datanode Slave B impalad datanode Slave C impalad datanode Master statestored namenode catalogd
  • 40. April 2-6, 2017 in Las Vegas, NV USA #C17LV Apache Spark • “Better Hadoop” with “native”: SQL, Mlib, GraphX • In-memory processing, based on RDDs • Supports many clusters: “native”, YARN, Mesos • Flexible programming model Master Driver Slave A Executor Slave B Executor Slave C Executor
  • 41. April 2-6, 2017 in Las Vegas, NV USA #C17LV Presto Slave A • Designed for “interactive” queries • In-memory processing • Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker Slave B worker Slave C worker Master coordinator
  • 43. April 2-6, 2017 in Las Vegas, NV USA #C17LV Step 1: Google “Hadoop ecosystem”
  • 44. April 2-6, 2017 in Las Vegas, NV USA #C17LV Step 2: Try to install the simplest thing
  • 45. April 2-6, 2017 in Las Vegas, NV USA #C17LV Step 3
  • 46. April 2-6, 2017 in Las Vegas, NV USA #C17LV Step 4
  • 47. April 2-6, 2017 in Las Vegas, NV USA #C17LV Hint: Nobody builds their own Linux anymore
  • 48. April 2-6, 2017 in Las Vegas, NV USA #C17LV Chose Hadoop distribution that suits you
  • 49. April 2-6, 2017 in Las Vegas, NV USA #C17LV Hadoop distributions • Pre-built and pre-integrated (aka: all things work out of the box) • Each has their own “philosophy” … • … As well as preferred Hadoop database
  • 50. April 2-6, 2017 in Las Vegas, NV USA #C17LV So what’s in it for me ? • It’s interesting (cool technology that hits many recent buzzwords) • If you know ORACLE, it’s close to your skill set • It’s promising and future oriented
  • 51. Q&A
  • 52. Please Complete Your Session Evaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey. Session ID: 557

Notes de l'éditeur

  1. All your queries are essentially Java Classes
  2. With Apache Hadoop, “everybody” can run “big data” queries
  3. With Apache Hadoop, “everybody” can run “big data” queries
  4. Columns and Rows Schemas Similar data types Other familiar objects: Views (*), Functions (**) Notably missing: Indexes
  5. Joins, Subqueries, aggregate functions Optimizer, statistics Different SQL dialects
  6. Joins, Subqueries, aggregate functions Optimizer, statistics Different SQL dialects
  7. Different databases support different formats. Some (i.e. Hive) support ”hookups” to support custom formats
  8. Some dictionary information (i.e. partitions) can be read directly from the file system
  9. Each database supports different DML semantics Apache Kudu is coming to change that
  10. Java Old and Trusty
  11. Not Hadoop C++
  12. Better Hadoop Different apps use different executors
  13. Does not spill to disk, not as stable as Hive https://www.quora.com/What-are-the-main-differences-between-Hive-and-Facebook-Presto
  14. Technically, not Hadoop Different apps use different executors
  15. Technically, not Hadoop Different apps use different executors
  16. Technically, not Hadoop Different apps use different executors
  17. Technically, not Hadoop Different apps use different executors