Smart SQL for Big Data

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Smart SQL Processing for
Databases, Hadoop, and Beyond
Dan McClary, Ph.D.
Big Data Product Management
Oracle
June, 2014
Oracle Confidential – Internal/Restricted/Highly Restricted

Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Oracle Confidential – Internal/Restricted/Highly Restricted 3

Databases, Hadoop, and Beyond
1
2
3
How and Why Companies are Using Big Data
Making Hadoop a first-class citizen
Smarter SQL Processing

Big Data Customer Snapshot
Big Data Analytic Services
• R&D, Cross-property analytics, massive ingestion
• Consolidated data science platform
Business Transformation
• Leading Spanish Bank > 13M customers
• Collect & unify all relevant information
Innovative Network Defense
• Hadoop and NoSQL DB for data of different speeds
• Detect 0-days, uncover intrusions
BDA Exadata
BDA Exadata
BDA Exadata

Exploit the Strengths of Both Systems
0
1
2
3
4
5
Tooling maturity
Stringent Functionals
ACID transactions
Security
Variety of data formats
Release Pace
ETL simplicity
Cost effectively store data
Ingestion rate
Business Interoperability
Hadoop
RDBMS
• Hadoop is good at some
things
• Databases are good at others
• Don’t reinvent wheels

BDMS: Big Data Management System
Run the Business
 Integrate existing systems
 Support mission-critical tasks
 Protect existing expenditures
 Insure skills relevance
RelationalHadoop
Change the Business
 Disrupt competitors
 Disintermediate supply chains
 Leverage new paradigms
 Exploit new analyses
NoSQL
Scale the Business
 Serve faster
 Meet mobile challenges
 Scale-out economically

Remarkable Innovation
Hadoop Ecosystem

Innovation Breeds Challenge
Operations Languages
Custom assembly
HW/SW optimization
Security
Redundancy
Integration
Support
Complexity
APIs in flux
Constant upgrade
Skill sets
Hadoop Ecosystem

Building for Database Operations At Scale
Intelligent Storage
Smart Scan
Storage Indexing
Advanced Compression
Optimized Network Protocols
Easy Upgrades
Easy Consolidation
Engineered System for
Oracle Database

Building for Hadoop Operations at Scale
Integrated Enterprise Management
OOB Authentication
Auditing
Role-based Access Control
Encryption
High Availability
Easy Upgrades
Rapid Provisioning
Engineered System for
Hadoop & NoSQL

Real Barriers to Adopting Big Data
The Platform is not the Problem
•Skills
–Hadoop requires new expertise
–Let experts be experts!
–Ensure experts can work together
•Integration
–Prevent Hadoop from becoming a silo
•Security
–Need clear routes to governance or enforcement

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 13
How do we make Hadoop
a first-class citizen?

SQL

Why?

40 Years of SQL
SELECT dept, sum(salary)
FROM emp, dept
WHERE dept.empid = emp.empid
GROUP BY dept
Still works
Faster and in more places
YEAR 1974YEAR 2014

SQL on Hadoop is Obvious
Stinger

Data Lives in Many Places
Profit and Loss
RelationalHadoop
Application Logs
NoSQL
Customer Profiles
SQL

The Challenge is ON
Create a system that:
• Gives you the full power of SQL
• Requires no changes to application code
• Gives you a single view of All Data stored in RDBMS and in Hadoop (++)
• No changes (required) to Hadoop or my data
• Best possible performance on my Hadoop data
19

Smart SQL Processing
on Hadoop (and more) data
20

100% of you are wondering how we do this!
21

BDMS Requirements
• Full Power of SQL and Advanced Analytics
• No Changes to Application Code
• Single View of All Data
• Fastest Performance
• No Changes to Hadoop
+
• Unified Metadata Across RDBMS & Hadoop
• SQL Access to NoSQL

How did we do this?
1. Give database queries the ability to be a Hadoop client
2. Expand the database metadata to understand Hadoop objects
3. Add services to Hadoop to execute and optimize data requests
23

Teaching Oracle About Hadoop
24

How does MapReduce process data?
• Scan and row creation needs to be able to
work on “any” data format
• User defined Java Classes are used to scan
and create the rows
RecordReader => Scans data (keys and values)
InputFormat => Defines parallelism
25
Data Node
disk
Consumer
SCAN
Create ROWS

How does Hive help?
• Definitions are represented as tables in the
Hive Metastore
• Hive leverages a SerDe (Java class) to define
columns on rows generated
SerDe => Creates columns
RecordReader => Scans data (keys and values)
InputFormat => Defines parallelism
26
Data Node
disk
Consumer
SCAN
Create ROWS & COLUMNS

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27
Big Data Appliance
+
Hadoop
HDFS
DataNode
Exadata
+
Oracle Database
OracleCatalog
ExternalTable
create table customer_address
( ca_customer_id number(10,0)
, ca_street_number char(10)
, ca_state char(2)
, ca_zip char(10)
)
organization external (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(com.oracle.bigdata.cluster hadoop_cl_1)
LOCATION ('hive://customer_address')
)
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
Publish Hadoop Metadata to Oracle Catalog

, ca_state char(2)
, ca_zip char(10)
)
TYPE ORACLE_HIVE
ACCESS PARAMETERS
)
Publish Hadoop Metadata to Oracle Catalog
Big Data Appliance
+
Hadoop
HDFS
DataNode
Exadata
+
Oracle Database
OracleCatalog
ExternalTable
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
, ca_state char(2)
, ca_zip char(10)
)
TYPE ORACLE_HIVE
ACCESS PARAMETERS
)
• SerDe
• RecordReader
• InputFormat
• StorageHandlers!

HDFS
DataNode
OracleCatalog
ExternalTable
Select c_customer_id
, c_customer_last_name
, ca_county
From customers
, customer_address
where c_customer_id = ca_customer_id
and ca_state = ‘CA’
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
Executing Queries on Hadoop
HDFS
DataNode
HDFS
DataNode
Determine:
• Data locations
• Data structure
• Parallelism
Send to specific data nodes:
• Data request
• Context
There’s a
bottleneck
here!

Making SQL Processing Smarter
30

What Can Big Data Learn from Exadata?
Minimized data movement  Performance
 Smart Scan
−Filters data as it streams from disk
 Storage Indexing
−Ensures only relevant data is read
 Caching
−Frequently accessed data takes less time to read

HDFS
DataNode
OracleCatalog
ExternalTable
Select c_customer_id
, c_customer_last_name
, ca_county
From customers
, customer_address
where c_customer_id = ca_customer_id
and ca_state = ‘CA’
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
Executing Queries on Hadoop
HDFS
DataNode
HDFS
DataNode
“Tables”
Do I/O and Smart Scan:
• Filter rows
• Project columns
Move only relevant data
• Relevant rows
• Relevant columns
Apply join with
database data
Note:
This also works without
Hive definitions, as the
underlying HDFS access
concepts apply…

Storage
Indexes
Optimizing Scans on Hadoop
• Automatically collect and
store the minimum and
maximum value within a
storage unit
• Before scanning a storage
unit, verify whether the data
requires falls within the Min-
Max
• If not, skip scanning the block
and reduce scan time
33
HDFS
DataNode
HDFS
DataNode
HDFS
NameNode
Hivemetadata
HDFS
DataNode
HDFS
DataNode
“Blocks”
Min
Max
Min
Max
Min
Max
Note:
This also works without
Hive definitions, simply
leverage the SerDE

What Does This Mean for Me?
34

What if You Could Query All Data?
Store JSON data unconverted
in Hadoop
JSON
Oracle Database 12cOracle Big Data Appliance
SQL
Data analyzed via SQLStore business-critical data in
Oracle
select
customers_document.address.state,
revenue
from
customers, sales
where
customers_document.id=sales.custID
group by
customers_document.address.state;
 Push down to Hadoop
− JSON parsing
− Column projection
− Bloom filter for faster join

What if You Could Govern All Data?
Store JSON data unconverted
in Hadoop
JSON
Oracle Database 12cOracle Big Data Appliance
SQL
Data analyzed via SQLStore business-critical data in
Oracle
DBMS_REDACT.ADD_POLICY(
object_schema => 'txadp_hive_01',
object_name => 'customer_address_ext',
column_name => 'ca_street_name',
policy_name => 'customer_address_redaction',
function_type => DBMS_REDACT.RANDOM,
expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'',
''REDACTION_TESTER'')=''TRUE'''
);
 Apply advanced security on Hadoop
− Masking/Redaction
− Virtual Private Database
− Fine-grained Access Control

Oracle’s Big Data Management System
One fast SQL query, on all your data.
Oracle SQL on Hadoop and beyond
• With a Smart Scan service as in Exadata
• With native SQL operators
• With the security and certainty of Oracle DatabaseHappy 40th
Birthday
SQL

http://www.oracle.com/bigdatabreakthrough
@dan_mcclary

Smart SQL for Big Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Smart SQL for Big Data

Similaire à Smart SQL for Big Data (20)

Plus de DataWorks Summit

Plus de DataWorks Summit (20)

Dernier

Dernier (20)

Smart SQL for Big Data

Notes de l'éditeur