Contenu connexe Similaire à Smart SQL for Big Data Similaire à Smart SQL for Big Data (20) Plus de DataWorks Summit (20) Smart SQL for Big Data2. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Smart SQL Processing for
Databases, Hadoop, and Beyond
Dan McClary, Ph.D.
Big Data Product Management
Oracle
June, 2014
Oracle Confidential – Internal/Restricted/Highly Restricted
3. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Oracle Confidential – Internal/Restricted/Highly Restricted 3
4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Databases, Hadoop, and Beyond
1
2
3
How and Why Companies are Using Big Data
Making Hadoop a first-class citizen
Smarter SQL Processing
Oracle Confidential – Internal/Restricted/Highly Restricted 4
5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Big Data Customer Snapshot
Oracle Confidential – Internal/Restricted/Highly Restricted 5
Big Data Analytic Services
• R&D, Cross-property analytics, massive ingestion
• Consolidated data science platform
Business Transformation
• Leading Spanish Bank > 13M customers
• Collect & unify all relevant information
Innovative Network Defense
• Hadoop and NoSQL DB for data of different speeds
• Detect 0-days, uncover intrusions
BDA Exadata
BDA Exadata
BDA Exadata
6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Exploit the Strengths of Both Systems
Oracle Confidential – Internal/Restricted/Highly Restricted 6
0
1
2
3
4
5
Tooling maturity
Stringent Functionals
ACID transactions
Security
Variety of data formats
Release Pace
ETL simplicity
Cost effectively store data
Ingestion rate
Business Interoperability
Hadoop
RDBMS
• Hadoop is good at some
things
• Databases are good at others
• Don’t reinvent wheels
7. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
BDMS: Big Data Management System
Oracle Confidential – Internal/Restricted/Highly Restricted 7
Run the Business
Integrate existing systems
Support mission-critical tasks
Protect existing expenditures
Insure skills relevance
RelationalHadoop
Change the Business
Disrupt competitors
Disintermediate supply chains
Leverage new paradigms
Exploit new analyses
NoSQL
Scale the Business
Serve faster
Meet mobile challenges
Scale-out economically
8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Remarkable Innovation
Oracle Confidential – Internal/Restricted/Highly Restricted 8
Hadoop Ecosystem
9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Innovation Breeds Challenge
Oracle Confidential – Internal/Restricted/Highly Restricted 9
Operations Languages
Custom assembly
HW/SW optimization
Security
Redundancy
Integration
Support
Complexity
APIs in flux
Constant upgrade
Skill sets
Hadoop Ecosystem
10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Building for Database Operations At Scale
Oracle Confidential – Internal/Restricted/Highly Restricted 10
Intelligent Storage
Smart Scan
Storage Indexing
Advanced Compression
Optimized Network Protocols
Easy Upgrades
Easy Consolidation
Engineered System for
Oracle Database
11. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Building for Hadoop Operations at Scale
Oracle Confidential – Internal/Restricted/Highly Restricted 11
Integrated Enterprise Management
OOB Authentication
Auditing
Role-based Access Control
Encryption
High Availability
Easy Upgrades
Rapid Provisioning
Engineered System for
Hadoop & NoSQL
12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Real Barriers to Adopting Big Data
The Platform is not the Problem
•Skills
–Hadoop requires new expertise
–Let experts be experts!
–Ensure experts can work together
•Integration
–Prevent Hadoop from becoming a silo
•Security
–Need clear routes to governance or enforcement
Oracle Confidential – Internal/Restricted/Highly Restricted 12
13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 13
How do we make Hadoop
a first-class citizen?
14. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 14
SQL
15. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 15
Why?
16. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16
40 Years of SQL
SELECT dept, sum(salary)
FROM emp, dept
WHERE dept.empid = emp.empid
GROUP BY dept
Still works
Faster and in more places
YEAR 1974YEAR 2014
17. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
SQL on Hadoop is Obvious
Oracle Confidential – Internal/Restricted/Highly Restricted 17
Stinger
18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Data Lives in Many Places
Oracle Confidential – Internal/Restricted/Highly Restricted 18
Profit and Loss
RelationalHadoop
Application Logs
NoSQL
Customer Profiles
SQL
19. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
The Challenge is ON
Create a system that:
• Gives you the full power of SQL
• Requires no changes to application code
• Gives you a single view of All Data stored in RDBMS and in Hadoop (++)
• No changes (required) to Hadoop or my data
• Best possible performance on my Hadoop data
19
20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Smart SQL Processing
on Hadoop (and more) data
20
21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
100% of you are wondering how we do this!
21
22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
BDMS Requirements
• Full Power of SQL and Advanced Analytics
• No Changes to Application Code
• Single View of All Data
• Fastest Performance
• No Changes to Hadoop
+
• Unified Metadata Across RDBMS & Hadoop
• SQL Access to NoSQL
23. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How did we do this?
1. Give database queries the ability to be a Hadoop client
2. Expand the database metadata to understand Hadoop objects
3. Add services to Hadoop to execute and optimize data requests
23
24. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Teaching Oracle About Hadoop
24
25. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How does MapReduce process data?
• Scan and row creation needs to be able to
work on “any” data format
• User defined Java Classes are used to scan
and create the rows
RecordReader => Scans data (keys and values)
InputFormat => Defines parallelism
25
Data Node
disk
Consumer
SCAN
Create ROWS
26. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How does Hive help?
• Definitions are represented as tables in the
Hive Metastore
• Hive leverages a SerDe (Java class) to define
columns on rows generated
SerDe => Creates columns
RecordReader => Scans data (keys and values)
InputFormat => Defines parallelism
26
Data Node
disk
Consumer
SCAN
Create ROWS & COLUMNS
27. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27
Big Data Appliance
+
Hadoop
HDFS
DataNode
Exadata
+
Oracle Database
OracleCatalog
ExternalTable
create table customer_address
( ca_customer_id number(10,0)
, ca_street_number char(10)
, ca_state char(2)
, ca_zip char(10)
)
organization external (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(com.oracle.bigdata.cluster hadoop_cl_1)
LOCATION ('hive://customer_address')
)
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
Publish Hadoop Metadata to Oracle Catalog
28. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 28
create table customer_address
( ca_customer_id number(10,0)
, ca_street_number char(10)
, ca_state char(2)
, ca_zip char(10)
)
organization external (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(com.oracle.bigdata.cluster hadoop_cl_1)
LOCATION ('hive://customer_address')
)
Publish Hadoop Metadata to Oracle Catalog
Big Data Appliance
+
Hadoop
HDFS
DataNode
Exadata
+
Oracle Database
OracleCatalog
ExternalTable
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
create table customer_address
( ca_customer_id number(10,0)
, ca_street_number char(10)
, ca_state char(2)
, ca_zip char(10)
)
organization external (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(com.oracle.bigdata.cluster hadoop_cl_1)
LOCATION ('hive://customer_address')
)
• SerDe
• RecordReader
• InputFormat
• StorageHandlers!
29. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29
HDFS
DataNode
OracleCatalog
ExternalTable
Select c_customer_id
, c_customer_last_name
, ca_county
From customers
, customer_address
where c_customer_id = ca_customer_id
and ca_state = ‘CA’
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
Executing Queries on Hadoop
HDFS
DataNode
HDFS
DataNode
Determine:
• Data locations
• Data structure
• Parallelism
Send to specific data nodes:
• Data request
• Context
There’s a
bottleneck
here!
30. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Making SQL Processing Smarter
30
31. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What Can Big Data Learn from Exadata?
Oracle Confidential – Internal/Restricted/Highly Restricted 31
Minimized data movement Performance
Smart Scan
−Filters data as it streams from disk
Storage Indexing
−Ensures only relevant data is read
Caching
−Frequently accessed data takes less time to read
32. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32
HDFS
DataNode
OracleCatalog
ExternalTable
Select c_customer_id
, c_customer_last_name
, ca_county
From customers
, customer_address
where c_customer_id = ca_customer_id
and ca_state = ‘CA’
HDFS
DataNode
HDFS
NameNode
Hivemetadata
ExternalTable
Hivemetadata
Executing Queries on Hadoop
HDFS
DataNode
HDFS
DataNode
“Tables”
Do I/O and Smart Scan:
• Filter rows
• Project columns
Move only relevant data
• Relevant rows
• Relevant columns
Apply join with
database data
Note:
This also works without
Hive definitions, as the
underlying HDFS access
concepts apply…
33. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Storage
Indexes
Optimizing Scans on Hadoop
• Automatically collect and
store the minimum and
maximum value within a
storage unit
• Before scanning a storage
unit, verify whether the data
requires falls within the Min-
Max
• If not, skip scanning the block
and reduce scan time
33
HDFS
DataNode
HDFS
DataNode
HDFS
NameNode
Hivemetadata
HDFS
DataNode
HDFS
DataNode
“Blocks”
Min
Max
Min
Max
Min
Max
Note:
This also works without
Hive definitions, simply
leverage the SerDE
34. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What Does This Mean for Me?
34
35. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What if You Could Query All Data?
Oracle Confidential – Internal/Restricted/Highly Restricted 35
Store JSON data unconverted
in Hadoop
JSON
Oracle Database 12cOracle Big Data Appliance
SQL
Data analyzed via SQLStore business-critical data in
Oracle
select
customers_document.address.state,
revenue
from
customers, sales
where
customers_document.id=sales.custID
group by
customers_document.address.state;
Push down to Hadoop
− JSON parsing
− Column projection
− Bloom filter for faster join
36. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
What if You Could Govern All Data?
Oracle Confidential – Internal/Restricted/Highly Restricted 36
Store JSON data unconverted
in Hadoop
JSON
Oracle Database 12cOracle Big Data Appliance
SQL
Data analyzed via SQLStore business-critical data in
Oracle
DBMS_REDACT.ADD_POLICY(
object_schema => 'txadp_hive_01',
object_name => 'customer_address_ext',
column_name => 'ca_street_name',
policy_name => 'customer_address_redaction',
function_type => DBMS_REDACT.RANDOM,
expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'',
''REDACTION_TESTER'')=''TRUE'''
);
Apply advanced security on Hadoop
− Masking/Redaction
− Virtual Private Database
− Fine-grained Access Control
37. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle’s Big Data Management System
Oracle Confidential – Internal/Restricted/Highly Restricted 37
One fast SQL query, on all your data.
Oracle SQL on Hadoop and beyond
• With a Smart Scan service as in Exadata
• With native SQL operators
• With the security and certainty of Oracle DatabaseHappy 40th
Birthday
SQL
38. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
http://www.oracle.com/bigdatabreakthrough
@dan_mcclary
Oracle Confidential – Internal/Restricted/Highly Restricted 38
39. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 39
Notes de l'éditeur This is a Title Slide with Picture slide ideal for including a picture with a brief title, subtitle and presenter information.
To customize this slide with your own picture:
Right-click the slide area and choose Format Background from the pop-up menu. From the Fill menu, click Picture and texture fill. Under Insert from: click File. Locate your new picture and click Insert. This is a Safe Harbor Front slide, one of two Safe Harbor Statement slides included in this template.
One of the Safe Harbor slides must be used if your presentation covers material affected by Oracle’s Revenue Recognition Policy
To learn more about this policy, e-mail: Revrec-americasiebc_us@oracle.com
For internal communication, Safe Harbor Statements are not required. However, there is an applicable disclaimer (Exhibit E) that should be used, found in the Oracle Revenue Recognition Policy for Future Product Communications. Copy and paste this link into a web browser, to find out more information.
http://my.oracle.com/site/fin/gfo/GlobalProcesses/cnt452504.pdf
For all external communications such as press release, roadmaps, PowerPoint presentations, Safe Harbor Statements are required. You can refer to the link mentioned above to find out additional information/disclaimers required depending on your audience. InputFormat
Hadoop relies on the input format of the job to do three things:1. Validate the input configuration for the job (i.e., checking that the data is there).2. Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing.3. Create the RecordReader implementation to be used to create key/value pairs from the raw InputSplit. These pairs are sent one by one to their mapper.
A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. In the context of file-based input, the “start” is the byte position in the file where the RecordReader should start generating key/value pairs. The “end” is where it should stop reading records. These are not hard boundaries as far as the API is concerned—there is nothing stopping a developer from reading the entire file for each map task. While reading the entire file is not advised, reading outside of the boundaries it often necessary to ensure that a complete record is generated