2. whoami –
Never Worked for Oracle
Worked with Oracle Since 1982 (V2)
Working with Exadata since early 2010
Work for Enkitec (www.enkitec.com)
(Enkitec owns a Half Rack – V2/X2)
(Enkitec owns a Big Data Appliance)
Many Exadata customers and POCs
Exadata Book (recently translated to Chinese)
Hadoop Aficionado
Blog: kerryosborne.oracle-guy.com
Twitter: @KerryOracleGuy
2
4. What’s the Point?
Data Volumes are Increasing Rapidly
Cost of Processing / Storing is High
Something’s Gotta Give!
Besides – managing large quantities of data is what
we do!
4
12. HDFS/Hadoop Architecture
HA ?
w
o Job Tracker
Name Node
r
k
datanode tasktracker datanode tasktracker
workers workers
Storage Storage
12
13. HDFS/Hadoop Architecture
HA ?
w
o Job Tracker
r
k
Block Mapper
(namenode)
datanode tasktracker datanode tasktracker
workers workers
Storage Storage
13
14. Exadata Architecture
RAC
w
workers
o Cache
r
k
Block Mapper
(ASM)
Storage Node Storage Node
workers workers
Storage Storage
14
15. HDFS/Hadoop Architecture
HA ?
w
o Job Tracker
r
k
Block Mapper
(namenode)
datanode tasktracker datanode tasktracker
workers workers
Storage Storage
15
20. Sqoop (SQL-to-Hadoop)
•
Graduated from Incubator Status in March 2012
•
Slower (no direct path?)
•
Quest has a plug-in (oraoop)
•
Bi-Directional
20
21. Oracle Big Data Connectors
Oracle Loader for Hadoop - OLH
Oracle Direct Connector for HDFS - ODCH
Oracle R Connector for Hadoop – ORHC
Oracle Data Integrator Application Adapter for Hadoop
Note:
All Connectors are One Way
All sold together for $2K per core list
21
23. Oracle R Connector for Hadoop (ORHC)
•
Provides ability to pull data from Oracle RDBMS
•
Provides ability to pull data from HDFS
•
Provides access to local file system
•
Not really a loader tool
•
Most useful for analysts
23
24. Oracle Loader for Hadoop (OLH)
•
Implemented as a MapReduce job (oraloader.jar)
•
Saves CPU on DB Server
•
Can convert to Oracle datatypes
•
Can partition data and optionally sort it
•
Online – direct into Oracle tables
•
Can load into Oracle via JDBC or OCI Direct Path
•
Offline – generate preprocessed files in HDFS (DP format)
24
25. Oracle Direct Connector for HDFS (ODCH)
•
My Favorite
•
Uses External Tables
•
Fastest
•
12T per hour
•
Can load DP files preprocessed by OLH
•
Allows Oracle SQL to query HDFS data
•
Doesn’t require loading into Oracle
•
Pretty Cool!
•
Downside – uses DB CPU’s
25
27. Exadoop
Unusual Situation!
Half Rack with 4 Spare Storage Servers
Exadata Cells Very Similar to BDA Servers
slower CPU’s
less memory
but same drives (12X3T)
and IB
and Flash
4 Cells ≈ Mini BDA! (happy face)
27
31. Exadoop
Situation
•
Pilot Underway – but wanted more power
•
4 Exadata Storage Servers were sitting idle
•
Suggestion was to Install Hadoop Cluster on them
•
1st Concern was being able to Reclaim for Exadata
•
Removing Data Node from HDFS Not a Problem
•
Adding Storage to ASM Not a Problem
•
So the Decision Was Made to Move Forward
31
32. Exadoop
Set Up
•
Removed the Internal USB’s
•
Installed OEL 6.2
•
Installed CDH3
•
Loaded Some Data
•
Set Up ODCH with External Tables
32
33. Exadoop
Testing
•
Selecting Data Using External Tables was Not Very Fast
•
Quickly Determined we had Used Default 1G Network
•
Reconfigured with IB
•
Helped But Not as Much as Expected
•
Using Little CPU on Data Nodes
•
But a Single Process was Pegging a CPU on the DB
•
Added Parallelism
•
No Good, Only One Slave Active
•
Added Multiple Files to External Table Def. – Bingo!
33
34. Exadoop
Testing - Continued
•
Added Fuse Client
•
Created External Tables with Fuse
•
PX seems to work even on single files
•
Puts additional CPU load on DB server (2T/hr)
34
35. Wrap Up
Right Tool For The Job?
Maybe
All the Cool Kids Are Doing It!
35
Many companies that are using Hadoop in a big way still have Oracle databases sitting right next to them. Nokia - I had a meeting with a guy from Nokia a couple of weeks ago. We discussed how they were using Hadoop and he described basically an ETL kind of setup. The HDFS cluster ingests data that is then processed by MR jobs. The aggregated data is then fed into a Relational DB so the analysts could have their way with it. People have preferences for certain tools (BI tools for example). Also, RDBMS’s can be very fast for this type of access is the data is of reasonable size. Not using Flume, ??? Usiing it for many things but positional data from phones was one of the main cases we discussed. Canadian NSA – They have Exadata and Hadoop Cluster – rows of racks of both
Use Firefox http://192.168.9.98:7777/pls/apex/f?p=100:2:1849672391763932::NO#
Use Firefox http://192.168.9.98:7777/pls/apex/f?p=100:2:1849672391763932::NO#
Use Firefox http://192.168.9.98:7777/pls/apex/f?p=100:2:1849672391763932::NO#
With all the new options available it will take some serious thought about what architecture makes the most sense for any given problem. I had a conversation 2 weeks ago with the Canadian NSA (CSE) – completely static data set – never updated. Good for Hadoop or for HCC. HCC provides about 10x compression on their data set. So a single Exadata rack which has a raw storage capacity of about half a pedabyte can store over 2 pedabytes with normal redundancy. On the other hand, I had a conversation with Nokia about how they are using Hadoop. They have been heavily investing in the technology for a couple of years. A large part of what they do involves investing data produced by mobile phones. The data is typical mined by MR jobs and aggregated data sets are then loaded into RDBMS’s where analysts can use standard BI tools to do what they do. So they described it as an ETL type process.