This document discusses loading data from Hadoop into Oracle databases using Oracle connectors. It describes how the Oracle Loader for Hadoop and Oracle SQL Connector for HDFS can load data from HDFS into Oracle tables much faster than traditional methods like Sqoop by leveraging parallel processing in Hadoop. The connectors optimize the loading process by automatically partitioning, sorting, and formatting the data into Oracle blocks to achieve high performance loads. Measuring the CPU time needed per gigabyte loaded allows estimating how long full loads will take based on available resources.
Vector Search -An Introduction in Oracle Database 23ai.pptx
Data Wrangling and Oracle Connectors for Hadoop
1. 1
Wrangling Data
With Oracle Connectors for Hadoop
Gwen Shapira, Solutions Architect
gshapira@cloudera.com
@gwenshap
2. Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%
5. Hadoop Is…
• HDFS – Massive, redundant data storage
• Map-Reduce – Batch oriented data processing at scale
6
Hadoop Distributed
File System (HDFS)
Replicated
High Bandwidth
Clustered Storage
MapReduce
Distributed Computing
Framework
CORE HADOOP SYSTEM COMPONENTS
6. Hadoop and Databases
7
“Schema-on-Write” “Schema-on-Read”
Schema must be created before any data
can be loaded
An explicit load operation has to take place
which transforms data to DB internal
structure
New columns must be added explicitly
Data is simply copied to the file store, no
transformation is needed
Serializer/Deserlizer is applied during read
time to extract the required columns
New data can start flowing anytime and will
appear retroactively
1) Reads are Fast
2) Standards and Governance
PROS
1) Loads are Fast
2) Flexibility and Agility
7. Hadoop rocks Data Wrangling
• Cheap storage for messy data
• Tools to play with data:
• Acquire
• Clean
• Transform
• Flexibility where you need it most
8
11. Data Sources
• Internal
• OLTP
• Log files
• Documents
• Sensors / network events
• External:
• Geo-location
• Demographics
• Public data sets
• Websites
12
12. Free External Data
Name URL
U.S. Census Bureau http://factfinder2.census.gov/
U.S. Executive Branch http://www.data.gov/
U.K. Government http://data.gov.uk/
E.U. Government http://publicdata.eu/
The World Bank http://data.worldbank.org/
Freebase http://www.freebase.com/
Wikidata http://meta.wikimedia.org/wiki/Wikidata
Amazon Web Services http://aws.amazon.com/datasets
13
13. Data for Sell
Source Type URL
Gnip Social Media http://gnip.com/
AC Nielsen Media Usage http://www.nielsen.com/
Rapleaf Demographic http://www.rapleaf.com/
ESRI Geographic (GIS) http://www.esri.com/
eBay AucAon https://developer.ebay.com/
D&B Business Entities http://www.dnb.com/
Trulia Real Estate http://www.trulia.com/
Standard & Poor’s Financial http://standardandpoors.com/
14
22. Endless Inconsistencies
• Upper vs. lower case
• Date formats
• Times, time zones, 24h
• Missing values
• NULL vs. empty string vs. NA
• Variation in free format input
• 1 PATCH EVERY 24 HOURS
• Replace patches on skin daily
23
23. Hadoop Strategies
• Validation script is
ALWAYS first step
• But not always enough
• We have
known unknowns and
unknowns unknowns
24
24. Known Unknowns
• Script to:
• Check number of columns per row
• Validate not-null
• Validate data type (“is number”)
• Date constraints
• Other business logic
25
25. Unknown Unknowns
• Bad records will happen
• Your job should move on
• Use counters in Hadoop job to count bad records
• Log errors
• Write bad records to re-loadable file
26
26. Solving Bad Data
• Can be done at many levels:
• Fix at source
• Improve acquisition process
• Pre-process before analysis
• Fix during analysis
• How many times will you analyze this data?
• 0,1, many, lots
27
28. Endless Possibilities
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
29
29. De-Identification
• Remove PII data
• Names, addresses, possibly
more
• Remove columns
• Remove IDs *after* joins
• Hash
• Use partial data
• Create statistically similar
fake data
30
30. 31
87% of US population
can be identified from
gender, zip code and date of birth
31. Joins
• Do at source if possible
• Can be done with MapReduce
• Or with Hive (Hadoop SQL )
• Joins are expensive:
• Do once and store results
• De-aggregate aggressively
• Everything a hospital knows about a patient
32
36. FUSE-DFS
• Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs
dfs://<name_node_hostname>:<namenode_port>
<mount_point>
• Use external tables to load data into Oracle
37
38. Oracle Connectors
• SQL Connector for Hadoop
• Oracle Loader for Hadoop
• ODI with Hadoop
• OBIEE with Hadoop
• R connector for Hadoop
You don’t need BDA
39
39. Oracle Loader for Hadoop
• Kinda like SQL Loader
• Data is on HDFS
• Runs as Map-Reduce job
• Partitions, sorts, converts format to Oracle Blocks
• Appended to database tables
• Or written to Data Pump files for later load
40
40. Oracle SQL Connector for HDFS
• Data is in HDFS
• Connector creates external table
• That automatically matches Hadoop data
• Control degree of parallelism
• You know External Tables, right?
41
41. Data Types Supported
• Data Pump
• Delimited text
• Avro
• Regular expressions
• Custom formats
43
47. Measuring Data Load
• Disks: ~300MB /s each
• SSD: ~ 1.6 GB/s each
• Network:
• ~ 100MB/s (1gE)
• ~ 1GB/s (10gE)
• ~ 4GB/s (IB)
• CPU: 1 CPU second per second per core.
• Need to know: CPU seconds per GB
49
48. Lets walk through this…
We have 5TB to load
Each core: 3600 seconds per hour
5000GB will take:
With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours
With SQL Connector: 5000 * 40 = 55 cpu-hours
Our X2-3 half rack has 84 cores.
So, around 30 minutes to load 5TB at 100% CPU.
Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate)
And use all CPUs for loading
50
49. 51
Given fast enough network and disks,
data loading will take all available CPU
This is a good thing
Data, especially from outside sources is not in a perfect condition to be useful to your business.Not only does it need to be processed into useful formats, it also needs:Filtering for potentially useful information. 99% of everything is crapStatistical analysis – is this data significant?Integration with existing dataEntity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle Corporation”? De-DuplicationGood processing and filtering of data can reduce the volume and variety of data. It is important to distinguish between true and accidental variety.This requires massive use of processing power. In a way, there is a trade-off between storage space and CPU. If you don’t invest CPU in filtering, de-duping and entity resolution – you’ll need more storage.
Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
Internal data sources are typically more valuable.Hadoop lets you utilize data that doesn’t make financial sense to load to RDBMSIn large enough organization, internal data becomes external – no control over quality, format, changes.
Example: Find our how far people live from nearest doctor and pharmacy. Using zipcodes and zipcode-long/lat mapping.
ESRI data is probably the most common. Oil&gas, defense.
Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
Inconsistent – data is correct, but has small formatting issues (1999 vs. 99. M vs. male, etc)Invalid – format is correct, but something is wrong with the data (update from 2036 or 1976)Corrupt – format completely unparsable.You can fix inconsistencies, identify invalid data and throw out corrupt data.
Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.