SlideShare une entreprise Scribd logo
1  sur  50
1
Wrangling Data
With Oracle Connectors for Hadoop
Gwen Shapira, Solutions Architect
gshapira@cloudera.com
@gwenshap
Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%
Data is Messy
5
Hadoop Is…
• HDFS – Massive, redundant data storage
• Map-Reduce – Batch oriented data processing at scale
6
Hadoop Distributed
File System (HDFS)
Replicated
High Bandwidth
Clustered Storage
MapReduce
Distributed Computing
Framework
CORE HADOOP SYSTEM COMPONENTS
Hadoop and Databases
7
“Schema-on-Write” “Schema-on-Read”
 Schema must be created before any data
can be loaded
 An explicit load operation has to take place
which transforms data to DB internal
structure
 New columns must be added explicitly
 Data is simply copied to the file store, no
transformation is needed
 Serializer/Deserlizer is applied during read
time to extract the required columns
 New data can start flowing anytime and will
appear retroactively
1) Reads are Fast
2) Standards and Governance
PROS
1) Loads are Fast
2) Flexibility and Agility
Hadoop rocks Data Wrangling
• Cheap storage for messy data
• Tools to play with data:
• Acquire
• Clean
• Transform
• Flexibility where you need it most
8
Got unstructured data?
• Data Warehouse:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
9
10
What Data Wrangling Looks Like?
Source Acquire Clean Transform Load
11
Data Sources
• Internal
• OLTP
• Log files
• Documents
• Sensors / network events
• External:
• Geo-location
• Demographics
• Public data sets
• Websites
12
Free External Data
Name URL
U.S. Census Bureau http://factfinder2.census.gov/
U.S. Executive Branch http://www.data.gov/
U.K. Government http://data.gov.uk/
E.U. Government http://publicdata.eu/
The World Bank http://data.worldbank.org/
Freebase http://www.freebase.com/
Wikidata http://meta.wikimedia.org/wiki/Wikidata
Amazon Web Services http://aws.amazon.com/datasets
13
Data for Sell
Source Type URL
Gnip Social Media http://gnip.com/
AC Nielsen Media Usage http://www.nielsen.com/
Rapleaf Demographic http://www.rapleaf.com/
ESRI Geographic (GIS) http://www.esri.com/
eBay AucAon https://developer.ebay.com/
D&B Business Entities http://www.dnb.com/
Trulia Real Estate http://www.trulia.com/
Standard & Poor’s Financial http://standardandpoors.com/
14
Source Acquire Clean Transform Load
15
Getting Data into Hadopp
• Sqoop
• Flume
• Copy
• Write
• Scraping
• Data APIs
16
Sqoop Import Examples
• Sqoop import --connect
jdbc:oracle:thin:@//dbserver:1521/masterdb
--username hr --table emp
--where “start_date > ’01-01-2012’”
• Sqoop import
jdbc:oracle:thin:@//dbserver:1521/masterdb
--username myuser
--table shops --split-by shop_id
--num-mappers 16
Must be
indexed or
partitioned to
avoid 16 full
table scans
Or…
• Hadoop fs -put myfile.txt /big/project/myfile.txt
• curl –i list_of_urls.txt
• curl
https://api.twitter.com/1/users/show.json?screen_name=
cloudera
{ "id":16134540,
"name":"Cloudera",
"screen_name":"cloudera",
"location":"Palo Alto, CA",
"url":"http://www.cloudera.com”
"followers_count":11359 }
18
And even…
$cat scraper.py
import urllib
from BeautifulSoup import BeautifulSoup
txt = urllib.urlopen("http://
www.example.com/")
soup = BeautifulSoup(txt)
headings = soup.findAll("h2")
for heading in headings:
print heading.string
19
Source Acquire Clean Transform Load
20
Data Quality Issues
• Given enough data – quality issues are inevitable
• Main issues:
• Inconsistent – “99” instead of “1999”
• Invalid – last_update: 2036
• Corrupt - #$%&@*%@
21
22
Happy families are all alike.
Each unhappy family is unhappy
in its own way.
Endless Inconsistencies
• Upper vs. lower case
• Date formats
• Times, time zones, 24h
• Missing values
• NULL vs. empty string vs. NA
• Variation in free format input
• 1 PATCH EVERY 24 HOURS
• Replace patches on skin daily
23
Hadoop Strategies
• Validation script is
ALWAYS first step
• But not always enough
• We have
known unknowns and
unknowns unknowns
24
Known Unknowns
• Script to:
• Check number of columns per row
• Validate not-null
• Validate data type (“is number”)
• Date constraints
• Other business logic
25
Unknown Unknowns
• Bad records will happen
• Your job should move on
• Use counters in Hadoop job to count bad records
• Log errors
• Write bad records to re-loadable file
26
Solving Bad Data
• Can be done at many levels:
• Fix at source
• Improve acquisition process
• Pre-process before analysis
• Fix during analysis
• How many times will you analyze this data?
• 0,1, many, lots
27
Source Acquire Clean Transform Load
28
Endless Possibilities
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
29
De-Identification
• Remove PII data
• Names, addresses, possibly
more
• Remove columns
• Remove IDs *after* joins
• Hash
• Use partial data
• Create statistically similar
fake data
30
31
87% of US population
can be identified from
gender, zip code and date of birth
Joins
• Do at source if possible
• Can be done with MapReduce
• Or with Hive (Hadoop SQL )
• Joins are expensive:
• Do once and store results
• De-aggregate aggressively
• Everything a hospital knows about a patient
32
DataWrangler
33
Process Tips
• Keep track of data lineage
• Keep track of all changes to data
• Use source control for code
34
Source Acquire Clean Transform Load
35
Sqoop
sqoop export
--connect jdbc:mysql://db.example.com/foo
--table bar
--export-dir /results/bar_data
36
FUSE-DFS
• Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs
dfs://<name_node_hostname>:<namenode_port>
<mount_point>
• Use external tables to load data into Oracle
37
38
That’s nice.
But can you load data FAST?
Oracle Connectors
• SQL Connector for Hadoop
• Oracle Loader for Hadoop
• ODI with Hadoop
• OBIEE with Hadoop
• R connector for Hadoop
You don’t need BDA
39
Oracle Loader for Hadoop
• Kinda like SQL Loader
• Data is on HDFS
• Runs as Map-Reduce job
• Partitions, sorts, converts format to Oracle Blocks
• Appended to database tables
• Or written to Data Pump files for later load
40
Oracle SQL Connector for HDFS
• Data is in HDFS
• Connector creates external table
• That automatically matches Hadoop data
• Control degree of parallelism
• You know External Tables, right?
41
Data Types Supported
• Data Pump
• Delimited text
• Avro
• Regular expressions
• Custom formats
43
44
Main Benefit:
Processing is done in Hadoop
Benefits
• High performance
• Reduce CPU usage on Database
• Automatic optimizations:
• Partitions
• Sort
• Load balance
45
Measuring Data Load
46
Concerns
How much time?
How much CPU?
Bottlenecks
Disk
CPU
Network
I Know What This Means:
47
What does this mean?
48
Measuring Data Load
• Disks: ~300MB /s each
• SSD: ~ 1.6 GB/s each
• Network:
• ~ 100MB/s (1gE)
• ~ 1GB/s (10gE)
• ~ 4GB/s (IB)
• CPU: 1 CPU second per second per core.
• Need to know: CPU seconds per GB
49
Lets walk through this…
We have 5TB to load
Each core: 3600 seconds per hour
5000GB will take:
With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours
With SQL Connector: 5000 * 40 = 55 cpu-hours
Our X2-3 half rack has 84 cores.
So, around 30 minutes to load 5TB at 100% CPU.
Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate)
And use all CPUs for loading
50
51
Given fast enough network and disks,
data loading will take all available CPU
This is a good thing
52

Contenu connexe

Tendances

Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

Tendances (20)

Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

En vedette

The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migrationAmit Sharma
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceSkillet Tony
 
Informatica Power Center 7.1
Informatica Power Center 7.1Informatica Power Center 7.1
Informatica Power Center 7.1ganblues
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudNeeraj Sabharwal
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science processBenjamin Skrainka
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modelingChi D. Nguyen
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"Naoto MATSUMOTO
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataUlf Mattsson
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadKelly Technologies
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...SlideTeam.net
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's LawDomino Data Lab
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataRitvvij Parrikh
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...Jade Global
 

En vedette (20)

The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Informatica session
Informatica sessionInformatica session
Informatica session
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
 
Informatica Power Center 7.1
Informatica Power Center 7.1Informatica Power Center 7.1
Informatica Power Center 7.1
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modeling
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's Law
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
 

Similaire à Wrangling Data with Oracle Connectors for Hadoop

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 

Similaire à Wrangling Data with Oracle Connectors for Hadoop (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Hadoop
HadoopHadoop
Hadoop
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Apache drill
Apache drillApache drill
Apache drill
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 

Plus de Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 

Plus de Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Dernier (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

Wrangling Data with Oracle Connectors for Hadoop

  • 1. 1 Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap
  • 2. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
  • 4. 5
  • 5. Hadoop Is… • HDFS – Massive, redundant data storage • Map-Reduce – Batch oriented data processing at scale 6 Hadoop Distributed File System (HDFS) Replicated High Bandwidth Clustered Storage MapReduce Distributed Computing Framework CORE HADOOP SYSTEM COMPONENTS
  • 6. Hadoop and Databases 7 “Schema-on-Write” “Schema-on-Read”  Schema must be created before any data can be loaded  An explicit load operation has to take place which transforms data to DB internal structure  New columns must be added explicitly  Data is simply copied to the file store, no transformation is needed  Serializer/Deserlizer is applied during read time to extract the required columns  New data can start flowing anytime and will appear retroactively 1) Reads are Fast 2) Standards and Governance PROS 1) Loads are Fast 2) Flexibility and Agility
  • 7. Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Flexibility where you need it most 8
  • 8. Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
  • 9. 10
  • 10. What Data Wrangling Looks Like? Source Acquire Clean Transform Load 11
  • 11. Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites 12
  • 12. Free External Data Name URL U.S. Census Bureau http://factfinder2.census.gov/ U.S. Executive Branch http://www.data.gov/ U.K. Government http://data.gov.uk/ E.U. Government http://publicdata.eu/ The World Bank http://data.worldbank.org/ Freebase http://www.freebase.com/ Wikidata http://meta.wikimedia.org/wiki/Wikidata Amazon Web Services http://aws.amazon.com/datasets 13
  • 13. Data for Sell Source Type URL Gnip Social Media http://gnip.com/ AC Nielsen Media Usage http://www.nielsen.com/ Rapleaf Demographic http://www.rapleaf.com/ ESRI Geographic (GIS) http://www.esri.com/ eBay AucAon https://developer.ebay.com/ D&B Business Entities http://www.dnb.com/ Trulia Real Estate http://www.trulia.com/ Standard & Poor’s Financial http://standardandpoors.com/ 14
  • 14. Source Acquire Clean Transform Load 15
  • 15. Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs 16
  • 16. Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans
  • 17. Or… • Hadoop fs -put myfile.txt /big/project/myfile.txt • curl –i list_of_urls.txt • curl https://api.twitter.com/1/users/show.json?screen_name= cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 } 18
  • 18. And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http:// www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string 19
  • 19. Source Acquire Clean Transform Load 20
  • 20. Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@ 21
  • 21. 22 Happy families are all alike. Each unhappy family is unhappy in its own way.
  • 22. Endless Inconsistencies • Upper vs. lower case • Date formats • Times, time zones, 24h • Missing values • NULL vs. empty string vs. NA • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily 23
  • 23. Hadoop Strategies • Validation script is ALWAYS first step • But not always enough • We have known unknowns and unknowns unknowns 24
  • 24. Known Unknowns • Script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic 25
  • 25. Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file 26
  • 26. Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots 27
  • 27. Source Acquire Clean Transform Load 28
  • 28. Endless Possibilities • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 29
  • 29. De-Identification • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data 30
  • 30. 31 87% of US population can be identified from gender, zip code and date of birth
  • 31. Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient 32
  • 33. Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code 34
  • 34. Source Acquire Clean Transform Load 35
  • 36. FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle 37
  • 37. 38 That’s nice. But can you load data FAST?
  • 38. Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop You don’t need BDA 39
  • 39. Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load 40
  • 40. Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right? 41
  • 41. Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats 43
  • 43. Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance 45
  • 44. Measuring Data Load 46 Concerns How much time? How much CPU? Bottlenecks Disk CPU Network
  • 45. I Know What This Means: 47
  • 46. What does this mean? 48
  • 47. Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB 49
  • 48. Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading 50
  • 49. 51 Given fast enough network and disks, data loading will take all available CPU This is a good thing
  • 50. 52

Notes de l'éditeur

  1. Data, especially from outside sources is not in a perfect condition to be useful to your business.Not only does it need to be processed into useful formats, it also needs:Filtering for potentially useful information. 99% of everything is crapStatistical analysis – is this data significant?Integration with existing dataEntity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle Corporation”? De-DuplicationGood processing and filtering of data can reduce the volume and variety of data. It is important to distinguish between true and accidental variety.This requires massive use of processing power. In a way, there is a trade-off between storage space and CPU. If you don’t invest CPU in filtering, de-duping and entity resolution – you’ll need more storage.
  2. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  3. Internal data sources are typically more valuable.Hadoop lets you utilize data that doesn’t make financial sense to load to RDBMSIn large enough organization, internal data becomes external – no control over quality, format, changes.
  4. Example: Find our how far people live from nearest doctor and pharmacy. Using zipcodes and zipcode-long/lat mapping.
  5. ESRI data is probably the most common. Oil&amp;gas, defense.
  6. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  7. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  8. Inconsistent – data is correct, but has small formatting issues (1999 vs. 99. M vs. male, etc)Invalid – format is correct, but something is wrong with the data (update from 2036 or 1976)Corrupt – format completely unparsable.You can fix inconsistencies, identify invalid data and throw out corrupt data.
  9. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  10. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.