SlideShare une entreprise Scribd logo
1  sur  50
1
Wrangling Data
With Oracle Connectors for Hadoop
Gwen Shapira, Solutions Architect
gshapira@cloudera.com
@gwenshap
Data Has Changed in the Last 30 YearsDATAGROWTH
END-USER
APPLICATIONS
THE INTERNET
MOBILE DEVICES
SOPHISTICATED
MACHINES
STRUCTURED DATA – 10%
1980 2013
UNSTRUCTURED DATA – 90%
Data is Messy
5
Hadoop Is…
• HDFS – Massive, redundant data storage
• Map-Reduce – Batch oriented data processing at scale
6
Hadoop Distributed
File System (HDFS)
Replicated
High Bandwidth
Clustered Storage
MapReduce
Distributed Computing
Framework
CORE HADOOP SYSTEM COMPONENTS
Hadoop and Databases
7
“Schema-on-Write” “Schema-on-Read”
 Schema must be created before any data
can be loaded
 An explicit load operation has to take place
which transforms data to DB internal
structure
 New columns must be added explicitly
 Data is simply copied to the file store, no
transformation is needed
 Serializer/Deserlizer is applied during read
time to extract the required columns
 New data can start flowing anytime and will
appear retroactively
1) Reads are Fast
2) Standards and Governance
PROS
1) Loads are Fast
2) Flexibility and Agility
Hadoop rocks Data Wrangling
• Cheap storage for messy data
• Tools to play with data:
• Acquire
• Clean
• Transform
• Flexibility where you need it most
8
Got unstructured data?
• Data Warehouse:
• Text
• CSV
• XLS
• XML
• Hadoop:
• HTML
• XML, RSS
• JSON
• Apache Logs
• Avro, ProtoBuffs, ORC, Parquet
• Compression
• Office, OpenDocument, iWorks
• PDF, Epup, RTF
• Midi, MP3
• JPEG, Tiff
• Java Classes
• Mbox, RFC822
• Autocad
• TrueType Parser
• HFD / NetCDF
9
10
What Data Wrangling Looks Like?
Source Acquire Clean Transform Load
11
Data Sources
• Internal
• OLTP
• Log files
• Documents
• Sensors / network events
• External:
• Geo-location
• Demographics
• Public data sets
• Websites
12
Free External Data
Name URL
U.S. Census Bureau http://factfinder2.census.gov/
U.S. Executive Branch http://www.data.gov/
U.K. Government http://data.gov.uk/
E.U. Government http://publicdata.eu/
The World Bank http://data.worldbank.org/
Freebase http://www.freebase.com/
Wikidata http://meta.wikimedia.org/wiki/Wikidata
Amazon Web Services http://aws.amazon.com/datasets
13
Data for Sell
Source Type URL
Gnip Social Media http://gnip.com/
AC Nielsen Media Usage http://www.nielsen.com/
Rapleaf Demographic http://www.rapleaf.com/
ESRI Geographic (GIS) http://www.esri.com/
eBay AucAon https://developer.ebay.com/
D&B Business Entities http://www.dnb.com/
Trulia Real Estate http://www.trulia.com/
Standard & Poor’s Financial http://standardandpoors.com/
14
Source Acquire Clean Transform Load
15
Getting Data into Hadopp
• Sqoop
• Flume
• Copy
• Write
• Scraping
• Data APIs
16
Sqoop Import Examples
• Sqoop import --connect
jdbc:oracle:thin:@//dbserver:1521/masterdb
--username hr --table emp
--where “start_date > ’01-01-2012’”
• Sqoop import
jdbc:oracle:thin:@//dbserver:1521/masterdb
--username myuser
--table shops --split-by shop_id
--num-mappers 16
Must be
indexed or
partitioned to
avoid 16 full
table scans
Or…
• Hadoop fs -put myfile.txt /big/project/myfile.txt
• curl –i list_of_urls.txt
• curl
https://api.twitter.com/1/users/show.json?screen_name=
cloudera
{ "id":16134540,
"name":"Cloudera",
"screen_name":"cloudera",
"location":"Palo Alto, CA",
"url":"http://www.cloudera.com”
"followers_count":11359 }
18
And even…
$cat scraper.py
import urllib
from BeautifulSoup import BeautifulSoup
txt = urllib.urlopen("http://
www.example.com/")
soup = BeautifulSoup(txt)
headings = soup.findAll("h2")
for heading in headings:
print heading.string
19
Source Acquire Clean Transform Load
20
Data Quality Issues
• Given enough data – quality issues are inevitable
• Main issues:
• Inconsistent – “99” instead of “1999”
• Invalid – last_update: 2036
• Corrupt - #$%&@*%@
21
22
Happy families are all alike.
Each unhappy family is unhappy
in its own way.
Endless Inconsistencies
• Upper vs. lower case
• Date formats
• Times, time zones, 24h
• Missing values
• NULL vs. empty string vs. NA
• Variation in free format input
• 1 PATCH EVERY 24 HOURS
• Replace patches on skin daily
23
Hadoop Strategies
• Validation script is
ALWAYS first step
• But not always enough
• We have
known unknowns and
unknowns unknowns
24
Known Unknowns
• Script to:
• Check number of columns per row
• Validate not-null
• Validate data type (“is number”)
• Date constraints
• Other business logic
25
Unknown Unknowns
• Bad records will happen
• Your job should move on
• Use counters in Hadoop job to count bad records
• Log errors
• Write bad records to re-loadable file
26
Solving Bad Data
• Can be done at many levels:
• Fix at source
• Improve acquisition process
• Pre-process before analysis
• Fix during analysis
• How many times will you analyze this data?
• 0,1, many, lots
27
Source Acquire Clean Transform Load
28
Endless Possibilities
• Map Reduce
(in any language)
• Hive (i.e. SQL)
• Pig
• R
• Shell scripts
• Plain old Java
29
De-Identification
• Remove PII data
• Names, addresses, possibly
more
• Remove columns
• Remove IDs *after* joins
• Hash
• Use partial data
• Create statistically similar
fake data
30
31
87% of US population
can be identified from
gender, zip code and date of birth
Joins
• Do at source if possible
• Can be done with MapReduce
• Or with Hive (Hadoop SQL )
• Joins are expensive:
• Do once and store results
• De-aggregate aggressively
• Everything a hospital knows about a patient
32
DataWrangler
33
Process Tips
• Keep track of data lineage
• Keep track of all changes to data
• Use source control for code
34
Source Acquire Clean Transform Load
35
Sqoop
sqoop export
--connect jdbc:mysql://db.example.com/foo
--table bar
--export-dir /results/bar_data
36
FUSE-DFS
• Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs
dfs://<name_node_hostname>:<namenode_port>
<mount_point>
• Use external tables to load data into Oracle
37
38
That’s nice.
But can you load data FAST?
Oracle Connectors
• SQL Connector for Hadoop
• Oracle Loader for Hadoop
• ODI with Hadoop
• OBIEE with Hadoop
• R connector for Hadoop
You don’t need BDA
39
Oracle Loader for Hadoop
• Kinda like SQL Loader
• Data is on HDFS
• Runs as Map-Reduce job
• Partitions, sorts, converts format to Oracle Blocks
• Appended to database tables
• Or written to Data Pump files for later load
40
Oracle SQL Connector for HDFS
• Data is in HDFS
• Connector creates external table
• That automatically matches Hadoop data
• Control degree of parallelism
• You know External Tables, right?
41
Data Types Supported
• Data Pump
• Delimited text
• Avro
• Regular expressions
• Custom formats
43
44
Main Benefit:
Processing is done in Hadoop
Benefits
• High performance
• Reduce CPU usage on Database
• Automatic optimizations:
• Partitions
• Sort
• Load balance
45
Measuring Data Load
46
Concerns
How much time?
How much CPU?
Bottlenecks
Disk
CPU
Network
I Know What This Means:
47
What does this mean?
48
Measuring Data Load
• Disks: ~300MB /s each
• SSD: ~ 1.6 GB/s each
• Network:
• ~ 100MB/s (1gE)
• ~ 1GB/s (10gE)
• ~ 4GB/s (IB)
• CPU: 1 CPU second per second per core.
• Need to know: CPU seconds per GB
49
Lets walk through this…
We have 5TB to load
Each core: 3600 seconds per hour
5000GB will take:
With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours
With SQL Connector: 5000 * 40 = 55 cpu-hours
Our X2-3 half rack has 84 cores.
So, around 30 minutes to load 5TB at 100% CPU.
Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate)
And use all CPUs for loading
50
51
Given fast enough network and disks,
data loading will take all available CPU
This is a good thing
52

Contenu connexe

Tendances

Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive TuningAdam Muise
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keownCisco Canada
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInDataWorks Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Lucidworks
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Uwe Printz
 

Tendances (20)

Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning2013 July 23 Toronto Hadoop User Group Hive Tuning
2013 July 23 Toronto Hadoop User Group Hive Tuning
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Scaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedInScaling Deep Learning on Hadoop at LinkedIn
Scaling Deep Learning on Hadoop at LinkedIn
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

En vedette

The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migrationAmit Sharma
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceSkillet Tony
 
Informatica Power Center 7.1
Informatica Power Center 7.1Informatica Power Center 7.1
Informatica Power Center 7.1ganblues
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudNeeraj Sabharwal
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science processBenjamin Skrainka
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsPat Patterson
 
Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modelingChi D. Nguyen
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"Naoto MATSUMOTO
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataUlf Mattsson
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadKelly Technologies
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...SlideTeam.net
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesArvind Prabhakar
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's LawDomino Data Lab
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataRitvvij Parrikh
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...Jade Global
 

En vedette (20)

The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Informatica object migration
Informatica object migrationInformatica object migration
Informatica object migration
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Informatica session
Informatica sessionInformatica session
Informatica session
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
 
Informatica Power Center 7.1
Informatica Power Center 7.1Informatica Power Center 7.1
Informatica Power Center 7.1
 
Real time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid CloudReal time data ingestion and Hybrid Cloud
Real time data ingestion and Hybrid Cloud
 
Tune up your data science process
Tune up your data science processTune up your data science process
Tune up your data science process
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Analysis, data & process modeling
Analysis, data & process modelingAnalysis, data & process modeling
Analysis, data & process modeling
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
 
Cross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive dataCross border - off-shoring and outsourcing privacy sensitive data
Cross border - off-shoring and outsourcing privacy sensitive data
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...Statistical analysis of process data 7 stages oil flow chart power point temp...
Statistical analysis of process data 7 stages oil flow chart power point temp...
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
 
Data Science and Goodhart's Law
Data Science and Goodhart's LawData Science and Goodhart's Law
Data Science and Goodhart's Law
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...Informatica to ODI Migration – What, Why and How |  Informatica to Oracle Dat...
Informatica to ODI Migration – What, Why and How | Informatica to Oracle Dat...
 

Similaire à Data Wrangling and Oracle Connectors for Hadoop

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 

Similaire à Data Wrangling and Oracle Connectors for Hadoop (20)

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Hadoop
HadoopHadoop
Hadoop
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Apache drill
Apache drillApache drill
Apache drill
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 

Plus de Gwen (Chen) Shapira

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep DiveGwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebookGwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupGwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings MeetupGwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereGwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersGwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupGwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupGwen (Chen) Shapira
 

Plus de Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 

Dernier

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 

Dernier (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 

Data Wrangling and Oracle Connectors for Hadoop

  • 1. 1 Wrangling Data With Oracle Connectors for Hadoop Gwen Shapira, Solutions Architect gshapira@cloudera.com @gwenshap
  • 2. Data Has Changed in the Last 30 YearsDATAGROWTH END-USER APPLICATIONS THE INTERNET MOBILE DEVICES SOPHISTICATED MACHINES STRUCTURED DATA – 10% 1980 2013 UNSTRUCTURED DATA – 90%
  • 4. 5
  • 5. Hadoop Is… • HDFS – Massive, redundant data storage • Map-Reduce – Batch oriented data processing at scale 6 Hadoop Distributed File System (HDFS) Replicated High Bandwidth Clustered Storage MapReduce Distributed Computing Framework CORE HADOOP SYSTEM COMPONENTS
  • 6. Hadoop and Databases 7 “Schema-on-Write” “Schema-on-Read”  Schema must be created before any data can be loaded  An explicit load operation has to take place which transforms data to DB internal structure  New columns must be added explicitly  Data is simply copied to the file store, no transformation is needed  Serializer/Deserlizer is applied during read time to extract the required columns  New data can start flowing anytime and will appear retroactively 1) Reads are Fast 2) Standards and Governance PROS 1) Loads are Fast 2) Flexibility and Agility
  • 7. Hadoop rocks Data Wrangling • Cheap storage for messy data • Tools to play with data: • Acquire • Clean • Transform • Flexibility where you need it most 8
  • 8. Got unstructured data? • Data Warehouse: • Text • CSV • XLS • XML • Hadoop: • HTML • XML, RSS • JSON • Apache Logs • Avro, ProtoBuffs, ORC, Parquet • Compression • Office, OpenDocument, iWorks • PDF, Epup, RTF • Midi, MP3 • JPEG, Tiff • Java Classes • Mbox, RFC822 • Autocad • TrueType Parser • HFD / NetCDF 9
  • 9. 10
  • 10. What Data Wrangling Looks Like? Source Acquire Clean Transform Load 11
  • 11. Data Sources • Internal • OLTP • Log files • Documents • Sensors / network events • External: • Geo-location • Demographics • Public data sets • Websites 12
  • 12. Free External Data Name URL U.S. Census Bureau http://factfinder2.census.gov/ U.S. Executive Branch http://www.data.gov/ U.K. Government http://data.gov.uk/ E.U. Government http://publicdata.eu/ The World Bank http://data.worldbank.org/ Freebase http://www.freebase.com/ Wikidata http://meta.wikimedia.org/wiki/Wikidata Amazon Web Services http://aws.amazon.com/datasets 13
  • 13. Data for Sell Source Type URL Gnip Social Media http://gnip.com/ AC Nielsen Media Usage http://www.nielsen.com/ Rapleaf Demographic http://www.rapleaf.com/ ESRI Geographic (GIS) http://www.esri.com/ eBay AucAon https://developer.ebay.com/ D&B Business Entities http://www.dnb.com/ Trulia Real Estate http://www.trulia.com/ Standard & Poor’s Financial http://standardandpoors.com/ 14
  • 14. Source Acquire Clean Transform Load 15
  • 15. Getting Data into Hadopp • Sqoop • Flume • Copy • Write • Scraping • Data APIs 16
  • 16. Sqoop Import Examples • Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’” • Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans
  • 17. Or… • Hadoop fs -put myfile.txt /big/project/myfile.txt • curl –i list_of_urls.txt • curl https://api.twitter.com/1/users/show.json?screen_name= cloudera { "id":16134540, "name":"Cloudera", "screen_name":"cloudera", "location":"Palo Alto, CA", "url":"http://www.cloudera.com” "followers_count":11359 } 18
  • 18. And even… $cat scraper.py import urllib from BeautifulSoup import BeautifulSoup txt = urllib.urlopen("http:// www.example.com/") soup = BeautifulSoup(txt) headings = soup.findAll("h2") for heading in headings: print heading.string 19
  • 19. Source Acquire Clean Transform Load 20
  • 20. Data Quality Issues • Given enough data – quality issues are inevitable • Main issues: • Inconsistent – “99” instead of “1999” • Invalid – last_update: 2036 • Corrupt - #$%&@*%@ 21
  • 21. 22 Happy families are all alike. Each unhappy family is unhappy in its own way.
  • 22. Endless Inconsistencies • Upper vs. lower case • Date formats • Times, time zones, 24h • Missing values • NULL vs. empty string vs. NA • Variation in free format input • 1 PATCH EVERY 24 HOURS • Replace patches on skin daily 23
  • 23. Hadoop Strategies • Validation script is ALWAYS first step • But not always enough • We have known unknowns and unknowns unknowns 24
  • 24. Known Unknowns • Script to: • Check number of columns per row • Validate not-null • Validate data type (“is number”) • Date constraints • Other business logic 25
  • 25. Unknown Unknowns • Bad records will happen • Your job should move on • Use counters in Hadoop job to count bad records • Log errors • Write bad records to re-loadable file 26
  • 26. Solving Bad Data • Can be done at many levels: • Fix at source • Improve acquisition process • Pre-process before analysis • Fix during analysis • How many times will you analyze this data? • 0,1, many, lots 27
  • 27. Source Acquire Clean Transform Load 28
  • 28. Endless Possibilities • Map Reduce (in any language) • Hive (i.e. SQL) • Pig • R • Shell scripts • Plain old Java 29
  • 29. De-Identification • Remove PII data • Names, addresses, possibly more • Remove columns • Remove IDs *after* joins • Hash • Use partial data • Create statistically similar fake data 30
  • 30. 31 87% of US population can be identified from gender, zip code and date of birth
  • 31. Joins • Do at source if possible • Can be done with MapReduce • Or with Hive (Hadoop SQL ) • Joins are expensive: • Do once and store results • De-aggregate aggressively • Everything a hospital knows about a patient 32
  • 33. Process Tips • Keep track of data lineage • Keep track of all changes to data • Use source control for code 34
  • 34. Source Acquire Clean Transform Load 35
  • 36. FUSE-DFS • Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point> • Use external tables to load data into Oracle 37
  • 37. 38 That’s nice. But can you load data FAST?
  • 38. Oracle Connectors • SQL Connector for Hadoop • Oracle Loader for Hadoop • ODI with Hadoop • OBIEE with Hadoop • R connector for Hadoop You don’t need BDA 39
  • 39. Oracle Loader for Hadoop • Kinda like SQL Loader • Data is on HDFS • Runs as Map-Reduce job • Partitions, sorts, converts format to Oracle Blocks • Appended to database tables • Or written to Data Pump files for later load 40
  • 40. Oracle SQL Connector for HDFS • Data is in HDFS • Connector creates external table • That automatically matches Hadoop data • Control degree of parallelism • You know External Tables, right? 41
  • 41. Data Types Supported • Data Pump • Delimited text • Avro • Regular expressions • Custom formats 43
  • 43. Benefits • High performance • Reduce CPU usage on Database • Automatic optimizations: • Partitions • Sort • Load balance 45
  • 44. Measuring Data Load 46 Concerns How much time? How much CPU? Bottlenecks Disk CPU Network
  • 45. I Know What This Means: 47
  • 46. What does this mean? 48
  • 47. Measuring Data Load • Disks: ~300MB /s each • SSD: ~ 1.6 GB/s each • Network: • ~ 100MB/s (1gE) • ~ 1GB/s (10gE) • ~ 4GB/s (IB) • CPU: 1 CPU second per second per core. • Need to know: CPU seconds per GB 49
  • 48. Lets walk through this… We have 5TB to load Each core: 3600 seconds per hour 5000GB will take: With Fuse: 5000*150 cpu-sec = 750000/3600 = 208 cpu-hours With SQL Connector: 5000 * 40 = 55 cpu-hours Our X2-3 half rack has 84 cores. So, around 30 minutes to load 5TB at 100% CPU. Assuming you use Exadata (Infiniband + SSD = 8TB/h load rate) And use all CPUs for loading 50
  • 49. 51 Given fast enough network and disks, data loading will take all available CPU This is a good thing
  • 50. 52

Notes de l'éditeur

  1. Data, especially from outside sources is not in a perfect condition to be useful to your business.Not only does it need to be processed into useful formats, it also needs:Filtering for potentially useful information. 99% of everything is crapStatistical analysis – is this data significant?Integration with existing dataEntity resolution. Is “Oracle Corp” the same as “Oracle” and “Oracle Corporation”? De-DuplicationGood processing and filtering of data can reduce the volume and variety of data. It is important to distinguish between true and accidental variety.This requires massive use of processing power. In a way, there is a trade-off between storage space and CPU. If you don’t invest CPU in filtering, de-duping and entity resolution – you’ll need more storage.
  2. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  3. Internal data sources are typically more valuable.Hadoop lets you utilize data that doesn’t make financial sense to load to RDBMSIn large enough organization, internal data becomes external – no control over quality, format, changes.
  4. Example: Find our how far people live from nearest doctor and pharmacy. Using zipcodes and zipcode-long/lat mapping.
  5. ESRI data is probably the most common. Oil&amp;gas, defense.
  6. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  7. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  8. Inconsistent – data is correct, but has small formatting issues (1999 vs. 99. M vs. male, etc)Invalid – format is correct, but something is wrong with the data (update from 2036 or 1976)Corrupt – format completely unparsable.You can fix inconsistencies, identify invalid data and throw out corrupt data.
  9. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.
  10. Oracle uses acquire-organize-analyze model. We are looking at acquire and organize phases in some details.