Contenu connexe
Similaire à Integrated Data Warehouse with Hadoop and Oracle Database (20)
Plus de Gwen (Chen) Shapira (20)
Integrated Data Warehouse with Hadoop and Oracle Database
- 2. Why Pythian
• Recognized Leader:
• Global industry leader in data infrastructure managed services and consulting with expertise
in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems
administration
• Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and
Western Union to help manage their complex IT deployments
• Expertise:
• One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8
Oracle ACEs/ACE Directors
• Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata,
Oracle GoldenGate & Oracle RAC
• Global Reach & Scalability:
• 24/7/365 global remote support for DBA and consulting, systems administration, special
projects or emergency response
© 2012 – Pythian
- 3. About Gwen Shapira
• Oracle ACE Director
• 13 Years with pager
• 7 as Oracle DBA
• Senior Consultant:
• Has MacBook, will travel.
• @gwenshap
• http://www.pythian.com/news/author/
shapira/
© 2012 – Pythian
- 4. Agenda
• What is Big Data?
• Why do we care about Big Data?
• Why your DWH needs Hadoop?
• Examples of Hadoop in the DWH
• How to integrate Hadoop into your
DWH
• Avoiding major pitfalls
© 2012 – Pythian
- 9. Data Arriving at fast Rates
Typically unstructured
Stored without aggregation
Analyzed in Real Time
For Reasonable Cost
© 2012 – Pythian
- 10. Where does Big Data come from?
• Social media
• Enterprise transactional data
• Consumer behaviour
• Multimedia
• Sensors and embedded devices
• Network devices
© 2012 – Pythian
- 14. Big Problems with Big Data
• It is:
• Unstructured
• Unprocessed
• Un-aggregated
• Un-filtered
• Repetitive
• And generally messy.
Oh, and there is a lot of it.
© 2012 – Pythian
- 15. Technical Challenges
• Storage capacity
• Storage throughput Scalable storage
• Pipeline throughput
• Processing power
• Parallel processing
Massive Parallel Processing
• System Integration
• Data Analysis Ready to use tools
© 2012 – Pythian
- 17. Hadoop in a Nutshell
HDFS: Map-Reduce:
Replicated Distributed Big-Data File Framework for writing massively parallel
jobs
System
© 2012 – Pythian
- 18. Hadoop Benefits
• Reliable solution based on unreliable hardware
• Designed for large files
• Load data first, structure later
• Designed to maximize throughput of large scans
• Designed to maximize parallelism
• Designed to scale
• Flexible development platform
• Solution Ecosystem
© 2012 – Pythian
- 19. Hadoop Limitations
• Hadoop is scalable but not fast
• Batteries not included
• Instrumentation not included either
• Well-known reliability limitations
© 2012 – Pythian
- 21. ETL for Unstructured Data
Logs Flume Hadoop
Web servers, Cleanup,
app server, aggregation
clickstreams Longterm storage
DWH
BI,
batch reports
© 2012 – Pythian
- 22. ETL for Structured Data
Sqoop, Hadoop
OLTP
Oracle, Perl Transformation
MySQL, aggregation
Informix… Longterm storage
DWH
BI,
batch reports
© 2012 – Pythian
- 28. Sqoop
Queries
© 2012 – Pythian
- 29. Sqoop is Flexible (for import)
• Select
columns from table where condition
• Or write your own query
• Split column
• Parallel
• Incremental
• File formats
© 2012 – Pythian
- 30. Sqoop Import Examples
• Sqoop
import
-‐-‐connect
jdbc:oracle:thin:@//dbserver:
1521/masterdb
-‐-‐username
hr
-‐-‐table
emp
-‐-‐where
“start_date
’01-‐01-‐2012’”
• Sqoop
import
jdbc:oracle:thin:@//dbserver:1521/masterdb
-‐-‐username
myuser
-‐-‐table
shops
-‐-‐split-‐by
shop_id
Must be indexed or
-‐-‐num-‐mappers
16
partitioned to avoid
16 full table scans
© 2012 – Pythian
- 31. Less Flexible Export
• 100row batch inserts
• Commit every 100 batches
• Parallel export
• Update mode
Example:
sqoop
export
-‐-‐connect
jdbc:oracle:thin:@//dbserver:1521/masterdb
-‐-‐table
bar
-‐-‐export-‐dir
/results/bar_data
© 2012 – Pythian
- 32. Fuse-DFS
• Mount HDFS on Oracle server:
• sudo yum install hadoop-0.20-fuse
• hadoop-fuse-dfs
dfs://
name_node_hostname:namenode_port mount_point
• Use external tables to load data into Oracle
• File Formats may vary
• All ETL best practices apply
© 2012 – Pythian
- 33. Oracle Loader for Hadoop
• Load data from Hadoop into Oracle
• Map-Reduce job inside Hadoop
• Converts data types.
• Partitions and sorts
• Direct path loads
• Reduces CPU utilization on database
© 2012 – Pythian
- 34. Oracle Direct Connector to HDFS
• Create external tables of files in HDFS
• PREPROCESSOR
HDFS_BIN_PATH:hdfs_stream
• All the features of External Tables
• Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS
© 2012 – Pythian
- 39. Use Hadoop Efficiently
• Understand your bottlenecks:
• CPU, storage or network?
• Reduce use of temporary data:
• All data is over the network
• Written to disk in triplicate.
• Eliminate unbalanced workloads
• Offload work to RDBMS
• Fine-tune optimization with Map-Reduce
© 2012 – Pythian
- 40. Your Data
is NOT
as BIG
as you think
© 2012 – Pythian
- 41. Getting Started
• Pick a Business Problem
• Acquire Data
• Use right tool for the job
• Hadoop can start on the cheap
• Integrate the systems
• Analyze data
• Get operational
© 2012 – Pythian
- 42. Thank you and QA
To contact us…
sales@pythian.com
1-877-PYTHIAN
To follow us…
http://www.pythian.com/news/
http://www.facebook.com/pages/The-Pythian-Group/163902527671
@pythian
@pythianjobs
http://www.linkedin.com/company/pythian
© 2012 – Pythian