講者:Informatica 資深產品顧問 | 尹寒柏
議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。
2. The Challenge
Data fragmentation becomes the
barrier to business success
10 2
MAINFRAME
CLIENT-SERVER
WEB
SOCIAL
INTERNET
OF THINGS
CLOUD
Few
Employees
Many
Employees
Customers/
Consumers
Business
Ecosystems
Communities
& Society
Devices
& Machines
10 4
10 6
10 7
10 9
10 11
Front Office
ProductivityBack Office
Automation
E-Commerce
Line-of-Business
Self-Service
Social
Engagement
Real-Time
Optimization
1960s-1970s
1980s
1990s
2011
2014
2007
OS/360
TECHNOLOGY
USERS
VALUE
TECHNOLOGIES
SOURCES
BUSINESS
5. 80% of the work in big data projects
is data integration and quality
“I spend more than half my time
integrating, cleansing, and
transforming data without doing
any actual analysis.”
“80% of the work in any data
project is in cleaning the data”
“70% of my value is an ability
to pull the data, 20% of my
value is using data-science…”
7. PowerCenter Big Data Edition
Big Transaction Data Big Interaction Data
Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server
…
Cloud
Salesforce.com
Concur
Google App Engine
Amazon
…
Other Interaction Data
Clickstream
image/Text
Scientific
Genomoic/pharma
Medical
Medical/Device
Sensors/meters
RFID tags
CDR/mobile
…
Social Media & Web Data
Facebook
Twitter
Linkedin
Youtube
…
Big Data Processing
Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata
HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel …
Web applications
Blogs
Discussion forums
Communities
Partner portals
…
Universal Data Access
High-Speed Data
Ingestion and
Extraction
ETL on Hadoop
Profiling on Hadoop
Complex Data
Parsing on Hadoop
Entity Extraction and
Data Classification on
Hadoop
No-Code Productivity
Business-IT
Collaboration
Unified Administration
the VibeTM virtual
data machine
9.6
8. Get Data Into and Out of
Hadoop
PowerExchange for Hadoop
Replication to Hadoop
Streaming to Hadoop
Data Archiving to Hadoop
9. Data
Warehouse
MDM
Applications
Data Ingestion and Extraction
Moving terabytes of data per hour
Replicate
Streaming
Batch Load
Extract
Archive Extract Low
Cost
Store
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Email
Industry
Standards
Machine Device,
Scientific
10. PowerExchange Connectors
Enterprise
Applications,
Software as a
Service (SaaS)
JDE EnterpriseOne
JDE World
Lotus Notes
Oracle E-Business Suite ✔
PeopleSoft Enterprise
Salesforce (salesforce.com) ✔
SAP NetWeaver ✔
SAP NetWeaver BI ✔
SAS
Siebel
Netsuite
Microsoft Dynamics
Databases and
Data
Warehouses
Adabas for UNIX, Windows
C-ISAM
DB2 for LUW ✔
Essbase
EMC/Greenplum
Informix Dynamic Server
Netezza Performance Server
ODBC
Oracle ✔
SQL Server ✔
Sybase
Teradata
Messaging
Systems
JMS ✔
MSMQ ✔
TIBCO ✔
webMethods Broker ✔
WebSphere MQ ✔
Technology
Standards
Email (POP, IMAP)
HTTP(S) ✔
LDAP ✔
Web Services ✔
XML
Mainframe
Adabas for z/OS ✔
Datacom ✔
DB2 for z/OS, z/Linux✔
IDMS ✔
IMS DB ✔
Oracle for z/Linux ✔
Teradata
WebSphere MQ for z/Linux ✔
VSAM ✔
Big Data
Asterdata,
Greenplum
Vertica
ParAccel
Microsoft PDW
Kognitio
Social Facebook, Twitter, LinkedIn DataSift, Kapow MongoDB
Hadoop HDFS HIVE HBASE
- Accessible in Real-time and/or via Change Data Capture (CDC)
11. NoSQL Support for HBase
11
Read
from HBase as
standard source
Write
to HBase as
standard target
Complete Mapping with
HBase Src/Tgt can
execute on hadoop
Sample HBase column
families
(Stored in JSON/complex
formats)
12. NoSQL Support for MongoDB
Access, integrate,
transform & ingest
MongoDB data into
other analytic
systems (e.g.
Hadoop, data
warehouse)
Access, integrate,
transform, & ingest
data into MongoDB
Sampling
MongoDB data &
flattening it to
relational format
14. Real-Time Data Collection and Streaming
14
UltraMessagingBus
Publish/Subscribe
Leverage High Performance Messaging
Infrastructure Publish with Ultra
Messaging for global distribution without
additional staging or landing.
HDFS, HBase,
Targets
Web Servers,
Operations
Monitors, rsyslog,
SLF4J, etc.
Handhelds, Smart
Meters, etc.
Discrete Data
Messages
Sources
Zookeeper
Management
and Monitoring
Internet of Things,
Sensor Data
Real Time
Analysis, Complex
Event Processing
No SQL
Databases:
Cassandara, Riak,
MongoDB
Node
Node
Node
Node
Node
Node
15. Informatica Vibe Data Stream for Machine Data
15
• High performance/efficient
streaming data collection over
LAN/WAN
• GUI interface provides ease of
configuration, deployment & use
• Continuous ingestion of real-time
generated data (sensors; logs;
etc.). Machine generated & other
data sources
• Enable real-time interactions &
response
• Real-time delivery directly to
multiple targets (batch/stream
processing)
• Highly available; efficient;
scalable
• Available ecosystem of light
weight agents (sources & targets)
16. Predictive Maintenance
with Event Processing and Analytics
United Technologies Aerospace Systems (UTAS)
provides engines and aircraft components to
leading commercial and defense manufacturers,
including the new Airbus A380 and Boeing B787.
The challenge:
• 5,000+ aircraft in service plus new design wins exponentially
increases the amount of sensor data being generated
• “Power by the Hour” leasing model means the maintenance cost and
service outages falls to UTAS
• No proactive capability to predict when a safety issue might occur
• Once-per-day sensor readings moving to real-time, over-the-air
17. Archive to Hadoop
Compression Extends Hadoop Cluster Capacity
Without INFA Optimized
Archive Compression
With INFA Optimized
Archive 95% Compression
10 TB 10 TB 10 TB
10 TB replicated 3X = 30TB 10 TB compressed 95% = 500GB
Replicated 3X = 1.5 TB
20X less I/O bandwidth required
20 min vs. 1 min response time
8 hours vs. 24 mins backup window
500 GB 500 GB 500 GB
19. 4. The DT engine can immediately
use this service to process data.
The DT Engine is fully
embeddable and can be invoked
using any of the supported APIs.
Java, C++, C, .NET, web services
For simple integration, a command
line interface is available to invoke
services.
Internal custom applications can
embed transformation services
using the various APIs.
PowerCenter leverages DT via the
Unstructured Data Transformation
(UDT).
This is a GUI transformation
widget in Powercenter which
wraps around the DT API and
engine.
DT can also be embedded in other middleware
technologies.
For some (WBIMB, WebMethods, BizTalk) INFA
provides similar GUI widgets (agents) for the
respective design environments.
For others the API layer can be used directly.
DT can be invoked in two general ways:
1. Filenames can be passed to it, and DT will
directly open the file(s) for processing.
On the output side, DT can also directly
write to the filesystem.
2. The calling application can buffer the data and send
buffers to DT for processing.
On the output side, DT can also write back to memory
buffers which are returned to the calling application.
Though not shown below, the engine fully supports multiple input
and output files or buffers as needed by the transformation.
Engine invocation is a shared library. The DT engine runs
fully within the process of the calling application.
It is not an external engine. This removes any overhead
from passing data between processes, across the network,
etc. The engine is also dynamically invoked and does not
need to be ‘started up’ or maintained externally.
The DT engine is also thread-safe and re-entrant.
This allows the calling application to invoke DT in multiple
threads to increase throughput.
A good example is DT’s support of PowerCenter partitioning
to scale up processing.
As shown below, the actual transformation logic is
completely independent of any calling application.
This means you can develop a transformation once, and
leverage it in multiple environments simultaneously resulting
in reduced development and maintenance times and lower
impact of change.
1. Developer uses Studio to
develop a transformation
2. Developer deploys transformation
to local service repository (directory).
All files needed for the transformation
are moved.
3. To deploy to the server, this service
folder is moved to the server via FTP,
copy, script, etc.
NOTE: If the server file system is mountable from
the developer machine directly, then step 2
would deploy directly to the server.
Parse and Prepare Data on Hadoop
S
Svc Repository
S
Flat Files &
Documents
Interaction dataIndustry StandardsXML
The broadest coverage for Big Data
social
Device/sensor
scientific
Productivity
• Visual
parsing
environment
• Predefined
translations
Any DI/BI architecture
PIG EDW
MDM
20. Example use cases
Call Detail record
• Why Hadoop?
• CDR – Large data sets every 7 seconds every mobile phone in
the region create a record
• Desire to analyze behavior, location to personalize and
optimize pricing and ,marketing
21. hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt
1. Define parser in HParser
visual studio
2. Deploy the parser on
Hadoop Distributed File
System (HDFS)
3. Run HParser to extract
data and produce tabular
format in Hadoop
Parse and Prepare Data on Hadoop
How does it work?
24. CUSTOMER_ID example COUNTRY CODE example
3. Drilldown Analysis (into Hadoop Data)
2. Value &
Pattern
Analysis of
Hadoop Data
1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.
Drill down into actual
data values to inspect
results across entire
data set, including
potential duplicates
Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling
results – exposed to
anyone in enterprise
via browser
Stats to identify
outliers and
anomalies in data
Hadoop Data Profiling Results
25. Hadoop Data Domain Discovery
Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to
identify functional meaning of
Hadoop data
Sensitive data
(e.g. SSN, Credit Card number, etc.)
PHI: Protected Health Information
PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type
View/share report of data domains/
sensitive data contained in Hadoop.
Ability to drill down to see suspect
data values.
28. Reuse and Import PC Metadata for Hadoop
Import existing
PC artifacts into
Hadoop
development
environment
Validate import
logic before the
actual import
process to ensure
compatibility
31. Configure Mapping for Hadoop Execution
No need to redesign
mapping logic to
execute on either
Traditional or Hadoop
infrastructure.
Configure where the
integration logic should
run – Hadoop or Native
32. SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;
Data Integration & Quality on Hadoop
Hive-QL
1. Entire Informatica mapping
translated to Hive Query Language
2. Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3. Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
MapReduce
UDF
33. Example Mapping Execution
Source External
Flat File
Source External
Relational Data
Engine Repository
Source HDFS
File
Cluster of Linux Machines
Mapping logic
translated to HQL
and submitted
to Hadoop Cluster
Relational Data
streamed to
Hadoop for
processing
Target HDFS
FIle
Local flat file
staged
temporarily
on HDFS
Read HDFS
file data
Final
processed data
loaded into
HDFS file
Temp Staged
Lookup File
35. Mixed Workflow Orchestration
One workflow running tasks on hadoop and local environments
Cmd_Choose
LoadPath
MT_Load2Hadoop
+ Parse
Cmd_Load2
Hadoop
MT_Parse
Cmd_ProfileData MT_Cleanse
MT_Data
Analysis
Notification
Name Type Default Value Description
$User.LoadOptionPath Integer 2 Load path for workflow, depending on output of cmd task
$User.DataSourceConnection String HiveSourceConnection Source connection object
$User.ProfileResult Integer 100 Output from “profiling” commnad task.
Add
Edit
Remove
List of variables:
Informatica Corporation Confidential
Do Not Distribute.
36. Full traceability from workflow
to MapReduce jobs
View generated
Hive scripts
Unified Administration
Single Place to Manage & Monitor