Organizations are now increasingly interested in finding more efficient ways to tackle deeply hierarchical data including XML and JSON as wellas other complex data formats like Web logs, binaries, and machine generated data in Hadoop.
How are you currently developing setting up data parsing tasks insideMapReduce? Are you interested in native streaming and splitting capabilities allow effective handling of files in any size regardless of format. In this session, we will share with you about HParseroptimized for parallel parsing in Hadoop including technical demonstration of HParser.
11. Informatica HParser
Tackling Diversity of Big Data
The broadest coverage for Big Data
EngineDTThe Engineusescan immediatelyand this
As shown Developerthe transformationviaways: send
PowerCenter leveragesgeneral the a
1. simple a actual in Studio re-entrant.
invocationthe be to develop
TheThedeployisto shared library. DT DTservice
2. To enginebe also thread-safe andthis engine runs
The below,custom deploys The logic anduse
3. withincallingis invokedfullybufferapplication. line
ForDT process ofisapplications data is
4. the application can embeddable
DT Developerembedded inthe canmiddleware
InternalDT engineserver,transformation
2. can alsointegration, a command
two other embed
fully buffers to DT for processing. application.
can be invokedDatacalling invokethe various
transformation the data.
completely independent of any calling
technologies.
folder isto Standardsanyusing (directory).
transformation using the serversupported
Unstructured services of the services.
interface service toTransformation (UDT).
service moved repository via FTP,
to local isprocess to
Flat Files & XML Industry available to invoke DT in multiple
allows output side, DT WebMethods, BizTalk) data
Interaction
APIs.some (WBIMB, can also writeto it, andandINFA
This On the the calling application
For you can develop a transformation once,memory
APIs. script, can
copy,is a GUI etc. be passed back to DT
This means Filenamestransformation widget in will
1.
Documents This
Itthreads toexternal are returned tothe calling application.
is not an files similar GUI widgets transformation
All increase throughput. removes any overhead
provides neededThis the for
buffers which engine. the file(s) (agents) for the
leverage it indirectly open processes, for processing.
multiple environments simultaneously resulting
from passing data between system is across the the DT
NOTE: If the serverwhich wraps mountable from
Powercenter fileenvironments. around
social
respective design maintenance times network,
are moved. and
in reduced developmentdynamically invoked and and lower
etc. The engine is also
the change.is C, .NET, web services does not
ThoughFor shown engine. engine fully supports step 2 input
impactAPI ‘started theor supportdirectly, then multiple
Java, andtheDT’sthe layer can be also directly
A goodnot others up’machine ofDT canused directly.
developer
be C++, output
need toof On below,API side, PowerCenter partitioning
example
would deploy directly to the externally.
maintained
server.
andscale up processing. as needed by the transformation.
to output files or buffers
Svc Repository write to the filesystem.
device/sensor
S scientific
Productivity Any DI/BI architecture
• Visual
parsing
environment PIG EDW
• Predefined MDM
translations
11
12. Universal Data Transformation
Data Formats - Subset of Supported Data Formats
UNSTRUCTURED SEMI STRUCTURED XML/JSON
HL7 ACORD XML
SWIFT LegalXML
Microsoft Word AL3 IFX
Microsoft Excel HIPAA cXML
PDF EDI–X12 ebXML
PowerPoint EDI-Fact HL7 V3.0
ASCII reports FIX RosettaNet
HTML NACHA ISO 20022
EBCDIC ASTM xBRL
Custom binaries Cargo IMP Other
Flat files COBOL
RPG PL1
ANSI UCS
WINS
PRINT STREAMS
VICS AFP
ASN.1 PostScript
12
14. Universal Data Transformation
Productivity: Data Transformation Studio
Financial Insurance B2B Standards
Out of the box
SWIFT MT DTCC-NSCC transformations for
UNEDIFACT
SWIFT MX ACORD-AL3 all messages in all
Easy example
EDI-X12
NACHA versions
ACORD XML based visual
EDI ARR
FIX enhancements
EDI UCS+WINS
Telekurs and edits
EDI VICS Updates and new
FpML
RosettaNet versions delivered
BAI – V2.0Lockbox
Healthcare OAGI from Informatica
CREST DEX
IFX HL7
Definition is done using
TWIST Business (industry)
Other
HL7 V3
Enhanced
UNIFI (ISO 20022) terminology and
HIPAA
Validations definitions
IATA-PADIS
SEPA NCPDP
FIXML PLMXML
CDISC
MISMO NEIM
14
15. HParser – How Does It Work?
hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt
HDFS
1. Develop a DT transformation
2. Deploy the transformation
3. Run HParser to produce
tabular data
4. Analyze the data with HIVE / PIG /
MapReduce / Other
15
16. Example use cases
Trade data
• Why Hadoop?
• trades data represent extremely large sets of data
• We are not sure what trades patterns we would like to
investigate
• Compare to other large data sets: Bloomberg, Reuters, NYSE
16
17. Example use cases
Trade data
• Why is handling Fix data complex?
• Variable length • Variations
• Name value pair • Proprietary tags
• Meaningful tags • Yearly releases
• Hierarchy • FIXML - XML version
17
18. Example use cases
Call Detail record
• Why Hadoop?
• CDR – Large data sets every 7 seconds every mobile phone
in the region create a record
• Desire to analyze behavior, location to personalize and
optimize pricing and ,marketing
18
19. Example use cases
Trade data
• Why is handling CDRs data complex?
• Binary format • Vendor variations
• ASN.1 • SWITCH Software update
• Meaningful tags • Hierarchy
19
20. Example use cases
Proprietary logs
• Why Hadoop?
• Extremely large data sets
• Often information is split
across multi files
• Not sure what are we
looking for
20
21. Example use cases
Proprietary logs
• Why is handling
proprietary logs
complex?
• Many times hierarchical data:
• flat files
• JSON
• XML
• Data logic and business/
context logic
• Variations
21
22. Thank you
http://www.informatica.com/HParser
22