SlideShare une entreprise Scribd logo
1  sur  22
Ronen Schwartz
    VP Products B2B Data Exchange BU
                           Informatica


                     November, 2011



1
Hadoop
records    M
                       results
           M       R


           M




                             2
real-world             Hadoop
data         records    M
                                    results
                        M       R


                        M




                                          3
real-world             Hadoop
data         records    M
                                    results
                        M       R


                        M




                                          4
real-world             Hadoop
data         records    M
                                    results
                        M       R


                        M




  80%


                                          5
HParser UI
                 - any format
                 - any complexity
                 - easily

             - in Map Reduce
real-world                 Hadoop
data            records        M
                                        results
                               M    R


                               M




  80%


                                              6
HParser UI
                 - any format
  5%             - any complexity
                 - easily

             - in Map Reduce
real-world                 Hadoop
data            records        M
                                        results
                               M    R


                               M




  80%


                                              7
Demo
 Construction                      Execution
 (Windows)
                                   (Linux)

binary           text
records          records

                                   Map         Reduce
    HParser UI
 in          out



    transform
    definition             input               output

                                                        8
Real-world Data


 Flat files                     HParser


Logs
                                          records
XML, JSON

Industry standards
Ex. FIX, SWIFT, X12, ASN.1


Documents
Ex. PDF, Excel

                                                    9
DEMO




10
Informatica HParser
Tackling Diversity of Big Data
                                                   The broadest coverage for Big Data
                                                           EngineDTThe Engineusescan immediatelyand this
                                                           As shown Developerthe transformationviaways: send
                                                                   PowerCenter leveragesgeneral the a
                                                                   1. simple a    actual in Studio re-entrant.
                                                                    invocationthe be               to develop
                                                            TheThedeployisto shared library. DT DTservice
                                                             2.  To enginebe also thread-safe andthis engine runs
                                                                  The below,custom deploys The logic anduse
                                                             3. withincallingis invokedfullybufferapplication. line
                                                                  ForDT process ofisapplications data is
                                                                   4. the application can embeddable
                                                                  DT Developerembedded inthe canmiddleware
                                                                  InternalDT engineserver,transformation
                                                                   2. can alsointegration, a command
                                                                                            two     other embed
                                                           fully buffers to DT for processing. application.
                                                                 can be invokedDatacalling invokethe various
                                                                   transformation the data.
                                                           completely independent of any calling
                                                                   technologies.
                                                                  folder isto Standardsanyusing (directory).
                                                                  transformation using the serversupported
                                                                   Unstructured services of the services.
                                                                  interface service toTransformation (UDT).
                                                                   service moved repository via FTP,
                                                                   to local isprocess to
  Flat Files &                                     XML               Industry available to invoke DT in multiple
                                                                  allows output side, DT WebMethods, BizTalk) data
                                                                                                     Interaction
                                                                 APIs.some (WBIMB, can also writeto it, andandINFA
                                                            This On the the calling application
                                                                   For you can develop a transformation once,memory
                                                                  APIs. script, can
                                                                  copy,is a GUI etc. be passed back to DT
                                                           This means Filenamestransformation widget in will
                                                                   1. 
  Documents                                                        This
                                                             Itthreads toexternal are returned tothe calling application.
                                                                is not an files similar GUI widgets transformation
                                                                     All increase throughput. removes any overhead
                                                                     provides neededThis the for
               	
                                                   buffers which engine. the file(s) (agents) for the
                                                             leverage it indirectly open processes, for processing.
                                                                             multiple environments simultaneously resulting
                                                             from passing data between system is across the the DT
                                                                NOTE: If the serverwhich wraps mountable from
                                                                     Powercenter fileenvironments.     around
                                                                                                        social
                                                                     respective design maintenance times network,
                                                                     are moved. and
                                                             in reduced developmentdynamically invoked and and lower
  	
 	
  	
          	
                                      etc. The engine is also
                                                                    the change.is C, .NET, web services does not
                                                             ThoughFor shown engine. engine fully supports step 2 input
                                                             impactAPI ‘started theor supportdirectly, then multiple
                                                                   Java, andtheDT’sthe layer can be also directly
                                                               A goodnot others up’machine ofDT canused directly.
                                                                         developer
                                                                       be C++, output
                                                             need toof On below,API side, PowerCenter partitioning
                                                                        example
                           	
                                       would deploy directly to the externally.
                                                                                          maintained
                                                                                                     server.
                                                             andscale up processing. as needed by the transformation.
                                                               to output files or buffers
                                  Svc Repository                            write to the filesystem.
                                                                                                                   device/sensor
                                        S                                                                          scientific


                                 Productivity                                         Any DI/BI architecture

•  Visual
   parsing
   environment                                                                                     PIG          EDW
•  Predefined                                                                                                   MDM
   translations

                                                                                                                                   11
Universal Data Transformation
Data Formats - Subset of Supported Data Formats


UNSTRUCTURED      SEMI STRUCTURED    XML/JSON
                   HL7                 ACORD XML
                   SWIFT               LegalXML
Microsoft Word     AL3                 IFX
Microsoft Excel    HIPAA               cXML
PDF                EDI–X12             ebXML
PowerPoint         EDI-Fact            HL7 V3.0
ASCII reports      FIX                 RosettaNet
HTML               NACHA               ISO 20022
EBCDIC             ASTM                xBRL
Custom binaries    Cargo IMP           Other
Flat files         COBOL
RPG                PL1
ANSI               UCS
                   WINS
                                     PRINT STREAMS
                   VICS                AFP
                   ASN.1               PostScript




                                                     12
Universal Data Transformation
Productivity: Data Transformation Studio




                                           13
Universal Data Transformation
    Productivity: Data Transformation Studio


Financial            Insurance           B2B Standards
                                                             Out of the box
SWIFT MT             DTCC-NSCC                               transformations for
                                         UNEDIFACT
SWIFT MX             ACORD-AL3                               all messages in all
                                         Easy example
                                         EDI-X12
NACHA                                                        versions
                     ACORD XML           based visual
                                         EDI ARR
FIX                                      enhancements
                                         EDI UCS+WINS
Telekurs                                 and edits
                                         EDI VICS            Updates and new
FpML
                                         RosettaNet          versions delivered
BAI – V2.0Lockbox
                     Healthcare          OAGI                from Informatica
CREST DEX
IFX                  HL7
                                  Definition is done using
TWIST                             Business (industry)
                                           Other
                     HL7 V3
  Enhanced
UNIFI (ISO 20022)                 terminology and
                     HIPAA
  Validations                     definitions
                                           IATA-PADIS
SEPA                 NCPDP
FIXML                                    PLMXML
                     CDISC
MISMO                                    NEIM



                                                                                  14
HParser – How Does It Work?
                                         hadoop … dt-hadoop.jar
                                         … My_Parser /input/*/input*.txt

                                                                     HDFS




1.  Develop a DT transformation
2.  Deploy the transformation
3.  Run HParser to produce
    tabular data
4.  Analyze the data with HIVE / PIG /
    MapReduce / Other


                                                                            15
Example use cases
 Trade data




•  Why Hadoop?
  •  trades data represent extremely large sets of data
  •  We are not sure what trades patterns we would like to
     investigate
  •  Compare to other large data sets: Bloomberg, Reuters, NYSE


                                                                  16
Example use cases
  Trade data




•  Why is handling Fix data complex?
  •    Variable length   •    Variations
  •    Name value pair   •    Proprietary tags
  •    Meaningful tags   •    Yearly releases
  •    Hierarchy         •    FIXML - XML version


                                                    17
Example use cases
 Call Detail record




•  Why Hadoop?
  •  CDR – Large data sets every 7 seconds every mobile phone
     in the region create a record
  •  Desire to analyze behavior, location to personalize and
     optimize pricing and ,marketing


                                                                18
Example use cases
  Trade data




•  Why is handling CDRs data complex?
  •  Binary format     •  Vendor variations
  •  ASN.1             •  SWITCH Software update
  •  Meaningful tags   •  Hierarchy



                                                   19
Example use cases
Proprietary logs

                   •  Why Hadoop?
                     •  Extremely large data sets
                     •  Often information is split
                        across multi files
                     •  Not sure what are we
                        looking for




                                                     20
Example use cases
Proprietary logs
                    •  Why is handling
                       proprietary logs
                       complex?
                       •  Many times hierarchical data:
                          •  flat files
                          •  JSON
                          •  XML
                       •  Data logic and business/
                          context logic
                       •  Variations




                                                          21
Thank you

     http://www.informatica.com/HParser




22

Contenu connexe

En vedette

Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasProf Dr Mehmed ERDAS
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationPowering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationDenodo
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureOdinot Stanislas
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Yellowfin
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersDataWorks Summit
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics ArchitectureArvind Sathi
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 

En vedette (11)

Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdasBig data analytics for telecom operators final use cases 0712-2014_prof_m erdas
Big data analytics for telecom operators final use cases 0712-2014_prof_m erdas
 
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data VirtualizationPowering Self Service Business Intelligence with Hadoop and Data Virtualization
Powering Self Service Business Intelligence with Hadoop and Data Virtualization
 
Big Data and Implications on Platform Architecture
Big Data and Implications on Platform ArchitectureBig Data and Implications on Platform Architecture
Big Data and Implications on Platform Architecture
 
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)Making Big Data Analytics with Hadoop fast & easy (webinar slides)
Making Big Data Analytics with Hadoop fast & easy (webinar slides)
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Monetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service ProvidersMonetizing Big Data at Telecom Service Providers
Monetizing Big Data at Telecom Service Providers
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data & Analytics Architecture
Big Data & Analytics ArchitectureBig Data & Analytics Architecture
Big Data & Analytics Architecture
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 

Similaire à Nov 2011 HUG: HParser

Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalogAdam Muise
 
Informatica
InformaticaInformatica
Informaticamukharji
 
Fra enkel J2SE til Grid computing med GigaSpaces XAP
Fra enkel J2SE til Grid computing med GigaSpaces XAPFra enkel J2SE til Grid computing med GigaSpaces XAP
Fra enkel J2SE til Grid computing med GigaSpaces XAPmudnaes
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentationpbridges
 
LogRhythm Appliance Data Sheet
LogRhythm Appliance Data SheetLogRhythm Appliance Data Sheet
LogRhythm Appliance Data Sheetjordagro
 
Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827surilige
 
Development Model for The Cloud
Development Model for The CloudDevelopment Model for The Cloud
Development Model for The Cloudumityalcinalp
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Easydd program
Easydd programEasydd program
Easydd programTaha Sochi
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 

Similaire à Nov 2011 HUG: HParser (20)

Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionBig Data Taiwan 2014 Track2-2: Informatica Big Data Solution
Big Data Taiwan 2014 Track2-2: Informatica Big Data Solution
 
2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog2013 feb 20_thug_h_catalog
2013 feb 20_thug_h_catalog
 
Informatica
InformaticaInformatica
Informatica
 
Fra enkel J2SE til Grid computing med GigaSpaces XAP
Fra enkel J2SE til Grid computing med GigaSpaces XAPFra enkel J2SE til Grid computing med GigaSpaces XAP
Fra enkel J2SE til Grid computing med GigaSpaces XAP
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
SnapLogic corporate presentation
SnapLogic corporate presentationSnapLogic corporate presentation
SnapLogic corporate presentation
 
Big data
Big dataBig data
Big data
 
LogRhythm Appliance Data Sheet
LogRhythm Appliance Data SheetLogRhythm Appliance Data Sheet
LogRhythm Appliance Data Sheet
 
Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827Ofm msft-interop-v5c-132827
Ofm msft-interop-v5c-132827
 
Oracle Fusion Middleware
Oracle Fusion MiddlewareOracle Fusion Middleware
Oracle Fusion Middleware
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Development Model for The Cloud
Development Model for The CloudDevelopment Model for The Cloud
Development Model for The Cloud
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Easydd program
Easydd programEasydd program
Easydd program
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 

Plus de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Plus de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Dernier

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Dernier (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Nov 2011 HUG: HParser

  • 1. Ronen Schwartz VP Products B2B Data Exchange BU Informatica November, 2011 1
  • 2. Hadoop records M results M R M 2
  • 3. real-world Hadoop data records M results M R M 3
  • 4. real-world Hadoop data records M results M R M 4
  • 5. real-world Hadoop data records M results M R M 80% 5
  • 6. HParser UI - any format - any complexity - easily - in Map Reduce real-world Hadoop data records M results M R M 80% 6
  • 7. HParser UI - any format 5% - any complexity - easily - in Map Reduce real-world Hadoop data records M results M R M 80% 7
  • 8. Demo Construction Execution (Windows) (Linux) binary text records records Map Reduce HParser UI in out transform definition input output 8
  • 9. Real-world Data Flat files HParser Logs records XML, JSON Industry standards Ex. FIX, SWIFT, X12, ASN.1 Documents Ex. PDF, Excel 9
  • 11. Informatica HParser Tackling Diversity of Big Data The broadest coverage for Big Data EngineDTThe Engineusescan immediatelyand this As shown Developerthe transformationviaways: send PowerCenter leveragesgeneral the a 1. simple a actual in Studio re-entrant. invocationthe be to develop TheThedeployisto shared library. DT DTservice 2.  To enginebe also thread-safe andthis engine runs The below,custom deploys The logic anduse 3. withincallingis invokedfullybufferapplication. line ForDT process ofisapplications data is 4. the application can embeddable DT Developerembedded inthe canmiddleware InternalDT engineserver,transformation 2. can alsointegration, a command two other embed fully buffers to DT for processing. application. can be invokedDatacalling invokethe various transformation the data. completely independent of any calling technologies. folder isto Standardsanyusing (directory). transformation using the serversupported Unstructured services of the services. interface service toTransformation (UDT). service moved repository via FTP, to local isprocess to Flat Files & XML Industry available to invoke DT in multiple allows output side, DT WebMethods, BizTalk) data Interaction APIs.some (WBIMB, can also writeto it, andandINFA This On the the calling application For you can develop a transformation once,memory APIs. script, can copy,is a GUI etc. be passed back to DT This means Filenamestransformation widget in will 1.  Documents This Itthreads toexternal are returned tothe calling application. is not an files similar GUI widgets transformation All increase throughput. removes any overhead provides neededThis the for buffers which engine. the file(s) (agents) for the leverage it indirectly open processes, for processing. multiple environments simultaneously resulting from passing data between system is across the the DT NOTE: If the serverwhich wraps mountable from Powercenter fileenvironments. around social respective design maintenance times network, are moved. and in reduced developmentdynamically invoked and and lower etc. The engine is also the change.is C, .NET, web services does not ThoughFor shown engine. engine fully supports step 2 input impactAPI ‘started theor supportdirectly, then multiple Java, andtheDT’sthe layer can be also directly A goodnot others up’machine ofDT canused directly. developer be C++, output need toof On below,API side, PowerCenter partitioning example would deploy directly to the externally. maintained server. andscale up processing. as needed by the transformation. to output files or buffers Svc Repository write to the filesystem. device/sensor S scientific Productivity Any DI/BI architecture •  Visual parsing environment PIG EDW •  Predefined MDM translations 11
  • 12. Universal Data Transformation Data Formats - Subset of Supported Data Formats UNSTRUCTURED SEMI STRUCTURED XML/JSON HL7 ACORD XML SWIFT LegalXML Microsoft Word AL3 IFX Microsoft Excel HIPAA cXML PDF EDI–X12 ebXML PowerPoint EDI-Fact HL7 V3.0 ASCII reports FIX RosettaNet HTML NACHA ISO 20022 EBCDIC ASTM xBRL Custom binaries Cargo IMP Other Flat files COBOL RPG PL1 ANSI UCS WINS PRINT STREAMS VICS AFP ASN.1 PostScript 12
  • 13. Universal Data Transformation Productivity: Data Transformation Studio 13
  • 14. Universal Data Transformation Productivity: Data Transformation Studio Financial Insurance B2B Standards Out of the box SWIFT MT DTCC-NSCC transformations for UNEDIFACT SWIFT MX ACORD-AL3 all messages in all Easy example EDI-X12 NACHA versions ACORD XML based visual EDI ARR FIX enhancements EDI UCS+WINS Telekurs and edits EDI VICS Updates and new FpML RosettaNet versions delivered BAI – V2.0Lockbox Healthcare OAGI from Informatica CREST DEX IFX HL7 Definition is done using TWIST Business (industry) Other HL7 V3 Enhanced UNIFI (ISO 20022) terminology and HIPAA Validations definitions IATA-PADIS SEPA NCPDP FIXML PLMXML CDISC MISMO NEIM 14
  • 15. HParser – How Does It Work? hadoop … dt-hadoop.jar … My_Parser /input/*/input*.txt HDFS 1.  Develop a DT transformation 2.  Deploy the transformation 3.  Run HParser to produce tabular data 4.  Analyze the data with HIVE / PIG / MapReduce / Other 15
  • 16. Example use cases Trade data •  Why Hadoop? •  trades data represent extremely large sets of data •  We are not sure what trades patterns we would like to investigate •  Compare to other large data sets: Bloomberg, Reuters, NYSE 16
  • 17. Example use cases Trade data •  Why is handling Fix data complex? •  Variable length •  Variations •  Name value pair •  Proprietary tags •  Meaningful tags •  Yearly releases •  Hierarchy •  FIXML - XML version 17
  • 18. Example use cases Call Detail record •  Why Hadoop? •  CDR – Large data sets every 7 seconds every mobile phone in the region create a record •  Desire to analyze behavior, location to personalize and optimize pricing and ,marketing 18
  • 19. Example use cases Trade data •  Why is handling CDRs data complex? •  Binary format •  Vendor variations •  ASN.1 •  SWITCH Software update •  Meaningful tags •  Hierarchy 19
  • 20. Example use cases Proprietary logs •  Why Hadoop? •  Extremely large data sets •  Often information is split across multi files •  Not sure what are we looking for 20
  • 21. Example use cases Proprietary logs •  Why is handling proprietary logs complex? •  Many times hierarchical data: •  flat files •  JSON •  XML •  Data logic and business/ context logic •  Variations 21
  • 22. Thank you http://www.informatica.com/HParser 22