SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
DATA ANALYSIS USING DATA
FLUX
FROM-SUNIL PAI
TYPICAL USAGE - CUSTOMER DATA
OPERATIONS
• Data De-Duping
• Data Standardization
• Data Analysis and Data Profiling
• Data Consolidation from various sources
• Comparing multiple data sets as per predefined parameters
• Insert Data in to Target Data Bases
• Match at the glance Reports for various New Acquisitions
DF NODE - DATA INPUTS
DF could use various Input sources such as Relational
Databases (using queries), Excel files, Access Files, Text files
This sources are connected Via ODBC
Examples-A query is inserted in SQL Query Node .By selecting a
database/Access file in the node properties
For Excel-Area needs to be defined for selection by using Name
manager under formula tab in excel sheet .For excel sheets
Data Source Input node is used
DF NODE - DATA OUTPUTS
By using DF we can insert a Job/Result output in an Excel,
Access ,Text, relational database like Oracle /Sql Server
DF uses Insert/Update/Target/Output utilities for Data output
stage
Examples-The output result can be directly inserted into
Database table by using Data Target Insert Node
Output can also be taken in an text file via Text file output node
DF NODE – QUALITY
• Standardization
 dfPower Architect's Standardization node is used to make similar items the
same
 The various definition of standardizations are Name, Address,
Organization,Zip, Phone, email address ,country, State ,Non Alpha numeric
remover, Numeric remover, Alpha Numeric remover ,space remover
,Quotation remover etc
 Various Schemas can also be selected which can be defined in QKB of
DataFlux
 For Example-using full company names instead of initials ("International
Business Machines" vs. "IBM"),
DF NODE – QUALITY
• Standardization (More Examples)-Addresses
1 Comcast Center to 1 Comcast Ctr
10 Glenlake Pkwy north east to 10 Glenlake Pkwy NE
"North Dakota" vs. "ND“
United States vs USA
DF NODE – QUALITY
• Parsing
 DF Power Architect's Parsing node is a simple but intelligent tool for
separating multi-part field values into multiple, single-part fields.
For example, if you have a Name field that includes the value "Mr.
Igor Bela Bonski III, Esq.," you can use parsing to create six separate
fields:
Name Prefix: "Mr."
Given Name: "Igor"
Middle Name: "Bela"
Family Name: "Bonski"
Name Suffix: "III"
Name Appendage: "Esq."
DF NODE – INTEGRATION
• Match Codes
dfPower Architect's Match Codes is to identify duplicate records
in your data. These steps create match codes, that evaluate the
quantity of duplicate fields in your data and eliminate the extra
fields.
Match codes can be set from 50%(Lowest) to 100%(Exact) and
various schemas can be selectedFieldName Defination Sensitivity
AccountName BussinessTiTtle 85%
Address_Line1 Address/AddressLong 85%
City City Exact-All,Exact-10characters
Country Country Exact-All,Exact-10characters
DF NODE – INTEGRATION
• Clustering
DFPower Architect's Data Clustering node is used to employ the
clustering functionality to group match duplicates or set of
unique records as per conditions defined. See cluster numbers
in given example belowCluster AccountName AccountAddress1 MatchCriteria
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1
7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1
7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1
7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1
7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1
DF NODE – INTEGRATION
• Surviving Record Identification
 DFPower Architect's Surviving Record Identification (SRI) node
examines clustered data and determines a surviving record for each
cluster. This process lets you eliminate duplicate information in a
data source. The surviving record is identified using one or more
user-configurable record rules. The user may also enter field rules to
perform automated field-level edits of the surviving record's data
during SRI processing. The SRI step can be configured to keep all
existing data, marking the surviving records with a flag or primary
key value, or it can remove all data except for that associated with
the surviving records.
Examples- Consider you have set of duplicate Accounts and addresses
in the system and you need to keep one distinct record out of those
duplicates but the record should have proper phone numbers in it.
You can use SRI node and define rule for selection which can be done
in properties of SRI Node. Please see the example given in the next
slide
DF NODE – INTEGRATION
• Surviving Record Identification
Examples (Continued) –Please see the cluster column and the
Surviving record column given below. So each cluster has only
one surviving record
Cluster AccountName AccountAddress1 Phone SurvivingRecord
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay (609) 883-1300 TRUE
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay Null FALSE
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay 987 FALSE
7663 Metlife,Incorporated 27-01QueensPlzN 1-800-638-5000 TRUE
7663 Metlife,Incorporated 27-01QueensPlzN Null FALSE
7791 EatonCorporation 34899CurtisBlvd 1-900-735-5674 TRUE
7791 EatonCorporation 34899CurtisBlvd Null FALSE
DF MATCH EXAMPLES
• Standardization and Match codes combined in job flow gives
Remarkable results as shown below
Exact or 100% Match results
Input-COMPANYNAME Matched/OutputCompanyName ADDRESS1(Input) ADDR(Matched)
NetscapeCommunicationsCorporation NetscapeCommunicationsCorporation 501EMiddlefieldRd 501EMiddlefieldRd
Alston&BirdLLP Alston&BirdLLP 1201WPeachtreeSt 1201WPeachtreeSt
GeorgiaPerimeterCollege GeorgiaPerimeterCollege 3251PanthersvilleRd 3251PanthersvilleRd
CountyofOneida CountyofOneida 800ParkAve 800ParkAve
EliLillyandCompany EliLillyandCompany POBox6034 POBox6034
ActuateCorporation ActuateCorporation 2207BridgepointePkwy.Ste.500 2207BridgepointePkwySte500
ShrinersHospitalsForChildren ShrinersHospitalsForChildren 3551NBroadSt 3551NBroadSt
CatholicHealthInitiatives CatholicHealthInitiatives 440CreameryWay 440CreameryWay
ElPasoElectricCompany ElPasoElectricCompany 123WMillsAve 123WMillsAve
DF MATCH EXAMPLES
• 75% Match Results
Input-Name MatchedName Input-ADDRESS Matched-Address
ArizonaStateUniversity ArizonaStateUniversity UniversityDrandalsoMillAve UniversityDrive&MillAvenue
CybernetSoftwareSystems,Inc. CybernetSoftwareSystemsIncorporated3031TischWaySte.1002 3031TischWay
VertrueInc. VertrueIncorporated 20GloverAve. 20GloverAve
DollarBank,FSB DollarBank 3GatewayCenter 3GatewayCenter8East
TextronInc. TextronIncorporated 40WestminsterStreet 40WestminsterSt
ArcherTechnologies ArcherTechnologiesLLC 13200Metcalf,Suite300 13200MetcalfAve
BMWFinancialServicesNA BMWFinancialServicesNAIncorporated 5515ParkCenterCircle 5515ParkcenterCir
GreatAmericanFinancialResources,Inc. GreatAmericanFinancialResourcesIncorporated250E.5thSt. 250E5thSt
CecEntertainment,Inc. CECEntertainmentIncorporated 4441WAirportFreeway 4441WAirportFwy
DF MATCH EXAMPLES
• Loose and Tight Contact Matches-See email addresses
100 % Matches
EMAILADDRESS(InputSource) EMAIL_ADDRESS(Matched) NAME(InputSource) FIRST_NAME-MatchedOutput
adam.fenech@priorityhealth.com adam.fenech@priority-health.com AdamFenech AdamFenech
braddpiontek@alliant-energy.com braddpiontek@alliantenergy.com BraddPiontek BraddPiontek
EMAILADDRESS-Input CONTACT_EMAIL_ADDRESS-Matched NAME-Input CONTACT_FIRST_NAME-Matched
brent.alexander@cingular.com brentalexander@cingular.com BrentAlexander BrentAlexander
chris.sims@fiserv.com chris.sims@fiserv.com ChrisSims ChrisSims
DF NODE – UTILITIES
• Data Joining Node
This nodes is used to joining data form various sources such as
Two different databases/Excels/Access etc
DFPower Architect's Data Joining job flow step is based on the
SQL concept of JOIN. You can use Data Joining to combine two
data sets in an intelligent way so that the records of one, the
other, or both data sets are used as the basis for the resulting
data set
DF NODE – UTILITIES
• SQL Lookup
SQL Lookup lets the user find rows in a database table that
have one or more fields matching those in the job flow. It
provides an explicit advantage with performance, especially
with large databases since the large database is not copied
locally on the hard drive in order to perform the operation (as
is the case with joins).
DF NODE – UTILITIES
• SQL Execute
This is a stand-alone node (no parents or children) that lets you
construct and execute any valid SQL statement (or series of
statements). It performs some database-specific task(s), either
before, after, or in-between architect job flows.
Examples-SQL Statements like Update, delete ,commit for a
particular table can be used in this node
DF NODE – UTILITIES
 Data Union
 DFPower Architect's Data Union node is based on the SQL concept of
UNION. As with Data Joining, use the Data Union node to combine data
from two data sets. Unlike Data Joining, however, Data Union does not
perform an intelligent combination. Rather, Data Union simply adds the
two data sets together; the resulting data set contains one record for
each record in each of the original data sets
Examples- Data from two or more sheets/Databases/DF job flows needs
to be clubbed together. This node performs the Task
DF NODE – UTILITIES
• Branch
This step lets multiple children (up to 32) simultaneously
access data from a single source. Depending on step's
configuration and children's access patterns, you can pass data
from the parent directly to each of the children, or it may be
temporarily stored in memory and/or disk caches, before being
passed to the children.
In other words it can be one input and multiple outputs(Max-
32)
DF NODE – UTILITIES
Concatenate
DFPower Architect's Concatenate node performs the opposite
function of the Parse node. Rather than separate a single field
into multiple fields, Concatenate combines one or more fields
into a single field.
Example
Suffix-Mr First Name- Rahul Last Name- Jain
Concatenate output – Mr Rahul Jain
DF NODE – UTILITIES
• Expression
 Use DFPower Architect's Expression node to run a Visual BASIC-like
language to process your data sets in ways that are not built into
dfPower Studio. The Expression language provides many statements,
functions, and variables for manipulating data
Examples like creating a column Match Criteria in middle of Job
flow.The syntax would be
Expression Match_Criteria = “ “
Pre-Processing Expression string Match Criteria
DF NODE – UTILITIES
• Data Sorting
Use DFPower Architect's Data Sorting node to re-order
(Ascending or Descending way)your data set at any
point in a job flow.
DF NODE – PROFILING
• Basic Statistics
 DFPower Architect's Basic Statistics node is used to calculate
statistics about your data, such as value ranges, counts, or
sums for any given field
The Basic Statistics node is typically used on numerical rather
than text fields. However, statistics such as Count, Missing,
MAX, and MIN could be useful on any field type
This can be used in middle of the job as well to do a Fault
finding by checking the counts of each step
Examples Basic stat of Siebel TableRow_Id Created Created_By Account Name Partner Flag Email Addr Phone CSN
Records 267413 267413 267413 267413 267413 267413 267413 267413
Count 267413 267413 267413 267413 267413 5 72552 181643
Null Count 0 0 0 0 0 267408 194861 85770
Distinct yes yes yes yes no yes yes yes
Min 1 0-5200 1/1/1980 0:00 0-1 N dllee@pentasoft.co.kr ###iswrong 1
Max 1 O-2 9/9/2010 21:55 1-XVOET ültje GmbH Y tloughran@infopath.net xxxxxxxxx
DF NODE – PROFILING
Pattern Analysis
DFPower Architect's Pattern Analysis node is used to generate a
new field containing alphanumeric patterns that represent each
value in a selected field. You can specify whether these patterns
represent each character or each word (as separated by spaces)
in a field.
DF NODE – PROFILING
• Frequency Distribution
 DFPower Architect's Frequency Distribution node is used to
calculate the number of occurrences of each unique value in a
field.
For example, Frequency Distribution can determine how many
customers in your customer database are in each of the 50 US
states, the District of Columbia, and the 13 Canadian
provinces.State Count of Customers %Total
CA 19593 12
CO 4041 2
CT 2807 1
DC 2555 1
DE 746 0
FL 7105 4
GA 5198 3
GE 1 0
GEO GEO_count GEO %
Americas 187235 57
AsiaPacific 30642 9
EMEA 107412 33
DF NODE – PROFILING
• Data validation
 DFPower Architect's Data Validation node is used to analyze
the content of data by setting validation conditions. These
conditions create validation expressions that you can use to
filter data for a more accurate view of that data.
DF NODE – ENRICHMENT
 Address Verification
 DFPower Architect Address Verification (US/Canada/World) node to
verify, correct, and enhance any addresses in your existing data (QKB).
Address Verification (US/Canada/World) uses geographic information
from various reference databases to match and standardize addresses.
You can also use Address Verification (US/Canada) for proper casing and
CASS /SERP compliance. The addresses are distinguished as per codes
mentioned in the next slide. So it gives the status of addresses i.e how
valid it is
DF NODE – ENRICHMENT
• For US Addresses
Text Result
Code
Numeric
Result Code Description
OK 0 Address was verified successfully.
PARSE 11
Error parsing address. Components of the
address may be missing.
CITY 12
Could not locate city/state or zip in the USPS
database. At least (city and state) or ZIP
must be present in the input.
MULTI 13
Ambiguous address. There were two or
more possible matches for this address with
differing data.
NOMATCH 14
No matching address found in the USPS
data.
OVER 15
One or more input strings is too long
(maximum 100 characters).
• For Canada Addresses
Result Code Description
0 No error occurred
1 Internal error
2 Cannot load database
3 Invalid - unspecified reason
4 Invalid civic number
5 Invalid street
6 Invalid unit
7 Invalid delivery mode
8 Invalid delivery installation
9 Invalid city
10 Invalid province
11 Invalid postal code
12 Address is not Canadian
• Rest of World(Excluding US and Canada)
ResultCode Description
0 Addresscorrectasentered.
1 Addresscorrectedautomatically.
2 Addressneedstobecorrected,butcouldnot
3
Addressneedstobecorrected,butcouldnot
bedeterminedautomatically.Thereisafair
4
Addressneedstobecorrected,butcouldnot
bedeterminedautomatically.Thereisasmall
DF NODE – MONITORING
 Data Monitoring
 The Data Monitoring node enables you to analyze data according to
business rules you create using the Business Rule Manager. The
business rules you create in Rule Manager can analyze the structure of
the data and trigger an event, such as logging a message or sending an
email alert, when a condition is detected. By using the Data Monitoring
node, you can insert these business rules in your job flow to analyze
data at various points in the flow.

Contenu connexe

Tendances

The Database Environment Chapter 11
The Database Environment Chapter 11The Database Environment Chapter 11
The Database Environment Chapter 11Jeanie Arnoco
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform LoadABDUL KHALIQ
 
Bank mangement system
Bank mangement systemBank mangement system
Bank mangement systemFaisalGhffar
 
Data modeling star schema
Data modeling star schemaData modeling star schema
Data modeling star schemaSayed Ahmed
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design phanleson
 
Chapter12 designing databases
Chapter12 designing databasesChapter12 designing databases
Chapter12 designing databasesDhani Ahmad
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Sap business intelligence 4.0 report basic
Sap business intelligence 4.0   report basicSap business intelligence 4.0   report basic
Sap business intelligence 4.0 report basictovetrivel
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process Omid Vahdaty
 
Lesson 2 network database system
Lesson 2 network database systemLesson 2 network database system
Lesson 2 network database systemGiO Friginal
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasiryasir873
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment phanleson
 
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2tovetrivel
 

Tendances (20)

The Database Environment Chapter 11
The Database Environment Chapter 11The Database Environment Chapter 11
The Database Environment Chapter 11
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 
Richa_Profile
Richa_ProfileRicha_Profile
Richa_Profile
 
Etl - Extract Transform Load
Etl - Extract Transform LoadEtl - Extract Transform Load
Etl - Extract Transform Load
 
Bank mangement system
Bank mangement systemBank mangement system
Bank mangement system
 
Transaction
TransactionTransaction
Transaction
 
Data modeling star schema
Data modeling star schemaData modeling star schema
Data modeling star schema
 
Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design Lecture 03 - The Data Warehouse and Design
Lecture 03 - The Data Warehouse and Design
 
Chapter12 designing databases
Chapter12 designing databasesChapter12 designing databases
Chapter12 designing databases
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Sap business intelligence 4.0 report basic
Sap business intelligence 4.0   report basicSap business intelligence 4.0   report basic
Sap business intelligence 4.0 report basic
 
Data warehouse logical design
Data warehouse logical designData warehouse logical design
Data warehouse logical design
 
Ranjitbanshpal1
Ranjitbanshpal1Ranjitbanshpal1
Ranjitbanshpal1
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 
Lesson 2 network database system
Lesson 2 network database systemLesson 2 network database system
Lesson 2 network database system
 
Introduction To Msbi By Yasir
Introduction To Msbi By YasirIntroduction To Msbi By Yasir
Introduction To Msbi By Yasir
 
Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment Lecture 02 - The Data Warehouse Environment
Lecture 02 - The Data Warehouse Environment
 
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
Day 9 __10_introduction_to_bi_enterprise_reporting_1___2
 
Final
FinalFinal
Final
 
D01 etl
D01 etlD01 etl
D01 etl
 

Similaire à Data Analysis using Data Flux

Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)Michael Rys
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresSteven Johnson
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015Neil Hambly
 
Intro to Database Design
Intro to Database DesignIntro to Database Design
Intro to Database DesignSondra Willhite
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
CA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_iCA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_iGeorge Jeffcock
 
TSQL in SQL Server 2012
TSQL in SQL Server 2012TSQL in SQL Server 2012
TSQL in SQL Server 2012Eduardo Castro
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenTammy Bednar
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 CourseMarcus Davage
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
SQL Azure Dec 2010 Update
SQL Azure Dec 2010 UpdateSQL Azure Dec 2010 Update
SQL Azure Dec 2010 UpdateEric Nelson
 
SQL Azure Dec Update
SQL Azure Dec UpdateSQL Azure Dec Update
SQL Azure Dec UpdateEric Nelson
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 

Similaire à Data Analysis using Data Flux (20)

Sql Server 2000
Sql Server 2000Sql Server 2000
Sql Server 2000
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
 
MIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome MeasuresMIS5101 WK10 Outcome Measures
MIS5101 WK10 Outcome Measures
 
Dqs mds-matching 15042015
Dqs mds-matching 15042015Dqs mds-matching 15042015
Dqs mds-matching 15042015
 
Intro to Database Design
Intro to Database DesignIntro to Database Design
Intro to Database Design
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
ETL
ETL ETL
ETL
 
CA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_iCA_Plex_SupportForModernizingIBM_DB2_for_i
CA_Plex_SupportForModernizingIBM_DB2_for_i
 
TSQL in SQL Server 2012
TSQL in SQL Server 2012TSQL in SQL Server 2012
TSQL in SQL Server 2012
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Database@Home : The Future is Data Driven
Database@Home : The Future is Data DrivenDatabase@Home : The Future is Data Driven
Database@Home : The Future is Data Driven
 
DP-900.pdf
DP-900.pdfDP-900.pdf
DP-900.pdf
 
MDI Training DB2 Course
MDI Training DB2 CourseMDI Training DB2 Course
MDI Training DB2 Course
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
SQL Azure Dec 2010 Update
SQL Azure Dec 2010 UpdateSQL Azure Dec 2010 Update
SQL Azure Dec 2010 Update
 
SQL Azure Dec Update
SQL Azure Dec UpdateSQL Azure Dec Update
SQL Azure Dec Update
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 

Data Analysis using Data Flux

  • 1. DATA ANALYSIS USING DATA FLUX FROM-SUNIL PAI
  • 2. TYPICAL USAGE - CUSTOMER DATA OPERATIONS • Data De-Duping • Data Standardization • Data Analysis and Data Profiling • Data Consolidation from various sources • Comparing multiple data sets as per predefined parameters • Insert Data in to Target Data Bases • Match at the glance Reports for various New Acquisitions
  • 3. DF NODE - DATA INPUTS DF could use various Input sources such as Relational Databases (using queries), Excel files, Access Files, Text files This sources are connected Via ODBC Examples-A query is inserted in SQL Query Node .By selecting a database/Access file in the node properties For Excel-Area needs to be defined for selection by using Name manager under formula tab in excel sheet .For excel sheets Data Source Input node is used
  • 4. DF NODE - DATA OUTPUTS By using DF we can insert a Job/Result output in an Excel, Access ,Text, relational database like Oracle /Sql Server DF uses Insert/Update/Target/Output utilities for Data output stage Examples-The output result can be directly inserted into Database table by using Data Target Insert Node Output can also be taken in an text file via Text file output node
  • 5. DF NODE – QUALITY • Standardization  dfPower Architect's Standardization node is used to make similar items the same  The various definition of standardizations are Name, Address, Organization,Zip, Phone, email address ,country, State ,Non Alpha numeric remover, Numeric remover, Alpha Numeric remover ,space remover ,Quotation remover etc  Various Schemas can also be selected which can be defined in QKB of DataFlux  For Example-using full company names instead of initials ("International Business Machines" vs. "IBM"),
  • 6. DF NODE – QUALITY • Standardization (More Examples)-Addresses 1 Comcast Center to 1 Comcast Ctr 10 Glenlake Pkwy north east to 10 Glenlake Pkwy NE "North Dakota" vs. "ND“ United States vs USA
  • 7. DF NODE – QUALITY • Parsing  DF Power Architect's Parsing node is a simple but intelligent tool for separating multi-part field values into multiple, single-part fields. For example, if you have a Name field that includes the value "Mr. Igor Bela Bonski III, Esq.," you can use parsing to create six separate fields: Name Prefix: "Mr." Given Name: "Igor" Middle Name: "Bela" Family Name: "Bonski" Name Suffix: "III" Name Appendage: "Esq."
  • 8. DF NODE – INTEGRATION • Match Codes dfPower Architect's Match Codes is to identify duplicate records in your data. These steps create match codes, that evaluate the quantity of duplicate fields in your data and eliminate the extra fields. Match codes can be set from 50%(Lowest) to 100%(Exact) and various schemas can be selectedFieldName Defination Sensitivity AccountName BussinessTiTtle 85% Address_Line1 Address/AddressLong 85% City City Exact-All,Exact-10characters Country Country Exact-All,Exact-10characters
  • 9. DF NODE – INTEGRATION • Clustering DFPower Architect's Data Clustering node is used to employ the clustering functionality to group match duplicates or set of unique records as per conditions defined. See cluster numbers in given example belowCluster AccountName AccountAddress1 MatchCriteria 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1 7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1 7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1 7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1 7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1
  • 10. DF NODE – INTEGRATION • Surviving Record Identification  DFPower Architect's Surviving Record Identification (SRI) node examines clustered data and determines a surviving record for each cluster. This process lets you eliminate duplicate information in a data source. The surviving record is identified using one or more user-configurable record rules. The user may also enter field rules to perform automated field-level edits of the surviving record's data during SRI processing. The SRI step can be configured to keep all existing data, marking the surviving records with a flag or primary key value, or it can remove all data except for that associated with the surviving records. Examples- Consider you have set of duplicate Accounts and addresses in the system and you need to keep one distinct record out of those duplicates but the record should have proper phone numbers in it. You can use SRI node and define rule for selection which can be done in properties of SRI Node. Please see the example given in the next slide
  • 11. DF NODE – INTEGRATION • Surviving Record Identification Examples (Continued) –Please see the cluster column and the Surviving record column given below. So each cluster has only one surviving record Cluster AccountName AccountAddress1 Phone SurvivingRecord 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay (609) 883-1300 TRUE 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay Null FALSE 7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay 987 FALSE 7663 Metlife,Incorporated 27-01QueensPlzN 1-800-638-5000 TRUE 7663 Metlife,Incorporated 27-01QueensPlzN Null FALSE 7791 EatonCorporation 34899CurtisBlvd 1-900-735-5674 TRUE 7791 EatonCorporation 34899CurtisBlvd Null FALSE
  • 12. DF MATCH EXAMPLES • Standardization and Match codes combined in job flow gives Remarkable results as shown below Exact or 100% Match results Input-COMPANYNAME Matched/OutputCompanyName ADDRESS1(Input) ADDR(Matched) NetscapeCommunicationsCorporation NetscapeCommunicationsCorporation 501EMiddlefieldRd 501EMiddlefieldRd Alston&BirdLLP Alston&BirdLLP 1201WPeachtreeSt 1201WPeachtreeSt GeorgiaPerimeterCollege GeorgiaPerimeterCollege 3251PanthersvilleRd 3251PanthersvilleRd CountyofOneida CountyofOneida 800ParkAve 800ParkAve EliLillyandCompany EliLillyandCompany POBox6034 POBox6034 ActuateCorporation ActuateCorporation 2207BridgepointePkwy.Ste.500 2207BridgepointePkwySte500 ShrinersHospitalsForChildren ShrinersHospitalsForChildren 3551NBroadSt 3551NBroadSt CatholicHealthInitiatives CatholicHealthInitiatives 440CreameryWay 440CreameryWay ElPasoElectricCompany ElPasoElectricCompany 123WMillsAve 123WMillsAve
  • 13. DF MATCH EXAMPLES • 75% Match Results Input-Name MatchedName Input-ADDRESS Matched-Address ArizonaStateUniversity ArizonaStateUniversity UniversityDrandalsoMillAve UniversityDrive&MillAvenue CybernetSoftwareSystems,Inc. CybernetSoftwareSystemsIncorporated3031TischWaySte.1002 3031TischWay VertrueInc. VertrueIncorporated 20GloverAve. 20GloverAve DollarBank,FSB DollarBank 3GatewayCenter 3GatewayCenter8East TextronInc. TextronIncorporated 40WestminsterStreet 40WestminsterSt ArcherTechnologies ArcherTechnologiesLLC 13200Metcalf,Suite300 13200MetcalfAve BMWFinancialServicesNA BMWFinancialServicesNAIncorporated 5515ParkCenterCircle 5515ParkcenterCir GreatAmericanFinancialResources,Inc. GreatAmericanFinancialResourcesIncorporated250E.5thSt. 250E5thSt CecEntertainment,Inc. CECEntertainmentIncorporated 4441WAirportFreeway 4441WAirportFwy
  • 14. DF MATCH EXAMPLES • Loose and Tight Contact Matches-See email addresses 100 % Matches EMAILADDRESS(InputSource) EMAIL_ADDRESS(Matched) NAME(InputSource) FIRST_NAME-MatchedOutput adam.fenech@priorityhealth.com adam.fenech@priority-health.com AdamFenech AdamFenech braddpiontek@alliant-energy.com braddpiontek@alliantenergy.com BraddPiontek BraddPiontek EMAILADDRESS-Input CONTACT_EMAIL_ADDRESS-Matched NAME-Input CONTACT_FIRST_NAME-Matched brent.alexander@cingular.com brentalexander@cingular.com BrentAlexander BrentAlexander chris.sims@fiserv.com chris.sims@fiserv.com ChrisSims ChrisSims
  • 15. DF NODE – UTILITIES • Data Joining Node This nodes is used to joining data form various sources such as Two different databases/Excels/Access etc DFPower Architect's Data Joining job flow step is based on the SQL concept of JOIN. You can use Data Joining to combine two data sets in an intelligent way so that the records of one, the other, or both data sets are used as the basis for the resulting data set
  • 16. DF NODE – UTILITIES • SQL Lookup SQL Lookup lets the user find rows in a database table that have one or more fields matching those in the job flow. It provides an explicit advantage with performance, especially with large databases since the large database is not copied locally on the hard drive in order to perform the operation (as is the case with joins).
  • 17. DF NODE – UTILITIES • SQL Execute This is a stand-alone node (no parents or children) that lets you construct and execute any valid SQL statement (or series of statements). It performs some database-specific task(s), either before, after, or in-between architect job flows. Examples-SQL Statements like Update, delete ,commit for a particular table can be used in this node
  • 18. DF NODE – UTILITIES  Data Union  DFPower Architect's Data Union node is based on the SQL concept of UNION. As with Data Joining, use the Data Union node to combine data from two data sets. Unlike Data Joining, however, Data Union does not perform an intelligent combination. Rather, Data Union simply adds the two data sets together; the resulting data set contains one record for each record in each of the original data sets Examples- Data from two or more sheets/Databases/DF job flows needs to be clubbed together. This node performs the Task
  • 19. DF NODE – UTILITIES • Branch This step lets multiple children (up to 32) simultaneously access data from a single source. Depending on step's configuration and children's access patterns, you can pass data from the parent directly to each of the children, or it may be temporarily stored in memory and/or disk caches, before being passed to the children. In other words it can be one input and multiple outputs(Max- 32)
  • 20. DF NODE – UTILITIES Concatenate DFPower Architect's Concatenate node performs the opposite function of the Parse node. Rather than separate a single field into multiple fields, Concatenate combines one or more fields into a single field. Example Suffix-Mr First Name- Rahul Last Name- Jain Concatenate output – Mr Rahul Jain
  • 21. DF NODE – UTILITIES • Expression  Use DFPower Architect's Expression node to run a Visual BASIC-like language to process your data sets in ways that are not built into dfPower Studio. The Expression language provides many statements, functions, and variables for manipulating data Examples like creating a column Match Criteria in middle of Job flow.The syntax would be Expression Match_Criteria = “ “ Pre-Processing Expression string Match Criteria
  • 22. DF NODE – UTILITIES • Data Sorting Use DFPower Architect's Data Sorting node to re-order (Ascending or Descending way)your data set at any point in a job flow.
  • 23. DF NODE – PROFILING • Basic Statistics  DFPower Architect's Basic Statistics node is used to calculate statistics about your data, such as value ranges, counts, or sums for any given field The Basic Statistics node is typically used on numerical rather than text fields. However, statistics such as Count, Missing, MAX, and MIN could be useful on any field type This can be used in middle of the job as well to do a Fault finding by checking the counts of each step Examples Basic stat of Siebel TableRow_Id Created Created_By Account Name Partner Flag Email Addr Phone CSN Records 267413 267413 267413 267413 267413 267413 267413 267413 Count 267413 267413 267413 267413 267413 5 72552 181643 Null Count 0 0 0 0 0 267408 194861 85770 Distinct yes yes yes yes no yes yes yes Min 1 0-5200 1/1/1980 0:00 0-1 N dllee@pentasoft.co.kr ###iswrong 1 Max 1 O-2 9/9/2010 21:55 1-XVOET ültje GmbH Y tloughran@infopath.net xxxxxxxxx
  • 24. DF NODE – PROFILING Pattern Analysis DFPower Architect's Pattern Analysis node is used to generate a new field containing alphanumeric patterns that represent each value in a selected field. You can specify whether these patterns represent each character or each word (as separated by spaces) in a field.
  • 25. DF NODE – PROFILING • Frequency Distribution  DFPower Architect's Frequency Distribution node is used to calculate the number of occurrences of each unique value in a field. For example, Frequency Distribution can determine how many customers in your customer database are in each of the 50 US states, the District of Columbia, and the 13 Canadian provinces.State Count of Customers %Total CA 19593 12 CO 4041 2 CT 2807 1 DC 2555 1 DE 746 0 FL 7105 4 GA 5198 3 GE 1 0 GEO GEO_count GEO % Americas 187235 57 AsiaPacific 30642 9 EMEA 107412 33
  • 26. DF NODE – PROFILING • Data validation  DFPower Architect's Data Validation node is used to analyze the content of data by setting validation conditions. These conditions create validation expressions that you can use to filter data for a more accurate view of that data.
  • 27. DF NODE – ENRICHMENT  Address Verification  DFPower Architect Address Verification (US/Canada/World) node to verify, correct, and enhance any addresses in your existing data (QKB). Address Verification (US/Canada/World) uses geographic information from various reference databases to match and standardize addresses. You can also use Address Verification (US/Canada) for proper casing and CASS /SERP compliance. The addresses are distinguished as per codes mentioned in the next slide. So it gives the status of addresses i.e how valid it is
  • 28. DF NODE – ENRICHMENT • For US Addresses Text Result Code Numeric Result Code Description OK 0 Address was verified successfully. PARSE 11 Error parsing address. Components of the address may be missing. CITY 12 Could not locate city/state or zip in the USPS database. At least (city and state) or ZIP must be present in the input. MULTI 13 Ambiguous address. There were two or more possible matches for this address with differing data. NOMATCH 14 No matching address found in the USPS data. OVER 15 One or more input strings is too long (maximum 100 characters).
  • 29. • For Canada Addresses Result Code Description 0 No error occurred 1 Internal error 2 Cannot load database 3 Invalid - unspecified reason 4 Invalid civic number 5 Invalid street 6 Invalid unit 7 Invalid delivery mode 8 Invalid delivery installation 9 Invalid city 10 Invalid province 11 Invalid postal code 12 Address is not Canadian
  • 30. • Rest of World(Excluding US and Canada) ResultCode Description 0 Addresscorrectasentered. 1 Addresscorrectedautomatically. 2 Addressneedstobecorrected,butcouldnot 3 Addressneedstobecorrected,butcouldnot bedeterminedautomatically.Thereisafair 4 Addressneedstobecorrected,butcouldnot bedeterminedautomatically.Thereisasmall
  • 31. DF NODE – MONITORING  Data Monitoring  The Data Monitoring node enables you to analyze data according to business rules you create using the Business Rule Manager. The business rules you create in Rule Manager can analyze the structure of the data and trigger an event, such as logging a message or sending an email alert, when a condition is detected. By using the Data Monitoring node, you can insert these business rules in your job flow to analyze data at various points in the flow.