Data Analysis using Data Flux

DATA ANALYSIS USING DATA
FLUX
FROM-SUNIL PAI

TYPICAL USAGE - CUSTOMER DATA
OPERATIONS
• Data De-Duping
• Data Standardization
• Data Analysis and Data Profiling
• Data Consolidation from various sources
• Comparing multiple data sets as per predefined parameters
• Insert Data in to Target Data Bases
• Match at the glance Reports for various New Acquisitions

DF NODE - DATA INPUTS
DF could use various Input sources such as Relational
Databases (using queries), Excel files, Access Files, Text files
This sources are connected Via ODBC
Examples-A query is inserted in SQL Query Node .By selecting a
database/Access file in the node properties
For Excel-Area needs to be defined for selection by using Name
manager under formula tab in excel sheet .For excel sheets
Data Source Input node is used

DF NODE - DATA OUTPUTS
By using DF we can insert a Job/Result output in an Excel,
Access ,Text, relational database like Oracle /Sql Server
DF uses Insert/Update/Target/Output utilities for Data output
stage
Examples-The output result can be directly inserted into
Database table by using Data Target Insert Node
Output can also be taken in an text file via Text file output node

DF NODE – QUALITY
• Standardization
 dfPower Architect's Standardization node is used to make similar items the
same
 The various definition of standardizations are Name, Address,
Organization,Zip, Phone, email address ,country, State ,Non Alpha numeric
remover, Numeric remover, Alpha Numeric remover ,space remover
,Quotation remover etc
 Various Schemas can also be selected which can be defined in QKB of
DataFlux
 For Example-using full company names instead of initials ("International
Business Machines" vs. "IBM"),

DF NODE – QUALITY
• Standardization (More Examples)-Addresses
1 Comcast Center to 1 Comcast Ctr
10 Glenlake Pkwy north east to 10 Glenlake Pkwy NE
"North Dakota" vs. "ND“
United States vs USA

DF NODE – QUALITY
• Parsing
 DF Power Architect's Parsing node is a simple but intelligent tool for
separating multi-part field values into multiple, single-part fields.
For example, if you have a Name field that includes the value "Mr.
Igor Bela Bonski III, Esq.," you can use parsing to create six separate
fields:
Name Prefix: "Mr."
Given Name: "Igor"
Middle Name: "Bela"
Family Name: "Bonski"
Name Suffix: "III"
Name Appendage: "Esq."

DF NODE – INTEGRATION
• Match Codes
dfPower Architect's Match Codes is to identify duplicate records
in your data. These steps create match codes, that evaluate the
quantity of duplicate fields in your data and eliminate the extra
fields.
Match codes can be set from 50%(Lowest) to 100%(Exact) and
various schemas can be selectedFieldName Defination Sensitivity
AccountName BussinessTiTtle 85%
Address_Line1 Address/AddressLong 85%
City City Exact-All,Exact-10characters
Country Country Exact-All,Exact-10characters

• Clustering
DFPower Architect's Data Clustering node is used to employ the
clustering functionality to group match duplicates or set of
unique records as per conditions defined. See cluster numbers
in given example belowCluster AccountName AccountAddress1 MatchCriteria
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay ExactCompanyName+Address-1
7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1
7663 Metlife,Incorporated 27-01QueensPlzN ExactCompanyName+Address-1
7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1
7791 EatonCorporation 34899CurtisBlvd ExactCompanyName+Address-1

• Surviving Record Identification
 DFPower Architect's Surviving Record Identification (SRI) node
examines clustered data and determines a surviving record for each
cluster. This process lets you eliminate duplicate information in a
data source. The surviving record is identified using one or more
user-configurable record rules. The user may also enter field rules to
perform automated field-level edits of the surviving record's data
during SRI processing. The SRI step can be configured to keep all
existing data, marking the surviving records with a flag or primary
key value, or it can remove all data except for that associated with
the surviving records.
Examples- Consider you have set of duplicate Accounts and addresses
in the system and you need to keep one distinct record out of those
duplicates but the record should have proper phone numbers in it.
You can use SRI node and define rule for selection which can be done
in properties of SRI Node. Please see the example given in the next
slide

• Surviving Record Identification
Examples (Continued) –Please see the cluster column and the
Surviving record column given below. So each cluster has only
one surviving record
Cluster AccountName AccountAddress1 Phone SurvivingRecord
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay (609) 883-1300 TRUE
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay Null FALSE
7231 NewJerseyManufacturersInsuranceCompany 301SullivanWay 987 FALSE
7663 Metlife,Incorporated 27-01QueensPlzN 1-800-638-5000 TRUE
7663 Metlife,Incorporated 27-01QueensPlzN Null FALSE
7791 EatonCorporation 34899CurtisBlvd 1-900-735-5674 TRUE
7791 EatonCorporation 34899CurtisBlvd Null FALSE

DF MATCH EXAMPLES
• Standardization and Match codes combined in job flow gives
Remarkable results as shown below
Exact or 100% Match results
Input-COMPANYNAME Matched/OutputCompanyName ADDRESS1(Input) ADDR(Matched)
NetscapeCommunicationsCorporation NetscapeCommunicationsCorporation 501EMiddlefieldRd 501EMiddlefieldRd
Alston&BirdLLP Alston&BirdLLP 1201WPeachtreeSt 1201WPeachtreeSt
GeorgiaPerimeterCollege GeorgiaPerimeterCollege 3251PanthersvilleRd 3251PanthersvilleRd
CountyofOneida CountyofOneida 800ParkAve 800ParkAve
EliLillyandCompany EliLillyandCompany POBox6034 POBox6034
ActuateCorporation ActuateCorporation 2207BridgepointePkwy.Ste.500 2207BridgepointePkwySte500
ShrinersHospitalsForChildren ShrinersHospitalsForChildren 3551NBroadSt 3551NBroadSt
CatholicHealthInitiatives CatholicHealthInitiatives 440CreameryWay 440CreameryWay
ElPasoElectricCompany ElPasoElectricCompany 123WMillsAve 123WMillsAve

DF MATCH EXAMPLES
• 75% Match Results
Input-Name MatchedName Input-ADDRESS Matched-Address
ArizonaStateUniversity ArizonaStateUniversity UniversityDrandalsoMillAve UniversityDrive&MillAvenue
CybernetSoftwareSystems,Inc. CybernetSoftwareSystemsIncorporated3031TischWaySte.1002 3031TischWay
VertrueInc. VertrueIncorporated 20GloverAve. 20GloverAve
DollarBank,FSB DollarBank 3GatewayCenter 3GatewayCenter8East
TextronInc. TextronIncorporated 40WestminsterStreet 40WestminsterSt
ArcherTechnologies ArcherTechnologiesLLC 13200Metcalf,Suite300 13200MetcalfAve
BMWFinancialServicesNA BMWFinancialServicesNAIncorporated 5515ParkCenterCircle 5515ParkcenterCir
GreatAmericanFinancialResources,Inc. GreatAmericanFinancialResourcesIncorporated250E.5thSt. 250E5thSt
CecEntertainment,Inc. CECEntertainmentIncorporated 4441WAirportFreeway 4441WAirportFwy

DF MATCH EXAMPLES
• Loose and Tight Contact Matches-See email addresses
100 % Matches
EMAILADDRESS(InputSource) EMAIL_ADDRESS(Matched) NAME(InputSource) FIRST_NAME-MatchedOutput
adam.fenech@priorityhealth.com adam.fenech@priority-health.com AdamFenech AdamFenech
braddpiontek@alliant-energy.com braddpiontek@alliantenergy.com BraddPiontek BraddPiontek
EMAILADDRESS-Input CONTACT_EMAIL_ADDRESS-Matched NAME-Input CONTACT_FIRST_NAME-Matched
brent.alexander@cingular.com brentalexander@cingular.com BrentAlexander BrentAlexander
chris.sims@fiserv.com chris.sims@fiserv.com ChrisSims ChrisSims

DF NODE – UTILITIES
• Data Joining Node
This nodes is used to joining data form various sources such as
Two different databases/Excels/Access etc
DFPower Architect's Data Joining job flow step is based on the
SQL concept of JOIN. You can use Data Joining to combine two
data sets in an intelligent way so that the records of one, the
other, or both data sets are used as the basis for the resulting
data set

• SQL Lookup
SQL Lookup lets the user find rows in a database table that
have one or more fields matching those in the job flow. It
provides an explicit advantage with performance, especially
with large databases since the large database is not copied
locally on the hard drive in order to perform the operation (as
is the case with joins).

• SQL Execute
This is a stand-alone node (no parents or children) that lets you
construct and execute any valid SQL statement (or series of
statements). It performs some database-specific task(s), either
before, after, or in-between architect job flows.
Examples-SQL Statements like Update, delete ,commit for a
particular table can be used in this node

 Data Union
 DFPower Architect's Data Union node is based on the SQL concept of
UNION. As with Data Joining, use the Data Union node to combine data
from two data sets. Unlike Data Joining, however, Data Union does not
perform an intelligent combination. Rather, Data Union simply adds the
two data sets together; the resulting data set contains one record for
each record in each of the original data sets
Examples- Data from two or more sheets/Databases/DF job flows needs
to be clubbed together. This node performs the Task

• Branch
This step lets multiple children (up to 32) simultaneously
access data from a single source. Depending on step's
configuration and children's access patterns, you can pass data
from the parent directly to each of the children, or it may be
temporarily stored in memory and/or disk caches, before being
passed to the children.
In other words it can be one input and multiple outputs(Max-
32)

Concatenate
DFPower Architect's Concatenate node performs the opposite
function of the Parse node. Rather than separate a single field
into multiple fields, Concatenate combines one or more fields
into a single field.
Example
Suffix-Mr First Name- Rahul Last Name- Jain
Concatenate output – Mr Rahul Jain

• Expression
 Use DFPower Architect's Expression node to run a Visual BASIC-like
language to process your data sets in ways that are not built into
dfPower Studio. The Expression language provides many statements,
functions, and variables for manipulating data
Examples like creating a column Match Criteria in middle of Job
flow.The syntax would be
Expression Match_Criteria = “ “
Pre-Processing Expression string Match Criteria

• Data Sorting
Use DFPower Architect's Data Sorting node to re-order
(Ascending or Descending way)your data set at any
point in a job flow.

DF NODE – PROFILING
• Basic Statistics
 DFPower Architect's Basic Statistics node is used to calculate
statistics about your data, such as value ranges, counts, or
sums for any given field
The Basic Statistics node is typically used on numerical rather
than text fields. However, statistics such as Count, Missing,
MAX, and MIN could be useful on any field type
This can be used in middle of the job as well to do a Fault
finding by checking the counts of each step
Examples Basic stat of Siebel TableRow_Id Created Created_By Account Name Partner Flag Email Addr Phone CSN
Records 267413 267413 267413 267413 267413 267413 267413 267413
Count 267413 267413 267413 267413 267413 5 72552 181643
Null Count 0 0 0 0 0 267408 194861 85770
Distinct yes yes yes yes no yes yes yes
Min 1 0-5200 1/1/1980 0:00 0-1 N dllee@pentasoft.co.kr ###iswrong 1
Max 1 O-2 9/9/2010 21:55 1-XVOET ültje GmbH Y tloughran@infopath.net xxxxxxxxx

Pattern Analysis
DFPower Architect's Pattern Analysis node is used to generate a
new field containing alphanumeric patterns that represent each
value in a selected field. You can specify whether these patterns
represent each character or each word (as separated by spaces)
in a field.

• Frequency Distribution
 DFPower Architect's Frequency Distribution node is used to
calculate the number of occurrences of each unique value in a
field.
For example, Frequency Distribution can determine how many
customers in your customer database are in each of the 50 US
states, the District of Columbia, and the 13 Canadian
provinces.State Count of Customers %Total
CA 19593 12
CO 4041 2
CT 2807 1
DC 2555 1
DE 746 0
FL 7105 4
GA 5198 3
GE 1 0
GEO GEO_count GEO %
Americas 187235 57
AsiaPacific 30642 9
EMEA 107412 33

• Data validation
 DFPower Architect's Data Validation node is used to analyze
the content of data by setting validation conditions. These
conditions create validation expressions that you can use to
filter data for a more accurate view of that data.

DF NODE – ENRICHMENT
 Address Verification
 DFPower Architect Address Verification (US/Canada/World) node to
verify, correct, and enhance any addresses in your existing data (QKB).
Address Verification (US/Canada/World) uses geographic information
from various reference databases to match and standardize addresses.
You can also use Address Verification (US/Canada) for proper casing and
CASS /SERP compliance. The addresses are distinguished as per codes
mentioned in the next slide. So it gives the status of addresses i.e how
valid it is

DF NODE – ENRICHMENT
• For US Addresses
Text Result
Code
Numeric
Result Code Description
OK 0 Address was verified successfully.
PARSE 11
Error parsing address. Components of the
address may be missing.
CITY 12
Could not locate city/state or zip in the USPS
database. At least (city and state) or ZIP
must be present in the input.
MULTI 13
Ambiguous address. There were two or
more possible matches for this address with
differing data.
NOMATCH 14
No matching address found in the USPS
data.
OVER 15
One or more input strings is too long
(maximum 100 characters).

• For Canada Addresses
Result Code Description
0 No error occurred
1 Internal error
2 Cannot load database
3 Invalid - unspecified reason
4 Invalid civic number
5 Invalid street
6 Invalid unit
7 Invalid delivery mode
8 Invalid delivery installation
9 Invalid city
10 Invalid province
11 Invalid postal code
12 Address is not Canadian

• Rest of World(Excluding US and Canada)
ResultCode Description
0 Addresscorrectasentered.
1 Addresscorrectedautomatically.
2 Addressneedstobecorrected,butcouldnot
3
Addressneedstobecorrected,butcouldnot
bedeterminedautomatically.Thereisafair
4
Addressneedstobecorrected,butcouldnot
bedeterminedautomatically.Thereisasmall

DF NODE – MONITORING
 Data Monitoring
 The Data Monitoring node enables you to analyze data according to
business rules you create using the Business Rule Manager. The
business rules you create in Rule Manager can analyze the structure of
the data and trigger an event, such as logging a message or sending an
email alert, when a condition is detected. By using the Data Monitoring
node, you can insert these business rules in your job flow to analyze
data at various points in the flow.

Data Analysis using Data Flux

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Analysis using Data Flux

Similaire à Data Analysis using Data Flux (20)

Data Analysis using Data Flux