SlideShare a Scribd company logo
1 of 21
Hadoop Powered
Corporate Data
How to Produce and Manage Meaningful Data and
Analytics
Dr. Geoffrey Malafsky
Phasic Systems Inc.
Phasic Systems Inc. 2
Governance
Warehouse
Analytics
NoSQL Streaming
BI
Integration
Architecture
Modeling
Big Data Hadoop Velocity,
Volume,
Variety
Veracity
Phasic Systems Inc. 3
Governance
Warehouse
Analytics
NoSQL Streaming
BI
Integration
Architecture
Modeling
Big Data Hadoop Velocity,
Volume,
Variety
Veracity
What does this
really mean for
my corporate
data?
Disruption
Phasic Systems Inc. 4
Organizational Issues
Technology Issues
Business Issues
Phasic Systems Inc. 5
Are we discovering new knowledge?
Are we analyzing business and
operations for decisions, audit,
compliance, consolidation?
Are we fulfilling required reports?
Phasic Systems Inc. 6
Veracity, Meaningful
Does it matter?
Topic Should Does
BI Yes Sometimes
Required Reports Yes Sometimes
Audit Yes Yes
Compliance Yes Yes
Consolidation Yes Sometimes
Marketing Yes Sometimes
Financial Yes Yes but….
Decision Making Yes Yes but….
TechLab by InsideAnalysis
Phasic Systems Inc. 7
Normalizing Corporate Small Data With Hadoop and Data Science
By Dr. Geoffrey P Malafsky
In part one of this discussion series (Hadoop for Small Data), I introduced the idea that Small Data is the mission-critical data management challenge. To
reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past
the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision
making, applications, reports, and Business Intelligence.”
I am excluding what I call stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results since the
business objective is getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases,
which I am focusing on here and in the next TechLab in September, where the ramifications for wrong results are severely negative. This is the realm of
executive decision making, Accounting, Risk Management, regulatory compliance, security, to name a few.
Corporate Small Data is
structured data that is the
fuel of its main activities
Data Normalization combines
subject matter knowledge,
governance, business rules,
and raw data to make it
meaningful.
Phasic Systems Inc. 8
Hadoop was created to handle extraordinarily
large and constantly changing data sets. It is a
very well-engineered software framework and
set of tools for distributed storage and cluster
computing. But, can it help solve the intractable
challenges with key corporate data ?
The Challenge of Corporate Small Data
Phasic Systems Inc. 9
multiple sources multiple definitions multiple copies
variable structures
different data values
hidden conflicts in data
definitions
which to use
different model types &
standards
more storage more data flows
Many DW & marts different ETL
complex dependencies
conflicting
business rules
analyses restricted
by inconsistencies
Phasic Systems Inc. 10
An example of embedded errors that defy traditional tools and methods. Two
authoritative data systems have many occurrences of conflicts, errors, and
quantitative discrepancies. Finding these has been too difficult with common tools.
But, using small Hadoop cluster (this is Corporate Data not Big Data) allows us to
iteratively detect, learn, adjust. Once detected, investigated, and understood we can
find just the one answer from business needed to correct.
Phasic Systems Inc. 11
136666505 adese genc petrol
136666505 amy lily chung
136666505 anderson erin ruth
136666505 andrew william knef
136666505 anduaga-arias laura
136666505 angelica m. de la cruz
136666505 anthony o'brien, 330531-5100194
136666505 batac belle
136666505 bottesini beth ms.
136666505 bouck shannon
136666505 bunn amy b.
136666505 carlene clark
136666505 cho, boong haeng
136666505 choe, sun young
136666505 christina michajlyszyn
136666505 christopher cannon
136666505 christopher l. booth
136666505 chun, kil mo
136666505 conflict + transition consultancies
136666505 cozzone elaine
136666505 deborah p. carney
136666505 denihan patricia joann
136666505 dong sook mcgeorge, 690525-2716816
136666505 dorene d.lukewalton,pharm d.
136666505 dr. terry a. klein
0
10
20
30
40
50
60
70
80
90
100
WhiteSpace Transpose Acronym NoiseWord LowSim Punctuation
PercentofDUNSWith>=50%NamesMatched
Proportion of DUNS Matched by Transform Type
FPDS FPDS-WAWF FPDS-WAWF-GDUNS
Requirements for Data Analytics
1. Data must be understood
2. The right definitions must apply at the right time for the right user
3. Data’s lineage and provenance must be clear
4. Data integrity must be preserved
5. Data must be accurate, consistent, complete, timely, unique and valid
6. Data and system access must be secure
7. Data must be provided in multiple arrangements to meet different user needs and analytical
processing requirements
8. Data must be prepared and tracked to support meaningful analysis for different user needs
9. Data processing must be flexible to adapt to new knowledge and discoveries on data already
being used
10. Data must be normalized using authoritative or best known sets of codes, lookup values, and
source adjudication knowledge and rules
11. High speed, low maintenance techniques and tools are needed to be cost and time effective
12. Lifecycle audits and data maintenance must be performed including maintaining and
documenting data from raw source to intermediate transformed to full normalized
13. Use Common data models that align, correct, and semantically unify data from multiple
sources to enforce meaningful and consistent analysis
Phasic Systems Inc. 12
Phasic Systems Inc. 13
An Example of Hidden Business Rules and Logic
• If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid =
DELIVERY_ORDER
• If ( x1='0') v_modification_number = '0‘ else v_modification_number =
x2
• where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD
• where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD
• where x2: if (x4=NULL) x2='0‘ else x2=x4
• where x4: x4= LTRIM(x5)
• where x5: x5=x1
• essentially this first tries to use ACO_MOD, and if this is NULL then it tries
to use PCO_MOD and sets = '0' if these are NULL
• If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT
• where y1: y1 = REF_PROC_INSTRUMENT with all '-' characters
removed
Phasic Systems Inc. 14
key business logic as buried in a database stored procedure (condensed)
Phasic Systems Inc. 15
Flexible, Fast, Adaptive, Multi-Tool Data Analytics Environment
Phasic Systems Inc. 16
Phasic Systems Inc. 17
0
50
100
150
200
250
300
350
400
Hive Impala SQLServer
FPDS Hadoop Query Times Text Field (secs)
Text Parquet Parquet Partitioned
Phasic Systems Inc. 18
Parallel Jobs in Hadoop
Phasic Systems Inc. 19
Phasic Systems Inc. 20
Phasic Systems Inc. 21

More Related Content

Similar to Phasic Systems - Dr. Geoffrey Malafsky

1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
aulasnilda
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Golu187360
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Golu187360
 
How to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationHow to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing Organization
Kissmetrics on SlideShare
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
Amanda Gray
 

Similar to Phasic Systems - Dr. Geoffrey Malafsky (20)

Enterprise Integration in a nutshell (16:9)
Enterprise Integration in a nutshell (16:9)Enterprise Integration in a nutshell (16:9)
Enterprise Integration in a nutshell (16:9)
 
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
Analyst Webinar: Discover how a logical data fabric helps organizations avoid...
 
Mighty Guides- Data Disruption
Mighty Guides- Data DisruptionMighty Guides- Data Disruption
Mighty Guides- Data Disruption
 
Big dataplatform operationalstrategy
Big dataplatform operationalstrategyBig dataplatform operationalstrategy
Big dataplatform operationalstrategy
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
Data Integration: Creating a Trustworthy Data Foundation for Business Intelli...
 
Whitepaper Building Power BI Solutions with Power Query
Whitepaper  Building Power BI Solutions with Power QueryWhitepaper  Building Power BI Solutions with Power Query
Whitepaper Building Power BI Solutions with Power Query
 
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx1Running head BIG DATA6BIG DATAMIT 681  MSIT.docx
1Running head BIG DATA6BIG DATAMIT 681 MSIT.docx
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
 
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptxUnleashing the Power of Cloud-Based Big Data Analytics.pptx
Unleashing the Power of Cloud-Based Big Data Analytics.pptx
 
The Data Warehouse Essays
The Data Warehouse EssaysThe Data Warehouse Essays
The Data Warehouse Essays
 
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Solix Common Data Platform: Advanced Analytics and the Data-Driven EnterpriseSolix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
Solix Common Data Platform: Advanced Analytics and the Data-Driven Enterprise
 
Building an API for EHR integration at scale
Building an API for EHR integration at scaleBuilding an API for EHR integration at scale
Building an API for EHR integration at scale
 
How to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing OrganizationHow to Scale your Analytics in a Maturing Organization
How to Scale your Analytics in a Maturing Organization
 
Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
IT Ready - DW: 1st Day
IT Ready - DW: 1st Day IT Ready - DW: 1st Day
IT Ready - DW: 1st Day
 
Data and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFOData and the Changing Role of the Tech Savvy CFO
Data and the Changing Role of the Tech Savvy CFO
 
Decision Point AI, plan around what will happen instead of what has happened?
Decision Point AI, plan around what will happen instead of what has happened?Decision Point AI, plan around what will happen instead of what has happened?
Decision Point AI, plan around what will happen instead of what has happened?
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
Semantic Applications for Financial Services
Semantic Applications for Financial ServicesSemantic Applications for Financial Services
Semantic Applications for Financial Services
 

More from Inside Analysis

Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
Inside Analysis
 

More from Inside Analysis (20)

An Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BIAn Ounce of Prevention: Forging Healthy BI
An Ounce of Prevention: Forging Healthy BI
 
Agile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for SuccessAgile, Automated, Aware: How to Model for Success
Agile, Automated, Aware: How to Model for Success
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security To Serve and Protect: Making Sense of Hadoop Security
To Serve and Protect: Making Sense of Hadoop Security
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
Introducing: A Complete Algebra of Data
Introducing: A Complete Algebra of DataIntroducing: A Complete Algebra of Data
Introducing: A Complete Algebra of Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Ahead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time AnalyticsAhead of the Stream: How to Future-Proof Real-Time Analytics
Ahead of the Stream: How to Future-Proof Real-Time Analytics
 
All Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of EverythingAll Together Now: Connected Analytics for the Internet of Everything
All Together Now: Connected Analytics for the Internet of Everything
 
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETLGoodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
Goodbye, Bottlenecks: How Scale-Out and In-Memory Solve ETL
 
The Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global LevelThe Biggest Picture: Situational Awareness on a Global Level
The Biggest Picture: Situational Awareness on a Global Level
 
Structurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your ArchitectureStructurally Sound: How to Tame Your Architecture
Structurally Sound: How to Tame Your Architecture
 
The Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big DataThe Perfect Fit: Scalable Graph for Big Data
The Perfect Fit: Scalable Graph for Big Data
 
A Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data WarehouseA Revolutionary Approach to Modernizing the Data Warehouse
A Revolutionary Approach to Modernizing the Data Warehouse
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Rethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile WorldRethinking Data Availability and Governance in a Mobile World
Rethinking Data Availability and Governance in a Mobile World
 
DisrupTech - Dave Duggal
DisrupTech - Dave DuggalDisrupTech - Dave Duggal
DisrupTech - Dave Duggal
 
Modus Operandi
Modus OperandiModus Operandi
Modus Operandi
 
Red Hat - Sarangan Rangachari
Red Hat - Sarangan RangachariRed Hat - Sarangan Rangachari
Red Hat - Sarangan Rangachari
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Phasic Systems - Dr. Geoffrey Malafsky

  • 1. Hadoop Powered Corporate Data How to Produce and Manage Meaningful Data and Analytics Dr. Geoffrey Malafsky Phasic Systems Inc.
  • 2. Phasic Systems Inc. 2 Governance Warehouse Analytics NoSQL Streaming BI Integration Architecture Modeling Big Data Hadoop Velocity, Volume, Variety Veracity
  • 3. Phasic Systems Inc. 3 Governance Warehouse Analytics NoSQL Streaming BI Integration Architecture Modeling Big Data Hadoop Velocity, Volume, Variety Veracity What does this really mean for my corporate data? Disruption
  • 4. Phasic Systems Inc. 4 Organizational Issues Technology Issues Business Issues
  • 5. Phasic Systems Inc. 5 Are we discovering new knowledge? Are we analyzing business and operations for decisions, audit, compliance, consolidation? Are we fulfilling required reports?
  • 6. Phasic Systems Inc. 6 Veracity, Meaningful Does it matter? Topic Should Does BI Yes Sometimes Required Reports Yes Sometimes Audit Yes Yes Compliance Yes Yes Consolidation Yes Sometimes Marketing Yes Sometimes Financial Yes Yes but…. Decision Making Yes Yes but….
  • 7. TechLab by InsideAnalysis Phasic Systems Inc. 7 Normalizing Corporate Small Data With Hadoop and Data Science By Dr. Geoffrey P Malafsky In part one of this discussion series (Hadoop for Small Data), I introduced the idea that Small Data is the mission-critical data management challenge. To reiterate, Small Data is “corporate structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence.” I am excluding what I call stochastic data use cases which can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. Most Big Data examples are this type. In stark contrast are deterministic use cases, which I am focusing on here and in the next TechLab in September, where the ramifications for wrong results are severely negative. This is the realm of executive decision making, Accounting, Risk Management, regulatory compliance, security, to name a few. Corporate Small Data is structured data that is the fuel of its main activities Data Normalization combines subject matter knowledge, governance, business rules, and raw data to make it meaningful.
  • 8. Phasic Systems Inc. 8 Hadoop was created to handle extraordinarily large and constantly changing data sets. It is a very well-engineered software framework and set of tools for distributed storage and cluster computing. But, can it help solve the intractable challenges with key corporate data ?
  • 9. The Challenge of Corporate Small Data Phasic Systems Inc. 9 multiple sources multiple definitions multiple copies variable structures different data values hidden conflicts in data definitions which to use different model types & standards more storage more data flows Many DW & marts different ETL complex dependencies conflicting business rules analyses restricted by inconsistencies
  • 10. Phasic Systems Inc. 10 An example of embedded errors that defy traditional tools and methods. Two authoritative data systems have many occurrences of conflicts, errors, and quantitative discrepancies. Finding these has been too difficult with common tools. But, using small Hadoop cluster (this is Corporate Data not Big Data) allows us to iteratively detect, learn, adjust. Once detected, investigated, and understood we can find just the one answer from business needed to correct.
  • 11. Phasic Systems Inc. 11 136666505 adese genc petrol 136666505 amy lily chung 136666505 anderson erin ruth 136666505 andrew william knef 136666505 anduaga-arias laura 136666505 angelica m. de la cruz 136666505 anthony o'brien, 330531-5100194 136666505 batac belle 136666505 bottesini beth ms. 136666505 bouck shannon 136666505 bunn amy b. 136666505 carlene clark 136666505 cho, boong haeng 136666505 choe, sun young 136666505 christina michajlyszyn 136666505 christopher cannon 136666505 christopher l. booth 136666505 chun, kil mo 136666505 conflict + transition consultancies 136666505 cozzone elaine 136666505 deborah p. carney 136666505 denihan patricia joann 136666505 dong sook mcgeorge, 690525-2716816 136666505 dorene d.lukewalton,pharm d. 136666505 dr. terry a. klein 0 10 20 30 40 50 60 70 80 90 100 WhiteSpace Transpose Acronym NoiseWord LowSim Punctuation PercentofDUNSWith>=50%NamesMatched Proportion of DUNS Matched by Transform Type FPDS FPDS-WAWF FPDS-WAWF-GDUNS
  • 12. Requirements for Data Analytics 1. Data must be understood 2. The right definitions must apply at the right time for the right user 3. Data’s lineage and provenance must be clear 4. Data integrity must be preserved 5. Data must be accurate, consistent, complete, timely, unique and valid 6. Data and system access must be secure 7. Data must be provided in multiple arrangements to meet different user needs and analytical processing requirements 8. Data must be prepared and tracked to support meaningful analysis for different user needs 9. Data processing must be flexible to adapt to new knowledge and discoveries on data already being used 10. Data must be normalized using authoritative or best known sets of codes, lookup values, and source adjudication knowledge and rules 11. High speed, low maintenance techniques and tools are needed to be cost and time effective 12. Lifecycle audits and data maintenance must be performed including maintaining and documenting data from raw source to intermediate transformed to full normalized 13. Use Common data models that align, correct, and semantically unify data from multiple sources to enforce meaningful and consistent analysis Phasic Systems Inc. 12
  • 14. An Example of Hidden Business Rules and Logic • If (DELIVERY_ORDER=NULL) v_piid = CONTRACT else v_piid = DELIVERY_ORDER • If ( x1='0') v_modification_number = '0‘ else v_modification_number = x2 • where x1: if (ACO_MOD=NULL) x1 = x3 else x1 = ACO_MOD • where x3: if (PCO_MOD=NULL) x3='0‘ else x3=PCO_MOD • where x2: if (x4=NULL) x2='0‘ else x2=x4 • where x4: x4= LTRIM(x5) • where x5: x5=x1 • essentially this first tries to use ACO_MOD, and if this is NULL then it tries to use PCO_MOD and sets = '0' if these are NULL • If (DELIVERY_ORDER=NULL) v_idv_piid = y1 else v_idv_piid = CONTRACT • where y1: y1 = REF_PROC_INSTRUMENT with all '-' characters removed Phasic Systems Inc. 14 key business logic as buried in a database stored procedure (condensed)
  • 15. Phasic Systems Inc. 15 Flexible, Fast, Adaptive, Multi-Tool Data Analytics Environment
  • 17. Phasic Systems Inc. 17 0 50 100 150 200 250 300 350 400 Hive Impala SQLServer FPDS Hadoop Query Times Text Field (secs) Text Parquet Parquet Partitioned
  • 18. Phasic Systems Inc. 18 Parallel Jobs in Hadoop