Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Data Governance in a big data era
1. Data Governance
in a Big Data Era
Pieter De Leenheer, PhD
Columbia University in the City of New York
August 7, 2017
2. Administration
• Slide Deck
• < slideshare link>
• Software
• Data Governance Center:
• inno.collibra.com
• User: johnfisher / Pwd: 08072017
• On the go:
• Collibra on the App Store
• Collibra University
• Sign up for free: http://university.collibra.com
• Collibra Blog, e.g.:
• https://www.collibra.com/blog/unleash-the-data-democracy-5-misconceptions-of-data-governance/
3. Overview
Intro - A Data Governance
Odyssey
•Library of Babel
•About the Company
Part 1 – The Chief Data Officer
Rises
•Digital Darwinism
•The Big Data Bang
•Data Brawls and FUD
•The Chief Data Officer Role Types
Part 2: Data Universe Expands
• Data Value Hierarchies, Networks and Hybrids
• Shift in Data Governance Approaches
• Systems of Record vs. Systems of Engagement
• Challenges:
• Big Data Analytics
• Digitalization of Trust
• Weapons of Math Destruction
Part 3: A Lens on the Data Universe
• A system of record for data
• Use Cases
12. What do the companies in these groups have in
common?
• Group A: American Motors, Brown Shoe, Studebaker,
Collins Radio, Detroit Steel, Zenith Electronics, and
National Sugar Refining.
• Group B: Boeing, Campbell Soup, General Motors,
Kellogg, Procter and Gamble, Deere, IBM and Whirlpool.
• Group C: Facebook, eBay, Home Depot, Microsoft, Office
Depot and Target.
Conclusion
• only 12.2% of the Fortune 500 companies in 1955 were
still on the list 59 years later in 2014
• life expectancy of a firm in the Fortune 500
• 50 years ago : ~ 75 years
• Today: < 15 years and declining
MIT Technology Review, Sept., 2013
13. What happened between 1955 and today
that caused this ‘creative destruction’?
• Name some compelling events in information
technology history
• Order them chronologically
• Try to explain the phenomenon in terms of the events
• E.g.,
• Invention of the transistor
• First modern computer
• Publication of the Internet protocol
• Launch of the World Wide Web
• Wikipedia
• Internet startups: FB, Google, etc.
• data big bang
14. Data Big Bang
• Phenomenon: connectivity between
• Social
• Knowledge
• Technology
• Draws curiosity
• Web Science (Pentland, etc)
• Big Data Native Market Entrants (23andMe, Uber,
Inventure)
• Big-date native entrants
• 23andMe, Uber, Inventure
• Enter Bottom up, Low-end and disrupt
• Pure data strategy
• Serving “data-citizen” Millenials
• +80% unstructured data or ‘dark energy’
24. Role Types for the Chief Data Officer
(CDO)
(Lee et al., 2014)
• Dimensions of CDO Roles
• Collaboration: inwards / outwards
• Data Space: traditional data / big data
• Value Impact: service / strategy
• Reporting:
• 30% to CDO
• 20% to COO
• 10% to CFO
26. Shift in Data Governance Approaches
• Digital forces pose gigantic risk as well as opportunity on organizations,
balance needed between:
• Hierarchical data governance (system of record)
• CDO as a Coordinator: Inward-oriented / Traditional Data / Service
• Defensive: Risk-driven
• Scarcity: Few consumers, few producers
• Compromises on old obsolete cost assumptions of digital power
• Use of digital optimizes to some extent
• Not scalable for big data by larger ‘data scientist’ populations
• Networked data governance (systems of engagement)
• CDO as an Experimenter: Outward / Big Data / Strategy
• Offensive: Value-driven
• Abundance
• Many Producers(Data Democratization)
• Eliminate Breadlines
• Consumerization of BI and cheap digital power
• Many serve many
• Supports customer
• Many Consumers (Data Amazonification)
• Access, SLA, Trust, Secure Cloud, etc
28. Big Data Analytics
Challenges
• Where everybody has data scientists: predict next
transaction is not competitive anymore
• from 'predict next transaction' to life-long relation
building and value creation
• reduce search and navigation for customer with
better apps
• crowd sourcing to cross compare with and learn
from other customers (Opower, INRIX, zillow)
• get trust from customer through branded non-intrusive
apps: personal health monitoring, Nest
• Retention analysis example
29. Digitalization of Trust
Challenges
• In Hierarchical Data Governance, trust is
• established by a centrally sanctioned competence center
• Or external appointed trustees with formal roles: steward,
owners, architects
• In Networked Data Governance, trust is more complicated:
• Authenticity: is the data factual or opinioned?
• Intention: does this data have good intentions? Can I use
it without peril? Hidden privacy concerns I should be
aware of?
• Assess expertise or quality: are people involved skilled or
certified stewards?
• Is it accurately representing our business reality, i.e.
customer base?
• Is it complete and up to date?
• Has it be certified through standard process?
30. Danger of the old paradigm models
• Weapons of Math Destruction (WMD) are
models
• Threaten to destabilize
• Equality
• Democracy
• Traits of WMDs
• Opaque
• Unregulated
• Uncontestable
• …hence : ungoverned
31. Preliminary Conclusions
• Digital forces have digitally empowered individuals in the organization
• Hybrid data governance approach should combine
• Top-down governance of critical data assets to enhance internal coordination
• Networked peer-driven empowerment to drive ‘serendipity’
• On a shared platform
• Key challenges are:
• Digitalization of trust with focus on social capital
• Big data analytics that drives life-time value for customer
• Data Valuation based on Usage
• Legacy of oblique, unregulated and incontestable models
• Recognize CDO Leadership and Role transition
37. 3 Industries
• Technology – Big Data Valuation
• Health Care – Reference Data
• Manufacturing – IoT and GDPR
• Banking - Compliance
• Can you identify hierarchical vs networked mechanisms in these
business cases?
39. Not all data is of equal value
• At Dell, lean data governance:
catalogs an inventory of all types of
data assets while implementing a
minimum set of business specific
metadata attributes
• Data is governed based on level of
consumption – the value of the
data and how much is shared
• Categorized as enterprise
supported “operationalized" or
innovation discovery
(courtesy Barbara Tulippe)
41. Reference Data in Health Care
• Independence is the leading health
insurer in southeastern
Pennsylvania.
• Serve close to seven million
people nationwide, including
2.1 million in the region.
• 42,000 physicians
• 160 hospitals
https://prezi.com/ve1ws8jmpqcn/workflow/
42. IoT + GDPR in Manufacturing
• Internet of Things and GDPR
• From responsive to competitive advantage
• Steps
• Identify processes ‘touching’ EU citizen data: employees / customers /…
• Identify critical data elements: name, SSN, address
• Lineage / Traceability
https://hbr.org/2014/11/how-smart-connected-products-are-transforming-competition
43.
44. Compliance in Banking - Scorecard
• How many Critical Data Elements (CDEs) have a dedicated
stewardship resource assigned?
• Are those Business Stewards actively participating in stewardship
activities?
• Are CDEs progressing through the expected life cycle?
• Are relations to physical data assets and source systems defined?
• Is data profiling occurring based on defined Data Quality Rules?
45. Compliance in Banking – Operating
Model
Compliance results fetching
Principles &
Requirements
Policies &
Standards
Regulatory Report
Catalog
Critical reporting
elements
Regulations &
Regulators
Lines of Business Business Process Data Categories System inventory
Enterprise
business glossary
Market risk
business glossary
Credit risk
business glossary
Third party
business glossary
Financial instruments
business glossary
Compliance score cards
Collibra DGC automation for
computed results on asset counts
Collibra Connect for results
coming from external sytems
Collibra Workflows for results
captured by stakeholders
BCBS 239 model & content (1.0)
BCBS 239 Metamodel
BCBS 239 compliance KPI, result capturing mechanisms & scorecards (3.0)
Company specifics
Reference Business Glossaries (1.0)
…
1
2
3
4
BCBS 239 model & content – Collibra
configured metamodel and loaded
BCBS 239 content from by Basel
committee and other recognized
bodies. Independent from customer
context. Can be used out-of-the box.
BCBS 239 compliance KPI, result
capturing mechanisms & scorecards –
Collibra out-of-the box workflows,
asset counts dashboards. Compliance
KPIs and scorecards. To be used with
specific customer configurations and
integrations when required.
Company specifics – Collibra standard
content to be modified to fit to
companies specifics.
Business Glossaries – Common
definitions on major financial business
concepts. To be used with specific
customer adaptations.
1
2
3
4
Critical Business Term PoliciesData Categories Principles
Life Cycle Management (2.0)Life Cycle Management (2.0)3
Business Dimensions
48. Recommended Reading
• Books:
• O’Neil, C.: Weapons of Math Destruction
• Franks, B.: Taming the Big Data Tidal Wave
• Sundararajan, A.: The Sharing Economy
• Pentland, S.: Social Physics: How Good Ideas Spread
• Zittrain, J.: The Future of the Internet
• Tunguz, T.; Bien, F. (2016) Winning with Data
• Articles:
• Lee et al. (2014) A Cubic Framework for the Chief Data Officer: Succeeding in a World of Big Data. MIS Quarterly Executive 13:1
• AAIM, Systems of Engagement and the Future of Enterprise IT (2017)
• http://mitiq.mit.edu/IQIS/Documents/CDOIQS_201177/Papers/05_01_7A-1_Laney.pdf
• http://si.deis.unical.it/zumpano/2004-2005/PSI/lezione2/ValueOfInformation.pdf
• http://dupress.deloitte.com/dup-us-en/topics/emerging-technologies/the-burdens-of-the-past.html
• Blog Posts
• https://www.collibra.com/blog/unleash-the-data-democracy-5-misconceptions-of-data-governance/
• https://www.collibra.com/blog/the-rise-of-the-chief-data-officer-cdo/
• https://www.collibra.com/blog/blognew-years-resolution/
• https://www.collibra.com/blog/data-lineage-diagrams-paradigm-shift-information-architects/
Notes de l'éditeur
Intro: intro to our company to make sure what the background is of what I am going to tell you
inno.com
Invention of the transistor
60’s Internet
1989 Publication by CERN of global hypertext system
2004 Social media
streaming
Explosion of mobie devices
* Evolution , not transformation because the latter would not take into account the “creative destruction” part.
An acceleration will happen as many of these young data citizens will enter the job market and change their employers behaviour.
Data-driven is the word of the day. Increasingly depending on data to thrive
According to Forrester Research, even though 74% of firms say they want to be data-driven, in reality, only 29% say the are good at connecting analytics to action.
There are many reasons why this happens. See if this one sounds familiar:
You spend hours preparing for a meeting. You collect the data you need from finance, IT, and even retrieve some from the data lake. You analyze it from every angle, and prepare your insights and recommendations. You’re confident in your findings, and are ready to make your data-driven argument.
But before you have a chance to present, your colleague presents her findings and recommendations, using data he analyzed from the data lake, salesforce.com, and IT. It’s vaguely familiar, but distinctly different and leads to a data-driven argument with a completely different conclusion.
Now, instead of making a data driven argument, you find yourself engaged in a data brawl. The meeting is no longer focused on the decisions at hand, but rather on answering questions about the data itself:
Where did it come from?
What does it mean?
What metrics were applied to it?
Who can access it?
It is correct? If not, how can I fix it?
Everyone realizes that “data-driven” doesn’t work without trustworthy data.
In an effort to answer those questions so this situation doesn’t happen again, you head back to your office and put in place processes driven by Excel spreadsheets, emails, meetings, and Sharepoint documents. You track information about the data - where it comes from, who can access it, what it means, and more - in all these different ways, holding the process together with duct tape and string. Your colleague does the same, and before you know it, many people across your organization have a similar process that they follow, which only multiplies the problem. These processes are inefficient, inaccurate, and expensive. It’s no wonder your meetings dissolve into data brawls. This approach isn’t just unsustainable – it simply doesn’t work.
ENDLESS MEETINGS – people fly in every two weeks
On the left side of the diagram, you have the traditional data infrastructure (IT) and they are coping with the trend of exploding volumes and types of data.
In addition, data growth is becoming more and more complex, increasing their challenge.
At the same time, on the right side, you have the business who need to report more complex data, faster than ever, as well as a need for analytics.
This can include internal reports, which are often compiled manually on spreadsheets; often reports rely on multiple layers of interrelated spreadsheets. This makes it more difficult than ever to trust and depend on the data.
Then from the Business focus, in the center, you have all kinds of trends like consumerization of IT (where IT is purchasing software tools), social and data maturity.
This leads to a need for a data authority. #
If all these communication channels are in place, we can trace whereabouts and usage of every data product individually, from the definition level down to the storage.
If all these communication channels are in place, we can trace whereabouts and usage of every data product individually, from the definition level down to the storage.
Outwards: e.g., manufacturing company may agree with his suppliers and distributors on 1 global product ID
Traditional: enterprise-level MDM, BI and Analytics
Big Data: more on the application level, more self-service BI, more data scientist experimenting with big data require appropr. Approvals for data usage and sharing
Data as a service as immediate need to improve service quality, regulatory compliance, reputation of the company
Data as a strategy: build aggregated data products and resell them as a strategy: e.g., the ab company selling GPS information of cabs to Google.
Top-down :
methodological step-wise decomposition into sub components ‘black boxes’ in a reverse engineering fasion
Presumes a preconcpetion of the ‘big picture’
Bottom =-up
Allowed behaviour based on simple rules set
Subsystems gives rise to complex systems
Original systems become subsystems of the emeging ssytem
Perception
Seed model
To be data driven, an organization needs three things:
Knowing where its data comes from
Knowing what it means, and
Knowing that it’s right
What if everyone in your organization had the same ability – to find, understand, and trust their data?
This is what Collibra brings to its clients:
Collibra allows you to find, understand and trust your data.
Finding data by giving you a catalog which ingests information about datasets and the metadata underneath.
Understanding data by putting this metadata into context. By linking it to business concepts: tags, business units, data dimensions, KPI’s reports...
And finally Collibra allows you to trust this information by enabling policy management, a data helpdesk and the worlds best stewardship platform.