Building the enterprise data architecture

Building The Enterprise Data
Architecture
The Components, Processes and Tools of a
Successful Enterprise Data Architecture

Enterprise Data Serves Two
Purposes
 Running the Business
◦ Tracking transactions such as sales, invoices,
payments, deliveries, payroll, benefits, etc.
 Managing the Business
◦ Accounting, Budgeting and Planning
◦ Marketing
◦ Asset Management
◦ Process Optimization
◦ Strategy
 Definition
 Monitoring
 Investment and resource allocation

Why is Data Architecture So
Hard?
 Complex
◦ Data IS complex
◦ The world changes rapidly, hard to keep up
 Shared Data and Competing Stakeholder Interests
◦ Stakeholders have different definitions and uses for shared data
◦ Conflicts are often resolved with incompatible models and duplication
 Data Proliferation
◦ Useful data is replicated or transferred to other systems (e.g. accounting)
◦ How and where data is used is often unknown
 Data can be inaccurate
◦ Data entry is error prone and data duplication is common
◦ Calculation errors are surprisingly common
 Resource Intensive
◦ Databases can be very large
◦ Managing data can consume a lot of staff resources
◦ Tools and licenses can be pricey

A Robust Data Architecture
Should Be
 Trusted
 Accessible
 Timely
 Integrated
 Consistent
 Comprehensive (Complete)
 Verified
 Documented
 Discoverable
 Flexible
 Scalable
 Cost Effective
 Transparent and Visible
 Safe
 Governed
 Auditable (Lineage and eDiscovery)
 Secure

The Data Lifecycle
 Data goes through a lifecycle with phases such as:
◦ Creation/capture
◦ Modification
◦ Processing
 Aggregation, joining, filtering, transformation
◦ Consumption:
 Product/service delivery/fulfillment
 Reporting
 Statistical analysis
 Exploration (Data discovery, search)
◦ Archiving/destruction
 Data lifecycle can be complex and difficult to manage

A Successful Data Architecture
Must Deal With
1. Data Repositories
2. Data Capture and Ingestion
3. Data Definition and Design
4. Data Integration
5. Data Access and Distribution
6. Data Analysis
And More….

Data Repositories
 Objective:
◦ Repositories must be extremely solid and reliable.
 Issues
◦ Complexity of tools
◦ Licensing, resource and skill costs can be high
◦ Tolerance for failure is extremely low
 Types of Data Repositories
◦ Application
◦ Reference/Master
◦ Meta Data
◦ Data Warehouse
◦ Discovery
◦ Archive
◦ Offsite/DR Repositories
 Technology
◦ Traditional RDBMS: Oracle, DB2, SQL Server, Informix
◦ Data Warehouse: Neteeza, Greenplum, Exadata, etc
◦ Big Data: Hadoop, CouchDB, Redis, HBase, CouchDB

Data Capture and Ingestion
 Objective:
◦ Capture all relevant business data
◦ Ensure data is both accurate and complete at point of capture
 Issues
◦ Great diversity (online entry, ftp, APIs, real time streams, etc., unstructured
data)
◦ Incoming data is often unreliable, incomplete
◦ Big data explosion
 Types of Data Capture
◦ Online
◦ File transfers
◦ Social media (real time)
◦ Logs (web, security, o/s)
◦ Data subscriptions (address databases, etc.)
 Technology
◦ Online Applications (JEE, .NET, COBOL, etc.)
◦ File transfer: FTP
◦ APIs: Web services, REST, SOAP

Data Definition and Design
 Objective:
◦ Organize the data to suit business requirements
◦ Define data components, structures and processes to enable shared use and collaboration
◦ Define rules for format, domain, structure and relationships
 Issues
◦ Data modeling is complex – often requires unsatisfactory trade-offs
◦ As data is distributed or shared, quality, structure and definition tend to worsen
◦ Provide sufficient flexibility for graceful evolution
 Types of Data Definition Artefacts
◦ Data Glossaries
◦ Meta Data
◦ Data Domains
◦ Data Structures
◦ Data Relationships
◦ Validation Rules
◦ Data Transformations
 Technology
◦ Data modeling tools (Erwin, Power Designer, XMLSpy, etc.)
◦ Metadata workbenches
◦ Business glossaries

Data Integration
 Objectives
◦ Data must usually be shared (copied or transferred) to realize its full value
◦ Shared data requires transformation, filtering, verification, security controls, calculations,
aggregations etc.
 Issues
◦ In most cases, data integration is done in an inconsistent manner
◦ Lack of enterprise standards and controls. Department silos
◦ Data can become less unreliable each time it is handled
◦ Rules for data integration poorly understood and documented
 Types of Data Integration
◦ EAI, SOA
◦ ETL
◦ Federation
 Technology
◦ ETL Tools: DataStage, Informatica, SSIS, Oracle ETL
◦ ESB: Web services, REST, Websphere, WebLogic
◦ Messaging servers: MQSeries, WebSphere, ActiveMQ, MessageBroker, Biztalk
◦ Federated databases

Data Access and Distribution
 Objectives:
◦ Data is of no use unless it can be efficiently and reliably consumed by client applications
◦ Data must be presented in a consistent, trusted and well understood way
 Issues
◦ Access patterns are very diverse
◦ Client requirements can cause conflicts that are difficult to resolve
 Types of Access
◦ SQL
◦ File APIs
◦ Dashboards
◦ Query engines
◦ BI tools
◦ Canned Reports
◦ Discovery tools
◦ APIs
◦ Published views and queries
◦ Publication and replication
 Technology
◦ Reporting tools: Cognos, BO, Tableau
◦ Data feeds: ETL
◦ Query languages: SQL, SAS, R, etc.

Data Analysis
 Objectives:
◦ Provide insight into business operations
◦ Assist in planning; provide predictions
 Issues:
◦ Complex and difficult to do right
◦ Skills shortages
 Types of Analysis
◦ Predictive analytics
◦ Correlation
◦ Pricing and underwriting
 Technology
◦ SAS, SSPS, R
◦ Machine Learning algorithms

More Details
 There are many aspects to consider in
good architecture, including:
◦ Business Glossary
◦ Metadata
◦ Data Discovery and Profiling
◦ Data Lineage
◦ Data Reconciliation
◦ Data Modeling
◦ Master Data
◦ Data Wrangling
◦ Data Pipelines

Business Glossary
 Business Glossary is the place where all
stakeholders can agree what the data
actually means mean and what data rules
apply
◦ In the real world, lack of precision and conflict
about terminology and meaning is common
 This information needs to be managed,
coordinated and subject to quality controls
 An essential first step to a robust enterprise
architecture
 Business stakeholders own the data
definitions

Meta Data
 For every data field, record and
process its useful to provide detailed
definitions, descriptions and rules that
govern production, use and storage of
any data item
 This information is managed,
coordinated and subject to quality
controls
 Targeted to technical stakeholders

Data Discovery and Profiling
 Data Discovery is about knowing
where the data arrives from, where it
goes to and where it is used
 Data profiling analyzes the content,
structure, and relationships within data
to uncover patterns and rules,
inconsistencies, anomalies, and
redundancies.

Data Lineage
 In many cases its important to know how
data has reached a particular point.
 Example: Why do reports A and B
disagree about last month’s revenues?
 Answer: Examine the data sources to
ensure that both reports are using the
same data sources.
 NOTE: Sometimes data can move a long
way and through many steps before it
ends up in a report

Data Synchronization and
Reconciliation
 When data exists in multiple locations,
there is always a danger that errors will
creep in
 Therefore as part of a robust control
process, its useful to compare the data in
source/target systems to find errors or
omissions
 This can however become very complex
as it may involve reconstructing the data
lineage of every item

Data Modeling
 Data modeling with 3NF and
dimensional modeling is relatively
mature, but the real world is complex,
which can make some models hard to
design, build and use
 NoSQL takes a different approach
◦ Data is stored as documents
◦ Data is flexible and dynamic

Master Data
 When data is generated and consumed
in multiple locations and systems its very
difficult to manage the data
 A Master Data Management (MDM)
repository can:
◦ Act as the corporate system of record
◦ Define the shared canonical data model
◦ Enforce data rules consistently across all
systems
◦ Resolve conflicts
◦ Detect and remove duplicates

Data Wrangling
 Data is often less than perfect
◦ Missing data
◦ Incompatible fields
◦ Semantic incompatibilities
 Data wrangling is the art of making
merging diverse data sources into a
consistent data store with consistent
semantics, formats and values

Data Pipelines
 A data pipeline is the process which handles the transfer of data
from one or more sources and delivers it to one or more targets
 A data pipeline may have the following stages:
1. Extraction
2. Cleansing
3. Transformation
4. Load
 Types of Transfer:
◦ Batch (ETL)
◦ Continuous/Real Time (SOA, Federation)
 Push
 Pull
 Issues:
◦ Completeness
◦ Accuracy
◦ Latency
◦ Access method
◦ Technology and tools

Data Pipelines – Data Extraction
 Data is extracted from a source
system and landed in a staging area
 Data extracts can be:
◦ Complete dumps of all data
◦ Changes only
 Principles:
◦ Capture Everything
◦ Land everything in a staging area
◦ Read once, write many

Data Pipelines - Data Cleansing
 Cleansing processes will fix or flag:
◦ Invalid data
◦ Missing data
◦ Inaccurate data
 Principles:
◦ Partition data into clean and rejected data.
◦ Generate rejection reports
◦ Land data into a zone for ‘clean’ data
◦ Errors should be corrected in the source
systems when detected

Data Pipelines - Data
Transformation
 Transformation is a data integration function
that modifies existing data or creates new data
through functions such as calculations and
aggregations.
 Common transformations include:
◦ Calculations
◦ Splits
◦ Joins
◦ Lookups
◦ Aggregations
◦ Filtering
 Principles
◦ Stage transformed files to publish ready landing
zones

Data Pipelines - Data Loading
 Data can be loaded:
◦ As files
◦ Via RDBMS utilities
◦ SQL
◦ Messages
 Principles
◦ Group loading files by target repository

Data Pipelines – Design
Practices
 For each stage of the data pipeline, its
advisable to create:
◦ Conceptual Models – what will happen
◦ Logical Models – how will it happen (no
technical details)
◦ Physical Models – Technical details to
realize the design

Case Study – Extraction Model
 Conceptual Model
◦ Extract 2013 policy records from system A and system B
 Logical Model
◦ Extract 2013 policy records from daily batch files generated
by system A
◦ Issue SQL queries to tables S, T in system B and store in
file X
 Physical Model
◦ From system A read file CS403.txt from directory x/y/z and
store extract in ….
◦ Eliminate all records in A that…
◦ From system B run SQL query Select * from Policies,
Clients … and store extract in file … with format….
◦ Etc

The Future
 Hortonworks claims HALF the worlds
data will soon be in Hadoop clusters
 For the first time in a generation,
relational databases are seriously
threatened by NOSQL data bases –
Redis, HBase, MongoDB, CouchDB, etc
 Data exploration tools such as Tableau,
QlikView are disrupting traditional BI
market
 In memory databases make it feasible to
eliminate the data mart altogether

Summary
 A Data Architecture is much more than
a collection of Erwin data models –
that’s just the starting point
 Coordination, synchronization and
collaboration are the hallmarks of a
successful data architecture

Building the enterprise data architecture

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Building the enterprise data architecture

Similaire à Building the enterprise data architecture (20)

Dernier

Dernier (20)

Building the enterprise data architecture