enterprise-data-everywhere

B
- Bill Peer, Kiran Kumar Kaipa, Shyamala Sadananda, and Swaminathan Natarajan
Abstract
The proliferation of new analytic processing capabilities, the deafening marketing
hype of Big Data, and the radical dropping of processing power barriers have
brought focus on a problem that has long existed in the area of information
harvesting: data resides in lots of places and in lots of forms. There is no singular
solution or approach available today that allows all the information latent in an
enterprise to appear and be exploitable by all systems in the enterprise. That is,
there is no ideal way to handle the enterprise knowledge value-chain. This paper
provides an articulation of a decision framework for identifying the“best fitting
approach”for an enterprise’s“data everywhere”challenge, exploring the common
models of data foraging, data virtualization, data consolidation, and information
fabrics. The viewpoint of this paper is based on a common usage of enterprise-
wide data: Business Intelligence (BI). Within the realm of BI, this paper further
refines specific usage scenarios that many of our forward looking clients expect:
advanced analytics and self-service BI.
Knowledge value chain approaches: A decision framework
WHITE PAPER
Introduction
The proliferation of new analytic
processing capabilities, the deafening
marketing hype of Big Data, and the radical
dropping of processing power barriers
have brought focus on a problem that has
long existed in information harvesting:
data resides in lots of places and in lots
of forms. Companies that address the
problem are creating a new generation
of business applications that support
customer service, risk management,
multichannel integration, loyalty
management, regulatory compliance,
marketing, business performance
management, and other critical functions
[3, 2]. These applications require cross-
functional data, with varying semantic
meaning, to be blended in near real-time.
This has created an increased demand for
data access and integration solutions [3].
Increasing volume of data, faster data
consumption and decision-making, and
dealing with integration cost and timelines
makes data integration a complex
challenge. This challenge acquires further
complexity due to diversity of data sources
External Document © 2015 Infosys Limited
and data freshness requirements [3, 2].
Compounding things even further is the
growing diversity of the user population of
the synthesized, blended data knowledge
products. With so many dimensions to
the problem, companies are seeking
more flexible ways of making integrated
data available to constituencies, people,
applications, and systems – in ways that
control costs and complexity by keeping
customized application code to a minimum
[3, 2].
As of this writing, there is no singular
solution or approach available that
allows all the information latent in an
enterprise to appear and be exploitable
by all systems in the enterprise. There
are, however, numerous approaches to
address various aspects of the knowledge
value chain problem and they go by the
names of data lakes, information fabric [1],
data federation, data virtualization, data
warehouse augmentation, and more.
Vendors, research organizations and
the like offer varying approaches and
perspectives on“the best approach”for
blending enterprise data. Our experience
at Infosys in implementing these solutions
in a variety of contexts highlight that
while there is no panacea available, there
is a decision-making criteria that can be
followed to help an organization pick
the“best fitting”answer for a particular
scenario and need. By“best fitting”we
mean a solution that is more effective at
realizing the desired outcome under one
set of conditions than another. This paper
explores these conditions to help our
clients make the best choice.
This paper is organized as follows. First,
a baseline of terms and concepts is
established such that the reader and
authors can carry forward the dialog.
Second, the“data everywhere”problem
is briefly discussed, noting the different
critical dimensions that must be
considered. This is followed by a discussion
on two common scenarios, advanced
analytics and self-service business
intelligence (BI). The paper finally presents
a decision-making framework that will help
enterprises identify which approach best
fits their situation.
Terms and Concepts
The terms and concepts covering the
domain of enterprise information
exploitation are constantly evolving. There
is a multitude of definitions from academia,
consultants, vendors, and historical
precedents. This paper defines a set of
working terms, definitions, and concepts
that contain and restrain the knowledge
being presented. It is not the intent of this
paper to usurp working definitions that
may be in use by any one group, but it is
to make the idea conveyance possible. If
there are differences in the definitions with
the reader’s knowledge base, the authors
ask the reader to grant an exception for the
duration of this paper.
Basic Terms
•	 Data is a set of values. The numbers 5,
15, 32, 38, and 41 are data.
•	 Knowledge is the awareness of patterns
of information. If the information
External Document © 2015 Infosys Limited
“Lottery Numbers 5, 15, 32, 38, and 41”
is known to appear every three weeks,
then there is knowledge.
•	 Business intelligence (BI) is the
awareness of business operational
information.
•	 Analytics is the art and science
of discerning patterns in data and
information.
•	 Knowledge value chain (KVC) is a
set of processes or activities by which
a company refines data such that it
becomes knowledge. Each step in the
process is known as a Knowledge Value
Link.
Basic Concept
An enterprise is laden with data and
information. By applying analytics one
can create knowledge from these assets
by finding patterns of interest. This
knowledge can be showcased through
reports, dashboards, or as inter-system
actions such as ordering more stock in a
particular geographic location. Knowledge
creation and knowledge sharing requires
people, processes, and technology. The
entirety of this knowledge creation and
sharing is collectively known as Knowledge
Value Chain (KVC). Different approaches
to realizing different parts of the KVC give
rise to different implementations seen in
the field.
The Problem
In today’s fast moving world, business
decision-makers, customers, and trading
partners need to act on rapidly changing,
difficult to harmonize information in
near real-time to generate value as
part of the KVC [3, 2]. The need for near
real-time response, combined with the
sheer number and complexity of the data
sources, creates a new set of data access
and integration challenges across the
enterprise [3, 2]. Enterprise data assets
are typified by different technologies in
underlying data sources and ownership
issues with lines of business data [3, 2].
Information in an enterprise can be
scattered across systems, technologies,
and geographies in various formats. There
are numerous reasons for such distributed
and heterogeneous information, such
as varying project funding models,
organizational structure constraints,
mergers and acquisitions, partner
integration requirements, regulatory
constraints, legal restrictions, and more
[3]. Regardless of the reason, blending
such widely distributed data is difficult to
accomplish. Even if one overcomes this
challenge, who is to say that the labeled
data being blended has the same format?
Typically, the information required for
blending is present in heterogeneous
formats thereby making the consolidated
access of this information difficult [3]. For
example, many companies today have
a variety of data marts and warehouses
for data consolidation activities as well as
operational data stores for transactional
purposes [3]. While it may be of value to
blend the information in these separate
systems for some analytical purpose, the
information contained within may be of
different formats, each optimized for its
particular need (e.g. business reporting
on one hand and shipment tracking on
another). Even if one overcomes this
challenge, who is to say that the unified
formatted information has the same
semantic meaning?
It is common for organizations to have
multiple stores with labeled data names
that are identical but whose semantic
meaning is not identical. For example, an
accounting department may have a stored
piece of information called“Sales”and
the sales department may have a stored
piece of information called“Sales.”. In the
accounting world, this is specifically net
sales and it refers to a company’s revenue
earned from the sale of a product or
service while in the sales world it refers
to a specific transaction where money is
exchanged for ownership of a good or
service. If these subtle differences are not
recognized, blending these commonly
labeled informational elements will result
in polluted synthetic information. Even if
one overcomes this challenge, who is to
say that the harmonized information is
on compatible technology platforms that
allow interchange?
Any company that has had digital systems
for more than a year is guaranteed to
have different platforms of information
technology underpinning its landscape.
These varied platforms have varied
protocols, varied requirements on
information interchange, and varied
performance capabilities. These variances
introduce a swath of engineering
challenges.
To cull a unified blend of information
based on the aforementioned challenges
introduces significant latency, as not only
do the technology capability variances
in underlying technologies have to be
accommodated, but information must
be harmonized, varying formats need to
be resolved, and physical spread must
be traversed. Even if one overcomes
these collective challenges, a slew of
new challenges and requirements come
forth from the new synthesized (blended)
information.
Dealing with synthesized information
in the enterprise requires a revisiting of
policies, access rights, data stewardship,
and provisioning. Individual elements
of information in isolation may be fine
to operate on, analyze, and report, such
as a social security number, but the
moment it is blended with first and last
name, the synthesized element is subject
to personally identifiable information
constraints. Such cases are particularly
challenging to identify when dealing with
technical data that may be subject to
export control laws.
All the aforementioned challenges beget
four broad categories of operational
challenges to be addressed:
1. Information service level agreements:
The right information needs to be
delivered to the right system in a timely
and consistent manner, irrespective of its
source. Many applications require real-
time data retrieval abilities [3, 2].
2. Data integration: Data which can
belong to different heterogeneous types
needs to be transformed, integrated
and aggregated to create intelligent
information for applications, systems,
and platforms [3, 2]. Sometimes this
requires creating integrated views of
data from different data sources [3]. The
variance in the business semantics from
one business unit to another creates an
additional challenge for data integration
and interleaving.
3. Data stewardship and provisioning:
Data architects and administrators must
protect the integrity and security of data
sources while making them available
for consumption by applications across
the enterprise [3, 2]. Added to these
concerns is the growing list of regulatory
compliance requirements that require
more careful auditing of enterprise
information usage [3, 2].
External Document © 2015 Infosys Limited
4. Information access rights: Information
security administration and
administrators must ensure that the
right people and right systems only
have access to the right information,
including dynamically created synthetic
information. If user access to a particular
atomic component of some synthesized
data is revoked, this must cascade to all
other systems that leverage the blended
data.
External Document © 2015 Infosys Limited
Possible Solutions
The fundamental problem faced by
large organizations is that distributed
data in variant forms is difficult for an
enterprise to exploit. This macro problem
is often referred to as the enterprise data
integration challenge. The fundamental
challenge stems from two basic issues:
1. Knowing what data is where (finding);
2. Knowing what information can be
derived from the data (blending). As with
most things, these issues can be solved
with people, processes, technology, or
(more likely) some combination. There are
a multitude of approaches within each of
these three dimensions.
The reader is cautioned that the usage of
the simple terms finding and blending
is only meant to make the concept more
consumable. Neither of these notions are
trivial when it comes to implementation;
each requires critical tradeoff
considerations along the triad of people,
processes, and technology. For example,
when blending there are many approaches
that can be adopted to help data make
its life cycle progression into actionable
insights as part of the KVC. While in some
cases, people serve as primary blender
(such as with statistical analysis), in other
cases machine acts as blender (such as in
machine learning).
It may not be surprising to learn that the
solution choice is often driven by the
constituency driving the need. Some
companies first face this problem with BI
(often when doing enterprise reporting)
and solve the problem with one method,
while some address this problem first with
advanced analytics (finding patterns that
cross organizational boundaries), resulting
in a different choice. Other first drivers
include transactional personalization
(targeting offers based on demand
moments), omni-channel (the unified
singular transaction experience), and so on.
The bottom line is that there is no single
solution as of this writing that fits every
situation perfectly.
To solve this fundamental challenge of
enterprise data integration, a number of
approaches have emerged over the past
few years, each addressing the problem
based on a particular need (often driven
by the project’s funding source). Below is
a brief summary of each method along
with key points associated with the
approach to help in decision-making. The
reader is reminded that the vocabulary
used in defining each basic solution type
isn’t presented as the definitive name for
an approach, but is rather used to serve
a common pattern to help convey the
decision-making criteria that this paper is
highlighting.
Data Foraging
This approach to enterprise data
integration relies on each consumer of data
to source and pull data on his/her own.
Following this approach places all finding
and blending burdens on the consumer
(be it a digital system or a human being)
along with all data movement and
orchestration activities. It is not uncommon
for this approach to be aided with some
form of online accessible data dictionary
or even a data encyclopedia to aid the
data hunters in their quest to find data
that meets a given need. Many exploratory
efforts begin in this model but eventually
morph into, or become inclusive of, one
of the other solution patterns defined. In
this approach the entirety of the KVC rests
with the person (or system) doing the
data foraging. As the variety of discrete
informational entities under study grows,
this solution becomes untenable. The
wider the range of different elements that
are to be considered, the more difficult this
solution becomes.
Data Consolidation
This approach to enterprise data
integration creates consolidated pools
of data, separate from the originating
data stores. These amalgams of data
create a pool of data teaming with new
blend possibilities. These consolidated
repositories of data are characterized by
data movements including synchronization
actions and duplicated data (e.g. data
exists in the source data store and in the
consolidated store). There is a possibility
of inconsistent data with this approach
as the source system must first change
and then this change cascades to the
consolidated pool with the time interval
being an important decision driver and
engineering concern. By placing all the
data together, the finding problem is
made easier and the processing paradigm
of moving algorithms to the data can
be exploited; a great value add with the
explosion of Big Data paradigm processing
technologies. Normally this approach is
implemented as a physical store, but it
could be implemented purely in memory
Information
Fabric
Data
Virtualization
Data Consolidation
Data Foraging1
3
2
4
Figure 1 Four key knowledge value chain approaches
External Document © 2015 Infosys Limited
as well. Depending upon which capabilities
are expected from the data consolidation
approach, the label applied to it varies
from augmented data warehouse to data
lakes, data pools, or data marts. In this
approach, the wider the range of source
systems to be considered, the more
difficult this solution becomes as large
troves of data are constantly shuffling
around and maintaining which systems are
to be pulled when (e.g. orchestration and
synchronization) becomes a nightmare to
manage.
Data Virtualization
This approach to enterprise data
integration leaves all data in their source
data stores, and it simply provides a proxy
that takes in-bound requests for data and
routes them on behalf of the caller to the
source system. This creates a rich pool
without moving data until it is actually
requested, creating consistency guarantees
(see exception that follows). This approach
creates some data retrieval latency due to
the extra routing but is often ameliorated
to some degree with some form of memory
based caching which brings forth a set of
tradeoffs. In the cache implementation
case, the possibility of inconsistent data
(e.g. different values in the source system
and virtualized data) arises but leads to
creation of a localized cache model with
the net effect of faster overall data retrieval.
For example, if a source system is located
half-way around the world but a virtualized
data store is local and implemented with
a cache, it will be quicker to get the data.
Depending upon what is expected of this
approach (transactional abilities, canonical
forms of data, read and/or write, etc.), it
can be labeled as data virtualization or
data federation. However, the more the
consumer base grows and the needs vary,
the more difficult it is to meet all needs
consistently under varying loads.
Information Layer
This approach to enterprise data
integration not only unifies data stores, but
also provides the capability to integrate
operational systems. Many operational
systems in the company have transient
data that is of value to the enterprise, but
the data is ephemeral (e.g. an intermediate
processing step). The information fabric
approach adds the capability to refine data
such that it is available to the enterprise
and its systems as information. It is a
software solution that enables applications
to access both raw and integrated data
from multiple, heterogeneous, and
distributed data sources and systems while
hiding the complexity of the disparate
data sources [3]. Instead of moving data
or creating new stores of integrated
data, an information layer creates a loose
federation of multiple existing data sources
and provides a single, virtual data source
through which people, applications, and
systems can access data [3]. In other words,
an information layer creates a“data service”
or“data veneer”that allows applications
and end users to treat a broad variety of
multiple data sources as if they were one
large single source of information [3]. As
such all capabilities of access rights, lineage
tracking, provisioning, etc. are included.
While the tactical implementation
approach to information layer can vary, it
is most common for the implementation
to be based on some form of data
virtualization. According to the technology
research company Forrester, besides other
aspects information layer also comprises
the ability to do transactional processing
and conduct roll-backs on information
changes. [1] However, as the service
level expectations and variations grow,
this solution becomes complex as varied
implementation approaches are required.
External Document © 2015 Infosys Limited
Knowledge Value Chain
Decision Making Framework
The four basic solution models for KVC
implementation discussed in this paper
provide a different set of trade-offs that
must be considered when picking an
approach. In practice, we find that most
enterprises have two (or more) of the four
solutions as driven by different business
and operational requirements, resulting
in a hybrid solution to the KVC challenge.
However, the question that still remains
unanswered is which particular solution
is the“best solution”for a particular“data
everywhere”challenge?
The KVC approach decision-making
framework this paper proposes identifies
three dimensions that must be considered:
Constituency, containment, and
composition.
The primary informants of“best fit”are
driven by the study of expectations of the
KVC as represented by the following three Cs:
Constituency: The user population of the
KVC must be defined. This could be people,
applications, and/or other systems. Who is
the constituency of the value chain?
Containment: The universe of data or
information that is considered to be a
part of the KVC must be defined. What is
contained by, or within, the value chain?
Composition: The approach (people,
processes, and/or technology) for each
knowledge value link progression must be
discerned. How is each link composed?
The KVC decision framework takes each
of these dimensions and explores them at
three different levels. First, at a macro level,
we study the variety of expectations of the
KVC itself in a given enterprise. Second, we
begin decomposing these expectations of
the KVC and its individual links. Third, we
ask a collection of specific questions that
will inform our decision-making flowchart.
While not required, a“current situation”
analysis can play a part in selecting the
future strategy based on existing (and
already paid for) capabilities. To this end,
a KVC subsystems capabilities can be
discerned from the Figure 3 graphic where
each individual capability required for KVC
is articulated.
Knowledge Value Chain: Variety
At the highest level, the variety of sources
and usages that the KVC must service
drives the“best approach”decision-making
framework. The variety of information (i.e.
the scope) to be blended is compared with
the variety of usage (i.e. the constituency
of users, be it people or systems).
Figure 2 shows the outcomes we’ve found
at the macro level.
Knowledge Value Chain: Expectations
The constituency, containment, and
composition dimensions must be explored
for their individual expectations. To this
end, the following questions must be
answered:
•	 How durable do we want the results of
the KVC to be? Some progressions of
the KVC are ephemeral while others are
intended to provide long-term results
that inform the enterprise. Discerning
typical usage is important and a
reflection of the constituency.
•	 How self-contained and self-consistent
are the data (or informational) elements
that will be included in the KVC? This
drives data integration, data cohesion,
and data harmonization complexity
concerns.
•	 How much shall technology address the
KVC? This helps identify options that can
be used in the composition of the value
chain.
Knowledge Value Chain: Details and
Specifics
The third level of analysis along the three
dimensions manifests the following
specific considerations that an enterprise
would have to look at.
Cost: The cost incurred for building and
maintaining the solution. This would
also include cost incurred for procuring
infrastructure, continued licenses and
maintenance contracts. Based on the
budgets available for resources (skills,
infrastructure and continued support) the
solution needs to be identified.
Timeliness: Time taken for consolidation,
cleansing, processing, augmenting,
persisting and accessing data for business
operations is one of the most critical
parameters or dimensions for the decision
matrix framework. Most solutions options
depend on the timeliness of data for
business decisions. Ease, frequency of data
access, and quick response to business
scenarios are critical for success of most
business operations. Timeliness and
frequency also depend on how often the
client system pulls data. For BI systems the
data would be accessed very frequently
External Document © 2015 Infosys Limited
Figure 2 Knowledge value chain variety.
Data
Consolidation
Information
Fabric
Data
Foraging
Data
Virtualization
Data
Consolidation
Data
Virtualization
Data Foraging
Variety of Consumer
InformationScope
Low High
High
Some other areas to be considered in
this level of depth include configuration,
technology standards in the enterprise,
technology adaptability, resource
constraints (including knowledge worker
pool), and data availability.
Knowledge Value Chain: Subsystem
Capability Requirements
It is instructive and informative to view the
requisite capabilities and potential tie-ins
between the solution options to help form
the final choice. This enables two types of
analyses. First, if an inventory of existing
capabilities is being done, the decision
framework can be used to identify what is
possible within a given scenario. Second, if
a particular solution model is picked then
the sourcing and implementation patterns
can ensure the required capabilities
are present. The functions also help us
identify and evaluate the right tools and
technology vendors for the enterprise.
External Document © 2015 Infosys Limited
Figure 3 Knowledge value chain decision framework
(data size would not be big), whereas for
analytics data access is not frequent but
the data size could be big.
Ownership: Not all data sets are available
for persistent storage. Some of the data
sources owners limit the duplication of
data in other storage systems due to
regulatory/ legal or contractual reasons.
The ownership of the data is solely limited
to the original stewards and owners of the
data.
Impact: Impact on the underlying database
and to other business systems using the
database could cause slow down. For
example, a frequent access of the order
book could slow down the transaction
processing system.
Throughput rates: Throughput rates
required by the workload need to be
considered carefully.
Stability: Stability of data requirements
from the consumer workloads needs to
be checked. For example, in case of BI
reports the nature and composition of the
report can be modeled easily, but for some
analytics workload the requirements are
not defined and are more exploratory in
nature.
Quality: Quality of data is another critical
decision parameter for the framework. For
example, self-service BI will need good
quality data and thorough cleansing and
augmentation before business operations
can explore, analyze and visualize the data.
Some of the analytics model, however,
can be built on noisy data to identify the
patterns of inconsistency.
Volume: Volume of data sources used,
information processed, and the need
for storing the processed information is
yet another concern. If the time taken to
process the information is more, we might
want to avoid data processing repetition
even if it was easy.
Legend:
Do we have
resources (people,
process, and technology)
allocated to this
program?
Do we have
clarity in the
business and operational
requirements for BI/
analytics?
Do we have
any regulatory/legal
or contractual reasons
to not persist
the data?
Do we need the BI/
analytics responses on
demand?
Is performance a
very critical factor for
the BI and analytical
applications?
Does the data
need aggregation,
augmentation, cleansing,
and/or other data
quality activities?
Do downstream
applications need data/
information access via
services?
Do we need
transaction write back?
Is the data
semi-structured
or unstructured?
Do we need to augment
data from numerous
social and web
portals?
Are users
asking for data in
an ad-hoc manner? Is
the data requirement
more exploratory
in nature?
Is the data manually
manageable?
Is the data available
in real time?
Can the data be
persisted in a separate
storage?
Data Foraging
Constituency Containment Composition
Data Consolidation Data Virtualization Information Fabric Hybrid Solution
Are we
constrained by time
for coming up with
BI/analytical
decisions?
Decision Flow
No
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
Yes
No
No
No
Yes
Yes
No
Yes
No
Yes
Yes
No
No
Yes
Figure 4 Flowchart to identify“best suited”solution approach in knowledge value chain
The functions can be distinctly categorized
into the following five groups based on the
implemented approach:
Data Sources: This function includes
all data sets that are required by the
enterprise to create information or
meaningful decisions. This could include
relational databases, third party market
analyst data or data augmented from
external sources, data on the cloud, manual
data or excel sheets created by business
users, data streamed via other internal/
external sources, or data from social media
or web portals.
Data Consolidation: Data can be
consolidated with the help of tools and
technologies. The functions enabled during
consolidation lead to the creation of data
lakes, data marts, and/or data warehouse.
Data Virtualization: Data virtualization
provides an abstraction layer for the source
data which can be directly accessed by the
BI and analytical applications.
Information Fabric: The information
layer or information fabric converts
data into information by leveraging
the functionalities of data quality
along with transactional management
capabilities of writing back data into the
data consolidation layer. Also integral
to this layer is the capability to provide
the information to the downstream
applications via data or information
services.
BI Applications and Analytics: BI
applications and analytics provide the
final insight to the enterprise data. Data
can be foraged from data sources by
manually massaging, integrating or
consolidating data for BI applications
and analytical capabilities. Advance
analytical and BI capabilities can be best
derived by leveraging data consolidation,
virtualization and information fabric
approaches.
Knowledge Value Chain Best Fit
Decision Flow
The enterprise can further envisage the
“best”solution with the help of a decision-
making flowchart by answering some basic
questions across the various informants
(three Cs) and dimensions as mentioned
in the earlier sections. The reader needs to
note that Figure 4 is a pragmatic flowchart
to identify the solution approach“best
suited”but nevertheless needs thorough
evaluation for your particular enterprise
environment and constraints. Also to be
considered are the tools and technologies
(that are ever evolving) available for this
solution. In cases where a straightforward
single approach is not possible, we might
need a hybrid solution.
External Document © 2015 Infosys Limited
Do we have
restrictions from
source owners to
run queries and
processes on their
infrastructure?
Do I need the data
for advance analytical
models, simulations
and processing?
Usage Scenarios
This section illustrates two possible usage
scenarios that leverage the advanced
analytics and self-service BI approaches.
Advanced Analytics
In this scenario, we cover data needs for
data science teams that are engaged
in development of advanced analytics
models. Data science team members are
qualified power users who are frequently
engaged in all aspects of the data life cycle,
from defining the source for data to data
formats and data quality aspects.
External Document © 2015 Infosys Limited
Figure 5 Knowledge value chain with advanced analytics human links
Possible solutions for enterprise data (data
foraging, data consolidation, data virtualization,
information layer/fabric)
Modeling (advanced analytics algorithms and models, service
API to integrate model to business applications)
Insight delivery (advanced visualizations, interactive analysis)
Integrate
and Manage
Operational
Data
Model
Development
Validation and
Deployment Insight Delivery
Users/Applications
Knowledge Supply
Chain
Other Data
Sources
Analytics Data
Access,
Transform,
Explore
Data Scientist Data Scientist Data Scientist Business Analyst
Data Engineer Data Engineer Data Engineer Executives
Data Analyst Data Analyst Business
Applications
Data Scientist
Data Engineer
EDW/Marts
The data intensive activities that a typical
data science team goes through while
building and deploying advanced analytics
models are listed below:
Exploratory Data Analytics is the most
data intensive phase that is iterative and
runs through the entire data science life
cycle. It starts at the finding phase where
data scientists work with analysts and
data engineers to explore all relevant
data available (to address the business
problem), identify patterns, formulate
hypothesis, and strategize the analytical
models that would be needed to deliver
the solution. The team access the data in
real-time to create an interactive solution
for accessing large data sets.
Data Cleansing and Transformation is
closely knit with the exploratory analysis
phase where analysts and data engineers
blend the data to derive information,
formulate hypothesis, and develop
analytical models. As the requirement is
not real-time, this activity is carried out as
a batch process. However, the solution has
to provision clean and transformed data for
real-time (interactive) access.
Analytical Model Building is experimental
and iterative. In this phase analysts work
with data scientists to test out their initial
hypothesis. They start by working on a
representative data sample (and later
on the entire data cube created for the
analytical model) where the team develops
and evaluates the model, analyzes the
performance, fine-tunes the models and
finally selects the model(s) that should be
deployed in production.
Analytical Model Deployment is done
when the data scientists are satisfied with
the models developed in the experiments
that were carried out in the model
development phase. In this phase data
scientists work with data engineers to
embed and run the model on the entire
data set to carry out the analysis and
deliver results based on the type of the
business problem. Depending on the use
case, the model may need to be deployed
as a service and interact with applications
in real-time to enable data-driven
decisions. The“next best offer”is a good
example of this. Alternatively, the model
may be deployed as a batch program to
analyze data, such as sales forecast, in
offline mode.
Analytical Model Maintenance is the
process of maintaining and managing the
model life cycle including reconstructing
sets of ethereal data. This aids regulatory,
governance, legal, and contractual
requirements.
Insight Delivery - Visualization,
Dashboard and Reporting is the most
crucial part of the entire data science
life cycle. Articulating the findings and
insights that the analytical models have
unearthed in an easy to understand format
is necessary for delivering the business
value to human beings as well as to digital
systems. It is therefore important for the
solution to deliver insights to business and
business systems in a form where they can
access, interact and analyze the findings
and take data-driven business decisions.
While all data science teams would like to
have an infinitely scalable, comprehensive
data platform with data from all sources in
the enterprise at their disposal, the reality
is dictated by different factors such as
cost, time, effort, compliance issues, etc. In
reality, a combination of the methods listed
in this document are deployed by IT teams
to meet the needs of advanced analytics.
To decide what all methods are required,
following are the constraints and factors
that need to be considered:
•	 The amount of variance between
informational elements drives the
program scope. The scope of the
analytics program (all business
functions, limited business functions,
and/or business informational domain
elements) is a driving factor in
determining the methods that need
to be adopted for addressing the“data
everywhere” problem.
•	 Potential benefits of specific advanced
analytics models determine the overall
spend that can be made in data, without
impacting the return on investments.
The more impactful models should have
correspondingly higher investments in data.
•	 Longevity of the advanced analytics
models determine the effort that needs
to be spent to ensure availability of
up-to-date and valid data across the
life cycle. For a long running model it
is important to have all relevant data
aggregated in one place, with data
quality and data lineage capabilities built
in. This would help with validations on
the efficacy of the model across different
time periods and also with tracing back
important decisions that were taken
during the life of the model.
•	 No matter how formal and rigorous the
planning process is, it is difficult to avoid
ad hoc requests for advanced analytics
model. This brings in an aspect of agility
that needs to be provided in sourcing
data with reasonable data quality.
•	 In some cases there is a short time to
market for the analytics model which
makes it even more important to adopt
agility and enable the data scientists
to source and provision data for the
analytics model.
Self-service BI
In this usage scenario, we discuss one
of the frequent scenarios for business
operations, managers and business users.
Business users prefer to explore and access
data themselves for BI and reporting.
They require functional capabilities that
could help them build reports, scorecards,
dashboards, and exploratory visualizations
using self-service tools and wizards.
External Document © 2015 Infosys Limited
Self-service BI, as the name suggests,
is used to drag/drop or create quick
visualizations, reports, and dashboards in
limited period of time by using wizards
so that the users can generate BI reports
themselves. Users use the underlining
integrated data from heterogeneous
sources brought together using our four
solutions approaches (data foraging, data
consolidation, data virtualization, and
information fabric).
Business users can leverage self-service BI
environments for their daily BI activities
along with business decisions and
operations activities. They could:
•	 Run a semantic (using natural language)
and quantitative search to discover data
available in the enterprise, and/or blend
enterprise data with data available in
the public domain, to make meaningful
insights. This discovery and exploration
could further lead to standard
operational BI reports that can be
presented to executives and managers
at set regular frequencies using job
schedulers.
•	 Request for purchase or loading of
data not found in the integrated data
environment. Data extracts and sets can
be requested and rendered with minimal
IT involvement. For data sets that are not
available to the enterprise, self-service BI
helps as an investigative and exploratory
mode of analysis to further enhance and
blend data.
•	 Run predefined and ad hoc queries on
the data sets. Build quick and custom
visualizations and analysis to resolve
immediate business problems and needs.
•	 Collaborate with other users to create
data mash-ups across multiple data
sources.
•	 Enable social BI features such as
community rating, collaborative
metadata enrichment, and more.
Besides functions such as collaboration,
integration, data mash-ups, data
visualizations along with data lineage,
metadata search, self-service BI also
provides an easy, simple and intuitive user
interface (UI) that can be made available
on a desktop as well as on smartphones,
tablets and other mobile devices. This
makes information available to business
users even when they are on-the-go,
thus making it an integrated BI platform
Integrate and
Manage
EDW/Marts
Access
Possible solutions for enterprise data (data foraging, data
consolidation, data virtualization, information layer/fabric)
Easy UI (office support, advanced visualization, portal, search,
analytical model)
Collaborative UI (collaboration, usage tracking)
Mobile UI (access on mobile devices)
Executives
Analytics Builder
Analytics Builder
Analysts/ModelersInformation
Producer
Information
Producer
Information
Consumer
Information
Collaborator
Decide
Information Supply Chain
Users
Analyze and
Publish
Workgroup
Data
Discover and
Enhance
Operational Data
Other Data
Sources
Figure 6 Knowledge value chain with human BI links
with ease of authoring, modeling and
publishing to the end user without IT
support.
Typical user groups that leverage the
information supply chain would include
data engineers (who are aware of the data
sources that supply the required data), BI
authors (who are capable of leveraging
the self-service intuitive UI and wizards to
publish reports, visualizations, dashboards
and scorecards for their managers) and
business executives (who leverage the
outputs for decision-making).
External Document © 2015 Infosys Limited
Conclusion
Data integration can be a significant effort
whether you are engaged in building new
data-intensive applications, adapting a
packaged application to a new context,
or are trying to create a single point of
access for your enterprise data [3]. More
companies are turning to a multitude
of solutions which can shorten data
integration projects and lower data
management and maintenance costs over
time [3]. The information fabric approach
can simplify data provisioning, access
and integration, thus shortening the data
integration time frame and enhancing
productivity by enabling developers
to spend their time concentrating on
developing actual application logic [3].
However, it requires a tremendously
skilled systems integration capability,
in-depth data theory, and more. On the
other extreme, those who need integrated
data can be left to their own devices. In
the modern landscape, this is simply not
a long term option. Therefore, without
conducting in-depth situation analysis,
the data consolidation solution is the most
pragmatic. Many of the efforts required
in data consolidation activity can roll into
either data virtualization or the information
layer.
Citations
1. August 8, 2013. Noel Yuhanna and Mike
Gilpin.“Information Fabric 3.0”. Forrester
Research
2. 2011.“Infosys Gradient: An EII Solution”.
Infosys
3. April 2005.“Infosys Gradient: Enabling
Enterprise Data Virtualization”. Infosys
External Document © 2015 Infosys Limited
About the Authors
Bill Peer is Principal Technology Architect
at Infosys Labs. He has over 20 years of IT
experience. His focus area is IT strategy
for business competitive advantage, with
hyper specializations in innovation, large
scale global enterprise architecture, and
emergent technology exploitation (such as
‘Big Data’systems).
Kiran Kumar Kaipa is Senior Consultant
at Infosys Labs. He has over nine years
of experience in the IT industry where
he has worked in both consulting and
technical roles. His current focus area is
Big Data analytics with specializations
in data munging, data analysis and data
visualization.
Shyamala Sadananda is Senior Architect
at Infosys Labs. She has over 15 years of
professional experience in systematic
innovation, architecture, consulting, and
solution implementation. She primarily
anchors engagements with focus on
emerging technologies in the areas of data
virtualization, analytics, data visualizations
and business intelligence.
Swaminathan Natarajan is Principal
Product Architect at Infosys Labs. He has
over 17 years of experience in the software
industry and has worked in various
functions such as product engineering,
R&D and technology consulting. His
current focus area is Big Data analytics
with specific interest in data munging and
unstructured data analytics.
External Document © 2015 Infosys Limited
© 2015 Infosys Limited, Bangalore, India. All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice.
Infosys acknowledges the proprietary rights of other companies to the trademarks, product names and such other intellectual property rights mentioned in this document. Except as expressly permitted,
neither this documentation nor any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, printing, photocopying, recording or
otherwise, without the prior permission of Infosys Limited and/ or any named intellectual property rights holders under this document.
About Infosys
Infosys is a global leader in consulting, technology, outsourcing and next-generation services. We enable clients, in more than 50 countries, to stay a
step ahead of emerging business trends and outperform the competition. We help them transform and thrive in a changing world by co-creating
breakthrough solutions that combine strategic insights and execution excellence.
Visit www.infosys.com to see how Infosys (NYSE: INFY), with US$8.25 B in annual revenues and 165,000+ employees, is helping enterprises renew
themselves while also creating new avenues to generate value.
For more information, contact askus@infosys.com www.infosys.com

Recommandé

Information economics and big data par
Information economics and big dataInformation economics and big data
Information economics and big dataMark Albala
336 vues6 diapositives
Data warehouse Vs Big Data par
Data warehouse Vs Big Data Data warehouse Vs Big Data
Data warehouse Vs Big Data Lisette ZOUNON
634 vues19 diapositives
DATA VIRTUALIZATION FOR DECISION MAKING IN BIG DATA par
DATA VIRTUALIZATION FOR DECISION MAKING IN BIG DATADATA VIRTUALIZATION FOR DECISION MAKING IN BIG DATA
DATA VIRTUALIZATION FOR DECISION MAKING IN BIG DATAijseajournal
34 vues9 diapositives
Data Warehouse Application Of Insurance Industry par
Data Warehouse Application Of Insurance IndustryData Warehouse Application Of Insurance Industry
Data Warehouse Application Of Insurance Industryinfoarup
3.1K vues8 diapositives
Tech Connect Live 30th May 2018 ,GDPR Summit Ken O'Connor par
Tech Connect Live 30th May 2018 ,GDPR Summit Ken O'ConnorTech Connect Live 30th May 2018 ,GDPR Summit Ken O'Connor
Tech Connect Live 30th May 2018 ,GDPR Summit Ken O'ConnorEvents2018
47 vues19 diapositives
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET par
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASETDATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASET
DATA MINING WITH CLUSTERING ON BIG DATA FOR SHOPPING MALL’S DATASETAM Publications
129 vues5 diapositives

Contenu connexe

Tendances

Semantic 'Radar' Steers Users to Insights in the Data Lake par
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data LakeCognizant
2.9K vues12 diapositives
Whitebook on Big Data par
Whitebook on Big DataWhitebook on Big Data
Whitebook on Big DataViren Aul
701 vues66 diapositives
The Economic Value of Data: A New Revenue Stream for Global Custodians par
The Economic Value of Data: A New Revenue Stream for Global CustodiansThe Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global CustodiansCognizant
6.5K vues8 diapositives
Trends 2011 and_beyond_business_intelligence par
Trends 2011 and_beyond_business_intelligenceTrends 2011 and_beyond_business_intelligence
Trends 2011 and_beyond_business_intelligencedivjeev
1.5K vues16 diapositives
2012 Data Acquisition Report par
2012 Data Acquisition Report 2012 Data Acquisition Report
2012 Data Acquisition Report Oceanos
405 vues7 diapositives
The Second Big Bang par
The Second Big BangThe Second Big Bang
The Second Big BangConnexica
85 vues8 diapositives

Tendances(19)

Semantic 'Radar' Steers Users to Insights in the Data Lake par Cognizant
Semantic 'Radar' Steers Users to Insights in the Data LakeSemantic 'Radar' Steers Users to Insights in the Data Lake
Semantic 'Radar' Steers Users to Insights in the Data Lake
Cognizant2.9K vues
Whitebook on Big Data par Viren Aul
Whitebook on Big DataWhitebook on Big Data
Whitebook on Big Data
Viren Aul701 vues
The Economic Value of Data: A New Revenue Stream for Global Custodians par Cognizant
The Economic Value of Data: A New Revenue Stream for Global CustodiansThe Economic Value of Data: A New Revenue Stream for Global Custodians
The Economic Value of Data: A New Revenue Stream for Global Custodians
Cognizant6.5K vues
Trends 2011 and_beyond_business_intelligence par divjeev
Trends 2011 and_beyond_business_intelligenceTrends 2011 and_beyond_business_intelligence
Trends 2011 and_beyond_business_intelligence
divjeev1.5K vues
2012 Data Acquisition Report par Oceanos
2012 Data Acquisition Report 2012 Data Acquisition Report
2012 Data Acquisition Report
Oceanos 405 vues
The Second Big Bang par Connexica
The Second Big BangThe Second Big Bang
The Second Big Bang
Connexica85 vues
Mastering data-modeling-for-master-data-domains par Chanukya Mekala
Mastering data-modeling-for-master-data-domainsMastering data-modeling-for-master-data-domains
Mastering data-modeling-for-master-data-domains
Chanukya Mekala315 vues
Value creation with big data analytics for enterprises: a survey par TELKOMNIKA JOURNAL
Value creation with big data analytics for enterprises: a surveyValue creation with big data analytics for enterprises: a survey
Value creation with big data analytics for enterprises: a survey
Supply Chain Finance and Artificial Intelligence - a game changing relationsh... par Igor Zax (Zaks)
Supply Chain Finance and Artificial Intelligence - a game changing relationsh...Supply Chain Finance and Artificial Intelligence - a game changing relationsh...
Supply Chain Finance and Artificial Intelligence - a game changing relationsh...
Reference data management in financial services industry par NIIT Technologies
Reference data management in financial services industryReference data management in financial services industry
Reference data management in financial services industry
NIIT Technologies1.9K vues
Monetizing data - An Evening with Eight of Chicago's Data Product Management... par Randy Horton
Monetizing data  - An Evening with Eight of Chicago's Data Product Management...Monetizing data  - An Evening with Eight of Chicago's Data Product Management...
Monetizing data - An Evening with Eight of Chicago's Data Product Management...
Randy Horton2.4K vues
Business Intelligence for Consumer Products: Actionable Insights for Business... par FindWhitePapers
Business Intelligence for Consumer Products: Actionable Insights for Business...Business Intelligence for Consumer Products: Actionable Insights for Business...
Business Intelligence for Consumer Products: Actionable Insights for Business...
FindWhitePapers921 vues
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv... par Denodo
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Accelerating Data-Driven Enterprise Transformation in Banking, Financial Serv...
Denodo 263 vues
DAS Slides: Metadata Management From Technical Architecture & Business Techni... par DATAVERSITY
DAS Slides: Metadata Management From Technical Architecture & Business Techni...DAS Slides: Metadata Management From Technical Architecture & Business Techni...
DAS Slides: Metadata Management From Technical Architecture & Business Techni...
DATAVERSITY2.3K vues

En vedette

жаби par
жаби жаби
жаби labinskiir-33
3.4K vues28 diapositives
Las esculturas par
Las esculturasLas esculturas
Las esculturasAlexa_Rusher
101 vues7 diapositives
Tics par
TicsTics
TicsRominix05
193 vues16 diapositives
Religion par
ReligionReligion
Religionduendesitaprincess
381 vues11 diapositives
Joomla par
JoomlaJoomla
JoomlaCOMET
135 vues6 diapositives
Reparalia en los medios: Septiembre 2013 par
Reparalia en los medios: Septiembre 2013Reparalia en los medios: Septiembre 2013
Reparalia en los medios: Septiembre 2013Beatriz A
795 vues21 diapositives

En vedette(20)

Joomla par COMET
JoomlaJoomla
Joomla
COMET135 vues
Reparalia en los medios: Septiembre 2013 par Beatriz A
Reparalia en los medios: Septiembre 2013Reparalia en los medios: Septiembre 2013
Reparalia en los medios: Septiembre 2013
Beatriz A795 vues
MecáNica De Fluidos Viscosidad DináMica par Diego Guanga
MecáNica De Fluidos Viscosidad DináMicaMecáNica De Fluidos Viscosidad DináMica
MecáNica De Fluidos Viscosidad DináMica
Diego Guanga921 vues
Marco 2014 iib90_overview_port par Juan Garay
Marco 2014 iib90_overview_portMarco 2014 iib90_overview_port
Marco 2014 iib90_overview_port
Juan Garay435 vues
DN13_U3_A6_HOEA par Erock300
DN13_U3_A6_HOEADN13_U3_A6_HOEA
DN13_U3_A6_HOEA
Erock300200 vues
12 x1 t02 01 differentiating exponentials (2013) par Nigel Simmons
12 x1 t02 01 differentiating exponentials (2013)12 x1 t02 01 differentiating exponentials (2013)
12 x1 t02 01 differentiating exponentials (2013)
Nigel Simmons661 vues
33 el futuro_ del_ dinero (1) par Apolo Montana
33  el  futuro_ del_ dinero (1)33  el  futuro_ del_ dinero (1)
33 el futuro_ del_ dinero (1)
Apolo Montana534 vues
Crecimiento económico y conflictos sociales pavel pinco aramburu par Pavel Pinco Aramburú
Crecimiento económico y conflictos sociales   pavel pinco aramburuCrecimiento económico y conflictos sociales   pavel pinco aramburu
Crecimiento económico y conflictos sociales pavel pinco aramburu
Republica bolivariana de venezuela nueva diapositiva par asulejo
Republica bolivariana de venezuela nueva diapositivaRepublica bolivariana de venezuela nueva diapositiva
Republica bolivariana de venezuela nueva diapositiva
asulejo261 vues
El pez plateado y el niño travieso par Janijane
El pez plateado y el niño traviesoEl pez plateado y el niño travieso
El pez plateado y el niño travieso
Janijane168 vues
Cпільнокореневі слова і форми слова par milona14
Cпільнокореневі слова і форми словаCпільнокореневі слова і форми слова
Cпільнокореневі слова і форми слова
milona1413K vues

Similaire à enterprise-data-everywhere

LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC... par
LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...ijdpsjournal
5 vues13 diapositives
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE... par
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
51 vues13 diapositives
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE... par
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...ijdpsjournal
20 vues13 diapositives
The Return On Investment Of Data Warehousing Essay par
The Return On Investment Of Data Warehousing EssayThe Return On Investment Of Data Warehousing Essay
The Return On Investment Of Data Warehousing EssayAdriana Wilson
3 vues78 diapositives
Data Warehousing System For Miller Inc par
Data Warehousing System For Miller IncData Warehousing System For Miller Inc
Data Warehousing System For Miller IncJaclyn Creedon
2 vues45 diapositives
Big Data is Here for Financial Services White Paper par
Big Data is Here for Financial Services White PaperBig Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White PaperExperian
579 vues8 diapositives

Similaire à enterprise-data-everywhere(20)

LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC... par ijdpsjournal
LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...LEVERAGING CLOUD BASED BIG DATA ANALYTICS  IN KNOWLEDGE MANAGEMENT FOR ENHANC...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANC...
ijdpsjournal5 vues
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE... par ijdpsjournal
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
ijdpsjournal51 vues
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE... par ijdpsjournal
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
LEVERAGING CLOUD BASED BIG DATA ANALYTICS IN KNOWLEDGE MANAGEMENT FOR ENHANCE...
ijdpsjournal20 vues
The Return On Investment Of Data Warehousing Essay par Adriana Wilson
The Return On Investment Of Data Warehousing EssayThe Return On Investment Of Data Warehousing Essay
The Return On Investment Of Data Warehousing Essay
Big Data is Here for Financial Services White Paper par Experian
Big Data is Here for Financial Services White PaperBig Data is Here for Financial Services White Paper
Big Data is Here for Financial Services White Paper
Experian579 vues
The Concepts Of Living In The Age Of The Customer par Laura Martin
The Concepts Of Living In The Age Of The CustomerThe Concepts Of Living In The Age Of The Customer
The Concepts Of Living In The Age Of The Customer
Laura Martin2 vues
Nuestar "Big Data Cloud" Major Data Center Technology nuestarmobilemarketing... par IT Support Engineer
Nuestar "Big Data Cloud" Major Data Center Technology  nuestarmobilemarketing...Nuestar "Big Data Cloud" Major Data Center Technology  nuestarmobilemarketing...
Nuestar "Big Data Cloud" Major Data Center Technology nuestarmobilemarketing...
Slow Data Kills Business eBook - Improve the Customer Experience par InterSystems
Slow Data Kills Business eBook - Improve the Customer ExperienceSlow Data Kills Business eBook - Improve the Customer Experience
Slow Data Kills Business eBook - Improve the Customer Experience
InterSystems71 vues
To Become a Data-Driven Enterprise, Data Democratization is Essential par Cognizant
To Become a Data-Driven Enterprise, Data Democratization is EssentialTo Become a Data-Driven Enterprise, Data Democratization is Essential
To Become a Data-Driven Enterprise, Data Democratization is Essential
Cognizant245 vues
from-big-data-comes-small-worlds-messineo.PDF par David Messineo
from-big-data-comes-small-worlds-messineo.PDFfrom-big-data-comes-small-worlds-messineo.PDF
from-big-data-comes-small-worlds-messineo.PDF
David Messineo386 vues
BRIDGING DATA SILOS USING BIG DATA INTEGRATION par ijmnct
BRIDGING DATA SILOS USING BIG DATA INTEGRATIONBRIDGING DATA SILOS USING BIG DATA INTEGRATION
BRIDGING DATA SILOS USING BIG DATA INTEGRATION
ijmnct29 vues
Data modeling techniques used for big data in enterprise networks par Dr. Richard Otieno
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
Business Information Systems Of The Drugstore Retail Industry par Megan Moore
Business Information Systems Of The Drugstore Retail IndustryBusiness Information Systems Of The Drugstore Retail Industry
Business Information Systems Of The Drugstore Retail Industry
Megan Moore2 vues
Thinking Small: Bringing the Power of Big Data to the Masses par FlutterbyBarb
Thinking Small:  Bringing the Power of Big Data to the MassesThinking Small:  Bringing the Power of Big Data to the Masses
Thinking Small: Bringing the Power of Big Data to the Masses
FlutterbyBarb2.2K vues

enterprise-data-everywhere

  • 1. - Bill Peer, Kiran Kumar Kaipa, Shyamala Sadananda, and Swaminathan Natarajan Abstract The proliferation of new analytic processing capabilities, the deafening marketing hype of Big Data, and the radical dropping of processing power barriers have brought focus on a problem that has long existed in the area of information harvesting: data resides in lots of places and in lots of forms. There is no singular solution or approach available today that allows all the information latent in an enterprise to appear and be exploitable by all systems in the enterprise. That is, there is no ideal way to handle the enterprise knowledge value-chain. This paper provides an articulation of a decision framework for identifying the“best fitting approach”for an enterprise’s“data everywhere”challenge, exploring the common models of data foraging, data virtualization, data consolidation, and information fabrics. The viewpoint of this paper is based on a common usage of enterprise- wide data: Business Intelligence (BI). Within the realm of BI, this paper further refines specific usage scenarios that many of our forward looking clients expect: advanced analytics and self-service BI. Knowledge value chain approaches: A decision framework WHITE PAPER
  • 2. Introduction The proliferation of new analytic processing capabilities, the deafening marketing hype of Big Data, and the radical dropping of processing power barriers have brought focus on a problem that has long existed in information harvesting: data resides in lots of places and in lots of forms. Companies that address the problem are creating a new generation of business applications that support customer service, risk management, multichannel integration, loyalty management, regulatory compliance, marketing, business performance management, and other critical functions [3, 2]. These applications require cross- functional data, with varying semantic meaning, to be blended in near real-time. This has created an increased demand for data access and integration solutions [3]. Increasing volume of data, faster data consumption and decision-making, and dealing with integration cost and timelines makes data integration a complex challenge. This challenge acquires further complexity due to diversity of data sources External Document © 2015 Infosys Limited and data freshness requirements [3, 2]. Compounding things even further is the growing diversity of the user population of the synthesized, blended data knowledge products. With so many dimensions to the problem, companies are seeking more flexible ways of making integrated data available to constituencies, people, applications, and systems – in ways that control costs and complexity by keeping customized application code to a minimum [3, 2]. As of this writing, there is no singular solution or approach available that allows all the information latent in an enterprise to appear and be exploitable by all systems in the enterprise. There are, however, numerous approaches to address various aspects of the knowledge value chain problem and they go by the names of data lakes, information fabric [1], data federation, data virtualization, data warehouse augmentation, and more. Vendors, research organizations and the like offer varying approaches and perspectives on“the best approach”for blending enterprise data. Our experience at Infosys in implementing these solutions in a variety of contexts highlight that while there is no panacea available, there is a decision-making criteria that can be followed to help an organization pick the“best fitting”answer for a particular scenario and need. By“best fitting”we mean a solution that is more effective at realizing the desired outcome under one set of conditions than another. This paper explores these conditions to help our clients make the best choice. This paper is organized as follows. First, a baseline of terms and concepts is established such that the reader and authors can carry forward the dialog. Second, the“data everywhere”problem is briefly discussed, noting the different critical dimensions that must be considered. This is followed by a discussion on two common scenarios, advanced analytics and self-service business intelligence (BI). The paper finally presents a decision-making framework that will help enterprises identify which approach best fits their situation.
  • 3. Terms and Concepts The terms and concepts covering the domain of enterprise information exploitation are constantly evolving. There is a multitude of definitions from academia, consultants, vendors, and historical precedents. This paper defines a set of working terms, definitions, and concepts that contain and restrain the knowledge being presented. It is not the intent of this paper to usurp working definitions that may be in use by any one group, but it is to make the idea conveyance possible. If there are differences in the definitions with the reader’s knowledge base, the authors ask the reader to grant an exception for the duration of this paper. Basic Terms • Data is a set of values. The numbers 5, 15, 32, 38, and 41 are data. • Knowledge is the awareness of patterns of information. If the information External Document © 2015 Infosys Limited “Lottery Numbers 5, 15, 32, 38, and 41” is known to appear every three weeks, then there is knowledge. • Business intelligence (BI) is the awareness of business operational information. • Analytics is the art and science of discerning patterns in data and information. • Knowledge value chain (KVC) is a set of processes or activities by which a company refines data such that it becomes knowledge. Each step in the process is known as a Knowledge Value Link. Basic Concept An enterprise is laden with data and information. By applying analytics one can create knowledge from these assets by finding patterns of interest. This knowledge can be showcased through reports, dashboards, or as inter-system actions such as ordering more stock in a particular geographic location. Knowledge creation and knowledge sharing requires people, processes, and technology. The entirety of this knowledge creation and sharing is collectively known as Knowledge Value Chain (KVC). Different approaches to realizing different parts of the KVC give rise to different implementations seen in the field.
  • 4. The Problem In today’s fast moving world, business decision-makers, customers, and trading partners need to act on rapidly changing, difficult to harmonize information in near real-time to generate value as part of the KVC [3, 2]. The need for near real-time response, combined with the sheer number and complexity of the data sources, creates a new set of data access and integration challenges across the enterprise [3, 2]. Enterprise data assets are typified by different technologies in underlying data sources and ownership issues with lines of business data [3, 2]. Information in an enterprise can be scattered across systems, technologies, and geographies in various formats. There are numerous reasons for such distributed and heterogeneous information, such as varying project funding models, organizational structure constraints, mergers and acquisitions, partner integration requirements, regulatory constraints, legal restrictions, and more [3]. Regardless of the reason, blending such widely distributed data is difficult to accomplish. Even if one overcomes this challenge, who is to say that the labeled data being blended has the same format? Typically, the information required for blending is present in heterogeneous formats thereby making the consolidated access of this information difficult [3]. For example, many companies today have a variety of data marts and warehouses for data consolidation activities as well as operational data stores for transactional purposes [3]. While it may be of value to blend the information in these separate systems for some analytical purpose, the information contained within may be of different formats, each optimized for its particular need (e.g. business reporting on one hand and shipment tracking on another). Even if one overcomes this challenge, who is to say that the unified formatted information has the same semantic meaning? It is common for organizations to have multiple stores with labeled data names that are identical but whose semantic meaning is not identical. For example, an accounting department may have a stored piece of information called“Sales”and the sales department may have a stored piece of information called“Sales.”. In the accounting world, this is specifically net sales and it refers to a company’s revenue earned from the sale of a product or service while in the sales world it refers to a specific transaction where money is exchanged for ownership of a good or service. If these subtle differences are not recognized, blending these commonly labeled informational elements will result in polluted synthetic information. Even if one overcomes this challenge, who is to say that the harmonized information is on compatible technology platforms that allow interchange? Any company that has had digital systems for more than a year is guaranteed to have different platforms of information technology underpinning its landscape. These varied platforms have varied protocols, varied requirements on information interchange, and varied performance capabilities. These variances introduce a swath of engineering challenges. To cull a unified blend of information based on the aforementioned challenges introduces significant latency, as not only do the technology capability variances in underlying technologies have to be accommodated, but information must be harmonized, varying formats need to be resolved, and physical spread must be traversed. Even if one overcomes these collective challenges, a slew of new challenges and requirements come forth from the new synthesized (blended) information. Dealing with synthesized information in the enterprise requires a revisiting of policies, access rights, data stewardship, and provisioning. Individual elements of information in isolation may be fine to operate on, analyze, and report, such as a social security number, but the moment it is blended with first and last name, the synthesized element is subject to personally identifiable information constraints. Such cases are particularly challenging to identify when dealing with technical data that may be subject to export control laws. All the aforementioned challenges beget four broad categories of operational challenges to be addressed: 1. Information service level agreements: The right information needs to be delivered to the right system in a timely and consistent manner, irrespective of its source. Many applications require real- time data retrieval abilities [3, 2]. 2. Data integration: Data which can belong to different heterogeneous types needs to be transformed, integrated and aggregated to create intelligent information for applications, systems, and platforms [3, 2]. Sometimes this requires creating integrated views of data from different data sources [3]. The variance in the business semantics from one business unit to another creates an additional challenge for data integration and interleaving. 3. Data stewardship and provisioning: Data architects and administrators must protect the integrity and security of data sources while making them available for consumption by applications across the enterprise [3, 2]. Added to these concerns is the growing list of regulatory compliance requirements that require more careful auditing of enterprise information usage [3, 2]. External Document © 2015 Infosys Limited
  • 5. 4. Information access rights: Information security administration and administrators must ensure that the right people and right systems only have access to the right information, including dynamically created synthetic information. If user access to a particular atomic component of some synthesized data is revoked, this must cascade to all other systems that leverage the blended data. External Document © 2015 Infosys Limited
  • 6. Possible Solutions The fundamental problem faced by large organizations is that distributed data in variant forms is difficult for an enterprise to exploit. This macro problem is often referred to as the enterprise data integration challenge. The fundamental challenge stems from two basic issues: 1. Knowing what data is where (finding); 2. Knowing what information can be derived from the data (blending). As with most things, these issues can be solved with people, processes, technology, or (more likely) some combination. There are a multitude of approaches within each of these three dimensions. The reader is cautioned that the usage of the simple terms finding and blending is only meant to make the concept more consumable. Neither of these notions are trivial when it comes to implementation; each requires critical tradeoff considerations along the triad of people, processes, and technology. For example, when blending there are many approaches that can be adopted to help data make its life cycle progression into actionable insights as part of the KVC. While in some cases, people serve as primary blender (such as with statistical analysis), in other cases machine acts as blender (such as in machine learning). It may not be surprising to learn that the solution choice is often driven by the constituency driving the need. Some companies first face this problem with BI (often when doing enterprise reporting) and solve the problem with one method, while some address this problem first with advanced analytics (finding patterns that cross organizational boundaries), resulting in a different choice. Other first drivers include transactional personalization (targeting offers based on demand moments), omni-channel (the unified singular transaction experience), and so on. The bottom line is that there is no single solution as of this writing that fits every situation perfectly. To solve this fundamental challenge of enterprise data integration, a number of approaches have emerged over the past few years, each addressing the problem based on a particular need (often driven by the project’s funding source). Below is a brief summary of each method along with key points associated with the approach to help in decision-making. The reader is reminded that the vocabulary used in defining each basic solution type isn’t presented as the definitive name for an approach, but is rather used to serve a common pattern to help convey the decision-making criteria that this paper is highlighting. Data Foraging This approach to enterprise data integration relies on each consumer of data to source and pull data on his/her own. Following this approach places all finding and blending burdens on the consumer (be it a digital system or a human being) along with all data movement and orchestration activities. It is not uncommon for this approach to be aided with some form of online accessible data dictionary or even a data encyclopedia to aid the data hunters in their quest to find data that meets a given need. Many exploratory efforts begin in this model but eventually morph into, or become inclusive of, one of the other solution patterns defined. In this approach the entirety of the KVC rests with the person (or system) doing the data foraging. As the variety of discrete informational entities under study grows, this solution becomes untenable. The wider the range of different elements that are to be considered, the more difficult this solution becomes. Data Consolidation This approach to enterprise data integration creates consolidated pools of data, separate from the originating data stores. These amalgams of data create a pool of data teaming with new blend possibilities. These consolidated repositories of data are characterized by data movements including synchronization actions and duplicated data (e.g. data exists in the source data store and in the consolidated store). There is a possibility of inconsistent data with this approach as the source system must first change and then this change cascades to the consolidated pool with the time interval being an important decision driver and engineering concern. By placing all the data together, the finding problem is made easier and the processing paradigm of moving algorithms to the data can be exploited; a great value add with the explosion of Big Data paradigm processing technologies. Normally this approach is implemented as a physical store, but it could be implemented purely in memory Information Fabric Data Virtualization Data Consolidation Data Foraging1 3 2 4 Figure 1 Four key knowledge value chain approaches External Document © 2015 Infosys Limited
  • 7. as well. Depending upon which capabilities are expected from the data consolidation approach, the label applied to it varies from augmented data warehouse to data lakes, data pools, or data marts. In this approach, the wider the range of source systems to be considered, the more difficult this solution becomes as large troves of data are constantly shuffling around and maintaining which systems are to be pulled when (e.g. orchestration and synchronization) becomes a nightmare to manage. Data Virtualization This approach to enterprise data integration leaves all data in their source data stores, and it simply provides a proxy that takes in-bound requests for data and routes them on behalf of the caller to the source system. This creates a rich pool without moving data until it is actually requested, creating consistency guarantees (see exception that follows). This approach creates some data retrieval latency due to the extra routing but is often ameliorated to some degree with some form of memory based caching which brings forth a set of tradeoffs. In the cache implementation case, the possibility of inconsistent data (e.g. different values in the source system and virtualized data) arises but leads to creation of a localized cache model with the net effect of faster overall data retrieval. For example, if a source system is located half-way around the world but a virtualized data store is local and implemented with a cache, it will be quicker to get the data. Depending upon what is expected of this approach (transactional abilities, canonical forms of data, read and/or write, etc.), it can be labeled as data virtualization or data federation. However, the more the consumer base grows and the needs vary, the more difficult it is to meet all needs consistently under varying loads. Information Layer This approach to enterprise data integration not only unifies data stores, but also provides the capability to integrate operational systems. Many operational systems in the company have transient data that is of value to the enterprise, but the data is ephemeral (e.g. an intermediate processing step). The information fabric approach adds the capability to refine data such that it is available to the enterprise and its systems as information. It is a software solution that enables applications to access both raw and integrated data from multiple, heterogeneous, and distributed data sources and systems while hiding the complexity of the disparate data sources [3]. Instead of moving data or creating new stores of integrated data, an information layer creates a loose federation of multiple existing data sources and provides a single, virtual data source through which people, applications, and systems can access data [3]. In other words, an information layer creates a“data service” or“data veneer”that allows applications and end users to treat a broad variety of multiple data sources as if they were one large single source of information [3]. As such all capabilities of access rights, lineage tracking, provisioning, etc. are included. While the tactical implementation approach to information layer can vary, it is most common for the implementation to be based on some form of data virtualization. According to the technology research company Forrester, besides other aspects information layer also comprises the ability to do transactional processing and conduct roll-backs on information changes. [1] However, as the service level expectations and variations grow, this solution becomes complex as varied implementation approaches are required. External Document © 2015 Infosys Limited
  • 8. Knowledge Value Chain Decision Making Framework The four basic solution models for KVC implementation discussed in this paper provide a different set of trade-offs that must be considered when picking an approach. In practice, we find that most enterprises have two (or more) of the four solutions as driven by different business and operational requirements, resulting in a hybrid solution to the KVC challenge. However, the question that still remains unanswered is which particular solution is the“best solution”for a particular“data everywhere”challenge? The KVC approach decision-making framework this paper proposes identifies three dimensions that must be considered: Constituency, containment, and composition. The primary informants of“best fit”are driven by the study of expectations of the KVC as represented by the following three Cs: Constituency: The user population of the KVC must be defined. This could be people, applications, and/or other systems. Who is the constituency of the value chain? Containment: The universe of data or information that is considered to be a part of the KVC must be defined. What is contained by, or within, the value chain? Composition: The approach (people, processes, and/or technology) for each knowledge value link progression must be discerned. How is each link composed? The KVC decision framework takes each of these dimensions and explores them at three different levels. First, at a macro level, we study the variety of expectations of the KVC itself in a given enterprise. Second, we begin decomposing these expectations of the KVC and its individual links. Third, we ask a collection of specific questions that will inform our decision-making flowchart. While not required, a“current situation” analysis can play a part in selecting the future strategy based on existing (and already paid for) capabilities. To this end, a KVC subsystems capabilities can be discerned from the Figure 3 graphic where each individual capability required for KVC is articulated. Knowledge Value Chain: Variety At the highest level, the variety of sources and usages that the KVC must service drives the“best approach”decision-making framework. The variety of information (i.e. the scope) to be blended is compared with the variety of usage (i.e. the constituency of users, be it people or systems). Figure 2 shows the outcomes we’ve found at the macro level. Knowledge Value Chain: Expectations The constituency, containment, and composition dimensions must be explored for their individual expectations. To this end, the following questions must be answered: • How durable do we want the results of the KVC to be? Some progressions of the KVC are ephemeral while others are intended to provide long-term results that inform the enterprise. Discerning typical usage is important and a reflection of the constituency. • How self-contained and self-consistent are the data (or informational) elements that will be included in the KVC? This drives data integration, data cohesion, and data harmonization complexity concerns. • How much shall technology address the KVC? This helps identify options that can be used in the composition of the value chain. Knowledge Value Chain: Details and Specifics The third level of analysis along the three dimensions manifests the following specific considerations that an enterprise would have to look at. Cost: The cost incurred for building and maintaining the solution. This would also include cost incurred for procuring infrastructure, continued licenses and maintenance contracts. Based on the budgets available for resources (skills, infrastructure and continued support) the solution needs to be identified. Timeliness: Time taken for consolidation, cleansing, processing, augmenting, persisting and accessing data for business operations is one of the most critical parameters or dimensions for the decision matrix framework. Most solutions options depend on the timeliness of data for business decisions. Ease, frequency of data access, and quick response to business scenarios are critical for success of most business operations. Timeliness and frequency also depend on how often the client system pulls data. For BI systems the data would be accessed very frequently External Document © 2015 Infosys Limited Figure 2 Knowledge value chain variety. Data Consolidation Information Fabric Data Foraging Data Virtualization Data Consolidation Data Virtualization Data Foraging Variety of Consumer InformationScope Low High High
  • 9. Some other areas to be considered in this level of depth include configuration, technology standards in the enterprise, technology adaptability, resource constraints (including knowledge worker pool), and data availability. Knowledge Value Chain: Subsystem Capability Requirements It is instructive and informative to view the requisite capabilities and potential tie-ins between the solution options to help form the final choice. This enables two types of analyses. First, if an inventory of existing capabilities is being done, the decision framework can be used to identify what is possible within a given scenario. Second, if a particular solution model is picked then the sourcing and implementation patterns can ensure the required capabilities are present. The functions also help us identify and evaluate the right tools and technology vendors for the enterprise. External Document © 2015 Infosys Limited Figure 3 Knowledge value chain decision framework (data size would not be big), whereas for analytics data access is not frequent but the data size could be big. Ownership: Not all data sets are available for persistent storage. Some of the data sources owners limit the duplication of data in other storage systems due to regulatory/ legal or contractual reasons. The ownership of the data is solely limited to the original stewards and owners of the data. Impact: Impact on the underlying database and to other business systems using the database could cause slow down. For example, a frequent access of the order book could slow down the transaction processing system. Throughput rates: Throughput rates required by the workload need to be considered carefully. Stability: Stability of data requirements from the consumer workloads needs to be checked. For example, in case of BI reports the nature and composition of the report can be modeled easily, but for some analytics workload the requirements are not defined and are more exploratory in nature. Quality: Quality of data is another critical decision parameter for the framework. For example, self-service BI will need good quality data and thorough cleansing and augmentation before business operations can explore, analyze and visualize the data. Some of the analytics model, however, can be built on noisy data to identify the patterns of inconsistency. Volume: Volume of data sources used, information processed, and the need for storing the processed information is yet another concern. If the time taken to process the information is more, we might want to avoid data processing repetition even if it was easy.
  • 10. Legend: Do we have resources (people, process, and technology) allocated to this program? Do we have clarity in the business and operational requirements for BI/ analytics? Do we have any regulatory/legal or contractual reasons to not persist the data? Do we need the BI/ analytics responses on demand? Is performance a very critical factor for the BI and analytical applications? Does the data need aggregation, augmentation, cleansing, and/or other data quality activities? Do downstream applications need data/ information access via services? Do we need transaction write back? Is the data semi-structured or unstructured? Do we need to augment data from numerous social and web portals? Are users asking for data in an ad-hoc manner? Is the data requirement more exploratory in nature? Is the data manually manageable? Is the data available in real time? Can the data be persisted in a separate storage? Data Foraging Constituency Containment Composition Data Consolidation Data Virtualization Information Fabric Hybrid Solution Are we constrained by time for coming up with BI/analytical decisions? Decision Flow No Yes Yes Yes No No No No No Yes Yes Yes Yes No Yes No No Yes Yes No No No Yes Yes No Yes No Yes Yes No No Yes Figure 4 Flowchart to identify“best suited”solution approach in knowledge value chain The functions can be distinctly categorized into the following five groups based on the implemented approach: Data Sources: This function includes all data sets that are required by the enterprise to create information or meaningful decisions. This could include relational databases, third party market analyst data or data augmented from external sources, data on the cloud, manual data or excel sheets created by business users, data streamed via other internal/ external sources, or data from social media or web portals. Data Consolidation: Data can be consolidated with the help of tools and technologies. The functions enabled during consolidation lead to the creation of data lakes, data marts, and/or data warehouse. Data Virtualization: Data virtualization provides an abstraction layer for the source data which can be directly accessed by the BI and analytical applications. Information Fabric: The information layer or information fabric converts data into information by leveraging the functionalities of data quality along with transactional management capabilities of writing back data into the data consolidation layer. Also integral to this layer is the capability to provide the information to the downstream applications via data or information services. BI Applications and Analytics: BI applications and analytics provide the final insight to the enterprise data. Data can be foraged from data sources by manually massaging, integrating or consolidating data for BI applications and analytical capabilities. Advance analytical and BI capabilities can be best derived by leveraging data consolidation, virtualization and information fabric approaches. Knowledge Value Chain Best Fit Decision Flow The enterprise can further envisage the “best”solution with the help of a decision- making flowchart by answering some basic questions across the various informants (three Cs) and dimensions as mentioned in the earlier sections. The reader needs to note that Figure 4 is a pragmatic flowchart to identify the solution approach“best suited”but nevertheless needs thorough evaluation for your particular enterprise environment and constraints. Also to be considered are the tools and technologies (that are ever evolving) available for this solution. In cases where a straightforward single approach is not possible, we might need a hybrid solution. External Document © 2015 Infosys Limited Do we have restrictions from source owners to run queries and processes on their infrastructure? Do I need the data for advance analytical models, simulations and processing?
  • 11. Usage Scenarios This section illustrates two possible usage scenarios that leverage the advanced analytics and self-service BI approaches. Advanced Analytics In this scenario, we cover data needs for data science teams that are engaged in development of advanced analytics models. Data science team members are qualified power users who are frequently engaged in all aspects of the data life cycle, from defining the source for data to data formats and data quality aspects. External Document © 2015 Infosys Limited Figure 5 Knowledge value chain with advanced analytics human links Possible solutions for enterprise data (data foraging, data consolidation, data virtualization, information layer/fabric) Modeling (advanced analytics algorithms and models, service API to integrate model to business applications) Insight delivery (advanced visualizations, interactive analysis) Integrate and Manage Operational Data Model Development Validation and Deployment Insight Delivery Users/Applications Knowledge Supply Chain Other Data Sources Analytics Data Access, Transform, Explore Data Scientist Data Scientist Data Scientist Business Analyst Data Engineer Data Engineer Data Engineer Executives Data Analyst Data Analyst Business Applications Data Scientist Data Engineer EDW/Marts
  • 12. The data intensive activities that a typical data science team goes through while building and deploying advanced analytics models are listed below: Exploratory Data Analytics is the most data intensive phase that is iterative and runs through the entire data science life cycle. It starts at the finding phase where data scientists work with analysts and data engineers to explore all relevant data available (to address the business problem), identify patterns, formulate hypothesis, and strategize the analytical models that would be needed to deliver the solution. The team access the data in real-time to create an interactive solution for accessing large data sets. Data Cleansing and Transformation is closely knit with the exploratory analysis phase where analysts and data engineers blend the data to derive information, formulate hypothesis, and develop analytical models. As the requirement is not real-time, this activity is carried out as a batch process. However, the solution has to provision clean and transformed data for real-time (interactive) access. Analytical Model Building is experimental and iterative. In this phase analysts work with data scientists to test out their initial hypothesis. They start by working on a representative data sample (and later on the entire data cube created for the analytical model) where the team develops and evaluates the model, analyzes the performance, fine-tunes the models and finally selects the model(s) that should be deployed in production. Analytical Model Deployment is done when the data scientists are satisfied with the models developed in the experiments that were carried out in the model development phase. In this phase data scientists work with data engineers to embed and run the model on the entire data set to carry out the analysis and deliver results based on the type of the business problem. Depending on the use case, the model may need to be deployed as a service and interact with applications in real-time to enable data-driven decisions. The“next best offer”is a good example of this. Alternatively, the model may be deployed as a batch program to analyze data, such as sales forecast, in offline mode. Analytical Model Maintenance is the process of maintaining and managing the model life cycle including reconstructing sets of ethereal data. This aids regulatory, governance, legal, and contractual requirements. Insight Delivery - Visualization, Dashboard and Reporting is the most crucial part of the entire data science life cycle. Articulating the findings and insights that the analytical models have unearthed in an easy to understand format is necessary for delivering the business value to human beings as well as to digital systems. It is therefore important for the solution to deliver insights to business and business systems in a form where they can access, interact and analyze the findings and take data-driven business decisions. While all data science teams would like to have an infinitely scalable, comprehensive data platform with data from all sources in the enterprise at their disposal, the reality is dictated by different factors such as cost, time, effort, compliance issues, etc. In reality, a combination of the methods listed in this document are deployed by IT teams to meet the needs of advanced analytics. To decide what all methods are required, following are the constraints and factors that need to be considered: • The amount of variance between informational elements drives the program scope. The scope of the analytics program (all business functions, limited business functions, and/or business informational domain elements) is a driving factor in determining the methods that need to be adopted for addressing the“data everywhere” problem. • Potential benefits of specific advanced analytics models determine the overall spend that can be made in data, without impacting the return on investments. The more impactful models should have correspondingly higher investments in data. • Longevity of the advanced analytics models determine the effort that needs to be spent to ensure availability of up-to-date and valid data across the life cycle. For a long running model it is important to have all relevant data aggregated in one place, with data quality and data lineage capabilities built in. This would help with validations on the efficacy of the model across different time periods and also with tracing back important decisions that were taken during the life of the model. • No matter how formal and rigorous the planning process is, it is difficult to avoid ad hoc requests for advanced analytics model. This brings in an aspect of agility that needs to be provided in sourcing data with reasonable data quality. • In some cases there is a short time to market for the analytics model which makes it even more important to adopt agility and enable the data scientists to source and provision data for the analytics model. Self-service BI In this usage scenario, we discuss one of the frequent scenarios for business operations, managers and business users. Business users prefer to explore and access data themselves for BI and reporting. They require functional capabilities that could help them build reports, scorecards, dashboards, and exploratory visualizations using self-service tools and wizards. External Document © 2015 Infosys Limited
  • 13. Self-service BI, as the name suggests, is used to drag/drop or create quick visualizations, reports, and dashboards in limited period of time by using wizards so that the users can generate BI reports themselves. Users use the underlining integrated data from heterogeneous sources brought together using our four solutions approaches (data foraging, data consolidation, data virtualization, and information fabric). Business users can leverage self-service BI environments for their daily BI activities along with business decisions and operations activities. They could: • Run a semantic (using natural language) and quantitative search to discover data available in the enterprise, and/or blend enterprise data with data available in the public domain, to make meaningful insights. This discovery and exploration could further lead to standard operational BI reports that can be presented to executives and managers at set regular frequencies using job schedulers. • Request for purchase or loading of data not found in the integrated data environment. Data extracts and sets can be requested and rendered with minimal IT involvement. For data sets that are not available to the enterprise, self-service BI helps as an investigative and exploratory mode of analysis to further enhance and blend data. • Run predefined and ad hoc queries on the data sets. Build quick and custom visualizations and analysis to resolve immediate business problems and needs. • Collaborate with other users to create data mash-ups across multiple data sources. • Enable social BI features such as community rating, collaborative metadata enrichment, and more. Besides functions such as collaboration, integration, data mash-ups, data visualizations along with data lineage, metadata search, self-service BI also provides an easy, simple and intuitive user interface (UI) that can be made available on a desktop as well as on smartphones, tablets and other mobile devices. This makes information available to business users even when they are on-the-go, thus making it an integrated BI platform Integrate and Manage EDW/Marts Access Possible solutions for enterprise data (data foraging, data consolidation, data virtualization, information layer/fabric) Easy UI (office support, advanced visualization, portal, search, analytical model) Collaborative UI (collaboration, usage tracking) Mobile UI (access on mobile devices) Executives Analytics Builder Analytics Builder Analysts/ModelersInformation Producer Information Producer Information Consumer Information Collaborator Decide Information Supply Chain Users Analyze and Publish Workgroup Data Discover and Enhance Operational Data Other Data Sources Figure 6 Knowledge value chain with human BI links with ease of authoring, modeling and publishing to the end user without IT support. Typical user groups that leverage the information supply chain would include data engineers (who are aware of the data sources that supply the required data), BI authors (who are capable of leveraging the self-service intuitive UI and wizards to publish reports, visualizations, dashboards and scorecards for their managers) and business executives (who leverage the outputs for decision-making). External Document © 2015 Infosys Limited
  • 14. Conclusion Data integration can be a significant effort whether you are engaged in building new data-intensive applications, adapting a packaged application to a new context, or are trying to create a single point of access for your enterprise data [3]. More companies are turning to a multitude of solutions which can shorten data integration projects and lower data management and maintenance costs over time [3]. The information fabric approach can simplify data provisioning, access and integration, thus shortening the data integration time frame and enhancing productivity by enabling developers to spend their time concentrating on developing actual application logic [3]. However, it requires a tremendously skilled systems integration capability, in-depth data theory, and more. On the other extreme, those who need integrated data can be left to their own devices. In the modern landscape, this is simply not a long term option. Therefore, without conducting in-depth situation analysis, the data consolidation solution is the most pragmatic. Many of the efforts required in data consolidation activity can roll into either data virtualization or the information layer. Citations 1. August 8, 2013. Noel Yuhanna and Mike Gilpin.“Information Fabric 3.0”. Forrester Research 2. 2011.“Infosys Gradient: An EII Solution”. Infosys 3. April 2005.“Infosys Gradient: Enabling Enterprise Data Virtualization”. Infosys External Document © 2015 Infosys Limited
  • 15. About the Authors Bill Peer is Principal Technology Architect at Infosys Labs. He has over 20 years of IT experience. His focus area is IT strategy for business competitive advantage, with hyper specializations in innovation, large scale global enterprise architecture, and emergent technology exploitation (such as ‘Big Data’systems). Kiran Kumar Kaipa is Senior Consultant at Infosys Labs. He has over nine years of experience in the IT industry where he has worked in both consulting and technical roles. His current focus area is Big Data analytics with specializations in data munging, data analysis and data visualization. Shyamala Sadananda is Senior Architect at Infosys Labs. She has over 15 years of professional experience in systematic innovation, architecture, consulting, and solution implementation. She primarily anchors engagements with focus on emerging technologies in the areas of data virtualization, analytics, data visualizations and business intelligence. Swaminathan Natarajan is Principal Product Architect at Infosys Labs. He has over 17 years of experience in the software industry and has worked in various functions such as product engineering, R&D and technology consulting. His current focus area is Big Data analytics with specific interest in data munging and unstructured data analytics. External Document © 2015 Infosys Limited
  • 16. © 2015 Infosys Limited, Bangalore, India. All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice. Infosys acknowledges the proprietary rights of other companies to the trademarks, product names and such other intellectual property rights mentioned in this document. Except as expressly permitted, neither this documentation nor any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, printing, photocopying, recording or otherwise, without the prior permission of Infosys Limited and/ or any named intellectual property rights holders under this document. About Infosys Infosys is a global leader in consulting, technology, outsourcing and next-generation services. We enable clients, in more than 50 countries, to stay a step ahead of emerging business trends and outperform the competition. We help them transform and thrive in a changing world by co-creating breakthrough solutions that combine strategic insights and execution excellence. Visit www.infosys.com to see how Infosys (NYSE: INFY), with US$8.25 B in annual revenues and 165,000+ employees, is helping enterprises renew themselves while also creating new avenues to generate value. For more information, contact askus@infosys.com www.infosys.com