1. Business Intelligence, Data Warehousing
Data Marts, Data Mining
Presented by
Mr. Manish Tripathi ( I – 15-18-19)
Thakur Institute of Management Studies
&
Research
(Sunday 26 March, 2017)
1
3. WHAT IS BUSINESS INTELLIGENCE?
• BI is a technology-driven process for analyzing data
and presenting actionable information to help
corporate executives, business managers and other
end users make more informed business decisions
• BI encompasses a wide variety of tools, applications
and methodologies that enable organizations to
collect data from internal systems and external
sources
• Prepare it for analysis, develop and run queries against
the data, and create reports, dashboards and data
visualizations to make the analytical results available
to corporate decision makers as well as operational
workers
3
4. WHAT IS BUSINESS INTELLIGENCE?
• BI technologies provide historical, current and
predictive views of business operations
• Identifying new opportunities and implementing
an effective strategy based on insights can provide
businesses with a competitive market advantage
and long-term stability
• Business intelligence can be used to support a
wide range of business decisions ranging from
operational to strategic
4
5. BENEFITS OF BUSINESS INTELLIGENCE
• The potential benefits of business intelligence
programs include accelerating and improving
decision making; optimizing internal business
processes; increasing operational efficiency;
driving new revenues; and gaining competitive
advantages over business rivals.
• It removes guesswork
• Gives quicker responses to your business-related
queries
• Obtain important business metrics reports
whenever and wherever you need them
5
6. BENEFITS OF BUSINESS INTELLIGENCE
• Gain a better understanding of business’ past,
present and future
• Gain valuable insight into your customer’s
behaviour
• Pinpoint up-selling as well as cross-selling
opportunities
• Develop efficiency
6
9. BUSINESS INTELLIGENCE TOOLS
• SAP Crystal Reports
• SAS Enterprise BI Server
• Oracle Business Intelligence Enterprise Edition Plus
• IBM Cognos 8 BI
• Microsoft PowerPivot
• MicroStrategy Reporting Suite
• Salesforce CRM
• TIBCO Spotfire Analytics
• Information Builders WebFOCUS
9
12. WHAT IS DATA WAREHOUSING?
• A data warehouse is a federated repository for all
the data that an enterprise's various business
systems collect
• It is a collection of corporate information and data
derived from operational systems and external
data sources
• A data warehouse is designed to support business
decisions by allowing data consolidation, analysis
and reporting at different aggregate levels
12
13. MOST POPULAR DATA WAREHOUSING DEFINITIONS
Ralph Kimball
• A data warehouse is a copy of transaction data
specifically structured for query and analysis
Bill Inmon
• A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile collection
of data in support of management's decision
making process
13
15. Subject-Oriented
A data warehouse can be used to analyze a
particular subject area. For example, "sales"
can be a particular subject 15
16. Integrated
A data warehouse integrates data from
multiple data sources. For example, source A
and source B may have different ways of
identifying a product, but in a data warehouse,
there will be only a single way of identifying a
product
16
17. Time-Variant
Historical data is kept in a data warehouse. For
example, one can retrieve data from 3 months,
6 months, 12 months, or even older data from
a data warehouse. This contrasts with a
transactions system, where often only the most
recent data is kept 17
18. Non-volatile
Once data is in the data warehouse, it will not
change. So, historical data in a data warehouse
should never be altered
18
20. Purpose of Data Warehousing
• Keeping Analysis/Reporting and Production Separate
• Information Integration from multiple systems- Single
point source for information
• Data Consistency and Quality
• High Response Time- Production Databases are tuned
to expected transaction load
• High Response time- Normalized Data vs. Dimensional
Modeling
• Establish the foundation for Decision Support
• Maintain data history, even if the source transaction
systems do not
20
22. Data Warehousing vs. normal Database
1- SIZE
Data warehouses are potentially much bigger than
the databases from where the data is derived.
Databases usually store only the data that is currently
in active use; older records can be purged and moved
to backups, mainly for performance reasons. Data
warehouses are used to store much older historical
records; it's also common to use data warehouses to
store additional information that is bought or
captured elsewhere to complement the information
that is generated and stored by the internal database
system
22
23. Data Warehousing vs. normal Database
2- Normalization
Databases are usually normalized, which means that
a lot of work is done to guarantee that there's a
unique copy of any given bit of information, which is
important for performance and consistency reasons.
But it's common to store different versions of the
same information on a data warehouse, using
different structures to compose and access the
information. In other words, data warehouses are
messier and more irregular, partly by design, as they
need to be able to work with so many different
sources of information
23
24. Data Warehousing vs. normal Database
3- Access pattern
Database records are often retrieved and updated
one by one; data warehouses are nearly always
acessed by reporting engines that work on entire
datasets at a time to generate aggregates and other
analytical information. Databases are frequently
updated, sometimes only a field or record at a time;
data warehouses aren't updated very frequently, and
for all practical purposes, never at the field or record
level; instead data is appended in large batches
24
25. Data Warehousing vs. normal Database
4- Use
Normal databases are used for OLTP whereas data
warehousing is used for OLAP
25
26. Data Warehousing vs. normal Database
5- Performance
For normal database performance is important and
optimized for write operation. Whereas for data
warehouse performance is not critical and optimized
for read operations.
26
27. Data Warehousing vs. normal Database
6- Table & Joins
For normal database the tables and joins are complex
since they are normalized (for RDMS). This is done to
reduce redundant data and to save storage space.
Whereas for data warehouse for the Tables and joins
are simple since they are de-normalized. This is done
to reduce the response time for analytical queries.
27
28. Data Warehousing vs. normal Database
7- Data source
For normal database mostly internal data sources are
used. Whereas for data warehouse external data
sources may also be used like macro economic
indicators, competitor data, market data, etc.
28
29. DATA WAREHOUSING PRODUCTS
• Teradata EDW (enterprise data warehouse)
• Oracle Exadata
• Amazon Redshift
• Cloudera Enterprise Data Hub (EDH)
• Marklogic
• IBM Netezza data warehouse appliance
• SAP Business Warehouse
• MS SQL Parallel Data Warehouse
29
31. 7 STEPS IN BUILDING DATA WAREHOUSE
(MANAGEMENT VIEW)
• Step 1: Determine Business Objectives
• Step 2: Collect and Analyze Information
• Step 3: Identify Core Business Processes
• Step 4: Construct a Conceptual Data Model
• Step 5: Locate Data Sources and Plan Data
Transformations
• Step 6: Set Tracking Duration
• Step 7: Implement the Plan
31
32. 3 STEPS IN BUILDING DATA WAREHOUSE
(TECHNICAL VIEW)
• Extract
• Transform
• Load
32
35. DATA MART
• The data mart is a subset of the data warehouse and
is usually oriented to a specific business line or team
• A data mart is a repository of data that is designed to
serve a particular community of knowledge workers
• Because data marts are optimized to look at data in a
unique way, the design process tends to start with an
analysis of user needs
• Today, data virtualization software can be used to
create virtual data marts, pulling data from disparate
sources and combining it with other data as necessary
to meet the needs of specific business users
35
36. DATA MART
• A virtual data mart provides knowledge workers
with access to the data they need while
preventing data silos and giving the organization's
data management team a level of control over the
organization's data throughout its lifecycle
36
37. REASONS FOR CREATING A DATA MART
• Easy access to frequently needed data
• Creates collective view by a group of users
• Improves end-user response time
• Ease of creation
• Lower cost than implementing a full data
warehouse
• Potential users are more clearly defined than in a
full data warehouse
• Contains only business essential data and is less
cluttered.
37
40. DATA LAKE
• A data lake is a storage repository that holds a vast
amount of raw data in its native format until it is needed
• A data lake uses a flat architecture to store data
• Each data element in a lake is assigned a unique identifier
and tagged with a set of extended metadata tags
• When a business question arises, the data lake can be
queried for relevant data, and that smaller set of data can
then be analyzed to help answer the question
• The term data lake is often associated with Hadoop-
oriented object storage
• In such a scenario, an organization's data is first loaded
into the Hadoop platform, and then business analytics and
data mining tools are applied to the data where it resides
on Hadoop's cluster nodes
40
42. DATA MART VS. DATA WAREHOUSE
1- Data Scope
The first, and most obvious difference is
the information scope each one stores. On
one hand, data warehouses save all kinds
of data related to system. On the other
hand, data marts just store specific subject
information, becoming much more focused
on these functionalities.
42
43. DATA MART VS. DATA WAREHOUSE
2- Size
We can say that a data warehouse is
usually much bigger than data marts,
because it keeps a lot more data.
43
44. DATA MART VS. DATA WAREHOUSE
3-Integration
A data warehouse usually integrates
several sources of data in order to feed
its database and the system’s needs. In
opposite, a data mart has a lot less
integration to do, since its data is very
specific
44
45. DATA MART VS. DATA WAREHOUSE
4- Data Scope
The first, and most obvious difference is
the information scope each one stores. On
one hand, data warehouses save all kinds
of data related to system. On the other
hand, data marts just store specific subject
information, becoming much more focused
on these functionalities.
45
46. DATA MART VS. DATA WAREHOUSE
5- Creation
Creating a data warehouse is way more
difficult and time consuming than building
a data mart. Building all the structure, a
relationships between data, its a long and
very important step. Plus we need to think
and analyse how we will integrate all of the
information sources. Since data marts are
smaller and subject oriented, these actions
tend to be much simpler. 46
47. DATA MART VS. DATA WAREHOUSE
6-Management
Like creation, the management of data
warehouses is far more complex than
data marts. For the same reasons, it is
obvious that when we have a lot more
data, relationships, processes to
manage, it becomes a harder task.
47
48. DATA MART VS. DATA WAREHOUSE
7- Cost
In overall, in terms of cost, data marts
are cheaper than data warehouse. To
build and maintain a data warehouse
we need significantly more physical
resources like servers, disk space,
memory and CPU. Due to the
complexity of the systems, a data mart
requires less time to build and operate.48
49. DATA MART VS. DATA WAREHOUSE
8- Performance
The performance of a system always
depends on how it is built, the
infrastructure which supports it, the
processes, the number of users, etc.
Usually a data mart is faster than a data
warehouse because of the inherited
complexity and large data. 49
51. MULTIDIMENSIONAL ANALYSIS
• Multi-Dimensional Analysis is an Informational
Analysis on data which takes into account many
different relationships, each of which represents a
dimension
• For example, a retail analyst may want to
understand the relationships among sales by
region, by quarter, by demographic distribution
(income, education level, gender), by product
• Multi-dimensional analysis will yield results for
these complex relationships
51
52. MULTIDIMENSIONAL ANALYSIS
• Multi-dimensional Data Analysis (MDDA) refers to
the process of summarizing data across multiple
levels (called dimensions) and then presenting the
results in a multi-dimensional grid format
• This process is also referred to as OLAP cube, Data
Pivot., Decision Cube, and Crosstab
52
53. OLAP CUBE
• An OLAP cube is a multidimensional database that
is optimized for data warehouse and online
analytical processing (OLAP) applications
• An OLAP cube is a method of storing data in a
multidimensional form, generally for reporting
purposes
• In OLAP cubes, data are categorized by dimensions
• OLAP cubes are often pre-summarized across
dimensions to drastically improve query time over
relational databases
53
56. WHAT IS DATA MINING?
• Data mining is the practice of automatically searching
large stores of data to discover patterns and trends
that go beyond simple analysis
• Data mining uses sophisticated mathematical
algorithms to segment the data and evaluate the
probability of future events
• It is the process of finding anomalies, patterns and
correlations within large data sets to predict outcomes
• The overall goal of the data mining process is to
extract information from a data set and transform it
into an understandable structure for further use
• Also known as Knowledge Discovery in Data (KDD)
56
57. The phases, and the iterative nature, of a data mining project.
The process flow shows that a data mining project does not
stop when a particular solution is deployed. The results of
data mining trigger new business questions, which in turn can
be used to develop more focused models.
57
58. 1- PROBLEM DEFINITION
• This initial phase of a data mining project focuses
on understanding the project objectives and
requirements. Once we have specified the project
from a business perspective, we can formulate it
as a data mining problem and develop a
preliminary implementation plan.
• For example, the business problem might be:
"How can I sell more of my product to customers?"
You might translate this into a data mining
problem such as: "Which customers are most likely
to purchase the product?"
58
59. 2- Data Gathering and Preparation
The data understanding phase involves data collection
and exploration. As you take a closer look at the data,
you can determine how well it addresses the business
problem. You might decide to remove some of the
data or add additional data. This is also the time to
identify data quality problems and to scan for
patterns in the data.
59
60. 3- Model Building and Evaluation
In this phase, you select and apply various modeling
techniques and calibrate the parameters to optimal
values. If the algorithm requires data transformations,
you will need to step back to the previous phase to
implement them.
60
61. 4- Knowledge Deployment
• Knowledge deployment is the use of data mining
within a target environment
• In the deployment phase, insight and actionable
information can be derived from data
61
63. Data Mining Models
• A mining model is created by applying an
algorithm to data
• it is a set of data, statistics, and patterns that can
be applied to new data to generate predictions
and make inferences about relationships
• A data mining model gets data from a mining
structure and then analyzes that data by using a
data mining algorithm
• The mining structure and mining model are
separate objects
• The mining structure stores information that
defines the data source
63
64. Data Mining Models
• A mining model stores information derived from
statistical processing of the data, such as the
patterns found as a result of analysis
• A mining model is empty until the data provided
by the mining structure has been processed and
analyzed.
• After a mining model has been processed, it
contains metadata, results, and bindings back to
the mining structure
• Model contains metadata, patterns, and bindings
64
67. Data Mining Algorithms
• An algorithm in data mining is a set of heuristics
and calculations that creates a model from data
• To create a model, the algorithm first analyzes the
data you provide, looking for specific types of
patterns or trends
• The algorithm uses the results of this analysis over
many iterations to find the optimal parameters for
creating the mining model
• These parameters are then applied across the
entire data set to extract actionable patterns and
detailed statistics.
67
68. Data Mining Algorithms
• The mining model that an algorithm creates from
your data can take various forms, including:
1. A set of clusters that describe how the cases in a
dataset are related
2. A decision tree that predicts an outcome, and
describes how different criteria affect that outcome
3. A mathematical model that forecasts sales
4. A set of rules that describe how products are grouped
together in a transaction, and the probabilities that
products are purchased together
68