2. What is Data warehouse?
Data warehouse is an information system that contains historical and commutative data
from single or multiple sources. It simplifies reporting and analysis process of the
organization.
It is also a single version of truth for any company for decision making and forecasting.
Characteristics of Data warehouse
A data warehouse has following characteristics:
Subject-Oriented
Integrated
Time-variant
Non-volatile
3. Data Warehouse Architectures
There are mainly three types of Data warehouse Architectures: -
Single-tier architecture
The objective of a single layer is to minimize the amount of data stored.This goal is to
remove data redundancy.This architecture is not frequently used in practice.
Two-tier architecture
Two-layer architecture separates physically available sources and data warehouse.This
architecture is not expandable and also not supporting a large number of end-users. It
also has connectivity problems because of network limitations.
Three-tier architecture
This is the most widely used architecture.
It consists of theTop, Middle and BottomTier.
Bottom Tier:The database of the Datawarehouse servers as the bottom tier. Data is
cleansed, transformed, and loaded into this layer using back-end tools.
MiddleTier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model.
Top-Tier: The top tier is a front-end client layer.Top tier is the tools and API that you
connect and get data out from the data warehouse.
4. Data warehouse Components
There are mainly five components of DataWarehouse:
Data Warehouse Database
The central database is the foundation of the data
warehousing environment.This database is implemented on
the RDBMS technology.
Sourcing,Acquisition, Clean-up andTransformationTools
(ETL)
The data sourcing, transformation, and migration tools are
used for performing all the conversions, summarizations,
and all the changes needed to transform data into a unified
format in the datawarehouse.
5. Metadata
Metadata is data about data which defines the data warehouse. It
is used for building, maintaining and managing the data
warehouse.
Metadata can be classified into following categories:
Technical Meta Data
Business Meta Data
QueryTools
One of the primary objects of data warehousing is to provide
information to businesses to make strategic decisions. Query tools
allow users to interact with the data warehouse system.
These tools fall into four different categories:
Query and reporting tools
Application Development tools
Data mining tools
OLAP tools
6. Data warehouse Bus Architecture
Data warehouse Bus determines the flow of data
in your warehouse.The data flow in a data
warehouse can be categorized as Inflow, Upflow,
Downflow, Outflow and Meta flow.
Data Marts
A data mart is an access layer which is used to get
data out to the users. It is presented as an option
for large size data warehouse as it takes less time
and money to build.
7. Data Mining
Data mining is defined as a process
used to extract usable data from a
larger set of any raw data.
It implies analysing data patterns in
large batches of data using one or
more software.
For segmenting the data and
evaluating the probability of future
events, data mining uses
sophisticated mathematical
algorithms. Data mining is also known
as Knowledge Discovery in Data
(KDD).
8. Key features of data mining
Automatic pattern predictions based on trend and
behaviour analysis.
Prediction based on likely outcomes.
Creation of decision-oriented information.
Focus on large data sets and databases for analysis.
Clustering based on finding and visually documented
groups of facts not previously known.
9. Data Mining Functionalities
Are used to specify the kind of pattern to be found in data
mining tasks.There are 2 types of tasks:
Descriptive Task:
These tasks present the general properties of data stored in
database. The descriptive tasks are used to find out patterns in
data i.e. cluster, correlation, trends and anomalies etc.
Predictive Tasks:
Predictive data mining tasks predict the value of one attribute
on the bases of values of other attributes, which is known as
target or dependent variable and the attributes used for making
the prediction are known as independent variables.
11. Clustering
Clustering is used to identify data objects that are similar to one another. Process of
partitioning a set of object or data in a same group called a cluster.
Used in- machine learning, patterns recognition, image analysis and information
retrieval. For example, an insurance company can cluster its customers based on age,
residence, income etc. .
Associations and correlations:
Association discovers the association or connection among a set of items.
A retailer can identify the products that normally customers purchase together or even
find the customers who respond to the promotion of same kind of products.
For example, a set of items, such as table and chair.
Summarization
A set of relevant data is summarized which result in a smaller set that gives aggregated
information of the data.
For example, the shopping done by a customer can be summarized into total products,
total spending, offers used, etc.
Data mining under DescriptiveTask
12. Prediction
Prediction task predicts the possible values of future data.
Prediction involves developing a model based on the available data and this model is
used in predicting future values of a new data set of interest.
For example, a model can predict the income of an employee based on education,
experience and other demographic factors like place of stay, gender etc.
Time - Series Analysis
Time series is a sequence of events where the next event is determined by one or more
of the preceding events.
Time series analysis includes methods to analyze time-series data in order to extract
useful patterns, trends, rules and statistics. Stock market prediction is an important
application of time- series analysis.
Classification:
Classification is used to builds models from data with predefined classes as the model
is used to classify new instance whose classification is not known.
for example one may classify the employee’s potential salary on the bases of salary
classification of similar employees in the company.
Data mining under PredictiveTask
13. Applications of Data Mining
Sales and Marketing
Banking and Finance
Healthcare and Insuarance
Retail Industry
Telecommunications Industry
Higher Education
15. Amazon Web Services, Inc.
(IT service management company)
AWS allows you to take advantage of all of the core benefits associated with on-demand
computing, such as access to seemingly limitless storage and compute capacity, and the
ability to scale your system in parallel with the growing amount of data collected, stored,
and queried, paying only for the resources you provision.
Further, AWS offers a broad set of managed services that integrate seamlessly with each
other so that you can quickly deploy an end-to-end analytics and data warehousing
solution.
16. Amazon Redshift
Amazon Redshift is a fast, fully managed, and cost-effective data
warehouse that gives you petabyte scale data warehousing and exabyte
scale data lake analytics together in one service.
Amazon Redshift is up to ten times faster than traditional on-premises
data warehouses. Get unique insights by querying across petabytes of
data in Redshift and exabytes of structured data or open file formats in
Amazon S3, without the need to move or transform your data.
Redshift is 1/10th the cost of traditional on-premises data warehouse
solutions.You can start small for just $0.25 per hour with no commitments,
scale out to petabytes of data for $250 to $333 per uncompressed terabyte
per year, and extend analytics to your Amazon S3 data lake for as little as
$0.05 for every 10 gigabytes of data scanned.
17. Amazon Redshift Customer Success
“Amazon Redshift enables faster business insights and growth, and provides an
easy-to-manage infrastructure to support our data workloads. Redshift has given us
the confidence to run more data and analytics workloads on AWS and helps us meet
the growing needs of our customers.”
(Abhi Bhatt, Director Global Data & Analytics, McDonald’s)
“Amazon Redshift allows us to ingest, optimize, transform, and aggregate billions
of transactional events per day at scale, coming to us from a variety of first and
third party sources. We query live data across our data warehouse and data lake,
and now with the new Amazon Redshift Federated Query feature we can easily
query and analyse live data across our relational databases as well.”
(AlexTverdohleb, Vice President Data Services, Consumer Products & Engineering,
FOX Corporation)
“AtWD we use Amazon Redshift to enable the enterprise to gain value and insights
from large, complex, and dispersed datasets. Our data is nearly doubling every year
and we run six Redshift clusters with a total of 78 nodes and 631+TB of compressed
data stored to get insights that our business analysts and leadership depend on.”
(Fayaz Syed, Sr. Manager, Big Data Platform, Western Digital)