Dw 07032018-dr pl pradhan

Prepared By
Dr. P L Pradhan, Ph D
CSE ( System Security)
Dept of
Information
Technology
TGPCET, RTM
NAGPUR University,
NAGPUR,INDIA

Database, BigData, Data Science
• Database, BigData, Data Science

• What is the difference between a primary key
and a foreign key?
• In a foreign key reference, a link is
created between two tables when the column
or columns that hold the primary key value
for one table are referenced by the column or
columns in another table. This column
becomes a foreign key in the second table.

4- Modules
Module-1 DBMS
Module-2 Data Warehousing
Module-3 Big Data
Module-4 Data Science

Information
• A set of data item satisfying to the specific
objective.
• Data about data= Meta data

Database
• Set of data items logically interconnected &
satisfied to the several users simultaneously over
a LAN & WAN.
Oracle
Sybase
MS-SQL
Ingress

Types of database
Hierarchical
database
Network
Database
Relational
database

Data Model
One-one
One-many
Many-may
One Many

Relational Model
• Primary key
• Foreign Key
• RN Name Dept RM Sub dept-id
• Prod_id Desp, Location, Store, Salsman_id

Data-Items- Records-Tables
• Tuples makes Tables
• Tables makes Database
• Database makes BigData=>D. Sc
• Hadoops helps to Extract the desired data &
infomation

Types of RDBMS
Oracle
Sybase
Ms-Sql
Ms Access

Advantage of RDBMS
Key Concept of PK & FK
Design of complex application

Disadvantage
Dirty Data
Sybase, Oracle, MS Access, Excel
Diff Country have different formats
Ex dd/mm/yyyy UK
Mm/dd/yyyy USA Format
Operational data

Operational data
Operational data is not permanent- Current Data
Data is volatile
Any time & All the time data can be Read, Write &
Execute ( RWX)-Insert, Delete & Update.
Modification & Updating of Data is very risk
Therefore, Operational data have no security & privacy

Operational data
High Risk to Business, HW & SW
Not Suitable for DSS- for TOP Mgmt

Operational data
Sybase
Oracle
MS-SQL
DW OLAP

Comparisons
MILK-BUTTER-GHEE
Milk1
Caw Milk
Buffalo Milk
Butter
Ghee
OLAP

DATA WAREHOUSING
• Separate
• High Availability, Reliability & Scalability
• Integrated
• Time Stamped( RX)
• Subject Oriented
• Non volatile-Permanent
• Accessible for all the time

OLTP-OLAP
OLAP –DW-data –DSS & Read only( View Only)

OLTP-OLAP
• Source of data
• OLTP: Operational data; OLTPs are the original source
of the data.
• OLAP: Consolidation data; OLAP data comes from the
various OLTP Databases
• Purpose of data
• OLTP: To control and run fundamental business tasks-
Raw-Current data
• OLAP: To help with planning, problem solving, and
decision support-Life data

OLTP-OLAP
• What the data
• OLTP: Reveals a snapshot of on going
• OLAP: Multi-dimensional views of various kinds of
• Inserts and Updates
• OLTP: Short and fast inserts and updates initiated
by end users
• OLAP: Periodic long-running batch jobs refresh
the data

OLTP-OLAP
• Queries
OLTP: Relatively standardized and simple queries
Returning relatively few records
OLAP: Often complex queries involving
aggregations. Association, Collaboration
Processing Speed
OLTP: Typically very fast
OLAP: Depends on the amount of data involved;
batch data refreshes and complex queries may
take many hours; query speed can be improved
by creating indexes

OLTP~OLAP
• Space Requirements
• OLTP: Can be relatively small if historical data is
archived
• OLAP: Larger due to the existence of aggregation
structures and history data; requires more indexes than
OLTP
• Database Design
• OLTP: Highly normalized with many tables (3-NF)
• OLAP: Typically de-normalized with fewer tables; use of
star and/or snowflake schemas

Backup and Recovery
• Backup and Recovery
OLTP: Backup religiously; operational data is
critical to run the business, data loss is likely
to entail significant monetary loss and legal
liability.
OLAP: Instead of regular backups, some
environments may consider simply reloading
the OLTP data as a recovery method source:

OLTP
• Oracle
Sybase
DW-Butter
MS-SQL
Ghee

OLTP
• On line transaction processing, or OLTP, is a
class of information systems that facilitate and
manage transaction-oriented applications,
typically for data entry and retrieval
transaction processing.
• Temporary Data- Current Data

OLAP
• OLAP is an acronym for Online Analytical
Processing. OLAP performs multidimensional
analysis of business data and provides the
capability for complex calculations, trend
analysis, and sophisticated data modeling.
• Past & Present Data

Big Data
• Extremely large data sets that may be
analysed computationally to reveal patterns,
trends, and associations, especially relating to
human behaviour and interactions.
• HCI-Human Computer Interaction on BD
• “Much more IT investment is going towards
managing and maintaining big data"

Big-Data
• Challenges include analysis, capture, data
curation, search, sharing, storage, transfer,
visualization, querying,
updating and information privacy.
• The term often refers simply to the use of
predictive analytics, user behavior analytics,
or certain other advanced data analytics
methods that extract value from data, and
seldom to a particular size of data set

Characteristics
• Big Data represents the Information assets
characterized by such a High Volume, Velocity
and Variety to require specific Technology and
Analytical Methods for its transformation into
Value

Big data
• Big data is arriving from multiple sources at
an alarming velocity, volume and variety. To
extract meaningful value from big data, you
need optimal processing power, analytics
capabilities and skills. ... Insights from big
data can enable all employees to make
better decisions ...

BRT
• Big Data is a collection of large datasets that
cannot be processed using traditional
computing techniques. It is not a single
technique or a tool, rather it involves many
areas of Business, Resource and Technology.

Big Data
• What Comes Under Big Data?

Characteristics
• Volume: big data doesn't sample; it just observes
and tracks what happens
• Velocity: big data is often available in real-time
• Variety: big data draws from text, images, audio,
video; plus it completes missing pieces
through data fusion
• Machine Learning: big data often doesn't ask why
and simply detects patterns
• Digital footprint: big data is often a cost-free by
product of digital interaction

Characteristics
• Volume
• The quantity of generated and stored data. The size of the data determines the
value and potential insight- and whether it can actually be considered big data or
not.
• Variety
• The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
• Velocity
• In this context, the speed at which the data is generated and processed to meet
the demands and challenges that lie in the path of growth and development.
• Variability
• Inconsistency of the data set can hamper processes to handle and manage it.
• Veracity
• The quality of captured data can vary greatly, affecting accurate analysis.

6C
• Factory work and Cyber-physical systems may
have a 6C system:
• Connection (sensor and networks)
• Cloud (computing and data on demand)
• Cyber (model and memory)
• Content/context (meaning and correlation)
• Community (sharing and collaboration)
• Customization (personalization and value)

What Comes Under Big Data?
• Black Box Data : It is a component of helicopter, airplanes, and jets,
etc. It captures voices of the flight crew, recordings of microphones
and earphones, and the performance information of the aircraft.
• Social Media Data : Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the
globe.
• Stock Exchange Data : The stock exchange data holds information
about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
• Power Grid Data : The power grid data holds information
consumed by a particular node with respect to a base station.
• Transport Data : Transport data includes model, capacity, distance
and availability of a vehicle.
• Search Engine Data : Search engines retrieve lots of data from
different databases.

3V
• Thus Big Data includes huge volume, high
velocity, and extensible large variety of data.
The data in it will be of three types.
• Structured data : Relational data.
• Semi Structured data : XML data.
• Unstructured data : Word, PDF, Text, Media
Logs.

D-V3
data
Volume
VerityVelocity

Big Data
Challenges
The major challenges associated with big data are
as follows:
• Capturing data
• Data Curation
• Storage
• Searching
• Sharing
• Transfer
• Analysis, visuallation, association, collaboration,
communications ( OOS, OOP, UML)
• Presentation

Data Science-Really a great thing

DSP
• The Data Science Process
• The Data Science Process is a framework for
approaching data science tasks, and is crafted
by Joe Blitzstein and Hanspeter Pfister of
Harvard's CS 109. The goal of CS 109, as per
Blitzstein himself, is to introduce students to
the overall process of data science
investigation, a goal which should provide
some insight into the framework itself.

Data Science
• Data science is an interdisciplinary field about
processes and systems to
extract knowledge or insights from data in
various forms, either structured or
unstructured, which is a continuation of some
of the data analysis fields such
as statistics, data mining, and predictive
analytics, similar to Knowledge Discovery in
Databases (KDD).

DS
• Data science employs techniques and theories drawn from
many fields within the broad areas of mathematics,
statistics, operations research, information science, and
computer science, including signal processing, probability
models, machine learning, statistical learning, data mining,
database, data engineering, pattern recognition and
learning, visualization, predictive analytics, uncertainty
modelling, data warehousing, data compression, computer
programming, artificial intelligence, and high performance
computing. Methods that scale to big data are of particular
interest in data science, although the discipline is not
generally considered to be restricted to such big data, and
big data solutions are often focused on organizing and pre-
processing the data instead of analysis. The development of
machine learning has enhanced the growth and importance
of data science.

CRISP-DM
• CRISP-DM
• As a comparison to the Data Science Process put
forth by Blitzstein & Pfister, and elaborated upon
by Squire, we take a quick look at the de facto
official (yet unquestionably falling out of fashion)
data mining framework (which has been
extended to data science problems), the Cross
Industry Standard Process for Data Mining
(CRISP-DM). Though the standard is no longer
actively maintained, it remains a popular
frameworkfor navigating data science projects.

DSP
• Business Understanding
• Data Understanding
• Data Preparation
• Modeling
• Evaluation
• Deployment

Knowledge Discovery in Databases
• KDD Process
• Around the same time that CRISP-DM was emerging, the KDD
Process had finished developing. The KDD (Knowledge Discovery
in Databases) Process, by Fayyad, Piatetsky-Shapiro, and Smyth, is
a framework which has, at its core, "the application of specific data-
mining methods for pattern discovery and extraction." The
framework consists of the following steps:
 Selection
 Preprocessing
 Transformation
 Data Mining
 Interpretation

SAS-SEMMA
• Discussion
• It is important to note that these are not the only
frameworks in this space; SEMMA (for Sample, Explore,
Modify, Model and Assess), from SAS, and the agile-
oriented Guerilla Analyticsboth come to mind. There
are also numerous in-house processes that various
data science teams and individuals no doubt employ
across any number of companies and industries in
which data scientists work.
• So, is the Data Science Process a new take on CRISP-
DM, which is just a reworking of KDD, or is it a new,
independent framework in its own right?

Infographic
Data visualization

Data science
Exploratory data analysis
Information design
Interactive data visualization
Descriptive statistics
Inferential statistics
Statistical graphics
Plot
Data analysis • Infographic

DS
• Data science affects academic and applied research in
many domains, including machine translation, speech
recognition, robotics,search engines, digital economy,
but also the biological sciences, medical
informatics, health care, social sciences and the
humanities.
• It heavily influences economics, business and finance.
From the business perspective, data science is an
integral part of competitive intelligence, a newly
emerging field that encompasses a number of
activities, such as data mining and data analysis.

Data scientist
• Data scientists use their data and analytical ability to
find and interpret rich data sources; manage large
amounts of data despite hardware, software, and
bandwidth constraints; merge data sources; ensure
consistency of datasets; create visualizations to aid in
understanding data; build mathematical models using
the data; and present and communicate the data
insights/findings. They are often expected to produce
answers in days rather than months, work by
exploratory analysis and rapid iteration, and to produce
and present results with dashboards (displays of
current values) rather than papers/reports, as
statisticians normally do

Data Science
Collection of OLTP is called OLAP
Collection of OLAP is called Data mining

Data Layers
Data
OLTP
OLAP
Data Mining
Big Data

DM
OLTP
OLTP
OLAP
OLAP
• DW
OLAP
OLAP
Data
Mining
Data
Mining
Customers
Scientist

Fact Data
• Facts of a business process
• Quality of Business: sales , cost , and profit
• In data warehousing, a Fact table consists of the measurements,
metrics or facts of a business process. It is located at the center of a
star schema or a snowflake schema surrounded by
dimension tables. Where multiple fact tables are used, these are
arranged as a fact constellation schema.
• Fact tables are the large tables in our warehouse schema that store
business measurements. Fact tables typically contain facts and
foreign keys to the dimension tables. Fact tables represent data,
usually numeric and additive, that can be analyzed and
examined. Examples include sales , cost , and profit .

Star Model-Multidimensional table

RFOS
• Relation Function Operation Services
Oracle
DB
ERP
ETL
Staging Area
Function-
Operation
DW
OLAP
Services
Business Analyst-Engineer Role

Role of Mgmt
Low Level mgmt OLTP:
Engineer & opterators
High Level mgmt
OLAP=Top Mgmt-
Scientist, CEO= DSS
Data

Data Action- Role
DM=BD
OLAP
OLTP
data

Traditional Complex IT Infrastructure
C
li
e
n
t

Dw 07032018-dr pl pradhan

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Dw 07032018-dr pl pradhan

Similaire à Dw 07032018-dr pl pradhan (20)

Dernier

Dernier (20)

Dw 07032018-dr pl pradhan

Notes de l'éditeur