Dma unit 1

18CSE355T -DATA MINING AND ANALYTICS

COURSE LEARNING
RATIONALE (CLR)
The purpose of learning this course
is to:
CLR -1: Understand the concepts of Data Mining
CLR -2: Familiarize with Association rule mining
CLR -3: Familiarize with various Classification algortihms
CLR -4: Understand the concepts of Cluster Analysis
CLR -5: Familiarize with Outlier analysis techniques
CLR -6: Familiarize with applications of Data mining in
different domains

COURSE LEARNING
OUTCOMES (CLO)
At the end of this course,
learners will be able to:
CLO -1: Gain knowledge about the concepts of Data Mining
CLO -2: Understand and Apply Association rule mining
techniques
CLO -3: Understand and Apply various Classification
algortihms
CLO -4: Gain knowledge on the concepts of Cluster Analysis
CLO -5: Gain knowledge on Outlier analysis techniques
CLO -6: Understand the importance of applying Data mining
concepts in different domains

LEARNING RESOURCES
S. No., TEXT BOOKS
1
Jiawei Han and Micheline Kamber, ― Data Mining:
Concepts and Techniques‖, 3rd Edition, Morgan
Kauffman Publishers, 2011.

UNIT I
INTRODUCTION
Why Data mining? What is Data mining ?-Kinds of data
meant for mining -Kinds of patterns that can be mined-
Applications suitable for data mining-Issues in Data
mining-Data objects and Attribute types-Statistical
descriptions of data-Need for data preprocessing and data
quality-Data cleaning-Data integration-Data reduction-
Data transformation-Data cube and its usage

Why Data Mining?
• The Explosive Growth of Data: from terabytes(10004) to petabytes(10008)
– Data collection and data availability
• Automated data collection tools, database systems, web
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: bioinformatics, scientific simulation, medical research …
• Society and everyone: news, digital cameras, …

Why Data Mining?
 The abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation.
 The fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools.
 As a result, data collected in large data repositories become “data tombs” data
archives that are seldom visited.
 Important decisions are often made based not on the information-rich data
stored in data repositories but rather on a decision maker’s intuition, simply
because the decision maker does not have the tools to extract the valuable
knowledge embedded in the vast amounts of data.

Why Data Mining?
 Efforts have been made to develop expert system and knowledge-based
technologies, which typically rely on users or domain experts to manually
input knowledge into knowledge bases.
 Unfortunately, however, the manual knowledge input procedure is prone to
biases and errors and is extremely costly and time consuming.
 The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden
nuggets” of knowledge.

Evolution of Database Technology

What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data.
– Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
– The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.

Potential Applications
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis

Ex.: Market Analysis and Management
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
– Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 -
$800 a month live in that area
– Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k –
29k usually buy this type of CD player
• Cross-market analysis—Find associations/co-relations between product sales, &
predict based on such association
– E.g. Customers who buy computer A usually buy software B

Knowledge Discovery (KDD) Process

KDD Process
7 Steps of KDD Process
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for
mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation ( presentation of knowledge to the user for visualization
in terms of trees, tables, rules graphs, charts, matrices..)

Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

On What Kinds of Data to be minined?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Object-Relational Databases
– Temporal Databases, Sequence Databases, Time-Series databases
– Spatial Databases and Spatiotemporal Databases
– Text databases and Multimedia databases
– Heterogeneous Databases and Legacy Databases
– Data Streams
– The World-Wide Web

Relational Databases
• DBMS – database management system, contains a collection of
interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to
manage and access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of data and
rows as records.
• Tables can be used to represent the relationships between or among multiple
tables.

Relational Databases (2) – AllElectronics store

Relational Databases (3)
• With a relational query language, e.g. SQL, we will be able to find answers to
questions such as:
 How many items were sold last year?
 Who has earned commissions higher than 10%?
 What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we can search for trends or data
patterns.
• data mining systems can analyze customer data to predict the credit risk of new
customers based on their income, age, and previous credit information.

Data Warehouses
• A repository of information collected from multiple sources, stored
under a unified schema, and that usually resides at a single site.
• Constructed via a process of data cleaning, data integration, data
transformation, data loading and periodic data refreshing.

Data Warehouses (2)
• Modelled by multidimensional data structure , called as data cube, in
which each dimension corresponds to set of attributes.
• Each cell stores the value of some aggregate measures like count.
• Data cube provides the multidimensional view of data and allows the
precomputation and fast access of summarized data.
• Data are organized around major subjects, e.g. customer, item, supplier
and activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization

Transactional Databases
• It captures a transaction, such as flight booking, customer’s purchase and
user’s click on a web page.
• Consists of a file where each record represents a transaction.
• A transaction typically includes a unique transaction ID and a list of the
items making up the transaction.
• Either stored in a flat file or unfolded into relational tables
• Easy to identify items that are frequently sold together

Transactional Databases
 Which items sold well together?”
 This kind of market basket data analysis would enable you to bundle
groups of items together as a strategy for boosting sales.
 e.g purchase the computer along with printer.
 A traditional database system is not able to perform market basket data
analysis.
 Data mining on transactional data can do so by mining frequent item sets
that is, sets of items that are frequently sold together.

Data Mining Functionalities
What kinds of patterns can be mined?
• Data Mining Functionalities are used to specify the kinds of patterns to be found
in data mining tasks.
• 2 tasks are descriptive and predictive.
• Descriptive mining tasks –describes the general properties of the data
• Predictive mining tasks – it makes the prediction based on current data.
Types of Data Mining Functionalities
• Concept/Class Description
• Mining Frequent Patterns, Associations, and Correlations
• Classification and Regression for Predictive Analysis
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
.

Data Mining Functionalities
- What kinds of patterns can be mined?
1. Concept/Class Description
Data can be associated with classes or concepts.
•E.g. classes of items – computers, printers, …
concepts of customers – bigSpenders, budgetSpenders, …
It can be useful to describe individual classes and concepts in summarized, concise,
and yet precise terms.
Data characterization – summarizing the general characteristics of a target class of
data.
–E.g. summarizing the characteristics of customers who spend more than $1,000 a
year at AllElectronics.
Result can be a general profile of the customers,
40 – 50 years old,
Employed
have excellent credit ratings

1.4 Data Mining Functionalities
•Data discrimination – comparing the target class with one or a set of
comparative classes
–E.g. Compare the general features of software products whole sales increase by
10% in the last year ,
–with those whose sales decrease by 30% during the same period

2.Mining Frequent Patterns, Associations and
Correlations
Mining Frequent Patterns ( patterns that occur frequently in data ).
Kinds of frequent patterns
– Frequent item set: a set of items that frequently appear together in a
transactional data set (e.g. milk and bread)
– Frequent subsequence: A frequently occurring subsequence, such as the
pattern that customers, tend to purchase first a laptop, followed by a digital
camera, and then a memory card, is a (frequent) sequential pattern.
– Frequent Substructures: A substructure can refer to different structural forms
(e.g., graphs, trees, or lattices) that may be combined with item sets or
subsequences.
– If a substructure occurs frequently, it is called a (frequent) structured pattern.
– Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.

1.4 Data Mining Functionalities
`
– Association Analysis: find frequent patterns
• E.g. a sample analysis result – an association rule:
buys(X, “computer”) => buys(X, “software”) [support = 1%, confidence
= 50%]
if a customer buys a computer, there is a 50% chance that she will buy
software. 1% of all of the transactions under analysis showed that
computer and software are purchased together.
• Associations rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence threshold.
– Correlation Analysis: additional analysis to find statistical correlations
between associated pairs

3.Classification and Prediction for predictive analysis
– Classification
• It is a data analysis task, i.e. the process of finding a model that
describes and distinguishes data classes and concepts
• The goal of classification is to accurately predict the target class for each
case in the data.
• The model can be represented in classification (IF-THEN) rules,
decision trees, neural networks, etc.

A decision tree is a flowchart-like tree structure, where each node denotes a
test on an attribute value, each branch represents an outcome of the test, and
tree leaves represent classes or class distributions.

• A neural network, when used for classification, is typically a collection of
neuron-like processing units with weighted connections between the units.
• Neural networks are used for effective data mining in order to turn
raw data into useful information.
• Neural networks look for patterns in large batches of data, allowing
businesses to learn more about their customers which directs their
marketing strategies, increase sales and lowers costs.

Classification and Regression
 Regression is used to predict missing or unavailable numerical data values
rather than (discrete) class labels.
 The term prediction refers to both numeric prediction and class label
prediction.
 Regression analysis is a statistical methodology that is most often used for
numeric prediction, although other methods exist as well.
 Regression also encompasses the identification of distribution trends based
on the available data.
 Classification and regression may need to be preceded by relevance
analysis, which attempts to identify attributes that are significantly relevant
to the classification and regression process.
 Such attributes will be selected for the classification and regression
process. Other attributes, which are irrelevant, can then be excluded from
consideration

4.Cluster Analysis
– Clustering can be used to generate class labels for a group of data.
– Clusters of objects are formed based on the principle of maximizing intra-
class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters
may
represent individual target groups for marketing.

5.Outlier Analysis: identify similar objects
– A data set may contain objects that do not comply with the general behavior or
model of the data. These data objects are outliers
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
the outlier indicates a fraudulent activity.
• E.g. Detect purchases of extremely large amounts
6.Evolution Analysis
– Describes and models regularities or trends for objects whose behavior
changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the
stocks of particular companies.

Are All of the Patterns Interesting?
• Data mining may generate thousands of patterns: Not all of them
are interesting
• A pattern is interesting if it is
– easily understood by humans
– valid on new or test data with some degree of certainty,
– potentially useful
– novel
– validates some hypothesis that a user seeks to confirm
• An interesting patterns represents knowledge !

Are All of the Patterns Interesting?
• Objective measures
– Based on statistics and structures of patterns, e.g., support, confidence, etc. (Rules
that do not satisfy a threshold are considered uninteresting.)
• Subjective measures
– Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.
– Based on user’s belief in the data.
• e.g., Patterns are interesting if they are unexpected, or can be used for strategic planning, etc
• Objective and subjective measures need to be combined.

Major Issues in Data Mining
There are many challenging issues in data mining research.
Areas include
i. Mining methodology
ii. User interaction
iii. Efficiency and scalability
iv. Dealing with diverse data types
v. Data mining and society

39
■ Mining Methodology
■ Mining various and new kinds of knowledge
■ Mining knowledge in multi-dimensional space
■ Data mining: An interdisciplinary effort
■ Boosting the power of discovery in a networked environment
■ Handling noise, uncertainty, and incompleteness of data
■ Pattern evaluation and pattern- or constraint-guided mining

40
■ Mining various and new kinds of knowledge
 use the same database in different ways and require the development of
numerous data mining techniques.
 Due to the diversity of applications, new mining tasks continue to emerge,
making data mining a dynamic and fast-growing field.
e.g
effective knowledge discovery in information networks, integrated clustering
and ranking may lead to the discovery of high-quality clusters and object ranks
in large networks.

41
■ Mining knowledge in multi-dimensional space
■ When searching for knowledge in large data sets, explore the data in
multidimensional space.
■ That is, can search for interesting patterns among combinations of
dimensions (attributes) at varying levels of abstraction. Such mining is
known as (exploratory) multidimensional data mining.
■ In many cases, data can be aggregated or viewed as a multidimensional data
cube.
■ Mining knowledge in cube space can enhance the power and flexibility of
data mining.

Data mining an interdisciplinary effort:
The power of data mining can be enhanced by integrating new methods from
multiple disciplines.
Eg.- mining data in natural language
- the mining of software bugs in large programs(bug mining), requires
software engineering knowledge.
Boosting the power of discovery in a networked environment:
• Most data objects reside in a linked or interconnected environment,
whether it be the Web, database relations, files, or documents.
• Semantic links across multiple data objects can be used to advantage in
data mining.
• Knowledge derived in one set of objects can be used discovery of
knowledge in a “related” objects.

Handling uncertainty, noise, or incompleteness of data:
 Data often contain noise, errors, exceptions, or uncertainty, or are incomplete.
 Errors and noise may confuse the data mining process, leading to the derivation
of erroneous patterns.
 Data cleaning, data preprocessing, outlier detection and removal are examples
of techniques that need to be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining:
 Not all the patterns generated by data mining processes are interesting.
 What makes a pattern interesting may vary from user to user.
 Therefore, techniques are needed to assess the interestingness of discovered
patterns based on subjective measures.

ii)User Interaction
 The user plays an important role in the data mining process.
 Interesting areas of research include how to interact with a data mining system,
how to incorporate a user’s background knowledge in mining, and how to
visualize data mining results.
Interactive mining:
 The data mining process should be highly interactive.
 important to build flexible user interfaces.
Incorporation of background knowledge:
 Background knowledge, constraints, rules, and other information regarding the
domain should be incorporated into the knowledge discovery process.
 Such knowledge can be used for pattern evaluation and guide the search to find
interesting patterns.

Presentation and visualization of data mining results:
 How can a data mining system present data mining results flexibly?
 This is especially crucial if the data mining process is interactive.
 It requires the system to adopt expressive knowledge representations, user
friendly interfaces, and visualization techniques.
iii)Efficiency and Scalability
 Efficiency and scalability are always considered when comparing data
mining algorithms.
 As data amounts continue to multiply, these two factors are especially
critical.

Efficiency and scalability of data mining algorithms:
 Data mining algorithms must be efficient and scalable in order to
effectively extract information from huge amounts of data .
 The running time of a data mining algorithm must be predictable, short, and
acceptable by applications.
 Efficiency, scalability, performance, optimization, and the ability to execute
in real time are key criteria that drive the development of many new data
mining algorithms.
Parallel, distributed, and incremental mining algorithms:
 The parallel processes may interact with one another. The patterns from
each partition are eventually merged.

47
iv)Diversity of Database Types
The wide diversity of database types brings about challenges to data mining.
Handling complex types of data
Mining dynamic, networked, and global data repositories
Handling complex types of data
 The construction of effective and efficient data mining tools for diverse
applications remains a challenging and active area of research.
Mining dynamic, networked, and global data repositories
Multiple sources of data are connected by the Internet and various kinds of
networks, forming distributed, and heterogeneous global information systems
and networks.

 The discovery of knowledge from different sources of structured, semi-
structured, or unstructured and interconnected data with diverse data
semantics having great challenges to data mining.
 Web mining, multisource data mining, and information network mining
have become challenging and fast-evolving data mining fields.
v)Data Mining and Society
How does data mining impact society?
Social impacts of data mining:
 The improper use of data and the potential violation of individual privacy
and data protection rights are areas of concern that need to be addressed.

Privacy-preserving data mining:
 Data mining will help scientific discovery, business management, economy
recovery, and security protection (e.g., the real-time discovery of intruders
and cyberattacks).
 However, it poses the risk of disclosing an individual’s personal
information.
 observe data sensitivity and preserve people’s privacy while performing
successful data mining.

Invisible data mining:
 We cannot expect everyone in society to learn and master data mining
techniques.
 More and more systems should have data mining functions built within so
that people can perform data mining or use data mining results simply by
mouse clicking, without any knowledge of data mining algorithms.
 Intelligent search engines and Internet-based stores perform such invisible
data mining by incorporating data mining into their components to improve
their functionality and performance. This is done often unbeknownst to the
user.
 For example, when purchasing items online, users may be unaware that the
store is likely collecting data on the buying patterns of its customers, which

Data Objects and Attribute Types
A data object is a region of storage that contains a value or group of values.
Data sets are made up of data objects.
A data object represents an entity
• in a sales database, the objects may be customers, store items, and sales;
• in a university database, the objects may be students, professors, and courses.
Row-> data objects
Columns->attributes
Data objects are typically described by attributes.
Data objects can also be referred to as samples, examples, instances, data points,
or objects.
If the data objects are stored in a database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes.

An attribute is a data field, representing a characteristic or feature of a
data object.
The type of an attribute is determined by the set of possible values: nominal,
binary, ordinal, or numerical
• Nominal Attributes only provide enough attributes to differentiate
between one object and another.
 -relating to name
 Hair-color{brown,black,white}
• Such as Student Roll No.
• Ordinal Attribute:
• Value have meaningful order.
The ordinal attribute value provides sufficient information to order the
objects.
Piazza={small,medium,large}
Rankings, Grades, Height

• Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the
inclusion of any characteristics.
• Quality
• Numeric attribute: It is quantitative, such that quantity can be measured
and represented in integer or real values .
Two types
i)Interval Scaled attribute:
It is measured on a scale of equal size units.
These attributes allows us to compare such as temperature in C or F . Thus
values of attributes have order.
ii)Ratio Scaled attribute:
Both differences and ratios(or multiply) are significant for Ratio.
For eg. age, length, Weight.
e.g.10 we can say multiply of 5 .

Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!

Data Quality:
Data have quality if they satisfy the requirements of the intended use.
Factors comprising quality
• accuracy
• completeness
• consistency
• timeliness
• believability
• interpretability
• Accessibility
e.g: Analyze the branch sale at All Electronics Store
three of the elements defining data quality:
• Accuracy
• completeness
• consistency

Data Quality:
Reasons for inaccurate data(having incorrect attribute value)
• The data collection instruments used may be faulty.
• There may have been human or computer errors occurring at data entry.
• Users may purposely submit incorrect data values for mandatory fields
• This is known as disguised missing data.
 There may be technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption.

Data Quality:
Reason for Incorrect data
• inconsistencies in naming conventions or data codes, or inconsistent formats for
input fields (e.g., date).
• Duplicate tuples also require data cleaning.
Reasons for Incomplete data
• Missing the Attribute information.
 e.g customer information for sales transaction data.
• Relevant data may not be recorded due to a misunderstanding.
• Data that were inconsistent with other recorded data, may have been deleted.

Data Quality:
Timeliness also affects data quality.
e.g: monthly sales bonuses to the top sales representatives at All Electronics
Store.
2 other factors affect quality
Believability reflects how much the data are trusted by users.
Interpretability reflects how easy the data are understood

Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
-- forms of Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data discretization: with particular importance, especially for numerical data
– Data aggregation, dimensionality reduction, data compression, generalization

61
Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
■ noisy: containing noise, errors, or outliers
■ e.g., Salary=“−10” (an error)
■ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age=“42”, Birthday=“03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
■ discrepancy between duplicate records
■ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?

62
Incomplete (Missing) Data
■ Data is not always available
■ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time of entry
■ not register history or changes of the data
■ Missing data may need to be inferred

63
Incomplete (Missing) Data
Methods for data cleaning:
1.Missing values
2.Noisy data
3. Data cleaning as a process.

64
1.How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
■ Fill in the missing value manually: this approach is time consuming and may
not be feasible given a large data set with many missing values.
■ Use a global constant to fill in the missing value:
 Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞.
 If missing values are replaced by, “Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have a
value in common—that of “Unknown.”
 Hence, although this method is simple, it is not foolproof.

65
1.How to Handle Missing Data?
■ Use the attribute mean or median for all samples belonging to the same
class as the given tuple:
 For example, if classifying customers according to credit risk, may replace
the missing value with the mean income value for customers in the same credit
risk category as that of the given tuple.
 If the data distribution for a given class is skewed, the median value is a better
choice.

66
How to Handle Missing Data?
■ Use the most probable value to fill in the missing value:
 This may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
 For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.

67
2.Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
■ Other data problems which require data cleaning
■ duplicate records
■ incomplete data
■ inconsistent data

68
How to Handle Noisy Data?
■ Binning
■ Binning is a way to group a
number of more or less
continuous values into a smaller
number of "bins".
■ The sorted values are distributed
into a number of “buckets,” or
bins .
■ example, if you have data about a
group of people, you might want
to arrange their ages into a
smaller number of age intervals.
Sorted data for price (in dollars): 4,
8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency)
bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

69
How to Handle Noisy Data?
Regression:
 Regression is a technique used to model and analyze the relationships
between variables.
 Predicts the value.
 Ex. Predict the children height ,given their age ,weight and other
factors.
 Linear regression involves finding the “best” line to fit two attributes
so that one attribute can be used to predict the other.
 Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Outlier analysis:
 Outliers may be detected by clustering.
 example, where similar values are organized into groups, or “clusters.”
 values that fall outside of the set of clusters may be considered outliers.

70
Data Cleaning as a Process
■ Data cleaning is usually performed as an iterative two-step process consisting
of discrepancy detection and data transformation.
■ First step is Data discrepancy detection.
 Discrepancies can be caused by several factors
 poorly designed data entry forms that have many optional fields.
 human error in data entry
 deliberate errors
e.g., respondents not wanting to divulge information about themselves.
 Other sources of discrepancies include errors in instrumentation devices that
record data and system errors.
 Errors can also occur when the data are used for purposes other than
originally intended.
 There may also be inconsistencies due to data integration.

71
■ how can we proceed with discrepancy detection?”
■ Data discrepancy detection
Data auditing tools find discrepancies by analyzing the data to discover rules and
relationships, and detecting data that violate such conditions.
■ Use metadata
e.g., domain, range, dependency
Check field overloading
■ Check uniqueness rule, consecutive rule and null rule for examine the data.
■ Use commercial tools
■ Data scrubbing use simple domain knowledge to detect errors and make
corrections.
e.g., postal code, spell-check
■ Data auditing find discrepancies by analyzing data to discover rules and
relationship to detect violators .
■ e.g., correlation and clustering to find outliers

72
■ how can we proceed with discrepancy detection?”
■ Data migration and integration
■ Data migration tools allow simple transformations to be specified such as
to replace the string “gender” by “sex.”
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface.
■ The two-step process of discrepancy detection and data transformation
iterates.

73
 As a result, the entire data cleaning process suffers from a lack of interactivity.
 New approaches to data cleaning emphasize increased interactivity.
 Potter’s Wheel, is a publicly available data cleaning tool that integrates
discrepancy detection and transformation.
 The tool automatically performs discrepancy checking in the background on the
latest transformed view of the data.
 Users can gradually develop and refine transformations as discrepancies are
found, leading to more effective and efficient data cleaning.
 For data transformation SQL and algorithm that enable users to express data
cleaning specifications efficiently.
 it is important to keep updating the metadata to reflect this knowledge. This will
help speed up data cleaning on future versions of the same data store.

74
74
Data Integration
■ Data integration:
■ Merging of data from multiple sources into a coherent store
How can we match schema and objects from different sources?
1.Entity identification problem
2. Redundancy and correlation analysis
3.Tuple duplication
4.Data value conflict detection and resolution

75
75
Data Integration
1.Entity identification problem
How can equivalent real world entities from multiple data sources be matched up?
■ Identify real world entities from multiple data sources,
e.g., cust_id=customer number
 metadata can be used to help avoid errors in schema integration.

76
76
Redundancy and Correlation Analysis
■ An attribute may be redundant if it can be “derived” from another attribute or
set of attributes.
■ E.g: annual revenue
■ Inconsistencies in attribute or dimension naming can also cause redundancies
in the resulting data set.
■ Redundant attributes detected by correlation analysis and covariance analysis.
■ Given two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data.
■ For nominal data, use the χ 2 (chi-square) test.
■ For numeric attributes, can use the correlation coefficient and covariance, both
of which access how one attribute’s values vary from those of another.
■ Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality.

77
Correlation Analysis for Nominal Data
■ a correlation relationship between two attributes, A and B, can be discovered
by a χ 2 (chi-square) test
■ Χ2 (chi-square) test
■ The larger the Χ2 value, the more likely the variables are related
■ The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count.
■ Correlation does not imply causality
■ # of hospitals and # of car-theft in a city are correlated
■ Both are causally linked to the third variable: population

78
Chi-Square Calculation: An Example
■ Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
■ It shows that like_science_fiction and play_chess are correlated in the group
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

79
Correlation Analysis for Numeric Data
■ Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A and
B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the
sum of the AB cross-product.
■ If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
■ rA,B = 0: independent; rAB < 0: negatively correlated( the values of
one attribute increase as the values of the other attribute decrease)

80
Correlation Analysis for Numeric Data
• correlation does not imply causality.
• That is, if A and B are correlated, this does not necessarily imply that A
causes B or that B causes A.
• For example, in analyzing a demographic database, we may find that
attributes representing the number of hospitals and the number of car thefts
in a region are correlated.
• This does not mean that one causes the other. Both are actually causally
linked to a third attribute, namely, population

81
Covariance of Numeric Data
■ covariance refers to the measure of the directional relationship between two
random variables.
where n is the number of tuples, and are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.
■ Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
■ Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B
is likely to be smaller than its expected value.
■ Independence: CovA,B = 0 but the converse is not true:
■ Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions does a covariance of 0 imply independence
Correlation coefficient:

Co-Variance: An Example
■ It can be simplified in computation as
■ Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
■ Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
■ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
■ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
■ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
■ Thus, A and B rise together since Cov(A, B) > 0.

Tuple Duplication
• To detecting redundancies between attributes, duplication should also be
detected at the tuple level .
• e.g., where there are two or more identical tuples for a given unique data entry
case
 Inconsistencies arise between various duplicates, due to inaccurate data entry
or updating some but not all data occurrences.
ex, if a purchase order database contains attributes for the purchaser’s name and
address instead of a key to this information in a purchaser database, discrepancies
can occur, such as the same purchaser’s name appearing with different addresses
within the purchase order database

Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts.
For example, for the same real-world entity, attribute values from different sources
may differ.
This may be due to differences in representation, scaling, or encoding.
e.g,
 a weight attribute may be stored in metric units in one system and units in another.
 University may adopt to quarter or semester system, and offer different database
courses.
This is difficult to work out the grade.
Attributes may also differ on the abstraction level.
 an attribute in one system is recorded at, a lower abstraction level than the “same”
attribute in another.
ex.:
 the total sales in one database may refer to one branch of All Electronics, while an
attribute of the same name in another database may refer to the total sales for All
Electronics stores in a given region.

85
Data Reduction Strategies
■ Data reduction:
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the
original data.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
1.Data reduction strategies
 dimensionality reduction
 numerosity reduction
 data compression.

86
Dimensionality reduction represents the original data in compressed as reduced
form by applying encoding or transformation.
It is the process of reducing the number of random variables or attributes.(remove
unimportant attributes).
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
transform the original data onto a smaller space

87
■Numerosity reduction reduces the data volume by choosing alternative smaller
forms of data representation.
■These techniques may be parametric or nonparametric.
Parametric methods, a model is used to estimate the data, only the data parameters
need to be stored, instead of the actual data. (Outliers may also be stored.)
e.x:
Regression and log-linear models.
Nonparametric methods for storing reduced representations of the data include
histograms .
e.x:
clustering, sampling, and data cube aggregation.

88
Data compression
 data compression, transformations are applied to obtain a reduced or
“compressed” representation of the original data.
 If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
 If can reconstruct only an approximation of the original data, then the data
reduction is called lossy.
lossless algorithms for string compression
 Dimensionality reduction
 numerosity reduction techniques
the time saved by mining on a reduced data set size

89
Data Reduction :Wavelet Transform
■ The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X
of wavelet coefficients.
■ All wavelet coefficients larger than some user defined threshold can be retained .
The remaining coefficients set to 0.
■ Use to remove the noise data.
■ 1.The length, L, of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector with zeros as necessary (L ≥ n).
■ 2. Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average.
■ The second performs a weighted difference, which acts to bring out the detailed
features of the data.

90
Wavelet Transform
■ 3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (x2i ,x2i+1). This results in two data sets of length L/2.
■ These represent a smoothed or low-frequency version of the input data and the
high frequency content of it, respectively.
■ 4. Apply 2 functions recursively , until reaches the desired length
■ 5. Selected values from the data sets obtained in the previous iterations are
designated the wavelet coefficients of the transformed data.
 a matrix multiplication can be applied to the input data in order to obtain the
wavelet coefficients.
 The matrix must be orthonormal.

91
Wavelet Transform
■ Wavelet transforms can be applied to multidimensional data such as a data cube.
This is done by first applying the transform to the first dimension, then to the
second, and so on.
■ real world applications
compression of fingerprint images
 computer vision
 analysis of time-series data
 data cleaning

92
x2
x1
e
Principal Component Analysis (PCA)
■ dimensionality-reduction method
■ used to reduce the dimensionality of large data sets, by transforming a large set of
variables into a smaller one.

93
Principal Component Analysis (PCA)
• Getting the principal components of the
data matrix x.
• Procedure:
• The first principle component is just the
normalized linear combination of the
variables that has the highest variance.
• The second principal component has
largest variance, subject to being
uncorrelated with the first.
• The principal components produces a
linear combinations of the data that are
really high in variance and that are
uncorrelated
The direction in which the data varies the
most actually falls along the green line.
This is the direction with the most
variation in the data, this is why it's the
first principal component.

94
Principal Component Analysis (Steps)
■ PCA can be applied to ordered and unordered attributes, and can handle sparse
data and skewed data.
■ Multidimensional data of more than two dimensions can be handled by
reducing the problem to two dimensions.
■ Principal components may be used as inputs to multiple regression and cluster
analysis.
■ In comparison with wavelet transforms, PCA tends to be better at handling
sparse data, whereas wavelet transforms are more suitable for data of high
dimensionality.

Attribute Subset Selection
• Why attribute subset selection
– Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant.
– For example,
◆ if the task is to classify customers as to whether or not they are likely to
purchase a popular new CD at AllElectronics when notified of a sale,
attributes such as the customer’s telephone number are likely to be
irrelevant, unlike attributes such as age or music taste.

• Using domain expert to pick out some of the useful attributes
– Sometimes this can be a difficult and time-consuming task, especially when the
behavior of the data is not well known.
• Leaving out relevant attributes or keeping irrelevant attributes result in discovered
patterns of poor quality.
• the added volume of irrelevant or redundant attributes can slow down the mining
process.

• Attribute subset selection (feature selection):
– Reduce the data set size by removing irrelevant or redundant attributes.
– Goal: select a minimum set of features (attributes) such that the probability
distribution of different classes given the values for those features is as close as
possible to the original distribution given the values of all features
– It reduces the number of attributes appearing in the discovered patterns, helping
to make the patterns easier to understand.

• How can we find a ‘good’ subset of the original attributes?
– For n attributes, there are 2n possible subsets.
– An exhaustive search for the optimal subset of attributes can be
prohibitively expensive, especially as n increase.
– Heuristic methods are commonly used for attribute subset selection.
– These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best choice at the
time.
– Such greedy methods are effective in practice and may come close to
estimating an optimal solution.

• Heuristic methods:
– Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction
• The “best” and “worst” attributes are typically determined using:
– the tests of statistical significance, which assume that the attributes are
independent of one another.
– the information gain measure used in building decision trees for
classification.

• Stepwise forward selection:
– The procedure starts with an empty set of attributes as the
reduced set.
– First: The best single-feature is picked.
– Next: At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.

• Stepwise backward elimination:
– The procedure starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the set.

• Combining forward selection and backward elimination:
– The stepwise forward selection and backward elimination
methods can be combined
– At each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.

• Decision tree induction:
– Decision tree induction constructs a flowchart-like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition the
data into individual classes.
– When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data.
– All attributes that do not appear in the tree are irrelevant.

• Decision tree induction

Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• There are several methods for storing reduced representations of
the data include histograms, clustering, and sampling.

Data Reduction: Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Suppose that a large data set, D, contains N instances.
• The most common ways that we could sample D for data
reduction:
– Simple random sample without replacement (SRSWOR)
– Simple random sample with replacement (SRSWR)
– Cluster sample
– Stratified sample

• Simple random sample without replacement (SRSWOR) of size s:
– SRSWOR is a method of selection of n units out of the N units one by one such
that at any stage of selection, any one of the remaining units have the same
chance of being selected, i.e. 1/ N.
• Simple random sample with replacement (SRSWR) of size s:
– SRSWR is a method of selection of n units out of the N units one by one such
that at each stage of selection, each unit has an equal chance of being selected,
i.e., 1/ N.

• Procedure of selection of a random sample:
• 1. Identify the N units in the population with the numbers 1 to N.
• 2. Choose any random number arbitrarily in the random number table and start
reading numbers.
• 3. Choose the sampling unit whose serial number corresponds to the random
number drawn from the table of random numbers.
• 4. In the case of SRSWR, all the random numbers are accepted ever if repeated
more than once.
• In the case of SRSWOR, if any random number is repeated, then it is ignored, and
more numbers are drawn.

Data Reduction:
Sampling
Raw Data

• Stratified Sample:
– This technique divides the elements of the population into small
subgroups (strata) based on the similarity .
– in such a way that the elements within the group are homogeneous and
heterogeneous among the other subgroups formed.
– the elements are randomly selected from each of these strata.
– We need to have prior information about the population to create
subgroups.

Raw Data Stratified Sample

Data Cube Aggregation
• used to aggregate data in a simpler form.
• Example
• imagine that information you gathered for your analysis for the years
2012 to 2014, that data includes the revenue of your company every
three months.
• They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It
summarizes the data.

• Sales data for a given branch of AllElectronics for the years
2002 to 2004.

• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to precomputed, summarized data, thereby
benefiting on-line analytical processing as well as data mining.
• A data cube for sales at AllElectronics.

• Base cuboid:
– The cube created at the lowest level of abstraction is referred to as
the base cuboid.
– The base cuboid should correspond to an individual entity of
interest, such as sales or customer.
• Apex cuboid:
– A cube at the highest level of abstraction is the apex cuboid.
– For the sales data, the apex cuboid would give one total— the total
sales.

116
Parametric Data Reduction: Regression
and Log-Linear Models
■ parametric methods, data is represented using some model.
■ Regression can be a simple linear regression or multiple linear regression.
■ simple linear regression
only single independent attribute
■ multiple linear regression
multiple independent attributes
■ the data are modeled to a fit straight line.
■ ex,
■ a random variable y can be modeled as a linear function of another random
variable x with the equation y = ax+b
where a and b (regression coefficients) specifies the slope and y-intercept of
the line, respectively.

117
Parametric Data Reduction: Regression
and Log-Linear Models
• Log-Linear Model:
Log-linear model can be used to estimate the probability of each data point
in a multidimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from lower-
dimensional attributes.
• Regression and log-linear model can both be used on sparse data, although
their application may be limited.

118
Histogram Analysis
■ Histogram is the data representation in
terms of frequency.
■ It uses binning to approximate data
distribution
■ popular form of data reduction

119
Clustering
■ Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
■ In data reduction, the cluster representation of the data are used to replace
the actual data.
■ It also helps to detect outliers in data

120
Data Transformation
■ data transformed or consolidated into forms appropriate for mining
■ Methods
■ Smoothing: Remove noise from data
■ Attribute/feature construction
■ New attributes constructed from the given ones
■ Aggregation: Summarization, data cube construction
■ Normalization: Scaled to fall within a smaller, specified range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
■ Discretization: divide the range of continuous attribute into intervals.

121
Normalization
• Min-max normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
• Formula:
• Where v is the original attribute value.
• v’ is the new value you get after normalizing the old value.
Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000
is mapped to

122
Normalization
■ Z-score normalization
• zero-mean normalization the values of an attribute (A), are normalized based
on the mean of A and its standard deviation
• A value, v, of attribute A is normalized to v’ by computing
• Ex. Let μ = 54,000, σ = 16,000. Then
• Decimal Scaling:
• It normalizes the values of an attribute by changing the position of their
decimal points
• The number of points by which the decimal point is moved can be determined
by the absolute maximum value of attribute A.
• A value, v, of attribute A is normalized to v’ by computing
Where j is the smallest integer such that Max(|ν’|) < 1

123
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Binning
■ Binning groups related values together in bins to reduce the
number of distinct values.
■ Top-down split, unsupervised
■ Histogram analysis
■ partition the values for an attribute into disjoint ranges called
buckets.
■ Top-down split, unsupervised(does not use class name)

124
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Clustering analysis
• Cluster analysis is a popular data discretization method.
• A clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy.
■ unsupervised, top-down split or bottom-up merge
■ Decision-tree analysis
■ supervised, top-down split
■ Correlation (e.g., χ2) analysis
■ unsupervised, bottom-up merge

125
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

126
Concept Hierarchy Generation
■ Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
■ in the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by concept
hierarchies.
■ Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
■ Concept hierarchy can be automatically formed for both numeric and nominal
data.

127
Concept Hierarchy Generation
for Nominal Data
■ Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
■ street < city < state < country
■ Specification of a hierarchy for a set of values by explicit data
grouping
■ {Urbana, Champaign, Chicago} < Illinois
■ Specification of only a partial set of attributes
■ E.g., only street < city, not others
■ Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
■ E.g., for a set of attributes: {street, city, state, country}

128
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set
■ The attribute with the most distinct values is placed at the lowest
level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct
values
3567 distinct values
674,339 distinct
values

Data cube
• Grouping of data in a multidimensional matrix is called data cubes
• A data cube is generally used to easily interpret data.
• It is especially useful when representing data together with dimensions as
certain measures of business
• extension of 2-Dimensional data cube or 2-dimensional matrix (column and
rows)

Data cube
• need to abstract the relevant or important data from complex data. There
comes into picture the need for the data cube.
• A Data cube is basically used to represent the specific information to be
retrieved from a huge set of complex data.
• e.g: purchasing in shopping mall

Data cube:Types
• The data cube can be classified into two categories:
• Multidimensional data cube:
• It basically helps in storing large amounts of data by making use of a multi-
dimensional array.
• It increases its efficiency by keeping an index of each dimension. Thus,
dimensional is able to retrieve data fast.
• Relational data cube:
• It basically helps in storing large amounts of data by making use of relational
tables.
• Each relational table displays the dimensions of the data cube.
• It is slower compared to a Multidimensional Data Cube

Data Cube :characteristics
• It can go very far beyond to include many more dimensions.
• Improvises business strategies by analysis of all the data.
• It helps to get the latest market scenario by establishing trends and
performance analysis.
• It plays a very pivotal role by creating intermediate data cubes to
serve the requirements and to bridge the gap between the data
warehouse and all the reporting tool, particularly in a data.

Data Cube:Benefits
• Increases the productivity of an enterprise.
• Improves the overall performance and efficiency.
• Representation of huge and complex data sets get simplified and
streamlined.
• Huge database and complex SQL queries are also manageable.
• Indexing and ordering provides the best set of data for analysis
and data mining techniques.
• Faster and easily accessible as It will posses pre-defined and pre-
calculated data sets or data cubes.

Data Cube:Benefits
• Aggregation of data makes access to all data very fast at each micro-
level which ultimately leads to easy and efficient maintenance and
reduced development time.
• OLAP will help in getting Fast Response time, Fast curve of
Learning, versatile environment, reach to a wide range of reach to
all applications, need of resources for deployment and less wait time
with a quality result.

Statistical Descriptions of Data
 Statistics help in identifying patterns that further help identify
differences between random noise and significant findings.
 Descriptive statistics are used to describe or summarize data in
ways that are meaningful and useful.
 data preprocessing to be successful, it is essential to have an overall
picture of your data.
 used to identify properties of the data and highlight which data
values should be treated as noise or outliers.

• It is actually a form of mathematical analysis
• It is an area of applied mathematics concern with data collection analysis,
interpretation, and presentation.
• Statistics deals with how data can be used to solve complex problems.
• Statistics makes work easy and simple and provides a clear and clean
picture of work you do on a regular basis.
• Basic terminology of Statistics :
• Population
It is actually a collection of set of individuals or objects or events whose
properties are to be analyzed.

Descriptive statistics uses data that provides a description of the population either
through numerical calculation or graph or table. It provides a graphical summary of data.
• Measure of central tendency
• Measure of Variability
Measure of central tendency
summary statistics that is used to represents the center point or a particular value of a
data set or sample set.
(i) Mean :
It is measure of average of all value in a sample set.
For example,

(ii) Median :
It is measure of central value of a sample set. In these, data set is ordered
from lowest to highest value and then finds exact middle.
For example,

(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of
time in central set is actually mode.

• Measure of Variability –
Measure of Variability is also known as measure of dispersion and helps to
understand the distribution of the data.
• three common measures of variability :
• (i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
1, 3,5, 6, 7 => Range = 7 -1= 6
• (ii) Variance :
variance measures how far each number in the set is from the mean.
S2= ∑n
i=1 [(xi - ͞x)2 ÷ n]
• n represent total data points, ͞x represent mean of data points and xi represent
individual data points.
• (iii) Dispersion :
dispersion in statistics is a way of describing how spread out a set of data is.
σ= √ (1÷n) ∑n
i=1 (xi - μ)2

Dma unit 1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Dma unit 1

Similaire à Dma unit 1 (20)

Dernier

Dernier (20)

Dma unit 1