SlideShare une entreprise Scribd logo
1  sur  140
18CSE355T -DATA MINING AND ANALYTICS
COURSE LEARNING
RATIONALE (CLR)
The purpose of learning this course
is to:
CLR -1: Understand the concepts of Data Mining
CLR -2: Familiarize with Association rule mining
CLR -3: Familiarize with various Classification algortihms
CLR -4: Understand the concepts of Cluster Analysis
CLR -5: Familiarize with Outlier analysis techniques
CLR -6: Familiarize with applications of Data mining in
different domains
COURSE LEARNING
OUTCOMES (CLO)
At the end of this course,
learners will be able to:
CLO -1: Gain knowledge about the concepts of Data Mining
CLO -2: Understand and Apply Association rule mining
techniques
CLO -3: Understand and Apply various Classification
algortihms
CLO -4: Gain knowledge on the concepts of Cluster Analysis
CLO -5: Gain knowledge on Outlier analysis techniques
CLO -6: Understand the importance of applying Data mining
concepts in different domains
LEARNING RESOURCES
S. No., TEXT BOOKS
1
Jiawei Han and Micheline Kamber, ― Data Mining:
Concepts and Techniques‖, 3rd Edition, Morgan
Kauffman Publishers, 2011.
UNIT I
INTRODUCTION
Why Data mining? What is Data mining ?-Kinds of data
meant for mining -Kinds of patterns that can be mined-
Applications suitable for data mining-Issues in Data
mining-Data objects and Attribute types-Statistical
descriptions of data-Need for data preprocessing and data
quality-Data cleaning-Data integration-Data reduction-
Data transformation-Data cube and its usage
Why Data Mining?
• The Explosive Growth of Data: from terabytes(10004) to petabytes(10008)
– Data collection and data availability
• Automated data collection tools, database systems, web
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: bioinformatics, scientific simulation, medical research …
• Society and everyone: news, digital cameras, …
Why Data Mining?
 The abundance of data, coupled with the need for powerful data analysis
tools, has been described as a data rich but information poor situation.
 The fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools.
 As a result, data collected in large data repositories become “data tombs” data
archives that are seldom visited.
 Important decisions are often made based not on the information-rich data
stored in data repositories but rather on a decision maker’s intuition, simply
because the decision maker does not have the tools to extract the valuable
knowledge embedded in the vast amounts of data.
Why Data Mining?
 Efforts have been made to develop expert system and knowledge-based
technologies, which typically rely on users or domain experts to manually
input knowledge into knowledge bases.
 Unfortunately, however, the manual knowledge input procedure is prone to
biases and errors and is extremely costly and time consuming.
 The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden
nuggets” of knowledge.
Evolution of Database Technology
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data.
– Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data.
– The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system
dynamically.
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Potential Applications
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management (CRM),
market basket analysis, cross selling, market segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
– Text mining (news group, email, documents) and Web mining
– Stream data mining
– Bioinformatics and bio-data analysis
Ex.: Market Analysis and Management
• Where does the data come from?—Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
– Find clusters of “model” customers who share the same characteristics: interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 -
$800 a month live in that area
– Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k –
29k usually buy this type of CD player
• Cross-market analysis—Find associations/co-relations between product sales, &
predict based on such association
– E.g. Customers who buy computer A usually buy software B
Knowledge Discovery (KDD) Process
KDD Process
7 Steps of KDD Process
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from database)
– Data transformation (data transformed or consolidated into forms appropriate for
mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation ( presentation of knowledge to the user for visualization
in terms of trees, tables, rules graphs, charts, matrices..)
Data Mining and Business Intelligence
Increasing potential
to support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
On What Kinds of Data to be minined?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Object-Relational Databases
– Temporal Databases, Sequence Databases, Time-Series databases
– Spatial Databases and Spatiotemporal Databases
– Text databases and Multimedia databases
– Heterogeneous Databases and Legacy Databases
– Data Streams
– The World-Wide Web
Relational Databases
• DBMS – database management system, contains a collection of
interrelated databases
e.g. Faculty database, student database, publications database
• Each database contains a collection of tables and functions to
manage and access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as attributes of data and
rows as records.
• Tables can be used to represent the relationships between or among multiple
tables.
Relational Databases (2) – AllElectronics store
Relational Databases (3)
• With a relational query language, e.g. SQL, we will be able to find answers to
questions such as:
 How many items were sold last year?
 Who has earned commissions higher than 10%?
 What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we can search for trends or data
patterns.
• data mining systems can analyze customer data to predict the credit risk of new
customers based on their income, age, and previous credit information.
Data Warehouses
• A repository of information collected from multiple sources, stored
under a unified schema, and that usually resides at a single site.
• Constructed via a process of data cleaning, data integration, data
transformation, data loading and periodic data refreshing.
Data Warehouses (2)
• Modelled by multidimensional data structure , called as data cube, in
which each dimension corresponds to set of attributes.
• Each cell stores the value of some aggregate measures like count.
• Data cube provides the multidimensional view of data and allows the
precomputation and fast access of summarized data.
• Data are organized around major subjects, e.g. customer, item, supplier
and activity.
• Provide information from a historical perspective (e.g. from the past 5 – 10
years)
• Typically summarized to a higher level (e.g. a summary of the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the data at
different degrees of summarization
Data Warehouses (3)
Transactional Databases
• It captures a transaction, such as flight booking, customer’s purchase and
user’s click on a web page.
• Consists of a file where each record represents a transaction.
• A transaction typically includes a unique transaction ID and a list of the
items making up the transaction.
• Either stored in a flat file or unfolded into relational tables
• Easy to identify items that are frequently sold together
Transactional Databases
 Which items sold well together?”
 This kind of market basket data analysis would enable you to bundle
groups of items together as a strategy for boosting sales.
 e.g purchase the computer along with printer.
 A traditional database system is not able to perform market basket data
analysis.
 Data mining on transactional data can do so by mining frequent item sets
that is, sets of items that are frequently sold together.
Data Mining Functionalities
What kinds of patterns can be mined?
• Data Mining Functionalities are used to specify the kinds of patterns to be found
in data mining tasks.
• 2 tasks are descriptive and predictive.
• Descriptive mining tasks –describes the general properties of the data
• Predictive mining tasks – it makes the prediction based on current data.
Types of Data Mining Functionalities
• Concept/Class Description
• Mining Frequent Patterns, Associations, and Correlations
• Classification and Regression for Predictive Analysis
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
.
Data Mining Functionalities
- What kinds of patterns can be mined?
1. Concept/Class Description
Data can be associated with classes or concepts.
•E.g. classes of items – computers, printers, …
concepts of customers – bigSpenders, budgetSpenders, …
It can be useful to describe individual classes and concepts in summarized, concise,
and yet precise terms.
Data characterization – summarizing the general characteristics of a target class of
data.
–E.g. summarizing the characteristics of customers who spend more than $1,000 a
year at AllElectronics.
Result can be a general profile of the customers,
40 – 50 years old,
Employed
have excellent credit ratings
1.4 Data Mining Functionalities
- What kinds of patterns can be mined?
•Data discrimination – comparing the target class with one or a set of
comparative classes
–E.g. Compare the general features of software products whole sales increase by
10% in the last year ,
–with those whose sales decrease by 30% during the same period
2.Mining Frequent Patterns, Associations and
Correlations
Mining Frequent Patterns ( patterns that occur frequently in data ).
Kinds of frequent patterns
– Frequent item set: a set of items that frequently appear together in a
transactional data set (e.g. milk and bread)
– Frequent subsequence: A frequently occurring subsequence, such as the
pattern that customers, tend to purchase first a laptop, followed by a digital
camera, and then a memory card, is a (frequent) sequential pattern.
– Frequent Substructures: A substructure can refer to different structural forms
(e.g., graphs, trees, or lattices) that may be combined with item sets or
subsequences.
– If a substructure occurs frequently, it is called a (frequent) structured pattern.
– Mining frequent patterns leads to the discovery of interesting associations and
correlations within data.
1.4 Data Mining Functionalities
- What kinds of patterns can be mined?
`
– Association Analysis: find frequent patterns
• E.g. a sample analysis result – an association rule:
buys(X, “computer”) => buys(X, “software”) [support = 1%, confidence
= 50%]
if a customer buys a computer, there is a 50% chance that she will buy
software. 1% of all of the transactions under analysis showed that
computer and software are purchased together.
• Associations rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence threshold.
– Correlation Analysis: additional analysis to find statistical correlations
between associated pairs
What kinds of patterns can be mined?
3.Classification and Prediction for predictive analysis
– Classification
• It is a data analysis task, i.e. the process of finding a model that
describes and distinguishes data classes and concepts
• The goal of classification is to accurately predict the target class for each
case in the data.
• The model can be represented in classification (IF-THEN) rules,
decision trees, neural networks, etc.
What kinds of patterns can be mined?
3.Classification and Prediction for predictive analysis
A decision tree is a flowchart-like tree structure, where each node denotes a
test on an attribute value, each branch represents an outcome of the test, and
tree leaves represent classes or class distributions.
What kinds of patterns can be mined?
3.Classification and Prediction for predictive analysis
• A neural network, when used for classification, is typically a collection of
neuron-like processing units with weighted connections between the units.
• Neural networks are used for effective data mining in order to turn
raw data into useful information.
• Neural networks look for patterns in large batches of data, allowing
businesses to learn more about their customers which directs their
marketing strategies, increase sales and lowers costs.
What kinds of patterns can be mined?
Classification and Regression
 Regression is used to predict missing or unavailable numerical data values
rather than (discrete) class labels.
 The term prediction refers to both numeric prediction and class label
prediction.
 Regression analysis is a statistical methodology that is most often used for
numeric prediction, although other methods exist as well.
 Regression also encompasses the identification of distribution trends based
on the available data.
 Classification and regression may need to be preceded by relevance
analysis, which attempts to identify attributes that are significantly relevant
to the classification and regression process.
 Such attributes will be selected for the classification and regression
process. Other attributes, which are irrelevant, can then be excluded from
consideration
What kinds of patterns can be mined?
4.Cluster Analysis
– Clustering can be used to generate class labels for a group of data.
– Clusters of objects are formed based on the principle of maximizing intra-
class similarity & minimizing interclass similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters
may
represent individual target groups for marketing.
What kinds of patterns can be mined?
5.Outlier Analysis: identify similar objects
– A data set may contain objects that do not comply with the general behavior or
model of the data. These data objects are outliers
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
the outlier indicates a fraudulent activity.
• E.g. Detect purchases of extremely large amounts
6.Evolution Analysis
– Describes and models regularities or trends for objects whose behavior
changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the
stocks of particular companies.
What kinds of patterns can be mined?
Are All of the Patterns Interesting?
• Data mining may generate thousands of patterns: Not all of them
are interesting
• A pattern is interesting if it is
– easily understood by humans
– valid on new or test data with some degree of certainty,
– potentially useful
– novel
– validates some hypothesis that a user seeks to confirm
• An interesting patterns represents knowledge !
What kinds of patterns can be mined?
Are All of the Patterns Interesting?
• Objective measures
– Based on statistics and structures of patterns, e.g., support, confidence, etc. (Rules
that do not satisfy a threshold are considered uninteresting.)
• Subjective measures
– Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who shop
frequently.
– Based on user’s belief in the data.
• e.g., Patterns are interesting if they are unexpected, or can be used for strategic planning, etc
• Objective and subjective measures need to be combined.
Major Issues in Data Mining
There are many challenging issues in data mining research.
Areas include
i. Mining methodology
ii. User interaction
iii. Efficiency and scalability
iv. Dealing with diverse data types
v. Data mining and society
39
Major Issues in Data Mining
■ Mining Methodology
■ Mining various and new kinds of knowledge
■ Mining knowledge in multi-dimensional space
■ Data mining: An interdisciplinary effort
■ Boosting the power of discovery in a networked environment
■ Handling noise, uncertainty, and incompleteness of data
■ Pattern evaluation and pattern- or constraint-guided mining
40
Major Issues in Data Mining
■ Mining Methodology
■ Mining various and new kinds of knowledge
 use the same database in different ways and require the development of
numerous data mining techniques.
 Due to the diversity of applications, new mining tasks continue to emerge,
making data mining a dynamic and fast-growing field.
e.g
effective knowledge discovery in information networks, integrated clustering
and ranking may lead to the discovery of high-quality clusters and object ranks
in large networks.
41
Major Issues in Data Mining
■ Mining Methodology
■ Mining knowledge in multi-dimensional space
■ When searching for knowledge in large data sets, explore the data in
multidimensional space.
■ That is, can search for interesting patterns among combinations of
dimensions (attributes) at varying levels of abstraction. Such mining is
known as (exploratory) multidimensional data mining.
■ In many cases, data can be aggregated or viewed as a multidimensional data
cube.
■ Mining knowledge in cube space can enhance the power and flexibility of
data mining.
Major Issues in Data Mining
Data mining an interdisciplinary effort:
The power of data mining can be enhanced by integrating new methods from
multiple disciplines.
Eg.- mining data in natural language
- the mining of software bugs in large programs(bug mining), requires
software engineering knowledge.
Boosting the power of discovery in a networked environment:
• Most data objects reside in a linked or interconnected environment,
whether it be the Web, database relations, files, or documents.
• Semantic links across multiple data objects can be used to advantage in
data mining.
• Knowledge derived in one set of objects can be used discovery of
knowledge in a “related” objects.
Major Issues in Data Mining
Handling uncertainty, noise, or incompleteness of data:
 Data often contain noise, errors, exceptions, or uncertainty, or are incomplete.
 Errors and noise may confuse the data mining process, leading to the derivation
of erroneous patterns.
 Data cleaning, data preprocessing, outlier detection and removal are examples
of techniques that need to be integrated with the data mining process.
Pattern evaluation and pattern- or constraint-guided mining:
 Not all the patterns generated by data mining processes are interesting.
 What makes a pattern interesting may vary from user to user.
 Therefore, techniques are needed to assess the interestingness of discovered
patterns based on subjective measures.
Major Issues in Data Mining
ii)User Interaction
 The user plays an important role in the data mining process.
 Interesting areas of research include how to interact with a data mining system,
how to incorporate a user’s background knowledge in mining, and how to
visualize data mining results.
Interactive mining:
 The data mining process should be highly interactive.
 important to build flexible user interfaces.
Incorporation of background knowledge:
 Background knowledge, constraints, rules, and other information regarding the
domain should be incorporated into the knowledge discovery process.
 Such knowledge can be used for pattern evaluation and guide the search to find
interesting patterns.
Major Issues in Data Mining
Presentation and visualization of data mining results:
 How can a data mining system present data mining results flexibly?
 This is especially crucial if the data mining process is interactive.
 It requires the system to adopt expressive knowledge representations, user
friendly interfaces, and visualization techniques.
iii)Efficiency and Scalability
 Efficiency and scalability are always considered when comparing data
mining algorithms.
 As data amounts continue to multiply, these two factors are especially
critical.
Major Issues in Data Mining
Efficiency and scalability of data mining algorithms:
 Data mining algorithms must be efficient and scalable in order to
effectively extract information from huge amounts of data .
 The running time of a data mining algorithm must be predictable, short, and
acceptable by applications.
 Efficiency, scalability, performance, optimization, and the ability to execute
in real time are key criteria that drive the development of many new data
mining algorithms.
Parallel, distributed, and incremental mining algorithms:
 The parallel processes may interact with one another. The patterns from
each partition are eventually merged.
47
Major Issues in Data Mining
iv)Diversity of Database Types
The wide diversity of database types brings about challenges to data mining.
Handling complex types of data
Mining dynamic, networked, and global data repositories
Handling complex types of data
 The construction of effective and efficient data mining tools for diverse
applications remains a challenging and active area of research.
Mining dynamic, networked, and global data repositories
Multiple sources of data are connected by the Internet and various kinds of
networks, forming distributed, and heterogeneous global information systems
and networks.
Major Issues in Data Mining
 The discovery of knowledge from different sources of structured, semi-
structured, or unstructured and interconnected data with diverse data
semantics having great challenges to data mining.
 Web mining, multisource data mining, and information network mining
have become challenging and fast-evolving data mining fields.
v)Data Mining and Society
How does data mining impact society?
Social impacts of data mining:
 The improper use of data and the potential violation of individual privacy
and data protection rights are areas of concern that need to be addressed.
Major Issues in Data Mining
Privacy-preserving data mining:
 Data mining will help scientific discovery, business management, economy
recovery, and security protection (e.g., the real-time discovery of intruders
and cyberattacks).
 However, it poses the risk of disclosing an individual’s personal
information.
 observe data sensitivity and preserve people’s privacy while performing
successful data mining.
Major Issues in Data Mining
Invisible data mining:
 We cannot expect everyone in society to learn and master data mining
techniques.
 More and more systems should have data mining functions built within so
that people can perform data mining or use data mining results simply by
mouse clicking, without any knowledge of data mining algorithms.
 Intelligent search engines and Internet-based stores perform such invisible
data mining by incorporating data mining into their components to improve
their functionality and performance. This is done often unbeknownst to the
user.
 For example, when purchasing items online, users may be unaware that the
store is likely collecting data on the buying patterns of its customers, which
Data Objects and Attribute Types
A data object is a region of storage that contains a value or group of values.
Data sets are made up of data objects.
A data object represents an entity
• in a sales database, the objects may be customers, store items, and sales;
• in a university database, the objects may be students, professors, and courses.
Row-> data objects
Columns->attributes
Data objects are typically described by attributes.
Data objects can also be referred to as samples, examples, instances, data points,
or objects.
If the data objects are stored in a database, they are data tuples.
That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes.
Data Objects and Attribute Types
An attribute is a data field, representing a characteristic or feature of a
data object.
The type of an attribute is determined by the set of possible values: nominal,
binary, ordinal, or numerical
• Nominal Attributes only provide enough attributes to differentiate
between one object and another.
 -relating to name
 Hair-color{brown,black,white}
• Such as Student Roll No.
• Ordinal Attribute:
• Value have meaningful order.
The ordinal attribute value provides sufficient information to order the
objects.
Piazza={small,medium,large}
Rankings, Grades, Height
Data Objects and Attribute Types
• Binary Attribute:
These are 0 and 1. Where 0 is the absence of any features and 1 is the
inclusion of any characteristics.
• Quality
• Numeric attribute: It is quantitative, such that quantity can be measured
and represented in integer or real values .
Two types
i)Interval Scaled attribute:
It is measured on a scale of equal size units.
These attributes allows us to compare such as temperature in C or F . Thus
values of attributes have order.
ii)Ratio Scaled attribute:
Both differences and ratios(or multiply) are significant for Ratio.
For eg. age, length, Weight.
e.g.10 we can say multiply of 5 .
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
Why Data Preprocessing?
Data Quality:
Data have quality if they satisfy the requirements of the intended use.
Factors comprising quality
• accuracy
• completeness
• consistency
• timeliness
• believability
• interpretability
• Accessibility
e.g: Analyze the branch sale at All Electronics Store
three of the elements defining data quality:
• Accuracy
• completeness
• consistency
Why Data Preprocessing?
Data Quality:
Reasons for inaccurate data(having incorrect attribute value)
• The data collection instruments used may be faulty.
• There may have been human or computer errors occurring at data entry.
• Users may purposely submit incorrect data values for mandatory fields
• This is known as disguised missing data.
 There may be technology limitations such as limited buffer size for
coordinating synchronized data transfer and consumption.
Why Data Preprocessing?
Data Quality:
Reason for Incorrect data
• inconsistencies in naming conventions or data codes, or inconsistent formats for
input fields (e.g., date).
• Duplicate tuples also require data cleaning.
Reasons for Incomplete data
• Missing the Attribute information.
 e.g customer information for sales transaction data.
• Relevant data may not be recorded due to a misunderstanding.
• Data that were inconsistent with other recorded data, may have been deleted.
Why Data Preprocessing?
Data Quality:
Timeliness also affects data quality.
e.g: monthly sales bonuses to the top sales representatives at All Electronics
Store.
2 other factors affect quality
Believability reflects how much the data are trusted by users.
Interpretability reflects how easy the data are understood
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, files, or notes
• Data transformation
-- forms of Normalization (scaling to a specific range)
– Aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results
– Data discretization: with particular importance, especially for numerical data
– Data aggregation, dimensionality reduction, data compression, generalization
Forms of data preprocessing
61
Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
■ noisy: containing noise, errors, or outliers
■ e.g., Salary=“−10” (an error)
■ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age=“42”, Birthday=“03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
■ discrepancy between duplicate records
■ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?
62
Incomplete (Missing) Data
■ Data is not always available
■ E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
■ Missing data may be due to
■ equipment malfunction
■ inconsistent with other recorded data and thus deleted
■ data not entered due to misunderstanding
■ certain data may not be considered important at the time of entry
■ not register history or changes of the data
■ Missing data may need to be inferred
63
Incomplete (Missing) Data
Methods for data cleaning:
1.Missing values
2.Noisy data
3. Data cleaning as a process.
64
1.How to Handle Missing Data?
■ Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
■ Fill in the missing value manually: this approach is time consuming and may
not be feasible given a large data set with many missing values.
■ Use a global constant to fill in the missing value:
 Replace all missing attribute values by the same constant such as a label like
“Unknown” or −∞.
 If missing values are replaced by, “Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have a
value in common—that of “Unknown.”
 Hence, although this method is simple, it is not foolproof.
65
1.How to Handle Missing Data?
■ Use the attribute mean or median for all samples belonging to the same
class as the given tuple:
 For example, if classifying customers according to credit risk, may replace
the missing value with the mean income value for customers in the same credit
risk category as that of the given tuple.
 If the data distribution for a given class is skewed, the median value is a better
choice.
66
How to Handle Missing Data?
■ Use the most probable value to fill in the missing value:
 This may be determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction.
 For example, using the other customer attributes in your data set, you may
construct a decision tree to predict the missing values for income.
67
2.Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments
■ data entry problems
■ data transmission problems
■ technology limitation
■ inconsistency in naming convention
■ Other data problems which require data cleaning
■ duplicate records
■ incomplete data
■ inconsistent data
68
How to Handle Noisy Data?
■ Binning
■ Binning is a way to group a
number of more or less
continuous values into a smaller
number of "bins".
■ The sorted values are distributed
into a number of “buckets,” or
bins .
■ example, if you have data about a
group of people, you might want
to arrange their ages into a
smaller number of age intervals.
Sorted data for price (in dollars): 4,
8, 15, 21, 21, 24, 25, 28, 34
Partition into (equal-frequency)
bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
69
How to Handle Noisy Data?
Regression:
 Regression is a technique used to model and analyze the relationships
between variables.
 Predicts the value.
 Ex. Predict the children height ,given their age ,weight and other
factors.
 Linear regression involves finding the “best” line to fit two attributes
so that one attribute can be used to predict the other.
 Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Outlier analysis:
 Outliers may be detected by clustering.
 example, where similar values are organized into groups, or “clusters.”
 values that fall outside of the set of clusters may be considered outliers.
70
Data Cleaning as a Process
■ Data cleaning is usually performed as an iterative two-step process consisting
of discrepancy detection and data transformation.
■ First step is Data discrepancy detection.
 Discrepancies can be caused by several factors
 poorly designed data entry forms that have many optional fields.
 human error in data entry
 deliberate errors
e.g., respondents not wanting to divulge information about themselves.
 Other sources of discrepancies include errors in instrumentation devices that
record data and system errors.
 Errors can also occur when the data are used for purposes other than
originally intended.
 There may also be inconsistencies due to data integration.
71
Data Cleaning as a Process
■ how can we proceed with discrepancy detection?”
■ Data discrepancy detection
Data auditing tools find discrepancies by analyzing the data to discover rules and
relationships, and detecting data that violate such conditions.
■ Use metadata
e.g., domain, range, dependency
Check field overloading
■ Check uniqueness rule, consecutive rule and null rule for examine the data.
■ Use commercial tools
■ Data scrubbing use simple domain knowledge to detect errors and make
corrections.
e.g., postal code, spell-check
■ Data auditing find discrepancies by analyzing data to discover rules and
relationship to detect violators .
■ e.g., correlation and clustering to find outliers
72
Data Cleaning as a Process
■ how can we proceed with discrepancy detection?”
■ Data migration and integration
■ Data migration tools allow simple transformations to be specified such as
to replace the string “gender” by “sex.”
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface.
■ The two-step process of discrepancy detection and data transformation
iterates.
73
Data Cleaning as a Process
 As a result, the entire data cleaning process suffers from a lack of interactivity.
 New approaches to data cleaning emphasize increased interactivity.
 Potter’s Wheel, is a publicly available data cleaning tool that integrates
discrepancy detection and transformation.
 The tool automatically performs discrepancy checking in the background on the
latest transformed view of the data.
 Users can gradually develop and refine transformations as discrepancies are
found, leading to more effective and efficient data cleaning.
 For data transformation SQL and algorithm that enable users to express data
cleaning specifications efficiently.
 it is important to keep updating the metadata to reflect this knowledge. This will
help speed up data cleaning on future versions of the same data store.
74
74
Data Integration
■ Data integration:
■ Merging of data from multiple sources into a coherent store
How can we match schema and objects from different sources?
1.Entity identification problem
2. Redundancy and correlation analysis
3.Tuple duplication
4.Data value conflict detection and resolution
75
75
Data Integration
1.Entity identification problem
How can equivalent real world entities from multiple data sources be matched up?
■ Identify real world entities from multiple data sources,
e.g., cust_id=customer number
 metadata can be used to help avoid errors in schema integration.
76
76
Redundancy and Correlation Analysis
■ An attribute may be redundant if it can be “derived” from another attribute or
set of attributes.
■ E.g: annual revenue
■ Inconsistencies in attribute or dimension naming can also cause redundancies
in the resulting data set.
■ Redundant attributes detected by correlation analysis and covariance analysis.
■ Given two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data.
■ For nominal data, use the χ 2 (chi-square) test.
■ For numeric attributes, can use the correlation coefficient and covariance, both
of which access how one attribute’s values vary from those of another.
■ Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality.
77
Correlation Analysis for Nominal Data
■ a correlation relationship between two attributes, A and B, can be discovered
by a χ 2 (chi-square) test
■ Χ2 (chi-square) test
■ The larger the Χ2 value, the more likely the variables are related
■ The cells that contribute the most to the Χ2 value are those whose actual
count is very different from the expected count.
■ Correlation does not imply causality
■ # of hospitals and # of car-theft in a city are correlated
■ Both are causally linked to the third variable: population
78
Chi-Square Calculation: An Example
■ Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
calculated based on the data distribution in the two categories)
■ It shows that like_science_fiction and play_chess are correlated in the group
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
79
Correlation Analysis for Numeric Data
■ Correlation coefficient (also called Pearson’s product moment
coefficient)
where n is the number of tuples, and are the respective means of A and
B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the
sum of the AB cross-product.
■ If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
■ rA,B = 0: independent; rAB < 0: negatively correlated( the values of
one attribute increase as the values of the other attribute decrease)
80
Correlation Analysis for Numeric Data
• correlation does not imply causality.
• That is, if A and B are correlated, this does not necessarily imply that A
causes B or that B causes A.
• For example, in analyzing a demographic database, we may find that
attributes representing the number of hospitals and the number of car thefts
in a region are correlated.
• This does not mean that one causes the other. Both are actually causally
linked to a third attribute, namely, population
81
Covariance of Numeric Data
■ covariance refers to the measure of the directional relationship between two
random variables.
where n is the number of tuples, and are the respective mean or expected
values of A and B, σA and σB are the respective standard deviation of A and B.
■ Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values.
■ Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B
is likely to be smaller than its expected value.
■ Independence: CovA,B = 0 but the converse is not true:
■ Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions does a covariance of 0 imply independence
Correlation coefficient:
Co-Variance: An Example
■ It can be simplified in computation as
■ Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
■ Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
■ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
■ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
■ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
■ Thus, A and B rise together since Cov(A, B) > 0.
Tuple Duplication
• To detecting redundancies between attributes, duplication should also be
detected at the tuple level .
• e.g., where there are two or more identical tuples for a given unique data entry
case
 Inconsistencies arise between various duplicates, due to inaccurate data entry
or updating some but not all data occurrences.
ex, if a purchase order database contains attributes for the purchaser’s name and
address instead of a key to this information in a purchaser database, discrepancies
can occur, such as the same purchaser’s name appearing with different addresses
within the purchase order database
Data Value Conflict Detection and Resolution
Data integration also involves the detection and resolution of data value conflicts.
For example, for the same real-world entity, attribute values from different sources
may differ.
This may be due to differences in representation, scaling, or encoding.
e.g,
 a weight attribute may be stored in metric units in one system and units in another.
 University may adopt to quarter or semester system, and offer different database
courses.
This is difficult to work out the grade.
Attributes may also differ on the abstraction level.
 an attribute in one system is recorded at, a lower abstraction level than the “same”
attribute in another.
ex.:
 the total sales in one database may refer to one branch of All Electronics, while an
attribute of the same name in another database may refer to the total sales for All
Electronics stores in a given region.
85
Data Reduction Strategies
■ Data reduction:
Data reduction techniques can be applied to obtain a reduced representation of the
data set that is much smaller in volume, yet closely maintains the integrity of the
original data.
Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
1.Data reduction strategies
 dimensionality reduction
 numerosity reduction
 data compression.
86
Data Reduction Strategies
Dimensionality reduction represents the original data in compressed as reduced
form by applying encoding or transformation.
It is the process of reducing the number of random variables or attributes.(remove
unimportant attributes).
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
transform the original data onto a smaller space
87
Data Reduction Strategies
■Numerosity reduction reduces the data volume by choosing alternative smaller
forms of data representation.
■These techniques may be parametric or nonparametric.
Parametric methods, a model is used to estimate the data, only the data parameters
need to be stored, instead of the actual data. (Outliers may also be stored.)
e.x:
Regression and log-linear models.
Nonparametric methods for storing reduced representations of the data include
histograms .
e.x:
clustering, sampling, and data cube aggregation.
88
Data Reduction Strategies
Data compression
 data compression, transformations are applied to obtain a reduced or
“compressed” representation of the original data.
 If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
 If can reconstruct only an approximation of the original data, then the data
reduction is called lossy.
lossless algorithms for string compression
 Dimensionality reduction
 numerosity reduction techniques
the time saved by mining on a reduced data set size
89
Data Reduction :Wavelet Transform
■ The discrete wavelet transform (DWT) is a linear signal processing technique that,
when applied to a data vector X, transforms it to a numerically different vector, X
of wavelet coefficients.
■ All wavelet coefficients larger than some user defined threshold can be retained .
The remaining coefficients set to 0.
■ Use to remove the noise data.
■ 1.The length, L, of the input data vector must be an integer power of 2. This
condition can be met by padding the data vector with zeros as necessary (L ≥ n).
■ 2. Each transform involves applying two functions. The first applies some data
smoothing, such as a sum or weighted average.
■ The second performs a weighted difference, which acts to bring out the detailed
features of the data.
90
Wavelet Transform
■ 3. The two functions are applied to pairs of data points in X, that is, to all pairs of
measurements (x2i ,x2i+1). This results in two data sets of length L/2.
■ These represent a smoothed or low-frequency version of the input data and the
high frequency content of it, respectively.
■ 4. Apply 2 functions recursively , until reaches the desired length
■ 5. Selected values from the data sets obtained in the previous iterations are
designated the wavelet coefficients of the transformed data.
 a matrix multiplication can be applied to the input data in order to obtain the
wavelet coefficients.
 The matrix must be orthonormal.
91
Wavelet Transform
■ Wavelet transforms can be applied to multidimensional data such as a data cube.
This is done by first applying the transform to the first dimension, then to the
second, and so on.
■ real world applications
compression of fingerprint images
 computer vision
 analysis of time-series data
 data cleaning
92
x2
x1
e
Principal Component Analysis (PCA)
■ dimensionality-reduction method
■ used to reduce the dimensionality of large data sets, by transforming a large set of
variables into a smaller one.
93
Principal Component Analysis (PCA)
• Getting the principal components of the
data matrix x.
• Procedure:
• The first principle component is just the
normalized linear combination of the
variables that has the highest variance.
• The second principal component has
largest variance, subject to being
uncorrelated with the first.
• The principal components produces a
linear combinations of the data that are
really high in variance and that are
uncorrelated
The direction in which the data varies the
most actually falls along the green line.
This is the direction with the most
variation in the data, this is why it's the
first principal component.
94
Principal Component Analysis (Steps)
■ PCA can be applied to ordered and unordered attributes, and can handle sparse
data and skewed data.
■ Multidimensional data of more than two dimensions can be handled by
reducing the problem to two dimensions.
■ Principal components may be used as inputs to multiple regression and cluster
analysis.
■ In comparison with wavelet transforms, PCA tends to be better at handling
sparse data, whereas wavelet transforms are more suitable for data of high
dimensionality.
Attribute Subset Selection
• Why attribute subset selection
– Data sets for analysis may contain hundreds of attributes, many of which may
be irrelevant to the mining task or redundant.
– For example,
◆ if the task is to classify customers as to whether or not they are likely to
purchase a popular new CD at AllElectronics when notified of a sale,
attributes such as the customer’s telephone number are likely to be
irrelevant, unlike attributes such as age or music taste.
Attribute Subset Selection
• Using domain expert to pick out some of the useful attributes
– Sometimes this can be a difficult and time-consuming task, especially when the
behavior of the data is not well known.
• Leaving out relevant attributes or keeping irrelevant attributes result in discovered
patterns of poor quality.
• the added volume of irrelevant or redundant attributes can slow down the mining
process.
Attribute Subset Selection
• Attribute subset selection (feature selection):
– Reduce the data set size by removing irrelevant or redundant attributes.
– Goal: select a minimum set of features (attributes) such that the probability
distribution of different classes given the values for those features is as close as
possible to the original distribution given the values of all features
– It reduces the number of attributes appearing in the discovered patterns, helping
to make the patterns easier to understand.
Attribute Subset Selection
• How can we find a ‘good’ subset of the original attributes?
– For n attributes, there are 2n possible subsets.
– An exhaustive search for the optimal subset of attributes can be
prohibitively expensive, especially as n increase.
– Heuristic methods are commonly used for attribute subset selection.
– These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best choice at the
time.
– Such greedy methods are effective in practice and may come close to
estimating an optimal solution.
Attribute Subset Selection
• Heuristic methods:
– Step-wise forward selection
– Step-wise backward elimination
– Combining forward selection and backward elimination
– Decision-tree induction
• The “best” and “worst” attributes are typically determined using:
– the tests of statistical significance, which assume that the attributes are
independent of one another.
– the information gain measure used in building decision trees for
classification.
Attribute Subset Selection
• Stepwise forward selection:
– The procedure starts with an empty set of attributes as the
reduced set.
– First: The best single-feature is picked.
– Next: At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.
Attribute Subset Selection
• Stepwise backward elimination:
– The procedure starts with the full set of attributes.
– At each step, it removes the worst attribute remaining in the set.
Attribute Subset Selection
• Combining forward selection and backward elimination:
– The stepwise forward selection and backward elimination
methods can be combined
– At each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
Attribute Subset Selection
• Decision tree induction:
– Decision tree induction constructs a flowchart-like structure where each
internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node
denotes a class prediction.
– At each node, the algorithm chooses the “best” attribute to partition the
data into individual classes.
– When decision tree induction is used for attribute subset selection, a tree is
constructed from the given data.
– All attributes that do not appear in the tree are irrelevant.
Attribute Subset Selection
• Decision tree induction
Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• There are several methods for storing reduced representations of
the data include histograms, clustering, and sampling.
Data Reduction: Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N
• Suppose that a large data set, D, contains N instances.
• The most common ways that we could sample D for data
reduction:
– Simple random sample without replacement (SRSWOR)
– Simple random sample with replacement (SRSWR)
– Cluster sample
– Stratified sample
Data Reduction: Sampling
• Simple random sample without replacement (SRSWOR) of size s:
– SRSWOR is a method of selection of n units out of the N units one by one such
that at any stage of selection, any one of the remaining units have the same
chance of being selected, i.e. 1/ N.
• Simple random sample with replacement (SRSWR) of size s:
– SRSWR is a method of selection of n units out of the N units one by one such
that at each stage of selection, each unit has an equal chance of being selected,
i.e., 1/ N.
Data Reduction: Sampling
• Procedure of selection of a random sample:
• 1. Identify the N units in the population with the numbers 1 to N.
• 2. Choose any random number arbitrarily in the random number table and start
reading numbers.
• 3. Choose the sampling unit whose serial number corresponds to the random
number drawn from the table of random numbers.
• 4. In the case of SRSWR, all the random numbers are accepted ever if repeated
more than once.
• In the case of SRSWOR, if any random number is repeated, then it is ignored, and
more numbers are drawn.
Data Reduction:
Sampling
Raw Data
Data Reduction: Sampling
• Stratified Sample:
– This technique divides the elements of the population into small
subgroups (strata) based on the similarity .
– in such a way that the elements within the group are homogeneous and
heterogeneous among the other subgroups formed.
– the elements are randomly selected from each of these strata.
– We need to have prior information about the population to create
subgroups.
Data Reduction: Sampling
Raw Data Stratified Sample
Data Cube Aggregation
• used to aggregate data in a simpler form.
• Example
• imagine that information you gathered for your analysis for the years
2012 to 2014, that data includes the revenue of your company every
three months.
• They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting
data summarizes the total sales per year instead of per quarter. It
summarizes the data.
Data Cube Aggregation
• Sales data for a given branch of AllElectronics for the years
2002 to 2004.
Data Cube Aggregation
• Data cubes store multidimensional aggregated information.
• Data cubes provide fast access to precomputed, summarized data, thereby
benefiting on-line analytical processing as well as data mining.
• A data cube for sales at AllElectronics.
Data Cube Aggregation
• Base cuboid:
– The cube created at the lowest level of abstraction is referred to as
the base cuboid.
– The base cuboid should correspond to an individual entity of
interest, such as sales or customer.
• Apex cuboid:
– A cube at the highest level of abstraction is the apex cuboid.
– For the sales data, the apex cuboid would give one total— the total
sales.
116
Parametric Data Reduction: Regression
and Log-Linear Models
■ parametric methods, data is represented using some model.
■ Regression can be a simple linear regression or multiple linear regression.
■ simple linear regression
only single independent attribute
■ multiple linear regression
multiple independent attributes
■ the data are modeled to a fit straight line.
■ ex,
■ a random variable y can be modeled as a linear function of another random
variable x with the equation y = ax+b
where a and b (regression coefficients) specifies the slope and y-intercept of
the line, respectively.
117
Parametric Data Reduction: Regression
and Log-Linear Models
• Log-Linear Model:
Log-linear model can be used to estimate the probability of each data point
in a multidimensional space for a set of discretized attributes, based on a
smaller subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from lower-
dimensional attributes.
• Regression and log-linear model can both be used on sparse data, although
their application may be limited.
118
Histogram Analysis
■ Histogram is the data representation in
terms of frequency.
■ It uses binning to approximate data
distribution
■ popular form of data reduction
119
Clustering
■ Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
■ In data reduction, the cluster representation of the data are used to replace
the actual data.
■ It also helps to detect outliers in data
120
Data Transformation
■ data transformed or consolidated into forms appropriate for mining
■ Methods
■ Smoothing: Remove noise from data
■ Attribute/feature construction
■ New attributes constructed from the given ones
■ Aggregation: Summarization, data cube construction
■ Normalization: Scaled to fall within a smaller, specified range
■ min-max normalization
■ z-score normalization
■ normalization by decimal scaling
■ Discretization: divide the range of continuous attribute into intervals.
121
Normalization
• Min-max normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the maxima of an
attribute, P
• Formula:
• Where v is the original attribute value.
• v’ is the new value you get after normalizing the old value.
Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000
is mapped to
122
Normalization
■ Z-score normalization
• zero-mean normalization the values of an attribute (A), are normalized based
on the mean of A and its standard deviation
• A value, v, of attribute A is normalized to v’ by computing
• Ex. Let μ = 54,000, σ = 16,000. Then
• Decimal Scaling:
• It normalizes the values of an attribute by changing the position of their
decimal points
• The number of points by which the decimal point is moved can be determined
by the absolute maximum value of attribute A.
• A value, v, of attribute A is normalized to v’ by computing
Where j is the smallest integer such that Max(|ν’|) < 1
123
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Binning
■ Binning groups related values together in bins to reduce the
number of distinct values.
■ Top-down split, unsupervised
■ Histogram analysis
■ partition the values for an attribute into disjoint ranges called
buckets.
■ Top-down split, unsupervised(does not use class name)
124
Data Discretization Methods
■ Typical methods: All the methods can be applied recursively
■ Clustering analysis
• Cluster analysis is a popular data discretization method.
• A clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy.
■ unsupervised, top-down split or bottom-up merge
■ Decision-tree analysis
■ supervised, top-down split
■ Correlation (e.g., χ2) analysis
■ unsupervised, bottom-up merge
125
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
126
Concept Hierarchy Generation
■ Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
■ in the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by concept
hierarchies.
■ Concept hierarchies can be explicitly specified by domain experts and/or data
warehouse designers
■ Concept hierarchy can be automatically formed for both numeric and nominal
data.
127
Concept Hierarchy Generation
for Nominal Data
■ Specification of a partial/total ordering of attributes explicitly at
the schema level by users or experts
■ street < city < state < country
■ Specification of a hierarchy for a set of values by explicit data
grouping
■ {Urbana, Champaign, Chicago} < Illinois
■ Specification of only a partial set of attributes
■ E.g., only street < city, not others
■ Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
■ E.g., for a set of attributes: {street, city, state, country}
128
Automatic Concept Hierarchy Generation
■ Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set
■ The attribute with the most distinct values is placed at the lowest
level of the hierarchy
■ Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct
values
3567 distinct values
674,339 distinct
values
Data cube
• Grouping of data in a multidimensional matrix is called data cubes
• A data cube is generally used to easily interpret data.
• It is especially useful when representing data together with dimensions as
certain measures of business
• extension of 2-Dimensional data cube or 2-dimensional matrix (column and
rows)
Data cube
• need to abstract the relevant or important data from complex data. There
comes into picture the need for the data cube.
• A Data cube is basically used to represent the specific information to be
retrieved from a huge set of complex data.
• e.g: purchasing in shopping mall
Data cube:Types
• The data cube can be classified into two categories:
• Multidimensional data cube:
• It basically helps in storing large amounts of data by making use of a multi-
dimensional array.
• It increases its efficiency by keeping an index of each dimension. Thus,
dimensional is able to retrieve data fast.
• Relational data cube:
• It basically helps in storing large amounts of data by making use of relational
tables.
• Each relational table displays the dimensions of the data cube.
• It is slower compared to a Multidimensional Data Cube
Data Cube :characteristics
• It can go very far beyond to include many more dimensions.
• Improvises business strategies by analysis of all the data.
• It helps to get the latest market scenario by establishing trends and
performance analysis.
• It plays a very pivotal role by creating intermediate data cubes to
serve the requirements and to bridge the gap between the data
warehouse and all the reporting tool, particularly in a data.
Data Cube:Benefits
• Increases the productivity of an enterprise.
• Improves the overall performance and efficiency.
• Representation of huge and complex data sets get simplified and
streamlined.
• Huge database and complex SQL queries are also manageable.
• Indexing and ordering provides the best set of data for analysis
and data mining techniques.
• Faster and easily accessible as It will posses pre-defined and pre-
calculated data sets or data cubes.
Data Cube:Benefits
• Aggregation of data makes access to all data very fast at each micro-
level which ultimately leads to easy and efficient maintenance and
reduced development time.
• OLAP will help in getting Fast Response time, Fast curve of
Learning, versatile environment, reach to a wide range of reach to
all applications, need of resources for deployment and less wait time
with a quality result.
Statistical Descriptions of Data
 Statistics help in identifying patterns that further help identify
differences between random noise and significant findings.
 Descriptive statistics are used to describe or summarize data in
ways that are meaningful and useful.
 data preprocessing to be successful, it is essential to have an overall
picture of your data.
 used to identify properties of the data and highlight which data
values should be treated as noise or outliers.
Statistical Descriptions of Data
• It is actually a form of mathematical analysis
• It is an area of applied mathematics concern with data collection analysis,
interpretation, and presentation.
• Statistics deals with how data can be used to solve complex problems.
• Statistics makes work easy and simple and provides a clear and clean
picture of work you do on a regular basis.
• Basic terminology of Statistics :
• Population
It is actually a collection of set of individuals or objects or events whose
properties are to be analyzed.
Statistical Descriptions of Data
Descriptive statistics uses data that provides a description of the population either
through numerical calculation or graph or table. It provides a graphical summary of data.
• Measure of central tendency
• Measure of Variability
Measure of central tendency
summary statistics that is used to represents the center point or a particular value of a
data set or sample set.
(i) Mean :
It is measure of average of all value in a sample set.
For example,
Statistical Descriptions of Data
(ii) Median :
It is measure of central value of a sample set. In these, data set is ordered
from lowest to highest value and then finds exact middle.
For example,
Statistical Descriptions of Data
(iii) Mode :
It is value most frequently arrived in sample set. The value repeated most of
time in central set is actually mode.
Statistical Descriptions of Data
• Measure of Variability –
Measure of Variability is also known as measure of dispersion and helps to
understand the distribution of the data.
• three common measures of variability :
• (i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
1, 3,5, 6, 7 => Range = 7 -1= 6
• (ii) Variance :
variance measures how far each number in the set is from the mean.
S2= ∑n
i=1 [(xi - ͞x)2 ÷ n]
• n represent total data points, ͞x represent mean of data points and xi represent
individual data points.
• (iii) Dispersion :
dispersion in statistics is a way of describing how spread out a set of data is.
σ= √ (1÷n) ∑n
i=1 (xi - μ)2

Contenu connexe

Tendances

1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesKrish_ver2
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models774474
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemKiran kumar
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data miningPradnya Saval
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data miningDataminingTools Inc
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Data mining PPT
Data mining PPTData mining PPT
Data mining PPTKapil Rode
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaRadhika Kotecha
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)snegacmr
 

Tendances (20)

1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Multidimensional data models
Multidimensional data  modelsMultidimensional data  models
Multidimensional data models
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Business Intelligence Data Warehouse System
Business Intelligence Data Warehouse SystemBusiness Intelligence Data Warehouse System
Business Intelligence Data Warehouse System
 
Data warehouse and data mining
Data warehouse and data miningData warehouse and data mining
Data warehouse and data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Data mining
Data miningData mining
Data mining
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Data mining PPT
Data mining PPTData mining PPT
Data mining PPT
 
CS8080 information retrieval techniques unit iii ppt in pdf
CS8080 information retrieval techniques unit iii ppt in pdfCS8080 information retrieval techniques unit iii ppt in pdf
CS8080 information retrieval techniques unit iii ppt in pdf
 
Data warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika KotechaData warehousing - Dr. Radhika Kotecha
Data warehousing - Dr. Radhika Kotecha
 
Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)Discretization and concept hierarchy(os)
Discretization and concept hierarchy(os)
 

Similaire à Dma unit 1

Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basicNivaTripathy2
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 abhagathk
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptPrasadG76
 
chap1.ppt
chap1.pptchap1.ppt
chap1.pptImXaib
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryYoung Alista
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryHarry Potter
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryJames Wong
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryFraboni Ec
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryLuis Goldster
 

Similaire à Dma unit 1 (20)

Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Dwdmunit1 a
Dwdmunit1 aDwdmunit1 a
Dwdmunit1 a
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Information_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.pptInformation_System_and_Data_mining12.ppt
Information_System_and_Data_mining12.ppt
 
chap1.ppt
chap1.pptchap1.ppt
chap1.ppt
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
BAS 250 Lecture 1
BAS 250 Lecture 1BAS 250 Lecture 1
BAS 250 Lecture 1
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Dm unit i r16
Dm unit i   r16Dm unit i   r16
Dm unit i r16
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 

Dernier

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfRagavanV2
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756dollysharma2066
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 

Dernier (20)

Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 

Dma unit 1

  • 1. 18CSE355T -DATA MINING AND ANALYTICS
  • 2. COURSE LEARNING RATIONALE (CLR) The purpose of learning this course is to: CLR -1: Understand the concepts of Data Mining CLR -2: Familiarize with Association rule mining CLR -3: Familiarize with various Classification algortihms CLR -4: Understand the concepts of Cluster Analysis CLR -5: Familiarize with Outlier analysis techniques CLR -6: Familiarize with applications of Data mining in different domains
  • 3. COURSE LEARNING OUTCOMES (CLO) At the end of this course, learners will be able to: CLO -1: Gain knowledge about the concepts of Data Mining CLO -2: Understand and Apply Association rule mining techniques CLO -3: Understand and Apply various Classification algortihms CLO -4: Gain knowledge on the concepts of Cluster Analysis CLO -5: Gain knowledge on Outlier analysis techniques CLO -6: Understand the importance of applying Data mining concepts in different domains
  • 4. LEARNING RESOURCES S. No., TEXT BOOKS 1 Jiawei Han and Micheline Kamber, ― Data Mining: Concepts and Techniques‖, 3rd Edition, Morgan Kauffman Publishers, 2011.
  • 5. UNIT I INTRODUCTION Why Data mining? What is Data mining ?-Kinds of data meant for mining -Kinds of patterns that can be mined- Applications suitable for data mining-Issues in Data mining-Data objects and Attribute types-Statistical descriptions of data-Need for data preprocessing and data quality-Data cleaning-Data integration-Data reduction- Data transformation-Data cube and its usage
  • 6. Why Data Mining? • The Explosive Growth of Data: from terabytes(10004) to petabytes(10008) – Data collection and data availability • Automated data collection tools, database systems, web – Major sources of abundant data • Business: Web, e-commerce, transactions, stocks, … • Science: bioinformatics, scientific simulation, medical research … • Society and everyone: news, digital cameras, …
  • 7. Why Data Mining?  The abundance of data, coupled with the need for powerful data analysis tools, has been described as a data rich but information poor situation.  The fast-growing, tremendous amount of data, collected and stored in large and numerous data repositories, has far exceeded our human ability for comprehension without powerful tools.  As a result, data collected in large data repositories become “data tombs” data archives that are seldom visited.  Important decisions are often made based not on the information-rich data stored in data repositories but rather on a decision maker’s intuition, simply because the decision maker does not have the tools to extract the valuable knowledge embedded in the vast amounts of data.
  • 8. Why Data Mining?  Efforts have been made to develop expert system and knowledge-based technologies, which typically rely on users or domain experts to manually input knowledge into knowledge bases.  Unfortunately, however, the manual knowledge input procedure is prone to biases and errors and is extremely costly and time consuming.  The widening gap between data and information calls for the systematic development of data mining tools that can turn data tombs into “golden nuggets” of knowledge.
  • 10. What Is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. – Data mining is the process of discovering interesting patterns and knowledge from large amounts of data. – The data sources can include databases, data warehouses, the Web, other information repositories, or data that are streamed into the system dynamically. • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 11. Potential Applications • Data analysis and decision support – Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and detection of unusual patterns (outliers) • Other Applications – Text mining (news group, email, documents) and Web mining – Stream data mining – Bioinformatics and bio-data analysis
  • 12. Ex.: Market Analysis and Management • Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, surveys … • Target marketing – Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc., • E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month live in that area – Determine customer purchasing patterns over time • E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy this type of CD player • Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association – E.g. Customers who buy computer A usually buy software B
  • 14. KDD Process 7 Steps of KDD Process – Data cleaning (remove noise and inconsistent data) – Data integration (multiple data sources maybe combined) – Data selection (data relevant to the analysis task are retrieved from database) – Data transformation (data transformed or consolidated into forms appropriate for mining) (Done with data preprocessing) – Data mining (an essential process where intelligent methods are applied to extract data patterns) – Pattern evaluation (indentify the truly interesting patterns) – Knowledge presentation ( presentation of knowledge to the user for visualization in terms of trees, tables, rules graphs, charts, matrices..)
  • 15. Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
  • 16. On What Kinds of Data to be minined? • Database-oriented data sets and applications – Relational database, data warehouse, transactional database • Advanced data sets and advanced applications – Object-Relational Databases – Temporal Databases, Sequence Databases, Time-Series databases – Spatial Databases and Spatiotemporal Databases – Text databases and Multimedia databases – Heterogeneous Databases and Legacy Databases – Data Streams – The World-Wide Web
  • 17. Relational Databases • DBMS – database management system, contains a collection of interrelated databases e.g. Faculty database, student database, publications database • Each database contains a collection of tables and functions to manage and access the data. e.g. student_bio, student_graduation, student_parking • Each table contains columns and rows, with columns as attributes of data and rows as records. • Tables can be used to represent the relationships between or among multiple tables.
  • 18. Relational Databases (2) – AllElectronics store
  • 19. Relational Databases (3) • With a relational query language, e.g. SQL, we will be able to find answers to questions such as:  How many items were sold last year?  Who has earned commissions higher than 10%?  What is the total sales of last month for Dell laptops? • When data mining is applied to relational databases, we can search for trends or data patterns. • data mining systems can analyze customer data to predict the credit risk of new customers based on their income, age, and previous credit information.
  • 20. Data Warehouses • A repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site. • Constructed via a process of data cleaning, data integration, data transformation, data loading and periodic data refreshing.
  • 21. Data Warehouses (2) • Modelled by multidimensional data structure , called as data cube, in which each dimension corresponds to set of attributes. • Each cell stores the value of some aggregate measures like count. • Data cube provides the multidimensional view of data and allows the precomputation and fast access of summarized data. • Data are organized around major subjects, e.g. customer, item, supplier and activity. • Provide information from a historical perspective (e.g. from the past 5 – 10 years) • Typically summarized to a higher level (e.g. a summary of the transactions per item type for each store) • User can perform drill-down or roll-up operation to view the data at different degrees of summarization
  • 23. Transactional Databases • It captures a transaction, such as flight booking, customer’s purchase and user’s click on a web page. • Consists of a file where each record represents a transaction. • A transaction typically includes a unique transaction ID and a list of the items making up the transaction. • Either stored in a flat file or unfolded into relational tables • Easy to identify items that are frequently sold together
  • 24. Transactional Databases  Which items sold well together?”  This kind of market basket data analysis would enable you to bundle groups of items together as a strategy for boosting sales.  e.g purchase the computer along with printer.  A traditional database system is not able to perform market basket data analysis.  Data mining on transactional data can do so by mining frequent item sets that is, sets of items that are frequently sold together.
  • 25. Data Mining Functionalities What kinds of patterns can be mined? • Data Mining Functionalities are used to specify the kinds of patterns to be found in data mining tasks. • 2 tasks are descriptive and predictive. • Descriptive mining tasks –describes the general properties of the data • Predictive mining tasks – it makes the prediction based on current data. Types of Data Mining Functionalities • Concept/Class Description • Mining Frequent Patterns, Associations, and Correlations • Classification and Regression for Predictive Analysis • Cluster Analysis • Outlier Analysis • Evolution Analysis .
  • 26. Data Mining Functionalities - What kinds of patterns can be mined? 1. Concept/Class Description Data can be associated with classes or concepts. •E.g. classes of items – computers, printers, … concepts of customers – bigSpenders, budgetSpenders, … It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms. Data characterization – summarizing the general characteristics of a target class of data. –E.g. summarizing the characteristics of customers who spend more than $1,000 a year at AllElectronics. Result can be a general profile of the customers, 40 – 50 years old, Employed have excellent credit ratings
  • 27. 1.4 Data Mining Functionalities - What kinds of patterns can be mined? •Data discrimination – comparing the target class with one or a set of comparative classes –E.g. Compare the general features of software products whole sales increase by 10% in the last year , –with those whose sales decrease by 30% during the same period
  • 28. 2.Mining Frequent Patterns, Associations and Correlations Mining Frequent Patterns ( patterns that occur frequently in data ). Kinds of frequent patterns – Frequent item set: a set of items that frequently appear together in a transactional data set (e.g. milk and bread) – Frequent subsequence: A frequently occurring subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern. – Frequent Substructures: A substructure can refer to different structural forms (e.g., graphs, trees, or lattices) that may be combined with item sets or subsequences. – If a substructure occurs frequently, it is called a (frequent) structured pattern. – Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
  • 29. 1.4 Data Mining Functionalities - What kinds of patterns can be mined? ` – Association Analysis: find frequent patterns • E.g. a sample analysis result – an association rule: buys(X, “computer”) => buys(X, “software”) [support = 1%, confidence = 50%] if a customer buys a computer, there is a 50% chance that she will buy software. 1% of all of the transactions under analysis showed that computer and software are purchased together. • Associations rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold. – Correlation Analysis: additional analysis to find statistical correlations between associated pairs
  • 30. What kinds of patterns can be mined? 3.Classification and Prediction for predictive analysis – Classification • It is a data analysis task, i.e. the process of finding a model that describes and distinguishes data classes and concepts • The goal of classification is to accurately predict the target class for each case in the data. • The model can be represented in classification (IF-THEN) rules, decision trees, neural networks, etc.
  • 31. What kinds of patterns can be mined? 3.Classification and Prediction for predictive analysis A decision tree is a flowchart-like tree structure, where each node denotes a test on an attribute value, each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
  • 32. What kinds of patterns can be mined? 3.Classification and Prediction for predictive analysis • A neural network, when used for classification, is typically a collection of neuron-like processing units with weighted connections between the units. • Neural networks are used for effective data mining in order to turn raw data into useful information. • Neural networks look for patterns in large batches of data, allowing businesses to learn more about their customers which directs their marketing strategies, increase sales and lowers costs.
  • 33. What kinds of patterns can be mined? Classification and Regression  Regression is used to predict missing or unavailable numerical data values rather than (discrete) class labels.  The term prediction refers to both numeric prediction and class label prediction.  Regression analysis is a statistical methodology that is most often used for numeric prediction, although other methods exist as well.  Regression also encompasses the identification of distribution trends based on the available data.  Classification and regression may need to be preceded by relevance analysis, which attempts to identify attributes that are significantly relevant to the classification and regression process.  Such attributes will be selected for the classification and regression process. Other attributes, which are irrelevant, can then be excluded from consideration
  • 34. What kinds of patterns can be mined? 4.Cluster Analysis – Clustering can be used to generate class labels for a group of data. – Clusters of objects are formed based on the principle of maximizing intra- class similarity & minimizing interclass similarity • E.g. Identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing.
  • 35. What kinds of patterns can be mined? 5.Outlier Analysis: identify similar objects – A data set may contain objects that do not comply with the general behavior or model of the data. These data objects are outliers – Outliers are usually discarded as noise or exceptions. – Useful for fraud detection. the outlier indicates a fraudulent activity. • E.g. Detect purchases of extremely large amounts 6.Evolution Analysis – Describes and models regularities or trends for objects whose behavior changes over time. • E.g. Identify stock evolution regularities for overall stocks and for the stocks of particular companies.
  • 36. What kinds of patterns can be mined? Are All of the Patterns Interesting? • Data mining may generate thousands of patterns: Not all of them are interesting • A pattern is interesting if it is – easily understood by humans – valid on new or test data with some degree of certainty, – potentially useful – novel – validates some hypothesis that a user seeks to confirm • An interesting patterns represents knowledge !
  • 37. What kinds of patterns can be mined? Are All of the Patterns Interesting? • Objective measures – Based on statistics and structures of patterns, e.g., support, confidence, etc. (Rules that do not satisfy a threshold are considered uninteresting.) • Subjective measures – Reflect the needs and interests of a particular user. • E.g. A marketing manager is only interested in characteristics of customers who shop frequently. – Based on user’s belief in the data. • e.g., Patterns are interesting if they are unexpected, or can be used for strategic planning, etc • Objective and subjective measures need to be combined.
  • 38. Major Issues in Data Mining There are many challenging issues in data mining research. Areas include i. Mining methodology ii. User interaction iii. Efficiency and scalability iv. Dealing with diverse data types v. Data mining and society
  • 39. 39 Major Issues in Data Mining ■ Mining Methodology ■ Mining various and new kinds of knowledge ■ Mining knowledge in multi-dimensional space ■ Data mining: An interdisciplinary effort ■ Boosting the power of discovery in a networked environment ■ Handling noise, uncertainty, and incompleteness of data ■ Pattern evaluation and pattern- or constraint-guided mining
  • 40. 40 Major Issues in Data Mining ■ Mining Methodology ■ Mining various and new kinds of knowledge  use the same database in different ways and require the development of numerous data mining techniques.  Due to the diversity of applications, new mining tasks continue to emerge, making data mining a dynamic and fast-growing field. e.g effective knowledge discovery in information networks, integrated clustering and ranking may lead to the discovery of high-quality clusters and object ranks in large networks.
  • 41. 41 Major Issues in Data Mining ■ Mining Methodology ■ Mining knowledge in multi-dimensional space ■ When searching for knowledge in large data sets, explore the data in multidimensional space. ■ That is, can search for interesting patterns among combinations of dimensions (attributes) at varying levels of abstraction. Such mining is known as (exploratory) multidimensional data mining. ■ In many cases, data can be aggregated or viewed as a multidimensional data cube. ■ Mining knowledge in cube space can enhance the power and flexibility of data mining.
  • 42. Major Issues in Data Mining Data mining an interdisciplinary effort: The power of data mining can be enhanced by integrating new methods from multiple disciplines. Eg.- mining data in natural language - the mining of software bugs in large programs(bug mining), requires software engineering knowledge. Boosting the power of discovery in a networked environment: • Most data objects reside in a linked or interconnected environment, whether it be the Web, database relations, files, or documents. • Semantic links across multiple data objects can be used to advantage in data mining. • Knowledge derived in one set of objects can be used discovery of knowledge in a “related” objects.
  • 43. Major Issues in Data Mining Handling uncertainty, noise, or incompleteness of data:  Data often contain noise, errors, exceptions, or uncertainty, or are incomplete.  Errors and noise may confuse the data mining process, leading to the derivation of erroneous patterns.  Data cleaning, data preprocessing, outlier detection and removal are examples of techniques that need to be integrated with the data mining process. Pattern evaluation and pattern- or constraint-guided mining:  Not all the patterns generated by data mining processes are interesting.  What makes a pattern interesting may vary from user to user.  Therefore, techniques are needed to assess the interestingness of discovered patterns based on subjective measures.
  • 44. Major Issues in Data Mining ii)User Interaction  The user plays an important role in the data mining process.  Interesting areas of research include how to interact with a data mining system, how to incorporate a user’s background knowledge in mining, and how to visualize data mining results. Interactive mining:  The data mining process should be highly interactive.  important to build flexible user interfaces. Incorporation of background knowledge:  Background knowledge, constraints, rules, and other information regarding the domain should be incorporated into the knowledge discovery process.  Such knowledge can be used for pattern evaluation and guide the search to find interesting patterns.
  • 45. Major Issues in Data Mining Presentation and visualization of data mining results:  How can a data mining system present data mining results flexibly?  This is especially crucial if the data mining process is interactive.  It requires the system to adopt expressive knowledge representations, user friendly interfaces, and visualization techniques. iii)Efficiency and Scalability  Efficiency and scalability are always considered when comparing data mining algorithms.  As data amounts continue to multiply, these two factors are especially critical.
  • 46. Major Issues in Data Mining Efficiency and scalability of data mining algorithms:  Data mining algorithms must be efficient and scalable in order to effectively extract information from huge amounts of data .  The running time of a data mining algorithm must be predictable, short, and acceptable by applications.  Efficiency, scalability, performance, optimization, and the ability to execute in real time are key criteria that drive the development of many new data mining algorithms. Parallel, distributed, and incremental mining algorithms:  The parallel processes may interact with one another. The patterns from each partition are eventually merged.
  • 47. 47 Major Issues in Data Mining iv)Diversity of Database Types The wide diversity of database types brings about challenges to data mining. Handling complex types of data Mining dynamic, networked, and global data repositories Handling complex types of data  The construction of effective and efficient data mining tools for diverse applications remains a challenging and active area of research. Mining dynamic, networked, and global data repositories Multiple sources of data are connected by the Internet and various kinds of networks, forming distributed, and heterogeneous global information systems and networks.
  • 48. Major Issues in Data Mining  The discovery of knowledge from different sources of structured, semi- structured, or unstructured and interconnected data with diverse data semantics having great challenges to data mining.  Web mining, multisource data mining, and information network mining have become challenging and fast-evolving data mining fields. v)Data Mining and Society How does data mining impact society? Social impacts of data mining:  The improper use of data and the potential violation of individual privacy and data protection rights are areas of concern that need to be addressed.
  • 49. Major Issues in Data Mining Privacy-preserving data mining:  Data mining will help scientific discovery, business management, economy recovery, and security protection (e.g., the real-time discovery of intruders and cyberattacks).  However, it poses the risk of disclosing an individual’s personal information.  observe data sensitivity and preserve people’s privacy while performing successful data mining.
  • 50. Major Issues in Data Mining Invisible data mining:  We cannot expect everyone in society to learn and master data mining techniques.  More and more systems should have data mining functions built within so that people can perform data mining or use data mining results simply by mouse clicking, without any knowledge of data mining algorithms.  Intelligent search engines and Internet-based stores perform such invisible data mining by incorporating data mining into their components to improve their functionality and performance. This is done often unbeknownst to the user.  For example, when purchasing items online, users may be unaware that the store is likely collecting data on the buying patterns of its customers, which
  • 51. Data Objects and Attribute Types A data object is a region of storage that contains a value or group of values. Data sets are made up of data objects. A data object represents an entity • in a sales database, the objects may be customers, store items, and sales; • in a university database, the objects may be students, professors, and courses. Row-> data objects Columns->attributes Data objects are typically described by attributes. Data objects can also be referred to as samples, examples, instances, data points, or objects. If the data objects are stored in a database, they are data tuples. That is, the rows of a database correspond to the data objects, and the columns correspond to the attributes.
  • 52. Data Objects and Attribute Types An attribute is a data field, representing a characteristic or feature of a data object. The type of an attribute is determined by the set of possible values: nominal, binary, ordinal, or numerical • Nominal Attributes only provide enough attributes to differentiate between one object and another.  -relating to name  Hair-color{brown,black,white} • Such as Student Roll No. • Ordinal Attribute: • Value have meaningful order. The ordinal attribute value provides sufficient information to order the objects. Piazza={small,medium,large} Rankings, Grades, Height
  • 53. Data Objects and Attribute Types • Binary Attribute: These are 0 and 1. Where 0 is the absence of any features and 1 is the inclusion of any characteristics. • Quality • Numeric attribute: It is quantitative, such that quantity can be measured and represented in integer or real values . Two types i)Interval Scaled attribute: It is measured on a scale of equal size units. These attributes allows us to compare such as temperature in C or F . Thus values of attributes have order. ii)Ratio Scaled attribute: Both differences and ratios(or multiply) are significant for Ratio. For eg. age, length, Weight. e.g.10 we can say multiply of 5 .
  • 54. Why Data Preprocessing? • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results!
  • 55. Why Data Preprocessing? Data Quality: Data have quality if they satisfy the requirements of the intended use. Factors comprising quality • accuracy • completeness • consistency • timeliness • believability • interpretability • Accessibility e.g: Analyze the branch sale at All Electronics Store three of the elements defining data quality: • Accuracy • completeness • consistency
  • 56. Why Data Preprocessing? Data Quality: Reasons for inaccurate data(having incorrect attribute value) • The data collection instruments used may be faulty. • There may have been human or computer errors occurring at data entry. • Users may purposely submit incorrect data values for mandatory fields • This is known as disguised missing data.  There may be technology limitations such as limited buffer size for coordinating synchronized data transfer and consumption.
  • 57. Why Data Preprocessing? Data Quality: Reason for Incorrect data • inconsistencies in naming conventions or data codes, or inconsistent formats for input fields (e.g., date). • Duplicate tuples also require data cleaning. Reasons for Incomplete data • Missing the Attribute information.  e.g customer information for sales transaction data. • Relevant data may not be recorded due to a misunderstanding. • Data that were inconsistent with other recorded data, may have been deleted.
  • 58. Why Data Preprocessing? Data Quality: Timeliness also affects data quality. e.g: monthly sales bonuses to the top sales representatives at All Electronics Store. 2 other factors affect quality Believability reflects how much the data are trusted by users. Interpretability reflects how easy the data are understood
  • 59. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, files, or notes • Data transformation -- forms of Normalization (scaling to a specific range) – Aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results – Data discretization: with particular importance, especially for numerical data – Data aggregation, dimensionality reduction, data compression, generalization
  • 60. Forms of data preprocessing
  • 61. 61 Data Cleaning ■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error ■ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data ■ e.g., Occupation=“ ” (missing data) ■ noisy: containing noise, errors, or outliers ■ e.g., Salary=“−10” (an error) ■ inconsistent: containing discrepancies in codes or names, e.g., ■ Age=“42”, Birthday=“03/07/2010” ■ Was rating “1, 2, 3”, now rating “A, B, C” ■ discrepancy between duplicate records ■ Intentional (e.g., disguised missing data) ■ Jan. 1 as everyone’s birthday?
  • 62. 62 Incomplete (Missing) Data ■ Data is not always available ■ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data ■ Missing data may be due to ■ equipment malfunction ■ inconsistent with other recorded data and thus deleted ■ data not entered due to misunderstanding ■ certain data may not be considered important at the time of entry ■ not register history or changes of the data ■ Missing data may need to be inferred
  • 63. 63 Incomplete (Missing) Data Methods for data cleaning: 1.Missing values 2.Noisy data 3. Data cleaning as a process.
  • 64. 64 1.How to Handle Missing Data? ■ Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably ■ Fill in the missing value manually: this approach is time consuming and may not be feasible given a large data set with many missing values. ■ Use a global constant to fill in the missing value:  Replace all missing attribute values by the same constant such as a label like “Unknown” or −∞.  If missing values are replaced by, “Unknown,” then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common—that of “Unknown.”  Hence, although this method is simple, it is not foolproof.
  • 65. 65 1.How to Handle Missing Data? ■ Use the attribute mean or median for all samples belonging to the same class as the given tuple:  For example, if classifying customers according to credit risk, may replace the missing value with the mean income value for customers in the same credit risk category as that of the given tuple.  If the data distribution for a given class is skewed, the median value is a better choice.
  • 66. 66 How to Handle Missing Data? ■ Use the most probable value to fill in the missing value:  This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction.  For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.
  • 67. 67 2.Noisy Data ■ Noise: random error or variance in a measured variable ■ Incorrect attribute values may be due to ■ faulty data collection instruments ■ data entry problems ■ data transmission problems ■ technology limitation ■ inconsistency in naming convention ■ Other data problems which require data cleaning ■ duplicate records ■ incomplete data ■ inconsistent data
  • 68. 68 How to Handle Noisy Data? ■ Binning ■ Binning is a way to group a number of more or less continuous values into a smaller number of "bins". ■ The sorted values are distributed into a number of “buckets,” or bins . ■ example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals. Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34 Partition into (equal-frequency) bins: Bin 1: 4, 8, 15 Bin 2: 21, 21, 24 Bin 3: 25, 28, 34 Smoothing by bin means: Bin 1: 9, 9, 9 Bin 2: 22, 22, 22 Bin 3: 29, 29, 29 Smoothing by bin boundaries: Bin 1: 4, 4, 15 Bin 2: 21, 21, 24 Bin 3: 25, 25, 34
  • 69. 69 How to Handle Noisy Data? Regression:  Regression is a technique used to model and analyze the relationships between variables.  Predicts the value.  Ex. Predict the children height ,given their age ,weight and other factors.  Linear regression involves finding the “best” line to fit two attributes so that one attribute can be used to predict the other.  Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface. Outlier analysis:  Outliers may be detected by clustering.  example, where similar values are organized into groups, or “clusters.”  values that fall outside of the set of clusters may be considered outliers.
  • 70. 70 Data Cleaning as a Process ■ Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation. ■ First step is Data discrepancy detection.  Discrepancies can be caused by several factors  poorly designed data entry forms that have many optional fields.  human error in data entry  deliberate errors e.g., respondents not wanting to divulge information about themselves.  Other sources of discrepancies include errors in instrumentation devices that record data and system errors.  Errors can also occur when the data are used for purposes other than originally intended.  There may also be inconsistencies due to data integration.
  • 71. 71 Data Cleaning as a Process ■ how can we proceed with discrepancy detection?” ■ Data discrepancy detection Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. ■ Use metadata e.g., domain, range, dependency Check field overloading ■ Check uniqueness rule, consecutive rule and null rule for examine the data. ■ Use commercial tools ■ Data scrubbing use simple domain knowledge to detect errors and make corrections. e.g., postal code, spell-check ■ Data auditing find discrepancies by analyzing data to discover rules and relationship to detect violators . ■ e.g., correlation and clustering to find outliers
  • 72. 72 Data Cleaning as a Process ■ how can we proceed with discrepancy detection?” ■ Data migration and integration ■ Data migration tools allow simple transformations to be specified such as to replace the string “gender” by “sex.” ■ ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface. ■ The two-step process of discrepancy detection and data transformation iterates.
  • 73. 73 Data Cleaning as a Process  As a result, the entire data cleaning process suffers from a lack of interactivity.  New approaches to data cleaning emphasize increased interactivity.  Potter’s Wheel, is a publicly available data cleaning tool that integrates discrepancy detection and transformation.  The tool automatically performs discrepancy checking in the background on the latest transformed view of the data.  Users can gradually develop and refine transformations as discrepancies are found, leading to more effective and efficient data cleaning.  For data transformation SQL and algorithm that enable users to express data cleaning specifications efficiently.  it is important to keep updating the metadata to reflect this knowledge. This will help speed up data cleaning on future versions of the same data store.
  • 74. 74 74 Data Integration ■ Data integration: ■ Merging of data from multiple sources into a coherent store How can we match schema and objects from different sources? 1.Entity identification problem 2. Redundancy and correlation analysis 3.Tuple duplication 4.Data value conflict detection and resolution
  • 75. 75 75 Data Integration 1.Entity identification problem How can equivalent real world entities from multiple data sources be matched up? ■ Identify real world entities from multiple data sources, e.g., cust_id=customer number  metadata can be used to help avoid errors in schema integration.
  • 76. 76 76 Redundancy and Correlation Analysis ■ An attribute may be redundant if it can be “derived” from another attribute or set of attributes. ■ E.g: annual revenue ■ Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set. ■ Redundant attributes detected by correlation analysis and covariance analysis. ■ Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. ■ For nominal data, use the χ 2 (chi-square) test. ■ For numeric attributes, can use the correlation coefficient and covariance, both of which access how one attribute’s values vary from those of another. ■ Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality.
  • 77. 77 Correlation Analysis for Nominal Data ■ a correlation relationship between two attributes, A and B, can be discovered by a χ 2 (chi-square) test ■ Χ2 (chi-square) test ■ The larger the Χ2 value, the more likely the variables are related ■ The cells that contribute the most to the Χ2 value are those whose actual count is very different from the expected count. ■ Correlation does not imply causality ■ # of hospitals and # of car-theft in a city are correlated ■ Both are causally linked to the third variable: population
  • 78. 78 Chi-Square Calculation: An Example ■ Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data distribution in the two categories) ■ It shows that like_science_fiction and play_chess are correlated in the group Play chess Not play chess Sum (row) Like science fiction 250(90) 200(360) 450 Not like science fiction 50(210) 1000(840) 1050 Sum(col.) 300 1200 1500
  • 79. 79 Correlation Analysis for Numeric Data ■ Correlation coefficient (also called Pearson’s product moment coefficient) where n is the number of tuples, and are the respective means of A and B, σA and σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product. ■ If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher, the stronger correlation. ■ rA,B = 0: independent; rAB < 0: negatively correlated( the values of one attribute increase as the values of the other attribute decrease)
  • 80. 80 Correlation Analysis for Numeric Data • correlation does not imply causality. • That is, if A and B are correlated, this does not necessarily imply that A causes B or that B causes A. • For example, in analyzing a demographic database, we may find that attributes representing the number of hospitals and the number of car thefts in a region are correlated. • This does not mean that one causes the other. Both are actually causally linked to a third attribute, namely, population
  • 81. 81 Covariance of Numeric Data ■ covariance refers to the measure of the directional relationship between two random variables. where n is the number of tuples, and are the respective mean or expected values of A and B, σA and σB are the respective standard deviation of A and B. ■ Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected values. ■ Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be smaller than its expected value. ■ Independence: CovA,B = 0 but the converse is not true: ■ Some pairs of random variables may have a covariance of 0 but are not independent. Only under some additional assumptions does a covariance of 0 imply independence Correlation coefficient:
  • 82. Co-Variance: An Example ■ It can be simplified in computation as ■ Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14). ■ Question: If the stocks are affected by the same industry trends, will their prices rise or fall together? ■ E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4 ■ E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6 ■ Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 ■ Thus, A and B rise together since Cov(A, B) > 0.
  • 83. Tuple Duplication • To detecting redundancies between attributes, duplication should also be detected at the tuple level . • e.g., where there are two or more identical tuples for a given unique data entry case  Inconsistencies arise between various duplicates, due to inaccurate data entry or updating some but not all data occurrences. ex, if a purchase order database contains attributes for the purchaser’s name and address instead of a key to this information in a purchaser database, discrepancies can occur, such as the same purchaser’s name appearing with different addresses within the purchase order database
  • 84. Data Value Conflict Detection and Resolution Data integration also involves the detection and resolution of data value conflicts. For example, for the same real-world entity, attribute values from different sources may differ. This may be due to differences in representation, scaling, or encoding. e.g,  a weight attribute may be stored in metric units in one system and units in another.  University may adopt to quarter or semester system, and offer different database courses. This is difficult to work out the grade. Attributes may also differ on the abstraction level.  an attribute in one system is recorded at, a lower abstraction level than the “same” attribute in another. ex.:  the total sales in one database may refer to one branch of All Electronics, while an attribute of the same name in another database may refer to the total sales for All Electronics stores in a given region.
  • 85. 85 Data Reduction Strategies ■ Data reduction: Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set. 1.Data reduction strategies  dimensionality reduction  numerosity reduction  data compression.
  • 86. 86 Data Reduction Strategies Dimensionality reduction represents the original data in compressed as reduced form by applying encoding or transformation. It is the process of reducing the number of random variables or attributes.(remove unimportant attributes).  Wavelet transforms  Principal Components Analysis (PCA)  Feature subset selection, feature creation transform the original data onto a smaller space
  • 87. 87 Data Reduction Strategies ■Numerosity reduction reduces the data volume by choosing alternative smaller forms of data representation. ■These techniques may be parametric or nonparametric. Parametric methods, a model is used to estimate the data, only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) e.x: Regression and log-linear models. Nonparametric methods for storing reduced representations of the data include histograms . e.x: clustering, sampling, and data cube aggregation.
  • 88. 88 Data Reduction Strategies Data compression  data compression, transformations are applied to obtain a reduced or “compressed” representation of the original data.  If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless.  If can reconstruct only an approximation of the original data, then the data reduction is called lossy. lossless algorithms for string compression  Dimensionality reduction  numerosity reduction techniques the time saved by mining on a reduced data set size
  • 89. 89 Data Reduction :Wavelet Transform ■ The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied to a data vector X, transforms it to a numerically different vector, X of wavelet coefficients. ■ All wavelet coefficients larger than some user defined threshold can be retained . The remaining coefficients set to 0. ■ Use to remove the noise data. ■ 1.The length, L, of the input data vector must be an integer power of 2. This condition can be met by padding the data vector with zeros as necessary (L ≥ n). ■ 2. Each transform involves applying two functions. The first applies some data smoothing, such as a sum or weighted average. ■ The second performs a weighted difference, which acts to bring out the detailed features of the data.
  • 90. 90 Wavelet Transform ■ 3. The two functions are applied to pairs of data points in X, that is, to all pairs of measurements (x2i ,x2i+1). This results in two data sets of length L/2. ■ These represent a smoothed or low-frequency version of the input data and the high frequency content of it, respectively. ■ 4. Apply 2 functions recursively , until reaches the desired length ■ 5. Selected values from the data sets obtained in the previous iterations are designated the wavelet coefficients of the transformed data.  a matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients.  The matrix must be orthonormal.
  • 91. 91 Wavelet Transform ■ Wavelet transforms can be applied to multidimensional data such as a data cube. This is done by first applying the transform to the first dimension, then to the second, and so on. ■ real world applications compression of fingerprint images  computer vision  analysis of time-series data  data cleaning
  • 92. 92 x2 x1 e Principal Component Analysis (PCA) ■ dimensionality-reduction method ■ used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one.
  • 93. 93 Principal Component Analysis (PCA) • Getting the principal components of the data matrix x. • Procedure: • The first principle component is just the normalized linear combination of the variables that has the highest variance. • The second principal component has largest variance, subject to being uncorrelated with the first. • The principal components produces a linear combinations of the data that are really high in variance and that are uncorrelated The direction in which the data varies the most actually falls along the green line. This is the direction with the most variation in the data, this is why it's the first principal component.
  • 94. 94 Principal Component Analysis (Steps) ■ PCA can be applied to ordered and unordered attributes, and can handle sparse data and skewed data. ■ Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions. ■ Principal components may be used as inputs to multiple regression and cluster analysis. ■ In comparison with wavelet transforms, PCA tends to be better at handling sparse data, whereas wavelet transforms are more suitable for data of high dimensionality.
  • 95. Attribute Subset Selection • Why attribute subset selection – Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant. – For example, ◆ if the task is to classify customers as to whether or not they are likely to purchase a popular new CD at AllElectronics when notified of a sale, attributes such as the customer’s telephone number are likely to be irrelevant, unlike attributes such as age or music taste.
  • 96. Attribute Subset Selection • Using domain expert to pick out some of the useful attributes – Sometimes this can be a difficult and time-consuming task, especially when the behavior of the data is not well known. • Leaving out relevant attributes or keeping irrelevant attributes result in discovered patterns of poor quality. • the added volume of irrelevant or redundant attributes can slow down the mining process.
  • 97. Attribute Subset Selection • Attribute subset selection (feature selection): – Reduce the data set size by removing irrelevant or redundant attributes. – Goal: select a minimum set of features (attributes) such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features – It reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
  • 98. Attribute Subset Selection • How can we find a ‘good’ subset of the original attributes? – For n attributes, there are 2n possible subsets. – An exhaustive search for the optimal subset of attributes can be prohibitively expensive, especially as n increase. – Heuristic methods are commonly used for attribute subset selection. – These methods are typically greedy in that, while searching through attribute space, they always make what looks to be the best choice at the time. – Such greedy methods are effective in practice and may come close to estimating an optimal solution.
  • 99. Attribute Subset Selection • Heuristic methods: – Step-wise forward selection – Step-wise backward elimination – Combining forward selection and backward elimination – Decision-tree induction • The “best” and “worst” attributes are typically determined using: – the tests of statistical significance, which assume that the attributes are independent of one another. – the information gain measure used in building decision trees for classification.
  • 100. Attribute Subset Selection • Stepwise forward selection: – The procedure starts with an empty set of attributes as the reduced set. – First: The best single-feature is picked. – Next: At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
  • 101. Attribute Subset Selection • Stepwise backward elimination: – The procedure starts with the full set of attributes. – At each step, it removes the worst attribute remaining in the set.
  • 102. Attribute Subset Selection • Combining forward selection and backward elimination: – The stepwise forward selection and backward elimination methods can be combined – At each step, the procedure selects the best attribute and removes the worst from among the remaining attributes.
  • 103. Attribute Subset Selection • Decision tree induction: – Decision tree induction constructs a flowchart-like structure where each internal (nonleaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. – At each node, the algorithm chooses the “best” attribute to partition the data into individual classes. – When decision tree induction is used for attribute subset selection, a tree is constructed from the given data. – All attributes that do not appear in the tree are irrelevant.
  • 104. Attribute Subset Selection • Decision tree induction
  • 105. Numerosity Reduction • Reduce data volume by choosing alternative, smaller forms of data representation • There are several methods for storing reduced representations of the data include histograms, clustering, and sampling.
  • 106. Data Reduction: Sampling • Sampling: obtaining a small sample s to represent the whole data set N • Suppose that a large data set, D, contains N instances. • The most common ways that we could sample D for data reduction: – Simple random sample without replacement (SRSWOR) – Simple random sample with replacement (SRSWR) – Cluster sample – Stratified sample
  • 107. Data Reduction: Sampling • Simple random sample without replacement (SRSWOR) of size s: – SRSWOR is a method of selection of n units out of the N units one by one such that at any stage of selection, any one of the remaining units have the same chance of being selected, i.e. 1/ N. • Simple random sample with replacement (SRSWR) of size s: – SRSWR is a method of selection of n units out of the N units one by one such that at each stage of selection, each unit has an equal chance of being selected, i.e., 1/ N.
  • 108. Data Reduction: Sampling • Procedure of selection of a random sample: • 1. Identify the N units in the population with the numbers 1 to N. • 2. Choose any random number arbitrarily in the random number table and start reading numbers. • 3. Choose the sampling unit whose serial number corresponds to the random number drawn from the table of random numbers. • 4. In the case of SRSWR, all the random numbers are accepted ever if repeated more than once. • In the case of SRSWOR, if any random number is repeated, then it is ignored, and more numbers are drawn.
  • 110. Data Reduction: Sampling • Stratified Sample: – This technique divides the elements of the population into small subgroups (strata) based on the similarity . – in such a way that the elements within the group are homogeneous and heterogeneous among the other subgroups formed. – the elements are randomly selected from each of these strata. – We need to have prior information about the population to create subgroups.
  • 111. Data Reduction: Sampling Raw Data Stratified Sample
  • 112. Data Cube Aggregation • used to aggregate data in a simpler form. • Example • imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. • They involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data.
  • 113. Data Cube Aggregation • Sales data for a given branch of AllElectronics for the years 2002 to 2004.
  • 114. Data Cube Aggregation • Data cubes store multidimensional aggregated information. • Data cubes provide fast access to precomputed, summarized data, thereby benefiting on-line analytical processing as well as data mining. • A data cube for sales at AllElectronics.
  • 115. Data Cube Aggregation • Base cuboid: – The cube created at the lowest level of abstraction is referred to as the base cuboid. – The base cuboid should correspond to an individual entity of interest, such as sales or customer. • Apex cuboid: – A cube at the highest level of abstraction is the apex cuboid. – For the sales data, the apex cuboid would give one total— the total sales.
  • 116. 116 Parametric Data Reduction: Regression and Log-Linear Models ■ parametric methods, data is represented using some model. ■ Regression can be a simple linear regression or multiple linear regression. ■ simple linear regression only single independent attribute ■ multiple linear regression multiple independent attributes ■ the data are modeled to a fit straight line. ■ ex, ■ a random variable y can be modeled as a linear function of another random variable x with the equation y = ax+b where a and b (regression coefficients) specifies the slope and y-intercept of the line, respectively.
  • 117. 117 Parametric Data Reduction: Regression and Log-Linear Models • Log-Linear Model: Log-linear model can be used to estimate the probability of each data point in a multidimensional space for a set of discretized attributes, based on a smaller subset of dimensional combinations. • This allows a higher-dimensional data space to be constructed from lower- dimensional attributes. • Regression and log-linear model can both be used on sparse data, although their application may be limited.
  • 118. 118 Histogram Analysis ■ Histogram is the data representation in terms of frequency. ■ It uses binning to approximate data distribution ■ popular form of data reduction
  • 119. 119 Clustering ■ Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only ■ In data reduction, the cluster representation of the data are used to replace the actual data. ■ It also helps to detect outliers in data
  • 120. 120 Data Transformation ■ data transformed or consolidated into forms appropriate for mining ■ Methods ■ Smoothing: Remove noise from data ■ Attribute/feature construction ■ New attributes constructed from the given ones ■ Aggregation: Summarization, data cube construction ■ Normalization: Scaled to fall within a smaller, specified range ■ min-max normalization ■ z-score normalization ■ normalization by decimal scaling ■ Discretization: divide the range of continuous attribute into intervals.
  • 121. 121 Normalization • Min-max normalization: • This transforms the original data linearly. • Suppose that: min_A is the minima and max_A is the maxima of an attribute, P • Formula: • Where v is the original attribute value. • v’ is the new value you get after normalizing the old value. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
  • 122. 122 Normalization ■ Z-score normalization • zero-mean normalization the values of an attribute (A), are normalized based on the mean of A and its standard deviation • A value, v, of attribute A is normalized to v’ by computing • Ex. Let μ = 54,000, σ = 16,000. Then • Decimal Scaling: • It normalizes the values of an attribute by changing the position of their decimal points • The number of points by which the decimal point is moved can be determined by the absolute maximum value of attribute A. • A value, v, of attribute A is normalized to v’ by computing Where j is the smallest integer such that Max(|ν’|) < 1
  • 123. 123 Data Discretization Methods ■ Typical methods: All the methods can be applied recursively ■ Binning ■ Binning groups related values together in bins to reduce the number of distinct values. ■ Top-down split, unsupervised ■ Histogram analysis ■ partition the values for an attribute into disjoint ranges called buckets. ■ Top-down split, unsupervised(does not use class name)
  • 124. 124 Data Discretization Methods ■ Typical methods: All the methods can be applied recursively ■ Clustering analysis • Cluster analysis is a popular data discretization method. • A clustering algorithm can be applied to discrete a numerical attribute of A by partitioning the values of A into clusters or groups. • Each initial cluster or partition may be further decomposed into several subcultures, forming a lower level of the hierarchy. ■ unsupervised, top-down split or bottom-up merge ■ Decision-tree analysis ■ supervised, top-down split ■ Correlation (e.g., χ2) analysis ■ unsupervised, bottom-up merge
  • 125. 125 Binning Methods for Data Smoothing ❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 126. 126 Concept Hierarchy Generation ■ Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior) ■ in the multidimensional model, data are organized into multiple dimensions, and each dimension contains multiple levels of abstraction defined by concept hierarchies. ■ Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers ■ Concept hierarchy can be automatically formed for both numeric and nominal data.
  • 127. 127 Concept Hierarchy Generation for Nominal Data ■ Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts ■ street < city < state < country ■ Specification of a hierarchy for a set of values by explicit data grouping ■ {Urbana, Champaign, Chicago} < Illinois ■ Specification of only a partial set of attributes ■ E.g., only street < city, not others ■ Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values ■ E.g., for a set of attributes: {street, city, state, country}
  • 128. 128 Automatic Concept Hierarchy Generation ■ Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set ■ The attribute with the most distinct values is placed at the lowest level of the hierarchy ■ Exceptions, e.g., weekday, month, quarter, year country province_or_ state city street 15 distinct values 365 distinct values 3567 distinct values 674,339 distinct values
  • 129. Data cube • Grouping of data in a multidimensional matrix is called data cubes • A data cube is generally used to easily interpret data. • It is especially useful when representing data together with dimensions as certain measures of business • extension of 2-Dimensional data cube or 2-dimensional matrix (column and rows)
  • 130. Data cube • need to abstract the relevant or important data from complex data. There comes into picture the need for the data cube. • A Data cube is basically used to represent the specific information to be retrieved from a huge set of complex data. • e.g: purchasing in shopping mall
  • 131. Data cube:Types • The data cube can be classified into two categories: • Multidimensional data cube: • It basically helps in storing large amounts of data by making use of a multi- dimensional array. • It increases its efficiency by keeping an index of each dimension. Thus, dimensional is able to retrieve data fast. • Relational data cube: • It basically helps in storing large amounts of data by making use of relational tables. • Each relational table displays the dimensions of the data cube. • It is slower compared to a Multidimensional Data Cube
  • 132. Data Cube :characteristics • It can go very far beyond to include many more dimensions. • Improvises business strategies by analysis of all the data. • It helps to get the latest market scenario by establishing trends and performance analysis. • It plays a very pivotal role by creating intermediate data cubes to serve the requirements and to bridge the gap between the data warehouse and all the reporting tool, particularly in a data.
  • 133. Data Cube:Benefits • Increases the productivity of an enterprise. • Improves the overall performance and efficiency. • Representation of huge and complex data sets get simplified and streamlined. • Huge database and complex SQL queries are also manageable. • Indexing and ordering provides the best set of data for analysis and data mining techniques. • Faster and easily accessible as It will posses pre-defined and pre- calculated data sets or data cubes.
  • 134. Data Cube:Benefits • Aggregation of data makes access to all data very fast at each micro- level which ultimately leads to easy and efficient maintenance and reduced development time. • OLAP will help in getting Fast Response time, Fast curve of Learning, versatile environment, reach to a wide range of reach to all applications, need of resources for deployment and less wait time with a quality result.
  • 135. Statistical Descriptions of Data  Statistics help in identifying patterns that further help identify differences between random noise and significant findings.  Descriptive statistics are used to describe or summarize data in ways that are meaningful and useful.  data preprocessing to be successful, it is essential to have an overall picture of your data.  used to identify properties of the data and highlight which data values should be treated as noise or outliers.
  • 136. Statistical Descriptions of Data • It is actually a form of mathematical analysis • It is an area of applied mathematics concern with data collection analysis, interpretation, and presentation. • Statistics deals with how data can be used to solve complex problems. • Statistics makes work easy and simple and provides a clear and clean picture of work you do on a regular basis. • Basic terminology of Statistics : • Population It is actually a collection of set of individuals or objects or events whose properties are to be analyzed.
  • 137. Statistical Descriptions of Data Descriptive statistics uses data that provides a description of the population either through numerical calculation or graph or table. It provides a graphical summary of data. • Measure of central tendency • Measure of Variability Measure of central tendency summary statistics that is used to represents the center point or a particular value of a data set or sample set. (i) Mean : It is measure of average of all value in a sample set. For example,
  • 138. Statistical Descriptions of Data (ii) Median : It is measure of central value of a sample set. In these, data set is ordered from lowest to highest value and then finds exact middle. For example,
  • 139. Statistical Descriptions of Data (iii) Mode : It is value most frequently arrived in sample set. The value repeated most of time in central set is actually mode.
  • 140. Statistical Descriptions of Data • Measure of Variability – Measure of Variability is also known as measure of dispersion and helps to understand the distribution of the data. • three common measures of variability : • (i) Range : It is given measure of how to spread apart values in sample set or data set. Range = Maximum value - Minimum value 1, 3,5, 6, 7 => Range = 7 -1= 6 • (ii) Variance : variance measures how far each number in the set is from the mean. S2= ∑n i=1 [(xi - ͞x)2 ÷ n] • n represent total data points, ͞x represent mean of data points and xi represent individual data points. • (iii) Dispersion : dispersion in statistics is a way of describing how spread out a set of data is. σ= √ (1÷n) ∑n i=1 (xi - μ)2