1. SUBMITTED BY: SHUVRA GHOSH
ROLL NO: 07
COURSE: MLIS
GUIDED BY: PROF. UDAYAN BHATTACHARYA
DEPARTMENT OF LIBRARY AND
INFORMATION SCIENCE
JADAVPUR UNIVERSITY
*
2. *
Process of discovering valuable information from a
collection of data, or it is the process of converting raw
data into useful information.
Knowledge discovery is an activity that produces
knowledge by discovering it or deriving it from existing
information.
Knowledge Discovery refers to the overall process of
discovering useful knowledge from data, and data mining
refers to a particular step in this process.
5. • Database data
• Data Warehouse
• Transactional data
• Other kinds of Data-
Time related data
Sequence data (historical data records, Stock Exchange)
Data streams (Video surveillance, Sensor data)
Spatial data (Maps)
Hypertext and Multimedia data (Text, Video, Audio)
Graph and networked data
Engineering design data (auto CAD)
Web
*
6. • Interactive
• Iterative
• Procedure to extract knowledge from data
• Knowledge being searched for is –
implicit
previously unknown
potentially useful
*
8. Data Cleaning − in this step, the noise and inconsistent data is
removed. Example Parsing the Data.
Cleaning is performed for detection
Of syntax error.
Parser decides the given string of
Data is acceptable within data
Specification.
*
9. Data Integration − in this step, multiple data sources are combined
Example: Retail loan application, commercial loan application,
demand deposit application are combined in bank data
warehouse.
.
10. Data Selection − in this step, data relevant to the analysis task
are retrieved from the database.
*
11. Data Transformation − in this step, data is transformed or consolidated into
forms appropriate for mining by performing summary or aggregation
operations.
The aggregation operators perform mathematical operations like Average,
Aggregate, Count, Max, Min and Sum, on the numeric property of the
elements in the collection.
*
12. Data Mining − in this step, intelligent methods are applied in order to
extract data patterns.
intelligent methods are –
• Association
• Classification
Decision tree
• Clustering
• Regression
*
19. The efforts to establish a KDP model were initiated in
academia, in the mid-1990s.
when the DM field was being shaped, researchers started
defining multistep procedures to guide users of DM tools in
the complex knowledge discovery world.
The two process models developed in 1996 and 1998 are the
nine-step model by Fayyad et al. and the eight-step model by
Anand and Buchner.
*
20. 1.Developing and understanding the application domain. This step
includes learning the relevant prior knowledge and the goals of the end user of
the discovered knowledge.
2. Creating a target data set. Here the data miner selects a subset of variables
(attributes) and data points (examples) that will be used to perform discovery
tasks. This step usually includes querying the existing data to select the desired
subset.
3. Data cleaning and pre-processing. This step consists of removing outliers,
dealing with noise and missing values in the data, and accounting for time
sequence information and known changes.
4. Data reduction and projection. This step consists of finding useful
attributes by applying dimension reduction and transformation methods, and
finding invariant representation of the data.
5. Choosing the data mining task. Here the data miner matches the goals
defined in Step 1 with a particular DM method, such as classification,
regression, clustering, etc.
*
21.
22. Two representative industrial models are the five-step model by
Cabena et al., with support from IBM and the industrial six-step
CRISP-DM model, developed by a large consortium of
European companies.
*
23. The CRISP-DM (Cross-Industry Standard Process for Data Mining)
was first established in the late 1990s by four companies: Integral
Solutions Ltd. (a provider of commercial data mining solutions),
NCR (a database provider), DaimlerChrysler (an automobile
manufacturer), and OHRA (an insurance company).
*
26. The development of academic and industrial models has led to the
development of hybrid models, i.e., models that combine aspects of both.
One such model is a six-step KDP model developed by Cios et al.
The main differences and extensions include
• providing more general, research-oriented description of the steps,
• introducing a data mining step instead of the modeling step,
• introducing several new explicit feedback mechanisms, (the CRISP-
DM model has only three major feedback sources, while the hybrid
model has more detailed feedback mechanisms) and
• Modification of the last step, since in the hybrid model, the
knowledge discovered for a particular domain may be applied in other
domains.
*
28. 1. Understanding of the problem domain. This initial step involves
working closely with domain experts to define the problem and
determine the project goals, identifying key people, and learning about
current solutions to the problem. It also involves learning domain-
specific terminology. A description of the problem, including its
restrictions, is prepared. Finally, project goals are translated into DM
goals, and the initial selection of DM tools to be used later in the process
is performed.
2. Understanding of the data. This step includes collecting sample data
and deciding which data, including format and size, will be needed.
Background knowledge can be used to guide these efforts. Data are
checked for completeness, redundancy, missing values, plausibility of
attribute values, etc. Finally, the step includes verification of the
usefulness of the data with respect to the DM goals.
*
29.
30.
31. Knowledge Discovery in Databases is the process by which a task is
identified and performed upon a database in order to extract
information about the elements of the database. This process involves
first collecting the data to be analysed, cleaning up the data, and
reducing it to those features of interest to the process. At which time the
tool or tools to be used upon the data are identified. These tools are
then used to mine the data for information. Once the information has
been created, it must be evaluated as to it efficacy to the process. Any
knowledge thereupon gained is then re-incorporated into the process as
well as used for purposes outside the scope of the process.
This is a very complex process, but it is one that lends itself to a fair
degree of automation. As such, it enters into the field of artificial
intelligence, not just for the tools it employs, but for the fact that the
process tries to re-incorporate the knowledge it has created.
*