2. WEB MINING
•Extraction of knowledge from web data.
•Web Data Includes:-
web documents.
hyperlinks between documents.
usage logs of web sites, etc.
A panel organized at ICTAI 1997 (Srivastava and
Mobasher 1997) asked the question “Is there anything
distinct about web mining (compared to data mining in
general)?”
3. WEB MINING:APPROACHES
First was a “Process-centric view” which defined web
mining as a sequence of tasks (Etzioni 1996).
Resource finding.
Information selection and preprocessing.
Generalization.
Analysis.
Kosala and Blockeel divided web mining process into
the following five subtasks:
Resource finding and retrieving.
Information selection and preprocessing.
Patterns analysis and recognition.
4. WEB MINING:APPROACHES
Validation and interpretation.
Visualization.
Second was a “Data-centric view” which defined web
mining in terms of the types of web data that was being used
in the mining process (Cooley, Srivastava, and Mobasher
1997). The second definition has become more acceptable.
In this Presentation we follow the data-centric view of web
mining which is defined as follows-
“Web mining is the application of data mining techniques
to extract knowledge from web data, i.e. Web content,
Web structure, and Web usage data.”
6. WEB CONTENT MINING
Mining, extraction and integration of useful data, information
and knowledge from Web page content.
Content data is the collection of facts a web page is
designed to contain. It may consist of text, images, audio,
video, or structured records such as lists and tables.
Search Engines do not generally provide structural information
nor categorize, filter, or interpret documents.
In recent years these factors have prompted researchers to
develop more intelligent tools for information retrieval, such
as intelligent web agents.
Research activities are going on in Information retrieval
methods, Natural language processing and Computer
vision.
7. WEB CONTENT MINING- PROBLEMS
Data/information extraction: Our focus will be on extraction
of structured data from Web pages, such as products and
search results. Extracting such data allows one to provide
services. Two main types of techniques, machine
learning and automatic extraction are used.
Web information integration and schema matching:
Although the Web contains a huge amount of data, each
web site (or even page) represents similar information
differently. How to identify or match semantically similar data
is a very important problem with many practical applications.
Opinion extraction from online sources: There are many
online opinion sources, e.g., customer reviews of products,
forums, blogs and chat rooms. Mining opinions (especially
consumer opinions) is of great importance for marketing
intelligence and product benchmarking.
8. WEB CONTENT MINING- PROBLEMS
Knowledge synthesis: Concept hierarchies or ontology are
useful in many applications. However, generating them
manually is very time consuming.
Segmenting Web pages and detecting noise: In many
Web applications, one only wants the main content of the
Web page without advertisements, navigation links,
copyright notices. Automatically segmenting Web page to
extract the main content of the pages is interesting problem.
10. WEB CONTENT MINING - APPROACHES
The database approaches to Web mining have generally
focused on techniques for integrating and organizing the
heterogeneous and semi-structured data on the Web into more
structured and high-level collections of resources, such as in
relational databases, and using standard database querying
mechanisms and data mining techniques to access and analyze
this information.
Multilevel-Databases:- The main idea behind these proposals is
that the lowest level of the database contains primitive semi-
structured information stored in various Web repositories, such
as hypertext documents. At the higher level(s) meta data or
generalizations are extracted from lower levels and organized in
structured collections such as relational or object-oriented
databases. ARANEUS system extracts relevant information from
hypertext documents and integrates these into higher-level
derived Web Hypertexts which are generalizations of the notion
of database views.
11. WEB CONTENT MINING - APPROACHES
WebQuery-Systems:- There have been many Web-based query
systems and languages developed recently that attempt to utilize
standard database query languages such as SQL, structural
information about Web documents, and even natural language
processing for accommodating the types of queries that are used
in World Wide Web searches. W3QL, WebLog, Lorel and UnQL
, TSIMMIS.
The agent-based approach to web mining involves the
development of sophisticated AI systems that can act
autonomously or semi-autonomously on behalf of a particular
user, to discover and organize web-based information.
Intelligent-Search-Agents:- Several intelligent Web agents
have been developed that search for relevant information using
characteristics of a particular domain (and possibly a user profile)
to organize and interpret the discovered information.
12. WEB CONTENT MINING - APPROACHES
Harvest , FAQ-Finder , Information Manifold , OCCAM
,ParaSite, ShopBot, ILA.
Information-Filtering/Categorization:- A number of Web
agents use various information retrieval techniques and
characteristics of open hypertext Web documents to
automatically retrieve, filter, and categorize them.
HyPursuit, BO (Bookmark Organizer).
18. PAGE RANK
• Simple solution: create a stochastic matrix of
the Web:
• Each page i corresponds to row i and column
j of the matrix.
• If page j has n successors (links) then the ijth
cell of the matrix is equal to-
1/n if page i is one of these n
succesors of page j
0 otherwise.
19. PAGE RANK – EXAMPLE
Assume that the Web consists of only three pages - A, B,
and C. The links among these pages are shown below.
A
B
C
Let [a, b, c] be
the vector of
importances for
these three pages
A B C
A 1/2 1/2 0
B 1/2 0 1
C 0 1/2 0
20. PAGE RANK – EXAMPLE (CONT.)
The equation describing the asymptotic values of these
three variables is:
a 1/2 1/2 0 a
b = 1/2 0 1 b
c 0 1/2 0 c
We can solve the equations like this one by starting with the
assumption a = b = c = 1, and applying the matrix to the current
estimate of these values repeatedly. The first four iterations give
the following estimates:
a = 1 1 5/4 9/8 5/4 … 6/5
b = 1 3/2 1 11/8 17/16 … 6/5
c = 1 1/2 3/4 1/2 11/16 ... 3/5
21. PAGE RANK – EXAMPLE (CONT.)
In the limit, the solution is a=b=6/5, c=3/5. That is, a
and b each have the same importance, and twice of c.
22.
23.
24.
25.
26. HITS
Define a matrix A whose rows and columns
correspond to Web pages with entry Aij=1 if page i
links to page j, and 0 if not.
Let a and h be vectors, whose ith component
corresponds to the degrees of authority and
hubbiness of the ith page. Then:
h = A × a. That is, the hubbiness of each page is the
sum of the authorities of all the pages it links to.
a = AT × h. That is, the authority of each page is the
sum of the hubbiness of all the pages that link to it (AT
- transponed matrix).
Then, a = AT × A × a h = A × AT × h
27. HUB AND AUTHORITIES - EXAMPLE
Consider the Web presented below.
A
C
B
1 1 1
A = 0 0 1
1 1 0
1 0 1
AT = 1 0 1
1 1 0
3 1 2
AAT = 1 1 0
2 0 2
2 2 1
ATA = 2 2 1
1 1 2
28. HUB AND AUTHORITIES - EXAMPLE
If we assume that the vectors
h = [ ha, hb, hc ] and a = [ aa, ab, ac ] are
each initially [ 1,1,1 ], the first three iterations of
the equations for a and h are the following:
aa = 1 5 24 114
ab = 1 5 24 114
ac = 1 4 18 84
ha = 1 6 28 132
hb = 1 2 8 36
hc = 1 4 20 96
29.
30. Data Sources
server level collection: the server stores data
regarding requests performed by the client, thus data
regard generally just one source.
client level collection: it is the client itself which
sends to a repository information regarding the user's
behaviour (can be implemented by using a remote agent
(such as Javascripts or Java applets) or by modifying the
source code of an existing browser (such as Mosaic or
Mozilla) to enhance its data collection capabilities. );
proxy level collection: information is stored at the
proxy side, thus Web data regards several Websites, but
only users whose Web clients pass through the proxy.
35. PREPROCESSING
Convert raw usage data into the data
abstractions.
Most difficult task in Web usage mining due to
the incompleteness of the available data.
36.
37. DATA CLEANING
Irrelevant records in web access log will be eliminated during
data cleaning.
Since the target of Web Usage Mining is to get the user’s
travel patterns, following two kinds of records are
unnecessary and should be removed:-
The records of graphics, videos and the format information
The records have filename suffixes of GIF, JPEG, CSS, and
so on, which can found in the URI field of the every record.
The records with the failed HTTP status code.
38. USER & SESSION IDENTIFICATION
The task of user and session identification is find out the
different user sessions from the original web access log.
User’s identification is, to identify who access web site and
which pages are accessed.
Session identification is to divide the page accesses of each
user at a time into individual sessions.
The difficulties to accomplish this step are introduced by
using proxy servers, e.g. different users may have same IP
address in the log.
A referrer-based method is proposed to solve these
problems in this study.
39. USER & SESSION IDENTIFICATION
The different IP addresses distinguish different users.
If the IP addresses are same, the different browsers and
operation systems indicate different users.
If all of the IP address, browsers and operating systems are
same, the referrer information should be taken into account.
“The Refer URI field is checked, and a new user session is
identified if the URL in the Refer URI field hasn’t been
accessed previously, or there is a large interval (usually
more than 10 seconds) between the accessing time of
this record and the previous one if the Refer URI field is
empty.”
40. PATH COMPLETION
The session identified by rule 3 may contains more than one
visit by the same user at different time, the time oriented
heuristics is then used to divide the different visits into
different user sessions.
Path completion process should be used for acquiring the
complete user access path.
There are some reasons that result in path’s
incompletion, for instance, local cache, agent cache, “post”
technique and browser’s “back” button can result in some
important accesses not recorded in the access log file, and
the number of Uniform Resource Locators(URL) recorded in
log may be less than the real one.
41. PATH COMPLETION
Using the local caching and proxy servers also produces
the difficulties for path completion because users can access
the pages in the local caching or the proxy servers caching
without leaving any record in server’s access log.
As a result, the user access paths are incompletely
preserved in the web access log. To discover user’s travel
pattern, the missing pages in the user access path should be
appended. The purpose of the path completion is to
accomplish this task.
42. CONTENT PREPROCESSING
Converting the text, image, scripts, and other files
such as multimedia into forms that are useful for the
Web Usage Mining process.
This consists of performing content mining such as
classification or clustering. (also found in pattern
discovery)
43. PATTERN DISCOVERY
Pattern discovery draws upon methods and
algorithms developed from several fields such as
statistics, data mining, machine learning and
pattern recognition.
45. PATTERN DISCOVERY(CONT.)
Statistics:-
Most common method.
This kind of analysis is performed by many tools, its aim
is to give a description of the traffic on a Web site, like
most visited pages, average daily hits, etc.;
Useful for improving the system performance,
enhancing the security of the system, facilitating the site
modification task, etc.
46. PATTERN DISCOVERY (CONT.)
Association rules:
Its main idea is to consider every URL requested by a
user in a visit as basket data (item) and to discover
relationships with a minimum support level between
them;
Discover the correlations among references to various
pages of a web site in a single server session.
Useful for restructuring web site, serving as a heuristic
for pre-fetching docs to reduce latency.
47. Association Rules (cont.)
discovers affinities among sets of items across transactions
X =====> Y
where X, Y are sets of items,
confidence,support
Examples:
60% of clients who accessed /products/, also accessed
/products/software/webminer.htm.
30% of clients who accessed /special-offer.html, placed
an online order in /products/software/.
48. PATTERN DISCOVERY (CONT.)
Clustering:-
meaningful clusters of URLs can be created by discovering
similar characteristics between them according to users
behaviors.
Usage clusters
Useful to perform market segmentation in E-commerce or
provide personalized Web content to the users.
Pages clusters
Useful for Internet search engines and web assistance
providers.
49. PATTERN DISCOVERY (CONT.)
Classification:-
Develop a profile of users belonging to a particular class
or category.
Require extraction and selection of features that best
describe the properties of a given class or category.
50. Clustering and Classification:-
clients who often access
/products/software/webminer.html tend to be
from educational institutions.
clients who placed an online order for software tend to
be students in the 20-25 age group and live in the United
States.
75% of clients who download software from
/products/software/demos/ visit between 7:00 and 11:00
pm on weekends.
Pattern Discovery (cont.)
51. Sequential Patterns:-
30% of clients who visited /products/software/, had done a
search in Yahoo using the keyword “software” before their visit
60% of clients who placed an online order for WEBMINER,
placed another online order for software within 15 days
52. PATTERN DISCOVERY (CONT.)
Path Analysis:-
Types of Path/Usage Information
Most Frequent paths traversed by users
Entry and Exit Points
Distribution of user session durations.
53. PATTERN ANALYSIS
Challenges of Pattern Analysis is to filter uninteresting
information and to visualize and interpret the interesting
patterns to the user.
First delete the less significance rules or models from the
interested model storehouse; Next use technology of OLAP
and so on to carry on the comprehensive mining and
analysis.
Once more, let discovered data or knowledge be visible;
Finally, provide the characteristic service to the electronic
commerce website.
54. WEB MINING SOFTWARES
SPSS Clementine.
Megaputer PolyAnalyst.
ClickTracks by Web analytics.
QL2 by QL2 Software Inc.