SlideShare une entreprise Scribd logo
1  sur  55
PRESENTATION
ON
WEB MINING
(CONTENT + STRUCTURE + USAGE)
Presented By:-
Mr. Jagrat Gupta
M.Tech. 1st Year
CSE Branch
WEB MINING
•Extraction of knowledge from web data.
•Web Data Includes:-
web documents.
hyperlinks between documents.
usage logs of web sites, etc.
A panel organized at ICTAI 1997 (Srivastava and
Mobasher 1997) asked the question “Is there anything
distinct about web mining (compared to data mining in
general)?”
WEB MINING:APPROACHES
 First was a “Process-centric view” which defined web
mining as a sequence of tasks (Etzioni 1996).
 Resource finding.
 Information selection and preprocessing.
 Generalization.
 Analysis.
Kosala and Blockeel divided web mining process into
the following five subtasks:
 Resource finding and retrieving.
 Information selection and preprocessing.
 Patterns analysis and recognition.
WEB MINING:APPROACHES
 Validation and interpretation.
 Visualization.
 Second was a “Data-centric view” which defined web
mining in terms of the types of web data that was being used
in the mining process (Cooley, Srivastava, and Mobasher
1997). The second definition has become more acceptable.
 In this Presentation we follow the data-centric view of web
mining which is defined as follows-
“Web mining is the application of data mining techniques
to extract knowledge from web data, i.e. Web content,
Web structure, and Web usage data.”
WEB MINING TAXONOMY
WEB CONTENT MINING
 Mining, extraction and integration of useful data, information
and knowledge from Web page content.
 Content data is the collection of facts a web page is
designed to contain. It may consist of text, images, audio,
video, or structured records such as lists and tables.
 Search Engines do not generally provide structural information
nor categorize, filter, or interpret documents.
 In recent years these factors have prompted researchers to
develop more intelligent tools for information retrieval, such
as intelligent web agents.
 Research activities are going on in Information retrieval
methods, Natural language processing and Computer
vision.
WEB CONTENT MINING- PROBLEMS
 Data/information extraction: Our focus will be on extraction
of structured data from Web pages, such as products and
search results. Extracting such data allows one to provide
services. Two main types of techniques, machine
learning and automatic extraction are used.
 Web information integration and schema matching:
Although the Web contains a huge amount of data, each
web site (or even page) represents similar information
differently. How to identify or match semantically similar data
is a very important problem with many practical applications.
 Opinion extraction from online sources: There are many
online opinion sources, e.g., customer reviews of products,
forums, blogs and chat rooms. Mining opinions (especially
consumer opinions) is of great importance for marketing
intelligence and product benchmarking.
WEB CONTENT MINING- PROBLEMS
 Knowledge synthesis: Concept hierarchies or ontology are
useful in many applications. However, generating them
manually is very time consuming.
 Segmenting Web pages and detecting noise: In many
Web applications, one only wants the main content of the
Web page without advertisements, navigation links,
copyright notices. Automatically segmenting Web page to
extract the main content of the pages is interesting problem.
WEB CONTENT MINING - APPROACHES
WEB CONTENT MINING - APPROACHES
 The database approaches to Web mining have generally
focused on techniques for integrating and organizing the
heterogeneous and semi-structured data on the Web into more
structured and high-level collections of resources, such as in
relational databases, and using standard database querying
mechanisms and data mining techniques to access and analyze
this information.
 Multilevel-Databases:- The main idea behind these proposals is
that the lowest level of the database contains primitive semi-
structured information stored in various Web repositories, such
as hypertext documents. At the higher level(s) meta data or
generalizations are extracted from lower levels and organized in
structured collections such as relational or object-oriented
databases. ARANEUS system extracts relevant information from
hypertext documents and integrates these into higher-level
derived Web Hypertexts which are generalizations of the notion
of database views.
WEB CONTENT MINING - APPROACHES
 WebQuery-Systems:- There have been many Web-based query
systems and languages developed recently that attempt to utilize
standard database query languages such as SQL, structural
information about Web documents, and even natural language
processing for accommodating the types of queries that are used
in World Wide Web searches. W3QL, WebLog, Lorel and UnQL
, TSIMMIS.
 The agent-based approach to web mining involves the
development of sophisticated AI systems that can act
autonomously or semi-autonomously on behalf of a particular
user, to discover and organize web-based information.
 Intelligent-Search-Agents:- Several intelligent Web agents
have been developed that search for relevant information using
characteristics of a particular domain (and possibly a user profile)
to organize and interpret the discovered information.
WEB CONTENT MINING - APPROACHES
Harvest , FAQ-Finder , Information Manifold , OCCAM
,ParaSite, ShopBot, ILA.
 Information-Filtering/Categorization:- A number of Web
agents use various information retrieval techniques and
characteristics of open hypertext Web documents to
automatically retrieve, filter, and categorize them.
HyPursuit, BO (Bookmark Organizer).
PREDECESSORS AND SUCCESSORS OF A
WEB PAGE
… …
Predecessors Successors
PAGE RANK
• Simple solution: create a stochastic matrix of
the Web:
• Each page i corresponds to row i and column
j of the matrix.
• If page j has n successors (links) then the ijth
cell of the matrix is equal to-
1/n if page i is one of these n
succesors of page j
0 otherwise.
PAGE RANK – EXAMPLE
 Assume that the Web consists of only three pages - A, B,
and C. The links among these pages are shown below.
A
B
C
Let [a, b, c] be
the vector of
importances for
these three pages
A B C
A 1/2 1/2 0
B 1/2 0 1
C 0 1/2 0
PAGE RANK – EXAMPLE (CONT.)
 The equation describing the asymptotic values of these
three variables is:
a 1/2 1/2 0 a
b = 1/2 0 1 b
c 0 1/2 0 c
We can solve the equations like this one by starting with the
assumption a = b = c = 1, and applying the matrix to the current
estimate of these values repeatedly. The first four iterations give
the following estimates:
a = 1 1 5/4 9/8 5/4 … 6/5
b = 1 3/2 1 11/8 17/16 … 6/5
c = 1 1/2 3/4 1/2 11/16 ... 3/5
PAGE RANK – EXAMPLE (CONT.)
 In the limit, the solution is a=b=6/5, c=3/5. That is, a
and b each have the same importance, and twice of c.
HITS
 Define a matrix A whose rows and columns
correspond to Web pages with entry Aij=1 if page i
links to page j, and 0 if not.
 Let a and h be vectors, whose ith component
corresponds to the degrees of authority and
hubbiness of the ith page. Then:
 h = A × a. That is, the hubbiness of each page is the
sum of the authorities of all the pages it links to.
 a = AT × h. That is, the authority of each page is the
sum of the hubbiness of all the pages that link to it (AT
- transponed matrix).
Then, a = AT × A × a h = A × AT × h
HUB AND AUTHORITIES - EXAMPLE
Consider the Web presented below.
A
C
B
1 1 1
A = 0 0 1
1 1 0
1 0 1
AT = 1 0 1
1 1 0
3 1 2
AAT = 1 1 0
2 0 2
2 2 1
ATA = 2 2 1
1 1 2
HUB AND AUTHORITIES - EXAMPLE
If we assume that the vectors
h = [ ha, hb, hc ] and a = [ aa, ab, ac ] are
each initially [ 1,1,1 ], the first three iterations of
the equations for a and h are the following:
aa = 1 5 24 114
ab = 1 5 24 114
ac = 1 4 18 84
ha = 1 6 28 132
hb = 1 2 8 36
hc = 1 4 20 96
Data Sources
server level collection: the server stores data
regarding requests performed by the client, thus data
regard generally just one source.
client level collection: it is the client itself which
sends to a repository information regarding the user's
behaviour (can be implemented by using a remote agent
(such as Javascripts or Java applets) or by modifying the
source code of an existing browser (such as Mosaic or
Mozilla) to enhance its data collection capabilities. );
proxy level collection: information is stored at the
proxy side, thus Web data regards several Websites, but
only users whose Web clients pass through the proxy.
WEB SERVER LOG
THREE PHASES
PREPROCESSING
Convert raw usage data into the data
abstractions.
Most difficult task in Web usage mining due to
the incompleteness of the available data.
DATA CLEANING
 Irrelevant records in web access log will be eliminated during
data cleaning.
 Since the target of Web Usage Mining is to get the user’s
travel patterns, following two kinds of records are
unnecessary and should be removed:-
 The records of graphics, videos and the format information
The records have filename suffixes of GIF, JPEG, CSS, and
so on, which can found in the URI field of the every record.
 The records with the failed HTTP status code.
USER & SESSION IDENTIFICATION
 The task of user and session identification is find out the
different user sessions from the original web access log.
 User’s identification is, to identify who access web site and
which pages are accessed.
 Session identification is to divide the page accesses of each
user at a time into individual sessions.
 The difficulties to accomplish this step are introduced by
using proxy servers, e.g. different users may have same IP
address in the log.
 A referrer-based method is proposed to solve these
problems in this study.
USER & SESSION IDENTIFICATION
 The different IP addresses distinguish different users.
 If the IP addresses are same, the different browsers and
operation systems indicate different users.
 If all of the IP address, browsers and operating systems are
same, the referrer information should be taken into account.
“The Refer URI field is checked, and a new user session is
identified if the URL in the Refer URI field hasn’t been
accessed previously, or there is a large interval (usually
more than 10 seconds) between the accessing time of
this record and the previous one if the Refer URI field is
empty.”
PATH COMPLETION
 The session identified by rule 3 may contains more than one
visit by the same user at different time, the time oriented
heuristics is then used to divide the different visits into
different user sessions.
 Path completion process should be used for acquiring the
complete user access path.
 There are some reasons that result in path’s
incompletion, for instance, local cache, agent cache, “post”
technique and browser’s “back” button can result in some
important accesses not recorded in the access log file, and
the number of Uniform Resource Locators(URL) recorded in
log may be less than the real one.
PATH COMPLETION
 Using the local caching and proxy servers also produces
the difficulties for path completion because users can access
the pages in the local caching or the proxy servers caching
without leaving any record in server’s access log.
 As a result, the user access paths are incompletely
preserved in the web access log. To discover user’s travel
pattern, the missing pages in the user access path should be
appended. The purpose of the path completion is to
accomplish this task.
CONTENT PREPROCESSING
Converting the text, image, scripts, and other files
such as multimedia into forms that are useful for the
Web Usage Mining process.
This consists of performing content mining such as
classification or clustering. (also found in pattern
discovery)
PATTERN DISCOVERY
Pattern discovery draws upon methods and
algorithms developed from several fields such as
statistics, data mining, machine learning and
pattern recognition.
PATTERN DISCOVERY
Statistics
Association Rules
Clustering
Classification
Sequential Patterns
Path Analysis
etc...
PATTERN DISCOVERY(CONT.)
Statistics:-
 Most common method.
 This kind of analysis is performed by many tools, its aim
is to give a description of the traffic on a Web site, like
most visited pages, average daily hits, etc.;
 Useful for improving the system performance,
enhancing the security of the system, facilitating the site
modification task, etc.
PATTERN DISCOVERY (CONT.)
Association rules:
 Its main idea is to consider every URL requested by a
user in a visit as basket data (item) and to discover
relationships with a minimum support level between
them;
 Discover the correlations among references to various
pages of a web site in a single server session.
 Useful for restructuring web site, serving as a heuristic
for pre-fetching docs to reduce latency.
Association Rules (cont.)
 discovers affinities among sets of items across transactions
X =====> Y
where X, Y are sets of items,
confidence,support
 Examples:
 60% of clients who accessed /products/, also accessed
/products/software/webminer.htm.
 30% of clients who accessed /special-offer.html, placed
an online order in /products/software/.

PATTERN DISCOVERY (CONT.)
Clustering:-
 meaningful clusters of URLs can be created by discovering
similar characteristics between them according to users
behaviors.
 Usage clusters
Useful to perform market segmentation in E-commerce or
provide personalized Web content to the users.
 Pages clusters
Useful for Internet search engines and web assistance
providers.
PATTERN DISCOVERY (CONT.)
Classification:-
 Develop a profile of users belonging to a particular class
or category.
 Require extraction and selection of features that best
describe the properties of a given class or category.
Clustering and Classification:-
clients who often access
/products/software/webminer.html tend to be
from educational institutions.
clients who placed an online order for software tend to
be students in the 20-25 age group and live in the United
States.
75% of clients who download software from
/products/software/demos/ visit between 7:00 and 11:00
pm on weekends.
Pattern Discovery (cont.)
Sequential Patterns:-
30% of clients who visited /products/software/, had done a
search in Yahoo using the keyword “software” before their visit
60% of clients who placed an online order for WEBMINER,
placed another online order for software within 15 days
PATTERN DISCOVERY (CONT.)
Path Analysis:-
Types of Path/Usage Information
Most Frequent paths traversed by users
Entry and Exit Points
Distribution of user session durations.
PATTERN ANALYSIS
 Challenges of Pattern Analysis is to filter uninteresting
information and to visualize and interpret the interesting
patterns to the user.
 First delete the less significance rules or models from the
interested model storehouse; Next use technology of OLAP
and so on to carry on the comprehensive mining and
analysis.
 Once more, let discovered data or knowledge be visible;
Finally, provide the characteristic service to the electronic
commerce website.
WEB MINING SOFTWARES
 SPSS Clementine.
 Megaputer PolyAnalyst.
 ClickTracks by Web analytics.
 QL2 by QL2 Software Inc.
Web Mining Presentation Final

Contenu connexe

Tendances

Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 

Tendances (20)

Data Mining
Data MiningData Mining
Data Mining
 
Web mining
Web mining Web mining
Web mining
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Web mining
Web miningWeb mining
Web mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
WEB MINING.
WEB MINING.WEB MINING.
WEB MINING.
 
Text mining
Text miningText mining
Text mining
 
Web mining
Web miningWeb mining
Web mining
 
Web content mining
Web content miningWeb content mining
Web content mining
 
Web data mining
Web data miningWeb data mining
Web data mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Dbms presentaion
Dbms presentaionDbms presentaion
Dbms presentaion
 
Data mining
Data mining Data mining
Data mining
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Multimedia Database
Multimedia DatabaseMultimedia Database
Multimedia Database
 
Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 

Similaire à Web Mining Presentation Final

Data preparation for mining world wide web browsing patterns (1999)
Data preparation for mining world wide web browsing patterns (1999)Data preparation for mining world wide web browsing patterns (1999)
Data preparation for mining world wide web browsing patterns (1999)
OUM SAOKOSAL
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
ijceronline
 

Similaire à Web Mining Presentation Final (20)

ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
ANALYSIS OF RESEARCH ISSUES IN WEB DATA MINING
 
A comprehensive study of mining web data
A comprehensive study of mining web dataA comprehensive study of mining web data
A comprehensive study of mining web data
 
A Study on Web Structure Mining
A Study on Web Structure MiningA Study on Web Structure Mining
A Study on Web Structure Mining
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
 
A Study On Web Structure Mining
A Study On Web Structure MiningA Study On Web Structure Mining
A Study On Web Structure Mining
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
Recommendation Based On Comparative Analysis of Apriori and BW-Mine AlgorithmRecommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
Recommendation Based On Comparative Analysis of Apriori and BW-Mine Algorithm
 
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
An Enhanced Approach for Detecting User's Behavior Applying Country-Wise Loca...
 
L017418893
L017418893L017418893
L017418893
 
WEB MINING.pptx
WEB MINING.pptxWEB MINING.pptx
WEB MINING.pptx
 
Webmining ppt
Webmining pptWebmining ppt
Webmining ppt
 
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web PagesWSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
WSO-LINK: Algorithm to Eliminate Web Structure Outliers in Web Pages
 
Comparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining CategoriesComparable Analysis of Web Mining Categories
Comparable Analysis of Web Mining Categories
 
Minning www
Minning wwwMinning www
Minning www
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
01635156
0163515601635156
01635156
 
Data preparation for mining world wide web browsing patterns (1999)
Data preparation for mining world wide web browsing patterns (1999)Data preparation for mining world wide web browsing patterns (1999)
Data preparation for mining world wide web browsing patterns (1999)
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Pxc3893553
Pxc3893553Pxc3893553
Pxc3893553
 

Web Mining Presentation Final

  • 1. PRESENTATION ON WEB MINING (CONTENT + STRUCTURE + USAGE) Presented By:- Mr. Jagrat Gupta M.Tech. 1st Year CSE Branch
  • 2. WEB MINING •Extraction of knowledge from web data. •Web Data Includes:- web documents. hyperlinks between documents. usage logs of web sites, etc. A panel organized at ICTAI 1997 (Srivastava and Mobasher 1997) asked the question “Is there anything distinct about web mining (compared to data mining in general)?”
  • 3. WEB MINING:APPROACHES  First was a “Process-centric view” which defined web mining as a sequence of tasks (Etzioni 1996).  Resource finding.  Information selection and preprocessing.  Generalization.  Analysis. Kosala and Blockeel divided web mining process into the following five subtasks:  Resource finding and retrieving.  Information selection and preprocessing.  Patterns analysis and recognition.
  • 4. WEB MINING:APPROACHES  Validation and interpretation.  Visualization.  Second was a “Data-centric view” which defined web mining in terms of the types of web data that was being used in the mining process (Cooley, Srivastava, and Mobasher 1997). The second definition has become more acceptable.  In this Presentation we follow the data-centric view of web mining which is defined as follows- “Web mining is the application of data mining techniques to extract knowledge from web data, i.e. Web content, Web structure, and Web usage data.”
  • 6. WEB CONTENT MINING  Mining, extraction and integration of useful data, information and knowledge from Web page content.  Content data is the collection of facts a web page is designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables.  Search Engines do not generally provide structural information nor categorize, filter, or interpret documents.  In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents.  Research activities are going on in Information retrieval methods, Natural language processing and Computer vision.
  • 7. WEB CONTENT MINING- PROBLEMS  Data/information extraction: Our focus will be on extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction are used.  Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications.  Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking.
  • 8. WEB CONTENT MINING- PROBLEMS  Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming.  Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is interesting problem.
  • 9. WEB CONTENT MINING - APPROACHES
  • 10. WEB CONTENT MINING - APPROACHES  The database approaches to Web mining have generally focused on techniques for integrating and organizing the heterogeneous and semi-structured data on the Web into more structured and high-level collections of resources, such as in relational databases, and using standard database querying mechanisms and data mining techniques to access and analyze this information.  Multilevel-Databases:- The main idea behind these proposals is that the lowest level of the database contains primitive semi- structured information stored in various Web repositories, such as hypertext documents. At the higher level(s) meta data or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases. ARANEUS system extracts relevant information from hypertext documents and integrates these into higher-level derived Web Hypertexts which are generalizations of the notion of database views.
  • 11. WEB CONTENT MINING - APPROACHES  WebQuery-Systems:- There have been many Web-based query systems and languages developed recently that attempt to utilize standard database query languages such as SQL, structural information about Web documents, and even natural language processing for accommodating the types of queries that are used in World Wide Web searches. W3QL, WebLog, Lorel and UnQL , TSIMMIS.  The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.  Intelligent-Search-Agents:- Several intelligent Web agents have been developed that search for relevant information using characteristics of a particular domain (and possibly a user profile) to organize and interpret the discovered information.
  • 12. WEB CONTENT MINING - APPROACHES Harvest , FAQ-Finder , Information Manifold , OCCAM ,ParaSite, ShopBot, ILA.  Information-Filtering/Categorization:- A number of Web agents use various information retrieval techniques and characteristics of open hypertext Web documents to automatically retrieve, filter, and categorize them. HyPursuit, BO (Bookmark Organizer).
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. PREDECESSORS AND SUCCESSORS OF A WEB PAGE … … Predecessors Successors
  • 18. PAGE RANK • Simple solution: create a stochastic matrix of the Web: • Each page i corresponds to row i and column j of the matrix. • If page j has n successors (links) then the ijth cell of the matrix is equal to- 1/n if page i is one of these n succesors of page j 0 otherwise.
  • 19. PAGE RANK – EXAMPLE  Assume that the Web consists of only three pages - A, B, and C. The links among these pages are shown below. A B C Let [a, b, c] be the vector of importances for these three pages A B C A 1/2 1/2 0 B 1/2 0 1 C 0 1/2 0
  • 20. PAGE RANK – EXAMPLE (CONT.)  The equation describing the asymptotic values of these three variables is: a 1/2 1/2 0 a b = 1/2 0 1 b c 0 1/2 0 c We can solve the equations like this one by starting with the assumption a = b = c = 1, and applying the matrix to the current estimate of these values repeatedly. The first four iterations give the following estimates: a = 1 1 5/4 9/8 5/4 … 6/5 b = 1 3/2 1 11/8 17/16 … 6/5 c = 1 1/2 3/4 1/2 11/16 ... 3/5
  • 21. PAGE RANK – EXAMPLE (CONT.)  In the limit, the solution is a=b=6/5, c=3/5. That is, a and b each have the same importance, and twice of c.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26. HITS  Define a matrix A whose rows and columns correspond to Web pages with entry Aij=1 if page i links to page j, and 0 if not.  Let a and h be vectors, whose ith component corresponds to the degrees of authority and hubbiness of the ith page. Then:  h = A × a. That is, the hubbiness of each page is the sum of the authorities of all the pages it links to.  a = AT × h. That is, the authority of each page is the sum of the hubbiness of all the pages that link to it (AT - transponed matrix). Then, a = AT × A × a h = A × AT × h
  • 27. HUB AND AUTHORITIES - EXAMPLE Consider the Web presented below. A C B 1 1 1 A = 0 0 1 1 1 0 1 0 1 AT = 1 0 1 1 1 0 3 1 2 AAT = 1 1 0 2 0 2 2 2 1 ATA = 2 2 1 1 1 2
  • 28. HUB AND AUTHORITIES - EXAMPLE If we assume that the vectors h = [ ha, hb, hc ] and a = [ aa, ab, ac ] are each initially [ 1,1,1 ], the first three iterations of the equations for a and h are the following: aa = 1 5 24 114 ab = 1 5 24 114 ac = 1 4 18 84 ha = 1 6 28 132 hb = 1 2 8 36 hc = 1 4 20 96
  • 29.
  • 30. Data Sources server level collection: the server stores data regarding requests performed by the client, thus data regard generally just one source. client level collection: it is the client itself which sends to a repository information regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. ); proxy level collection: information is stored at the proxy side, thus Web data regards several Websites, but only users whose Web clients pass through the proxy.
  • 31.
  • 33.
  • 35. PREPROCESSING Convert raw usage data into the data abstractions. Most difficult task in Web usage mining due to the incompleteness of the available data.
  • 36.
  • 37. DATA CLEANING  Irrelevant records in web access log will be eliminated during data cleaning.  Since the target of Web Usage Mining is to get the user’s travel patterns, following two kinds of records are unnecessary and should be removed:-  The records of graphics, videos and the format information The records have filename suffixes of GIF, JPEG, CSS, and so on, which can found in the URI field of the every record.  The records with the failed HTTP status code.
  • 38. USER & SESSION IDENTIFICATION  The task of user and session identification is find out the different user sessions from the original web access log.  User’s identification is, to identify who access web site and which pages are accessed.  Session identification is to divide the page accesses of each user at a time into individual sessions.  The difficulties to accomplish this step are introduced by using proxy servers, e.g. different users may have same IP address in the log.  A referrer-based method is proposed to solve these problems in this study.
  • 39. USER & SESSION IDENTIFICATION  The different IP addresses distinguish different users.  If the IP addresses are same, the different browsers and operation systems indicate different users.  If all of the IP address, browsers and operating systems are same, the referrer information should be taken into account. “The Refer URI field is checked, and a new user session is identified if the URL in the Refer URI field hasn’t been accessed previously, or there is a large interval (usually more than 10 seconds) between the accessing time of this record and the previous one if the Refer URI field is empty.”
  • 40. PATH COMPLETION  The session identified by rule 3 may contains more than one visit by the same user at different time, the time oriented heuristics is then used to divide the different visits into different user sessions.  Path completion process should be used for acquiring the complete user access path.  There are some reasons that result in path’s incompletion, for instance, local cache, agent cache, “post” technique and browser’s “back” button can result in some important accesses not recorded in the access log file, and the number of Uniform Resource Locators(URL) recorded in log may be less than the real one.
  • 41. PATH COMPLETION  Using the local caching and proxy servers also produces the difficulties for path completion because users can access the pages in the local caching or the proxy servers caching without leaving any record in server’s access log.  As a result, the user access paths are incompletely preserved in the web access log. To discover user’s travel pattern, the missing pages in the user access path should be appended. The purpose of the path completion is to accomplish this task.
  • 42. CONTENT PREPROCESSING Converting the text, image, scripts, and other files such as multimedia into forms that are useful for the Web Usage Mining process. This consists of performing content mining such as classification or clustering. (also found in pattern discovery)
  • 43. PATTERN DISCOVERY Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition.
  • 45. PATTERN DISCOVERY(CONT.) Statistics:-  Most common method.  This kind of analysis is performed by many tools, its aim is to give a description of the traffic on a Web site, like most visited pages, average daily hits, etc.;  Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, etc.
  • 46. PATTERN DISCOVERY (CONT.) Association rules:  Its main idea is to consider every URL requested by a user in a visit as basket data (item) and to discover relationships with a minimum support level between them;  Discover the correlations among references to various pages of a web site in a single server session.  Useful for restructuring web site, serving as a heuristic for pre-fetching docs to reduce latency.
  • 47. Association Rules (cont.)  discovers affinities among sets of items across transactions X =====> Y where X, Y are sets of items, confidence,support  Examples:  60% of clients who accessed /products/, also accessed /products/software/webminer.htm.  30% of clients who accessed /special-offer.html, placed an online order in /products/software/. 
  • 48. PATTERN DISCOVERY (CONT.) Clustering:-  meaningful clusters of URLs can be created by discovering similar characteristics between them according to users behaviors.  Usage clusters Useful to perform market segmentation in E-commerce or provide personalized Web content to the users.  Pages clusters Useful for Internet search engines and web assistance providers.
  • 49. PATTERN DISCOVERY (CONT.) Classification:-  Develop a profile of users belonging to a particular class or category.  Require extraction and selection of features that best describe the properties of a given class or category.
  • 50. Clustering and Classification:- clients who often access /products/software/webminer.html tend to be from educational institutions. clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States. 75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends. Pattern Discovery (cont.)
  • 51. Sequential Patterns:- 30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit 60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days
  • 52. PATTERN DISCOVERY (CONT.) Path Analysis:- Types of Path/Usage Information Most Frequent paths traversed by users Entry and Exit Points Distribution of user session durations.
  • 53. PATTERN ANALYSIS  Challenges of Pattern Analysis is to filter uninteresting information and to visualize and interpret the interesting patterns to the user.  First delete the less significance rules or models from the interested model storehouse; Next use technology of OLAP and so on to carry on the comprehensive mining and analysis.  Once more, let discovered data or knowledge be visible; Finally, provide the characteristic service to the electronic commerce website.
  • 54. WEB MINING SOFTWARES  SPSS Clementine.  Megaputer PolyAnalyst.  ClickTracks by Web analytics.  QL2 by QL2 Software Inc.