50320140501002

International Journal of Information Technology & Management Information System (IJITMIS),
ISSN 0976 – 6405(Print), ISSN 0976 – 6413(Online), Volume 5, Issue 1, January - April (2014), © IAEME
12
WEB USAGE MINING CONTEXTUAL FACTOR: HUMAN
INFORMATION BEHAVIOR
Ms. Ravita Mishra
Information Technology Dept, Ramrao Adik Institute of Technology, Nerul Navi Mumbai,
India
ABSTRACT
With the rapid development of information technology, the World Wide Web has
been widely used in various applications, such as search engines, online learning and
electronic commerce. These applications are used by a diverse population of users with
heterogeneous backgrounds, in terms of their knowledge, skills, and needs. Therefore, human
factors are key issues for the development of web-based application and research. This paper
first identifies reviews from different authorsand then examines the three important human
factors: gender differences, prior knowledge, and cognitive styles. The review result is not
significantly correct; a new model is proposed that will access the data (log data) and show
the human access behavior. The proposed model has two stages: web intelligence and
navigation pattern. Stage 1(web intelligence system) captures data from different server and
converts in the form of table (data store). Stage 2 uses the N-gram algorithm which assumes
that the last N-pages browsed affect the probability of the next page to be visited, and user
navigation sessions are modelled as a hypertext probabilistic grammar whose higher
probability strings correspond to the user’s preferred trails.In this paper web caching and pre-
fetching are two important approaches used to reduce the noticeable response time perceived
by users.The model improves the navigation pattern of users and find the users behavior (
gender difference and user type) that finding is used by site designer and researchers and also
used for detecting and avoiding the terror threats caused by terrorists all over the world.The
paper is organized into five different parts, first part contain introduction, second part contain
different type of web mining third part contain usage mining on the web and forth part
contain analysis of human factor and evaluation technique,fifth part contain propose
methodology and last part contains application, limitation, conclusion and further work.
Keywords: Pattern Discovery, Contextual factor, Information Retrieval, N-gram,
Gender difference, Cognitive style and Prior experience.
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY &
MANAGEMENT INFORMATION SYSTEM (IJITMIS)
ISSN 0976 – 6405(Print)
ISSN 0976 – 6413(Online)
Volume 5, Issue 1, January - April (2014), pp. 12-29
© IAEME: http://www.iaeme.com/IJITMIS.asp
Journal Impact Factor (2013): 5.2372 (Calculated by GISI)
www.jifactor.com
IJITMIS
© I A E M E

13
1.INTRODUCTION
Web mining is a very hot research topic which combines two of the activated research
areas: Data Mining and World Wide Web. The Web mining research relates to several
research communities such as Database, Information Retrieval and Artificial Intelligence.
Web mining is categorized in into three areas: Web content mining, Web structure mining
and Web usage mining. Web content mining focuses on the discovery/retrieval of the useful
information from the web contents/data/documents, while the Web structure mining
emphasizes to the discovery of how to model the underlying link structures of the web [14,
16]. Web usage mining is relative independent, but not isolated, category, which mainly
describes the techniques that discover the user's usage pattern and try to predict the user's
behaviors. Web mining is the term of applying data mining techniques to automatically
discover and extract useful information from the World Wide Web documents and services
[16]. Here, human factors are increasingly seen as important issues, as reflected in the
substantial number of existing studies in the area. Among various human factors, gender
differences (e.g., Roy, Taylor, & Chi, 2003), prior knowledge (e.g., Calisir&Gurel, 2003) and
cognitive styles (e.g., Chen &Macredie, 2004) have significant impacts on web-based
interaction. Furthermore, these three human factors have certain inter-relations. For example,
females tend to behave similarly to novices, in terms of the extent to which they experience
disorientation problems; males and experts seem to have similar preferences in their
interaction patterns, with studies reporting that they enjoy non-linear interaction (Ford &
Chen, 2000). Despite the growing number of studies looking at these three human factors,
there is a lack of an integrated review which synthesizes their effects.
2. WEB DATA MINING
2.1 Overview: Today, with the tremendous growth of the data sources available on the Web
and the dramatic popularity of e-commerce in the business community, Web mining has
become the focus of quite a few research projects and papers [13, 14, and 15]. In previous
research, they suggested a similar way to decompose web mining into the following subtasks:
Resource Discovery: The task of retrieving the intended information from web.
Information Extraction: Automatically selecting and pre-processing specific information
from the retrieved web resources. Generalization: Automatically discovers general patters at
the both individual web sites and across multiple sites. Analysis: Analyzing the mined
pattern. The authors of [10] claims the web involves three types of data: data on the web
(content), web log data (usage) and web structure data. The author classified the data type as
content data, structure data, usage data, and user profile data.
2.1.1 Web Content Mining: Web content mining describes the automatic search of
information resourceavailable online and involves mining web data contents. The web
document usually contains several types of data, such as text, image, audio, video, metadata
and hyperlinks. The technologies that are normally used in web content mining are NLP and
IR. Some of them are semi-structured such as HTML documents or a more structured data
like the data in the tables or database generated HTML pages, butmost of the data is
unstructured text data [14].
2.1.2 Web Structure Mining: Technically, web content mining mainly focuses on the
structure of inner-document, while web structure mining tries to discover the link structure of

14
the hyperlinks at the inter-document level. Based on the topology of the hyperlinks, web
structure mining will categorize the web pages and generate the information, such as the
similarity and relationship between different web sites. Web structure mining can also have
another direction – discovering the structure of web document itself. This type of structure
mining can be used to reveal the structure (schema) of web pages; this would be good for
navigation purpose and make it possible to compare/integrate web page schemes. The
structural information generated from the web structure mining includes the following: the
information measuring the frequency of the local links in the web tuples in a web table; the
information measuring the frequency of web tuples in a web table containing links that are
interior and the links that are within the same document; the information measuring the
frequency of web tuples in a web table that contains links that are global and the links that
span different web sites; the information measuring the frequency of identical web tuples that
appear in a web table or among the web tables [15,20]. In general, if a web page is linked to
another web page directly, or the web pages are neighbors, we would like to discover the
relationships among those web pages. The relations maybe fall in one of the types, such as
they related by synonyms or ontology, they may have similar contents, and both of them may
sit in the same web server therefore created by the same person [13, 14].
2.1.3 Web Usage Mining: Analyzing the web access logs of different web sites can help
understand the user behaviour and the web structure, thereby improving the design of this
colossal collection of resources. There are two main tendencies in web usage mining driven
by the applications of the discoveries: General Access Pattern Tracking and Customized
Usage Tracking. The general access pattern tracking analyzes the web logs to understand
access patterns and trends. These analyses can be used for better structure and grouping of
resource providers. Applying data mining techniques on access logs unveils interesting access
patterns that can be used to restructure sites in a more efficient grouping, pinpoint effective
advertising locations, and target specific users for specific Selling ads. Customized usage
tracking analyzes individual trends. Its purpose is to customize web sites to users. The
information displayed the depth of the site structure and the format of the resources can all be
dynamically customized for each user over time based on their access patterns.
2.2. STEPS IN WEB MINING
Web usage mining falls in three areas 1: Pre-processing 2: Pattern discovery 3:
Pattern analysis. Preprocessing further categorized into three parts.
2.2.1 Pre-processing: Pre-processing is categorized in three types they are: Content Pre-
processing, Structure Pre-processing and Usage Pre-processing. Content preprocessing is the
process of converting text, image, scripts and other files into the forms that can be used by
the usage mining. For the content of static page views, the preprocessing can be easily done
by parsing the HTML and reformatting the information or running additional algorithm as
desired [15].The structure preprocessing can be treated similar as the content preprocessing.
However, each server session may have to construct a different site structure than others [13,
15].The inputs of the preprocessing phase may include the Web server logs, referral logs,
registration files, index server logs, and optionally usage statistics from a previous analysis.
The outputs are the user session file, transaction file, site topology, and page classifications.
It’s always necessary to adopt a data cleaning techniques to eliminate the impact of the
irrelevant items to the analysis result. Without sufficient data, it is very difficult to identify
the users [14].The session identification is also a part of the usage preprocessing. The goal of

15
it is to divide the page accesses of each user, who is likely to visit the Web site more than
once, into individual sessions. The simplest way to do is to use a timeout to break a user’s
click-stream into session. Another problem is named as path completion, which indicates the
determining if there are any important accesses missed in the access log. The methods used
for the user identification can be used for path completion. The final procedure of the pre-
processing is formatting, which is a preparation module to properly format the sessions or
transactions.
2.2.2 Pattern Discovery
Pattern discovery converges the algorithms and techniques from several research
areas, such as data mining, machine learning, statistics, and pattern recognition. Pattern
discovery falls in following categories: Statistical Analysis, Association Rules, Clustering,
Classification, Sequential Pattern and Dependency Modeling. Statistical techniques are the
most powerful tools in extracting knowledge about visitors to a Web site. The analysts may
perform different kinds of descriptive statistical analyses based on different variables when
analyzing the session file [13]. By analyzing the statistical information contained in the
periodic web system report, the extracted report can be potentially useful for improving the
system performance, enhancing the security of the system.Association rule mining techniques
can be used to discover unordered correlation between items found in a database of
transactions [13]. The association rules refer to sets of pages that are accessed together with a
support value exceeding some specified threshold. The web designers can restructure their
web sites efficiently with the help of the presence or absence of the association rules.
Clustering analysis is a technique to group together users or data items with the similar
characteristics. Clustering of user information or pages can facilitate the development and
execution of future marketing strategies [13]. Clustering of users will help to discover the
group of users, who have similar navigation pattern. It’s very useful for inferring user
demographics to perform market segmentation in E-commerce applications or provide
personalized web content to the individual users. Classification is supervised inductive
learning technique that maps a data item into one of several predefined classes. In the web
domain, Web master or marketer will have to use this technique if he/she want to establish a
profile of users belonging to a particular class or category. This requires extraction and
selection of features that best describe the properties of a given class or category [13].
Sequential Patternfinds the inter-session pattern, such that a set of the items follows the
presence of another’s in a time-ordered set of sessions.It also includes other types of temporal
analysis such as trend analysis, change point detection, or similarity analysis. It’s very useful
for the web marketer to predict the future trend, which help to place advertisements aimed at
certain user groups [13]. Dependency Modelingrepresents significant dependencies among
the various variables in the web domain [13]. The modeling technique provides a theoretical
framework for analyzing the behavior of users, and is potentially useful for predicting future
web resource consumption.
2.3 PATTERN ANALYSIS
The goal of this process is to eliminate the irrelative rules or patterns and to extract
the interesting rules or patterns from the output of the pattern discovery process. Output of
algorithms is not in the form suitable for direct human consumption, and thus need to be
transform to a format can be assimilate easily [13]. There are two most common approaches
for the pattern analysis. One is to use the knowledge query mechanism such as SQL, while
another is to construct multi-dimensional data cube before perform OLAP operations.

ISSN 0976 – 6405(Print), ISSN 0976 – 6413(Online)
3. ANALYSIS OFCONTEXTUAL FACTOR
In the given framework
analysed and it includes: information exploration, seeking, filtering, use
Based on the framework, various contextual
credibility and browser dependence
economic, social, and political -
model, the user dimension is considered to be influenced by the particular task, information
need, knowledge state, cognitive style, affective state and so on. They measured users’
cognitive styles and affective states before a user study, applying a process
while users were conducting information
relationships among the elements of the dimensions
users judge cognitive authority and information quality by two types of judgment
judgment and evaluative judgment
the judgments through a user study.
Review process:Due to the massive growth of the
topic and attracts more and more attenti
extract information from data set for business needs, which determines its application is
highly customer-related. In business r
research area which demonstrates completes human information behavior based on
experimental dataset. Analysis of this factor is based on four points. 1. Gender difference 2.
Cognitive style 3. Prior experience
few commercial analysis applications available
efficient, flexible and powerful tools, lots of work needs to be done for
developer.
Figure 4.1 illustrates the review process, which consists of four stages
As shown in above Fig.
journals and search engines; here
include empirical studies related to gender differences, prior knowledge and cognitive styles.
The search terms for these electronic resources included four group
6413(Online), Volume 5, Issue 1, January - April (2014), © IAEME
16
ANALYSIS OFCONTEXTUAL FACTOR (HUMAN INFORMATION BEHAVIOR)
In the given framework, contextual parameter human information behaviour
information exploration, seeking, filtering, use and communication
e framework, various contextual factors –user interest, difficulty, time taken,
credibility and browser dependence and their influential factor physical, cognitive, affective,
- and their implications were investigated [12]. In
, the user dimension is considered to be influenced by the particular task, information
states before a user study, applying a process-tracing technique
while users were conducting information-seeking tasks, and found various types of
relationships among the elements of the dimensions. In (Rieh 2002), the authors found that
ive authority and information quality by two types of judgment
judgment and evaluative judgment – and they also identified the main facets and keywords of
the judgments through a user study.
Due to the massive growth of the e-commerce, privacy becomes a sensitive
topic and attracts more and more attention recently. The basic goal of web mining is to
related. In business related customer data, Human factor is
which demonstrates completes human information behavior based on
enceand 4. Web based interaction. Although there are quite a
few commercial analysis applications available and many more are free on to develop the
efficient, flexible and powerful tools, lots of work needs to be done for both researcher
4.1 illustrates the review process, which consists of four stages
Fig.there is four major stages. Stage one search
here resources were selected because they were known to
empirical studies related to gender differences, prior knowledge and cognitive styles.
The search terms for these electronic resources included four groups: (1) Internet and
(2014), © IAEME
BEHAVIOR)
information behaviour is
and communication.
user interest, difficulty, time taken,
physical, cognitive, affective,
cations were investigated [12]. In previous
, the user dimension is considered to be influenced by the particular task, information
tracing technique
seeking tasks, and found various types of
In (Rieh 2002), the authors found that
ive authority and information quality by two types of judgment - predictive
and they also identified the main facets and keywords of
commerce, privacy becomes a sensitive
eb mining is to
Human factor is a fertilized
which demonstrates completes human information behavior based on
Although there are quite a
o develop the
both researcher and
4.1 illustrates the review process, which consists of four stages
searches electronic
resources were selected because they were known to
empirical studies related to gender differences, prior knowledge and cognitive styles.
s: (1) Internet and WWW;

17
(2) Gender, females/males; boys/girls, and men/women; (3) Prior knowledge, system
experience, novices/experts, domain expertise, domain knowledge, computer experience,
previous experience, Internet experience; and (4) Cognitive styles, learning styles, field
dependence.Stage two analyzes search based on timeline. Stage three selects the analysis
based on titles, elements and keywords. Stage four asses the behavior based on credibility.
3.1GENDER DIFFERENCES
Gender difference is important variable that influences computing skills and find the
human information behavior and their emotions. As the web has become a popular platform
for various applications, such as search engines, online learning and electronic commerce, a
growing body of studies has been conducted to examine gender differences in the use of the
web, this literature suggests that major differences between males and females lie within
navigation patterns, attitudes and perceptions [8, 9].In the previous research number of
theoretical survey will be taken and the literature has suggested that males report lower levels
of computer anxiety than their female counterparts; in addition, it also seems that males
achieve much better outcomes than females in the use of computers (Karavidas, Lim,
&Katsikas, 2004). Gender difference will be analyzed by Navigation Pattern andAttitudes
and Perceptions.
Navigation pattern is defined as the way user access the webpages. Without good
navigation, a site becomes useless to visitors. They can’t find the information they need, and
then seek out competing sites instead. It’s vital that your sites be easy to navigate if you want
to be a successful designer. There are certain navigation patterns that work on virtually all
sites. The first pattern tabbed navigation, second pattern is header navigation and third pattern
is blog, informational and reference site, corporate site etc.Large et al. (2002) examined how
boys and girls behaved differently when retrieving information from the web. 53 students,
comprising 23 boys and 30 girls from two grade-six classes, were the subjects of their study.
Overall, the boys explored more hypertext links per minute, tended to perform more page
jumps per minute, entered more searches in search engines, and gathered and saved
information more often than the girls, while the boys spent less time viewing pages than the
girls [8, 9]. Furthermore, Ford, Miller and Moss (2001) investigated individual differences in
internet searching using a sample of 64 Master’s students with 20 males and 44 females. The
above mentioned studies suggest that females and males show different approaches to
navigation, reflected in the navigation patterns that they exhibit, but that there are
contradictory findings.Table 1 Summarize how male and female student explore the web
pages.
Table 1: Gender Difference
Author/Year Male Female
ET/el/2002(23 boys and 30 girls) Explore more hyperlink Explore less hyperlink
Roy et el /2003(equal no. of boys
and girls)
More page Jump Less Page Jump
Lorigo/2006( 23 boys and 30
girls)
Linear Non-Linear
Lio,Huang2008( equal no. of
boys and girls)
Non-linear Linear
Ford,Miller/1996( 24 boys and
44 girls)
More Effective Less Effective

18
Attitudes and Perceptions: Perceptioncan determine the attitude it defines how you perceive
the word.Attitude is what the individual thinks about the perception and perception is the
human subjective experience of information provided by the senses. A number of studies
suggest that there are gender differences in attitudes towards web-based interaction and
perceptions. The first survey result state that 630 Anglo-American undergraduates completed
the Student Computer and Internet Survey, the results of which indicated that females
reported more computer anxiety and less computer self-efficacy than males. Schumacher and
Morahan-Martin (2001) conducted a survey to identify gender differences in attitudes
towards computers and the Internet. The survey was completed by 619 students,the results of
which indicated that females reported more computer anxiety and less computer self-efficacy
than males. Similar results were also found in the study by Koohang(2004), which
investigated 154 students of undergraduate management program, and the results indicated
that males had significantly higher positive perceptions than the females toward using the
digital library [5].The studies reviewed so far in this section indicate that females tend to have
more negative attitudes towards the use of the web than males and that they feel less able
when using the web than their male peers.
Table 2: Attitude and Perception
Author/Year Male Female
Jackson,Ervin/2001(630 students) Less computer
anxiety
More Computer
anxiety
Koohnag/2004 (245 students) Positive perception Negative perception
Koohang,Durante/2003(125 students) No significant
difference
---
Hong/2002( 24 students) Asynchronous
learning
Synchronous learning
3.2 PRIOR KNOWLEDGE
User’s prior knowledge includes system experience and domain knowledge and
alsorefers to user’s understanding of the content area (Lazonder, 2000). Prior knowledge or
domain knowledge also depends on web-based instruction, text structure, navigation facility
and internet searching, number of studies suggests that prior knowledge also growing body of
research low prior knowledge users and high prior knowledge users show different levels of
familiarity and have different requirements. The first survey result state that 200 students
participated in the web-based course and the authors found that the participants with more
experience in the use of internet tools used less time to organize their work and visited fewer
pages in each session [5]. The results showed that experts issued longer queries than non-
experts and experts also used many more technical query terms than non-experts [8].Prior
knowledge depends on the following categories:
Web-based instruction, Text structure, Navigation facilities and Internet Searching:
Web-based instruction:Some research has suggested that individuals with different levels of
prior knowledge show preferences for different types of text structure and different kinds of
navigation facilities.

19
Text structure: Three types of text structure – hierarchical, non-linear, and mixed
(hierarchical structure with cross referential links) has found and a number of studies have
examined how text structure interacts with user’s prior knowledge; the findings suggest that
experts and novices differ in their performance depending on the text structure used in Web-
based instruction. Survey 1, McDonald and Stevenson (1998) examined the effects of text
structure and prior knowledge on navigation performance [8, 9]. The results showed that the
performance of knowledgeable participants was better than that of non-knowledgeable
participants, as they had a better conception of the subject matter than non-knowledgeable
participants. Survey 2,Calisir and Gurel (2003) also investigated the interaction of three types
of text structure – linear, hierarchical and mixed in relation to the prior knowledge of users.
However, in contrast to the study by McDonald and Stevenson (1998), they examined the
influence of text structure and prior knowledge on learning performance, rather than on
navigation performance. Survey 3,Amadieu, Tricot, and MarinéDo (2005) obtained similar
results. Three types of structure were provided: hierarchical; network; and linear. The results
indicated that low prior knowledge learners demonstrated better performance in the
hierarchical structure, whereas the hierarchical structure seemed to obstruct the domain
representation for high prior knowledge learners. The findings suggest that a hierarchical
structure is most appropriate for non-knowledgeable subjects. The summary of text structure
analysis is given below:
Table 3: Text Structure
Author/Year Knowledge participant Non-knowledge
participant
McDonald and steewan(1998)
(Three structure non-linear,
hierarchical and mixed)
Better understanding of
subject matter
Less understanding of
subject matter
Calisir and Gurel (2003)
(Three types of text structure –
linear, hierarchical and mixed)
Linear and Mixed
Structure
Hierarchical structure
Amadieu, Tricot, and
MarinéDo (2005)(Three types
of structure hierarchical,
networkand linear.
Non-linear structure Hierarchical Structure
Mitchell, Chen, and Macredie
(2005) students reacted to
Web-based instruction with 74
undergraduate students
Non-linear Linear

20
Navigation facilities: When considering the relationships between learning strategies and
navigation facilities, student’s prior knowledge is an important factor in determining whether
a particular navigation facility is likely to be useful. Most current Web-based instruction
applications provide a range of navigation facilities to allow users to employ multiple
approaches to support their learning. Hierarchical maps and alphabetical indices are most
commonly used in Web-based instruction; each of them provides different functions in
relation to information access. The characteristics of the different navigation facilities may
influence how users develop their learning strategies, making navigation support a critical
issue. Farrell and Moore (2001) investigated with the use of different navigation facilities
(linear, main menu and search engine) influence user’s achievement and attitude [2, 3]. 200
students were placed into three groups based on their knowledge levels (low, middle, and
high) with the results indicating that high-knowledge users commonly tended to use search
engines to locate specific topics. Conversely, low-knowledge users seem to benefit from
hierarchical maps, which can facilitate the integration of individual topics [4].
Internet Searching: The goal of each fact-finding task was to find one specific answer to a
simple question while the broader tasks required the participants to find several documents
that would satisfy the task. The results indicated that no significant differences were noted
between experts and novices regarding the fact-finding, several studies also argue that prior
knowledge plays a substantial role in internet searching, which covers three aspects: search
strategies; search performance; and search perception. Regarding search strategies, Tabatabai
and Luconi(1998) investigated different strategies used by three experts and three novices.
The results showed that experts used more keywords while novices used the ‘Back’ key more
often; used fewer search engines; and missed some highly relevant sites [5].
Table 4: Internet searching
Author/Year Expert Novices
Tabatabi and Luconi/1998 More keywords Back key
2006 One specific answer Broader answer
Thatcher/2008 Web experience Cognitive search
3.3 COGNITIVE STYLES
Cognitive style also plays an essential role in web-based instruction, learning
preference, learning performance and internet searching. Field Dependence is a user’s
perception or comprehension of information is influenced by the surrounding perceptual or
contextual field.
Web-based instruction:Web based instruction isthe relationships between the degree of Field
Dependence and student’s learning performance and learning preferences.
Learning performance: Students Cognitive styles are determined by using cognitive style
analysis (Riding, 1991) and their learning performance are in breadth first and depth first

21
versions. Ford and Chen (2000) found that Field Dependent learners in the breadth-first
version performed better than those in the depth-first version. Conversely, Field Independent
students performed better in the depth-first version than those in the breadth-first version [5].
Graff (2003) determine an individual’s cognitive style, and the relationship between cognitive
style and performance in two versions of the system – long-page and short-page versions [4].
The study’s findings indicated that Field Independent students achieved superior scores in the
long-page condition whereas Field Dependent students were superior in the short-page
condition [5].
Learning preferences: Learning preferences are the choices that learners show in certain
types of learning environments and activities such as the selection of certain navigation paths
or facilities. Studies state that field independent and field dependent students show different
learning preferences. Lee, Cheng, Rai, and Depickere (2005) investigated student’s learning
preferences in WebCT. The study’s findings indicate that field dependent students were
accustomed to linear learning whereas field independent students tended to have a preference
for non-linear learning.
Internet searching: In this analysis GEFT was used to identify the participant’s cognitive
styles and participants were asked to find answers from the Web for two search questions.
The results showed that there were a statistically significant correlation between GEFT scores
and the time spent for searching and the URLs visited. The participants with the higher GEFT
scores conducted the longer search sessions, and visited more URLs. In contrast, the
participants with the lower GEFT scores had the shorter search sessions.Kim, Yun, and Kim
(2004) compare search strategies of different cognitive style groups and the results showed
that the Field Dependent group demonstrated significantly more repeated search attempt and,
more use of search operators [4,5].
4. PROPOSED MODEL
4.2 WEB INTELLIGENCE ARCHITECTURE
The proposed model solves the problem discussed above and provides easier
technique to find behaviour and increased the reliability of the system. The model is divided
into two parts in first part web intelligent system is used to record the web logs from server or
client using ISP. Second part uses the N-gram technique to combine content and usage
mining. The framework should enable the collection of online data from various Internet
Service Providers (ISPs), optionally analyzing the data in real-time, andtransmitting the
relevant data cleaning purpose. Previous review results had some limitation like:Inconsistent
results:The results reported in existing studies are not fully consistent. There are
contradictory findings as to whether gender differences influence user’s attitudes and
perceptions towards Web-based interaction and whether cognitive styles affect user’s
learning performance. In the future, we are developing a standard template for the
questionnaires so that the accuracy of the results can be improved. Lack of mixed methods
and limited application:The survey suggests that quantitative methods are favoured when
seeking to find the overall effectiveness of the systems. It is clear that quantitative and
qualitative methods have different strengths and weaknesses. However, existing study mixes
quantitative and qualitative methods. Fig.2. Proposed Architecture. As illustrated in Fig. 1,
individual surfers' activities are managed by various ISP’s and are recorded by each ISP. The
data is cleaned and filtered according to requirements. Filtered data is transmitted to relay and
is further propagated to a persistent data store, where it can be further analyzed by Big-Data
analysis tools.

ISSN 0976 – 6405(Print), ISSN 0976 – 6413(Online)
Stage-1
Data sets consisting of web log records for 5063 users are
University website. Web log is unprocessed text file which is recorded from t
Server. E.g. Log file of DePaul University (
De-Paul University (or any other log file) will be used for analysis. The pattern of log file is
shown below:
<Date><Time><C-ip><Cs-Username><S
<Cs-Method><Cs-Uri-Stem><Cs-
Web
Page
Persistent
Data Store
Big Data
Analytics
Pre-
Processing
Relay(Real
time
Analysis)
6413(Online), Volume 5, Issue 1, January - April (2014), © IAEME
22
1 Stage-
Fig. 2 Web intelligent System
Data sets consisting of web log records for 5063 users are collected from De Paul
University website. Web log is unprocessed text file which is recorded from t
Log file of DePaul University (www.cdm.maya.depaul.edu).Recorded log file
rsity (or any other log file) will be used for analysis. The pattern of log file is
Username><S-Sitename><S-Computername><S-Ip><S
-Uri-Query><Sc-Status><Time-Taken><Cs-Version>
Web
Page
EOI Parameter
(Behavioral
eter)
N-gram
Generation
Extraction
Classification/
Prediction
Contextual F
(Human Behavior)
Classification of
Web PLog
File
(2014), © IAEME
-2
collected from De Paul
University website. Web log is unprocessed text file which is recorded from the IIS Web
Recorded log file
rsity (or any other log file) will be used for analysis. The pattern of log file is
Ip><S-Port>
Version>
EOI Parameter
ehavioralParam
eter)
gram Feature
eneration and
action
Classification/
Prediction
Contextual Factor
(Human Behavior)
Classification of
Web Pages

23
The structure of log file:
Here we are suggesting few parameters that indicate the active involvement of the
subject in an EOI. Where each parameter in itself may have a limited predictive value, the
combination of these parameters may yield an accurate prediction or evidence.
A. Intensity of surfing/accessing
It measures the intensity of the user's Internet surfing activities and measuringthe
browsing intensity value by the number of pages that the user visited in a given time. When a
user shows an increased interest in a given event, we can assume that he will visit related web
pages, more intensively than usual. Consequently, historical data of the user's surfing
intensity should be used when searching for anomalies. We are measuring browser intensity
of users by field CS-Uri-Stem and CS-Version of log file.
B. Frequency of revisiting/refreshing a given page
It measures the number of revisit/refresh operations performed by the user on each
page. Through this information the system may locate stressful behavior, where the user
strives for immediate updates regarding his topic of interest. He may repeatedly and
frequently revisit the same page, or simply push the 'refresh' button on the browser.
Significant peaks in this parameter may be observed at real-time and it is calculated by the
CS-Uri-Stem and Time-Taken field of log file.
C. Irregular/Unusual hours of activity
It measures irregular surfing hours and irregular lengths of surfing sessions.
Examination of a user's historical data may reveal a regular pattern, concerning his surfing
hours. This parameter requires analyzing the user's historical data to learn the regular surfing
hours and session-lengths. The irregular hours are calculated by Time-Taken filed of log file
and deviations from such patterns can be found by anomaly detection methods.
D. Interaction level (Passive (high)/Active (low))
It measures the level of the user's interaction, ranging from 'low' (passive only), to
'high' (mostly active). In passive surfing the user suffices with reading pages, whereas in
active surfing he may chat, write email, commit responses or talkbacks, do Internet shopping,

24
and so on and it is calculated by S-code and Cs-Method filed of log file. Regarding our
'terrorist' scenario, we hypothesize that, as the deadline comes closer, the subject will lower
his or her active profile, and will focus on passive consumption of relevant information.
E. Diversity of interest topics/content topics
It measures user's range of interest topics, surfers are often attracted to diverse topics
such as news, sports, music, gaming or finances. When the subject is focused on an urgent
issue, we assume that it will affect his or her surfing pattern, restricting the range of visited
sites to a specific topic. The diversity measure can be learned from user’shistorical data,
using clustering methods and it is calculated by S-Sitename, CS-Uri-Stem and Cs-Uri-Query
field of log file. Significant deviations show up as anomalies or outliers.
F. Classification of webpage
Web pages are index pages and content pages. An index page is a page used by the user
for navigation of the web site. It normally contains little information except links. A content
page is a page containing information the user would be interested in and its content offers
something other than links.
Algorithm step
• Two threshold count threshold and link threshold
• Set χ =1/(mean reference length of all pages)
• t= -ln(1- )/χ
• For each page p
• If P’s file type is not HTML orP’s end of session count > count _threshold
• Mark P as a content page else
• P’s number of links > link _threshold
• Mark p as an index page else
• If P’s reference length < t
• Mark P as an index page else mark P as a content page
Correlation with EOI timing
We assume that our five behavioral parameters are correlated with the timing of the
EOI. When the timing of the EOI is known to the investigator, as in forensic investigations,
such correlations can provide supportive evidence in a rather straightforward manner.
However, when the timing of the EOI is unknown to the investigator, as in pre-emptive
investigations, the behavioural parameters can still be used for prediction.
4.2 IMPROVED NAVIGATION PATTERN
Here we are using the N gram model which assumes that the last N pages browsed
affect the probability of the next page to be visited. The model is based on the theory of
probabilistic grammars providing it with a sound theoretical foundation for future
enhancements. We propose a new model for handling the problem of mining log data which
directly captures the semantics of the user navigation sessions. We model the user navigation
records, inferred from log data, as a hypertext probabilistic grammar whose higher
probability generated strings correspond to the user’s preferred trails. There are two contexts
in which such model is potentially useful. On the one hand, it can help the service provider to
understand the user’s needs and as a result improve the quality of its service. The quality of

25
service can be improved by providing adaptive pages suited to the individual user, by
building dynamic pages in advance to reduce waiting time. On the other hand, such a model
can be useful to the individual web user by acting as a personal assistant integrated with
his/her web browser. Model has the advantage of being compact, self-contained, coherent,
and based on the well-established work probabilistic grammars. In fact the size of the model
depends only on the size of the web site being analysed and the amount of data collected.
Extensive experiments with both real and random data were conducted and the results show
that, in practice, the algorithm runs in linear time in the size of the grammar. Our model has
potential use both in helping the web site designer to understand the preferences of the
sitevisitor’s, and in helping individual users. To better understand their own navigation
patterns and increase their knowledge of the web’s content.Our approach has the following
characteristics: 1) Extracting search-focused information from web pages. 2) Taking key n-
grams as the representations of search-focused information. 3) Employing data mining for
extraction model using search log data. 4) Employing learning to search-focused key n-grams
as features.
4.2.1 KEY N-GRAMEXTRACTION
Extraction step requires data pre-processing, training data generation and N-gram
feature generation and N-gram extraction with task classification.
Pre-processing: We assume that the objects to be searched and ranked by the search engine
are web pages. During pre-processing, a web page in HTML format is parsed and represented
as a sequence of tags/words.
Algorithm step
• Read records in Logtable, For each record in Logtable
• Read fields (Sc_code, Sc_method)
• If Sc_code = ‘**’and Sc_ method = ‘**’ Then
• Get IP address and URL_link
• If suffix.URL_Link= {*.gif,*.jpg,*.css} Then
• Delete suffix.URL_link
• Save IP_sddress and URL_Link
• End if Else , Read next record End
Training Data Generation: We can consider automatically extracting queries from the page.
Head pages generally include a number of associated queries in the search log data. Such data
can naturally be used as training data for the automatic extraction of queries, particularly for
tail pages. We treat the n-grams in each of the document’s queries as its labelled key n-grams.
For example, when a document is “ABDC” associated with the query “ABC”, we consider
unigrams “A”, “B”, “C”, and bigrams “AB” are key n-grams with the assumption that they
should be ranked higher than unigram “D”, and bigrams “BD”and “DC”, by the extraction
model.
N-gram Features Generation: Web pages contain rich formatting information compared to
plain text. We utilize both textual and formatting information to create features in the
extraction model in order to accurately extract key n-grams. Feature generation based on two
parameter1. Frequency features 2. Appearance features.

26
1. Frequency Features
The original/normalized term frequencies of an n-gram within several fields, tags and
attributes are utilized.
• Frequency in Fields: Frequency in fields is: URL, page title, meta- keyword and
meta-description.
• Frequency within Structure Tags: The frequencies of n-gram in texts within a header,
table or list indicated by HTML tags including <h1>, . . . ,<h6>, <table>, <li> and
<dd>.
• Frequency within Highlight Tags: Texts highlighted or emphasized by HTML tags
including <a>, <b>, <i>, <em> and <strong>.
• Frequency within Attributes of Tags: These are hidden texts which are not visible to
users. Specifically, title, alt, href and src tag attributes are used.
• Frequencies in other Contexts: It includes: page headers, page meta-data, page body
and HTML file.
2. Appearance Features
The appearances of n-grams are also important for position, coverage and distribution
.indicators of their importance.Position indicates when it first appears in the title, paragraph
and document and Coverage indicate the coverage of an n-gram in the title or a header and
distribution are used to distribute across different parts of a page.
N-Gram Extraction and Task Classification: Features for each n-gram are then extracted, an
extraction model is trained.Key n-gram extraction is formalized as a learning to rank
problem.In learning, a ranking model is trained which rank n-grams and task user’s current
task will be finalized.The main aim task classification algorithm is to find the user’s task and
is classified into two main group’s casual user and careful user, in casual searching the user
wants to find the precise and credible information.
Algorithm step
• Frequently visited URLs as indicators for the task type classification (Cs-Uri-Stem)
field.
• Web task threshold (t=5ms).
• Storing all frequently visited URLs and counting the occurrence of the Frequently
Visited URLs.
• If frequently visited URLs are more than or equals to 5 then setting the user task is
careful user, otherwise the user task is casual user.
• If frequently visited URL have query (Cs-Uri-Query) and that query will be same then
setting the user task is casual otherwise the user task is careful user.
• Total no. of the URL in casual searching was higher than total no. of URL in careful
searching.
5. APPLICATION AND FUTURE TRENDS AND CONCLUSION
5.1 APPLICATION
Web-wide tracking – DoubleClick: ‘Web-wide tracking’, is tracking an individual across all
sites he visits is one of the most intriguing and controversial technologies, it provides an
understanding of an individual’s lifestyle and habits. The value of this technology in
applications such as cyber-threat analysis and homeland defense is quite clear, and it might

27
be only a matter of time before these organizations are asked to provide this information.
Understanding Web communities – AOL: Applying web mining to the data collected from
community interactions provides AOL with a very good understanding of its communities,
which it has used for targeted marketing through ads and e-mail solicitations. The idea is to
treat the community as a highly specialized focus group, understand its needs and opinions on
new and existing products; and also test strategies for influencing opinions. Web Catching:
The Web caching aims to improve the performance of web-based systems by storing and
reusing web objects that are likely to be used in the near future. It has proven to be an
effective technique in reducing network traffic, decreasing the access latency and lowering
the server load[18] .Web caching has focused on the use of historic information about web
objects to aid the cache replacement policies. Web Prefetching: Web prefetching is a
technique for reducing web latency based on predicting the next future web objects to be
accessed by the user and prefetching them during times. The prefetching technique has two
main components: The prediction engine and the prefetching engine. The prediction engine
runs a prediction algorithm to predict the next user’s request [18].
5.2 FUTURE DIRECTION
Fraud and Threat analysis: The anonymity provided by the Web has led to a significant
increase in attempted fraud, from unauthorized use of individual credit cards to hacking into
credit card databases for blackmail purposes. Yet another example is auction fraud, which has
been increasing on popular sites like eBay. Since all these frauds are being perpetrated
through the Internet, Web mining is the perfect analysis technique for detecting and
preventing them. Web mining and Privacy: While there are many benefits to be gained from
Web mining, a clear drawback is the potential for severe violations of privacy. Public attitude
towards privacy seems to be almost schizophrenic – i.e. people say one thing and do quite the
opposite. The research issue generated by this attitude is the need to develop approaches,
methodologies and tools that can be used to verify and validate that a Web service is indeed
using an end-user’s information in a manner consistent with its stated policies.
5.3 CONCLUSION
This paper will present a state-of-the art review of the current research associated with
these human factors. This review will be important for practitioners who want to develop a
sound understanding of the needs and preferences of users with various characteristics such
as intensity of surfing, interest, gender difference and topic similarity. Our model has
potential use both in helping the web site designer to understand the preferences of the site
visitor’s, and their behaviour and access pattern that will be used to decide the human
information behaviour. The model also analyzes the users’ web surfing patterns and traces the
terrorists and criminals activities. In this paper we are using the N-grams methods to search
log data, and the characteristics of key n-grams can be applied to the other data set. The
extracted key n-grams are used as features of the relevance ranking model for finding users
current task and their access behaviour. This approach also applicable to understand the
navigation patterns and increase their knowledge of the web’s content and it also applicable
in a posterior forensic investigation. The model will also help designers to develop web-based
personalized applications that can accommodate user’s individual differences and used for
detecting and avoiding the terror threats caused by terrorists all over the world.

28
REFERENCES
[1] Ford, N., Miller, D., & Moss, N, “ Web search strategies and human individual
differences: Cognitive and demographic factors, internet attitudes, and approaches” .
Journal of the American Society for Information Science and Technology, pp. 741–
756. 2005.
[2] Graff, M. (2003). “Learning from web-based instructional Systems and cognitive
style”. British Journal of Educational Technology, 34(4), 407–418.
[3] Chi E. H.; Pirolli P.; Chen K.; and Pitkow J. 2001. “Using information scent to model
user information needs and actions and the Web” . In Proceedings of the SIGCHI
conference on Human factors in computing systems,490- 497, Seattle, Washington,
United States: AC/M 22/11/2007).
[4] Kim K. and Allen B. 2002. Cognitive and task influences on web searching behavior.
Journal of the American Society forInformation Science and Technology, 53(2):109-
119: JohnWiley& Sons.
[5] Sherry y. chen, Robert Macradie,” web based interaction: A review of three important
human factors”, International journal of information management, journal homepage:
www.elsevier.com/locate/ijinfomgt pp. 1-9, 2010.
[6] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Lipschitz-Hankel
type involving products of Besselfunctions,” Phil. Trans. Roy. Soc. London, vol.
A247, pp. 529– 551, April 1955.
[7] White R. W. and Drucker S. M. 2007. Investigating behavioral variability in web
search. In Proceedings of the16th international conference on World Wide Web, 21-
30,Banff, Alberta, Canada: ACM.
[8] K.R.Suneetha, K.R.Krishnamoorthy,“Identifying User behavior by Analyzing Web
Server Access File” IJCSNA International Journal of Computer Science and Network
Security, Vol. 9 No.4 April 2009
[9] Alaa El-Halees “Mining Students Data to AnalyzingLearning Behavior: a Case
Study”, http://eref.uqu.edu.sa/files/eref2/folder6/f158.pdf
[10] R.Cooley, B.Mobasher, and J.Srivastav, “Web mining: Information and Pattern
Discovery on the World Wide Web”,Proc. IEEE Intl. Conf. Tools with Al, Newport
Beach, CA, pp.558-56, 1997
[11] Mahesh thyloreramkrishna, LathaKomalGowdar, LalatessSomashekarHavanur, “Web
mining: Key Accomplishments, Application, and Future Directions”, International
conference on Data Storage and Data Engineering, pp. 186-191, 2010
[12] Jinhyuk Choi, Jeongseok Seo, Geehyuk Lee “Analysis of web usage pattern using
various contextual factors” Association of advancement of artificial intelligence pp. 1-
9, 2009.
[13] R. Cooley, B. Mobasher, J. Srivastava, “Web Mining Information and Pattern
Discovery on the World Wide Web”, InProceedings of the 9th IEEE International
Conference on Tools With Artificial Intelligence, Newport Beach, CA, 1997.
[14] J.Srivastava, R. Cooley, M. Deshpande and P- N.Tan, “Web Usage Mining:
Discovery and Applications of usage pattern From Web Data”, SIGKDD
Explorations, Vol.1, Issue 2, 2000.
[15] Cooley, R., Mobasher, B.,&Srivastava, J. (1999). “Data preparation for mining world
wide web browsing patterns” Journal of Knowledge and Information Systems, 1 (1),
5-32.

29
[16] R. Kosala, H. Blockeel, “Web Mining Research: A Survey”,in SIGKDD Explorations
2(1), ACM, July 2000.
[17] JaideepSrivastava, Robert Cooleyz ,MukundDeshpande, Pang-Ning Tan, “Web
Usage Mining: Discovery and Applications of Usage Patterns from Web
Data”SIGKD Explorations. ACM SIGKDD, pp. 1-10, Jan 2000.
[18] Sandhaya Gawade , Hitesh Gupta, “Review of Algorithms for Web Pre-fetching
andCaching, International Journal of Advanced Research in Computer and
Communication Engineering Vol. 1, Issue 2, pp. 1-4, April 2012.
[19] RozitaJamiliOsfouei, “Behaviour mining of female students by analysing log files”, In
Proceeding of IEEE fifth international Conferences on Digital
InformationManagement ICDM 2010, Canada pp. 5-8. July 2010.
[20] T. Anand, S. Padmapriya, E. kirubakram, “Terror Tracking Using Advanced Web
Mining Perspective”, In Proceeding of IEEE Fourth international Conferences on
Intelligent agent and multimedia. pp. 1-4, 2009.
[21] Jos’eBorges and Mark Levene, “Data Mining of User Navigation Patterns”
Department of Computer Science, University College London, Gower Street, London,
pp. 1-19, April 2000.
[22] Chen Wan, KepingBi,Yunhua Hu, “Extracting Search-Focused Key N-Grams for
Relevance Ranking in Web Search” WSDM’12, February 8–12, 2012, Seattle,
Washington, USA, ACM. pp. 1-10.2012.
[23] Prof. Sindhu P Menon and Dr. Nagaratna P Hegde, “Research on Classification
Algorithms and its Impact on Web Mining”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 4, 2013, pp. 495 - 504,
ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[24] Alamelu Mangai J, Santhosh Kumar V and Sugumaran V, “Recent Research in Web
Page Classification – A Review”, International Journal of Computer Engineering &
Technology (IJCET), Volume 1, Issue 1, 2010, pp. 112 - 122, ISSN Print: 0976 –
6367, ISSN Online: 0976 – 6375.
[25] Suresh Subramanian and Dr. Sivaprakasam, “Genetic Algorithm with a Ranking
Based Objective Function and Inverse Index Representation for Web Data Mining”,
International Journal of Computer Engineering & Technology (IJCET), Volume 4,
Issue 5, 2013, pp. 84 - 90, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[26] Purvi Dubey and Asst. Prof. Sourabh Dave, “Effective Web Mining Technique for
Retrieval Information on the World Wide Web”, International Journal of Computer
Engineering & Technology (IJCET), Volume 4, Issue 6, 2013, pp. 156 - 160, ISSN
Print: 0976 – 6367, ISSN Online: 0976 – 6375.
[27] Hemprasad Badgujar and Dr. R.C.Thool, “His: Human Identification Schemes on
Web”, International Journal of Computer Engineering & Technology (IJCET),
Volume 4, Issue 2, 2013, pp. 198 - 212, ISSN Print: 0976 – 6367, ISSN Online:
0976 – 6375.

50320140501002

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à 50320140501002

Similaire à 50320140501002 (20)

Plus de IAEME Publication

Plus de IAEME Publication (20)

Dernier

Dernier (20)

50320140501002