SlideShare une entreprise Scribd logo
1  sur  36
DATAMINING


                              PROJECT REPORT


                           Submitted by SHY AM KUMAR S MTHIN
                            GOPINADH AJITH JOHN ALIAS RI TO
                                       GEORGE CHERIAN
                                                  1


                                               INTRODUCTION




  1.1 ABOUT THE TOPIC


       Data Mining is the process of discovering new correlations, patterns, and trends by digging into

(mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and

mathematical techniques. Data mining can also be defined as the process of extracting knowledge hidden

from large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and

potentially useful information from data. The alternative name of Data Mining is Knowledge discovery

(mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc.


       Data mining is the principle of sorting through large amounts of data and picking out relevant

information. It is usually used by business intelligence organizations, and financial analysts, but it is

increasingly used in the sciences to extract information from the enormous data sets generated by modern

experimental and observational methods, it has been described as "the nontrivial extraction of implicit,

previously unknown, and potentially useful information from data" and "the science of extracting useful

information from large data sets or databases".




  1.2 ABOUT THE PROJECT


       The Project has been developed in our college in an effort to identify the most frequently visited
sites, the site from where the most voluminous downloading has taken place and the sites that have been
denied access when referred to by the users.




                                                      1
Our college uses the Squid proxy server and our aim is to extract useful knowledge from one of

the log files in it. After a combined scrutiny of the log files the log named access.log was decided to be

used as the database. Hence our project was to mine the contents ofaccess.log .
Finally the PERL programming language was used for manipulating the contents of the log file.

PERL EXPRESS 2.5 was the platform used to develop the mining application.


     The log file content is in the form of standard text file requiring extensive and quick siring
manipulation to retrieve the necessary contents. The programs were required to sort the mined contents
in the descending order of its frequency of usage and size.


                                         CHAPTER 2 REQUIREMENT


                                                     ANALYSIS




2.1 INTRODUCTION


       Requirement analysis is the process of gathering and interpreting facts, diagnosing problems and
using the information lo recommend improvements on the system. It is a problem solving activity that
requires intensive communication between the system users and system developers.


       Requirement analysis or study is an important phase of any system development process. The

system is studied to the minutest detail and analyzed. The system analyst plays the role of an interrogator

and dwells deep into the working of the present system. The system is viewed as a whole and the inputs

to the system are identified. The outputs from the organization are traced through the various processing

that the inputs phase through in the organization.


        A detailed study of these processes must be made by various techniques like Interviews,

Questionnaires etc. The data collected by these sources must be scrutinized to arrive to a conclusion. The

conclusion is an understanding of how the system functions. This system is called the existing system.

Now, the existing system is subjected to close study and the problem areas are identified. The designer

now functions as a problem solver and tries to sort out the difficulties that the enterprise faces. The

solutions are given as a proposal.


        The proposal is then weighed with the existing system analytically and the best one              is

selected. The proposal is presented to the user for an endorsement by the user. The proposal             is

reviewed on user request and suitable changes are made. This loop ends as soon as the user               is

satisfied with the proposal.




                                                      3
2.2 PROPOSED SYSTEM


         In order to make the programming strategy optimal, complete and least complex a detailed

understanding of data mining, related concepts and associated algorithms are required. This is to be

followed by effective implementation of the algorithm using the best possible alternative.




2.3 DATAM1NING (KDD PROCESS)


          The Knowledge Discovery from Data process involved / includes relevant prior knowledge and

goals of applications: Creating a large dataset, Preprocessing of the data, Filtering or clearing, data

transformation, identifying dimcnsionally and useful feature. It also involves classification, association,

regression, clustering and summarization. Choosing the mining algorithm is the most important parameter

for the process.


         The final stage includes pattern evaluation which means visualization, transformation, removing

redundant pattern etc. use of discovery knowledge of the process.


         DM Technology and System: Data mining methods involves neural network, evolutionary

programming, memory base programming, Decision trees. Genetic Algorithms, Nonlinear regression

methods these work also involve fuzzy logic, which is a superset of conventional Boolean logic that has

been extended handle the concept of partial truth, partial false between completely true and complete

false.



           The term data mining is often used to apply to the two separate processes of knowledge discovery

and prediction. Knowledge discovery provides explicit information that has a readable form and can be

understood by a user. Forecasting, or predictive modeling provides predictions of future events and may

be transparent and readable in some approaches (e.g. rule based systems) and opaque in others such as

neural networks. Moreover, some data mining systems such as neural networks are inherently geared

towards prediction and pattern recognition, rather than knowledge discovery.

          Metadata, or data about a given data set, are often expressed in a condensed data mine-able format,

or one that facilitates the practice of data mining. Common examples include executive summaries and

scientific abstracts.




                                                     4
Data Mining is the process of discovering new correlations, patterns, and trends by digging into

(mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and

mathematical techniques.


       Data mining can also be defined as the process of extracting knowledge hidden from large

volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially useful

information from data. The alternative name of Data Mining is Knowledge discovery (mining) in

databases (KDD), knowledge extraction, data/pattern analysis, etc. The importance of collecting data thai

reflect your business or scientific activities to achieve competitive advantage is widely recognized now.

Powerful systems for collecting data and managing it in large databases are in place in all large and mid-

range companies.


                                                 LOG files

                                   Preprocessing
                                                Data cleaning


                                           Session identification


                                              Data conversion



                                                                                         mjnsup

              Frequent          mjnsup
                                                 Frequent           mjnsup
                                                                             Frequent
                Iternset                         Sequence                     Subtree
              Discovery                          Discovery                   Discovery




                            |       Pattern RESULTS i
                                              Analysis




      Figure 2.3.1 : Process of web usage mining

       However, the bottleneck of turning this data into your success is the difficulty of extracting

knowledge about the system you study from the collected data. DSS are computerize tools develop assist

decision makers through the process of making of decision. This is inherently prescription which

enhances decision making in some way. DSS are closely related to the concept of rationality which means

the tendency to act in a reasonable'way to make good decision. To produce the key decision for an

organization involve product/service, distribution of the product using different distribution channel,

calculation /computation of the output on different time and space, prediction/trend of the output for




                                                              5
individual product or service with in estimated time frame and finally the schedule of the production on

the basis of demand, capacity and resource.


          The main aim and objective of the work is to develop a system on dynamic decision which depend

on product life cycle individual characteristics graph analysis has been done to give enhance and advance

thought to analysis the pattern of the product. The system has been reviewed in terms of local and global

aspect.




2.4 WORKING OF DATAMINTNG



          While large-scale information technology has been evolving separate transaction and analytical

systems, data mining provides the link between the two. Data mining software analyzes relationships and

patterns in stored transaction data based on open-ended user queries. Several types of analytical software

are available: statistical, machine learning, and neural networks. Generally, any of four types of

relationships are sought:


          Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant

chain could mine customer purchase data to determine when customers visit and what they typically

order. This information could be used to increase traffic by having daily specials.


          Clusters: Data items are grouped according to logical relationships or consumer preferences. For

example, data can be mined to identify market segments or consumer affinities.

          Associations: Data can be mined to identify associations. The beer-diaper example is an example
of associative mining.


          Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an

otitdoor equipment retailer could predict the likelihood of a backpack being purchased based on a

consumer's purchase of sleeping bags and hiking shoes. Data mining consists of five major elements:


     •Extract, transform, and load transaction data onto the data warehouse system.


     •Store and manage the data in a multidimensional database system.


     •Provide data access to business analysts and information technology professionals.



                                                   6
•Analyze the data by application software.


     •Present the data in a useful format, such as a graph or table.
               1 .Classification and Regression Trees (CART) and Chi Square


              2.Detection (CHAID) : CART and CHAID are decision tree techniques used for classification

of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict

which records will have a given outcome. CART' segments a dataset by creating 2-way splits while

CHAID segments using chi square tests to create multi-way splits. CART typically requires less data

preparation than CHAID.


•Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of
the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the A:-nearest
neighbor technique.


•Rule induction: The extraction of useful if-then rules from data based on statistical significance.

•   Data visualization: The visual interpretation of complex relationships in multidimensional data.

Graphics tools are used to illustrate data relation.




2.5 DATA MINING ALGORITHMS


        The data mining algorithm is the mechanism that creates mining models. To create a model, an
algorithm first analyzes a set of data, looking for specific patterns and trends. The algorithm then uses
the results of this analysis to define the parameters of the mining model.


     The mining model that an algorithm creates can take various forms, including:


     •A set of rules that describe how products are grouped together in a transaction.

     •A decision tree that predicts whether a particular customer will buy a product.

     •A mathematical model that forecasts sales.
          •     A set of clusters that describe how the cases in a dataset are related.




                                                       7
Microsoft SQL Server 2005 Analysis Services (SSAS) provides several algorithms for use in your

data mining solutions. These algorithms are a subset of all the algorithms that can be used for data

mining. You can also use third-party algorithms that comply with the OLE DB for Data Mining

specification. For more information about third-party algorithms, see Plugin Algorithms.


       Analysis Services includes the following algorithm types:


         •Classification algorithms predict one or more discrete variables, based on the other attributes in
         the dataset. An example of a classification algorithm is the Decision Trees Algorithm.


         •Regression algorithms predict one or more continuous variables, such as profit or loss, based on
         other attributes in the dataset. An example of a regression algorithm is the Time Series

         Algorithm.


         •Segmentation algorithms divide data into groups, or clusters, of items that have similar
         properties. An example of a segmentation algorithm is the Clustering Algorithm.


         •Association algorithms find correlations between different attributes in a dataset. The most
         common application of this kind of algorithm is for creating association rules, which can be

         used in a market basket analysis.


         » Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web
             path How. An example of a sequence analysis algorithm is the Sequence Clustering
             Algorithm.




 2.6 SOFTWARE REQUIREMENTS


          OPERATION SYSTEM                             WINDOWS XP SP2

          PERL COMPILER. PERL                          ACTIVE PERL

          SCRIPT EDITOR                                PERL EXPRESS

          SERVER SOFTWARE                              IIS SERVER




                                                   8
2.7 FUZZY LOGIC



           Fuzzy logic is a form of multi-valued logic derived from fuzzy set theory to deal with

reasoning that is approximate rather than precise. Just as in fuzzy set theory the set membership values

can range (inclusively) between 0 and 1, in fuzzy logic the degree of truth of a statement can range

between 0 and 1 and is not constrained to the two truth values ftrue, false} as in classic predicate logic.

And when linguistic variables are used, these degrees may be managed by specific functions, as

discussed below.


           Both fuzzy degrees of truth and probabilities range between 0 and 1 and hence may seem
similar at first. However, they are distinct conceptually; fuzzy truth represents membership in vaguely
defined sets, not likelihood of some event or condition as in probability theory. For example, if a 100-ml
glass contains 30 ml of water, then, for two fuzzy sets, Empty and Full, one might define the glass as
being 0.7 empty and 0.3 full.


        Note that the concept of emptiness would be subjective and thus would depend on the observer

or designer. Another designer might equally well design a set membership function where the glass

would be considered full for all values down to 50 ml. A probabilistic setting would first define a

scalar variable for the fullness of the glass, and second, conditional distributions describing the

probability that someone would call the glass full given a specific fullness level. Note that the

conditioning can be achieved by having a specific observer that randomly selects ihe label for the

glass, a distribution over deterministic observers, or both. While fuzzy logic avoids talking about

randomness in this context, this simplification at the same time obscures what is exactly meant by the

statement the 'glass is 0.3 full'.
2.7.1 APPLYING FUZZY TRUTH VALUES

        A basic application might characterize sub ranges of a continuous variable. For instance, a

temperature measurement for anti-lock brakes might have several separate membership functions

defining particular temperature ranges needed to control the brakes properly. Each function maps the

same temperature value to a truth value in the 0 to I range. These truth values can then be used to

determine how the brakes should be controlled.


        In this image, cold, warm, and hot are functions mapping a temperature scale. A point on that
scale has three "truth values" — one for each of the three functions. The vertical line in the image
represents a particular temperature that the three arrows (truth values) gauge. Since the red arrow




                                                   9
points to zero, this temperature may be interpreted as "not hot". The orange arrow (pointing at 0.2)
  may describe it as "slightly warm" and the blue arrow (pointing at 0.8) "fairly cold".




  2.7.2 FUZZY LINGUISTIC VARIABLES

           While variables in mathematics usually take numerical values, in fuzzy logic applications, the

  non-numeric linguistic variables are often used to facilitate the expression of rules and facts.


          A linguistic variable such as age may have a value such as young or its opposite defined as old.
  ITowever, the great utility of linguistic variables is that they can be modified via linguistic operations
  on the primary terms. For instance, if young is associated with the value 0.7 then very young is
  automatically deduced as having the value 0.7 * 0.7 = 0.49. And not very young gets the value (l - 0.49),
  i.e. 0.51.


          In this example, the operator very(X) was defined as X * X, however in general these operators

  may be uniformly, but flexibly defined to fit the application, resulting in a great deal of power for the

  expression of both rules and fuzzy facts.




                                                 CHAPTER 3


                                              SYSTEM DESIGN




        System design is the solution to the creation of a new system. This phase is composed of several

systems. This phase focuses on the detailed implementation of the feasible system. Its emphasis is on

translating design specifications to performance specification. System design has two phases of

development logical and physical design.


        During logical design phase the analyst describes inputs (sources), out puts (destinations),

databases (data sores) and procedures (data flows) all in a format that meats the uses requirements. The

analyst also specifies the user needs and at a level that virtually determines the information How into and



                                                     10
out of the system and the data resources. Here the logical design is done through data flow diagrams and

database design.


          The physical design is followed by physical design or coding. Physical design produces the working

system by defining the design specifications, which tell the programmers exactly what the candidate system

must do. The programmers write the necessary programs that accept input from the user, perform necessary

processing on accepted data through call and produce the required report on a hard copy or display it on the

screen.




3.1 DATABASE DESIGN


          The data mining process involves the manipulation of large data sets. Hence, a large database is a

key requirement in the mining operation. Ordered set of information is now to be extracted from this

database.

          The overall objective in the development of database technology has been to treat data as an

organizational resource and as an integrated whole. DBMS allow data to be protected and organized

separately from other resources.


          Database is an integrated collection of data. The most significant form of data as seen by the

programmers is data as stored on the direct access storage devices. This is the difference between logical

and physical data.


          Database files are the key source of information into the system. It is the process of designing
database files, which are the key source of information to the system. The files should be properly designed
and planned for collection, accumulation, editing and retrieving the required information.


      The organization of data in database aims to achieve three major objectives: -


      •Data integration.

      •Data integrity.

      •Data independence.




                                                     11
A large data set is difficult to parse and to interpret the knowledge contained in it. Since the data
base used in this project is the log file of a proxy server called SQUID, a detailed study of the squid style
transaction logging is also required.




3.2 PKOXY SERVER


       A proxy server is a server (a computer system or an application program) which services the
requests of its clients by forwarding requests to other servers. A client connects to the proxy server,
requesting some service, such as a file, connection, web page, or other resource, available from a different
server. The proxy server provides the resource by connecting to the specified server and requesting the
service on behalf of the client. A proxy server may optionally alter the client's request or the server's
response, and sometimes it may serve the request without contacting the specified server. In this case, it
would 'cache' the first request to the remote server, so it could save the information for later, and make
everything as fast as possible.


        A proxy server that passes all requests and replies unmodified is usually called a gateway or

sometimes tunneling proxy. A proxy server can be placed in the user's local computer or at specific key

points between the user and the destination servers or the Internet.


      • Caching proxy server


         A proxy server can service requests without contacting the specified server, by retrieving content

saved from a previous request, made by the same client or even other clients. This is called caching.


      • Web proxy


        A proxy that focuses on WWW traffic is called a "web proxy". The most common use of a web
proxy is to serve as a web cache. Most proxy programs (e.g. Squid, Net Cache) provide a means to deny
access to certain URLs in a blacklist, thus providing content filtering.


      • Content Filtering Web Proxy


        A content filtering web proxy server provides administrative control over the content that may be
relayed through the proxy. It is commonly used in commercial and non-commercial organizations
(especially schools) to ensure that Internet usage conforms to acceptable use policy.


      • Anonymizing proxy server



                                                     12
An anonymous proxy server (sometimes called a web proxy) generally attempts to anonymize web

surfing. These can easily be overridden by site administrators, and thus rendered useless in some cases.

There are different varieties of anonymizers.
      • Hostile proxy


        Proxies can also be installed by online criminals, in order to eavesdrop upon the dataflow between

the client machine and the web. All accessed pages, as well as all forms submitted, can be captured and

analyzed by the proxy operator.




3.3 THE SQUID PROXY SERVER


       Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces

bandwidth and improves response times by caching and reusing frequently-requested web pages. Squid has

extensive access controls and makes a great server accelerator. It runs on Unix and Windows and is

licensed under the GNU GPL. Squid is used by hundreds of Internet Providers world-wide to provide their

users with the best possible web access.


       Squid optimizes the data flow between client and server to improve performance and caches

frequently-used content to save bandwidth. Squid can also route content requests to servers in a wide

variety of ways to build cache server hierarchies which optimize network throughput.


       Thousands of web-sites around the Internet use Squid to drastically increase their content delivery.

Squid can reduce your server load and improve delivery speeds to clients. Squid can also be used to deliver

content from around the world - copying only the content being used, rather than inefficiently copying

everything. Finally, Squid's advanced content routing configuration allows you to build content clusters to

route and load balance requests via a variety of web servers.


       Squid is a fully-featured HTTP/1.0 proxy which is almost HTTP/1.1 compliant. Squid offers a rich
access control, authorization and logging environment to develop web proxy and content serving
applications. Squid is one of the projects which grew out of the initial content distribution and caching
work in the mid-90s.


       It has grown to include extra features such as powerful access control, authorization, logging,

content distribution/replication, traffic management and shaping and more. It has many, many work -

arounds, new and old. to deal with incomplete and incorrect HTTP implementations.




                                                    13
Squid allows Internet Providers to save on their bandwidth through content caching. Cached

content means data is served locally and users will see this through faster download speeds with

frequently-used content.


        A well-tuned proxy server (even without caching!) can improve user speeds purely by optimizing

TCP flows. Its easy to tune servers to deal with the wide variety of latencies found on the internet -

something that desktop environments just aren't tuned for.


      Squid allows ISPs to avoid needing to spend large amounts of money on upgrading core equipment
and transit links to cope with ever-demanding content growth. It also allows ISPs to prioritize and control
certain web content types where dictated by technical or economic reasons.




3.3.1 SQUID STYLE TRANSACTION-LOGGING


       Transaction logs allow administrators to view the traffic that has passed through the Content

Engine. Typical fields in the transaction log are the date and time when a request was made, the URL that

was requested, whether it was a cache-hit or a cache-miss, the type of request, the number of bytes

transferred, and the source IP.


        High-performance caching presents additional challenges other than how to quickly retrieve objects

from storage, memory, or the web. Administrators of caches are often interested in what requests have been

made of the cache and what the results of these requests were. This information is then used for such

applications as:


     •Problem identification and solving

     •Load monitoring

     •Billing

     •Statistical analysis

     •Security problems
      • Cost analysis and provisioning




                                                   14
Squid log file format is:




     time elapsed remotehost code/status bytes method URL rfc931 peerstatus/peerhost type A Squid log


format example looks like this:




       1012429341.115 100 172.16.100.152 TCP REFRESHJVIISS/304 1100 GET http://www.cisco.com/iiiiages/

homepage/news.gif - DlRECT/www.cisco.com -




       Squid logs are a valuable source of information about cache workloads and performance. The logs

record not only access information but also system configuration errors and resource consumption, such as

memory and disk space.




                                                   15
Field                 Description




 lme
                      UNIX time stamp as Coordinated Jniversal

              Time (UTC) seconds with a millisecond ■esolution.



Elapsed
                     Length of time in milliseconds that the

              ache was busy with the transaction.


                     Note Entries are logged after the reply
              las been sent, not during the lifetime of the
              transaction.



Remote Host           IP address of the requesting instance.




Code/Status
                     Two entries separated by a slash. The first

              mtry contains information on the result of the

              xansaction: the kind of request, how it was

              satisfied, or in what way it failed. The second ■

              mtry contains the HTTP result codes.



Bytes
                     Amount of data delivered to the client.
              This does not constitute the net object size,
              because headers are also counted. Also, failed
              ■equests may deliver an error page, the size of
              which is also logged here.




                             16
Method
       i
       ........................................ ...ARequest method to obtain an object for
       jxample, GET.URLURL requested.Rfc93 1Contains the authentication
       server's identification or lookup names of the requesting ;lient. This field
       will always be a "-" (dash).Peerstatus/Peerhost
I Two entries separated by a slash. The first ;ntry represents a code that explains
how the •equest was handled, for example, by forwarding t to a peer, or returning
the request to the source. The second entry contains the name of the host rrom
which the object was requested. This host nay be the origin site, a parent, or any
other peer. Mso note that the host name may be numerical.Type
i1
        ! ..................................Content type of the object as seen in the HITTP
reply header. In the ACNS 4.1 software, :his field will always contain a "-"
(dash).




Table      3.3.1.1    :    Squid-Style      Format
3.3.2 SQUID LOG FILES


        The logs are a valuable source of information about Squid workloads and performance. The logs

record not only access information, but also system configuration errors and resource consumption (eg,

memory, disk space). There are several log file maintained by Squid. Some have 10 be explicitly

activated during compile time, others can safely be deactivated during run-time.


        There are a few basic points common to all log files. The lime stamps logged into the log files are

usually UTC seconds unless stated otherwise. The initial time stamp usually contains a millisecond

extension.




        SQUID.OUT


        If we run your Squid from the Run Cache script, a file squid.out contains the Squid startup times,

and also all fatal errors, e.g. as produced by an assertQ failure. If we are not using Run Cache, you will

not see such a file.




        CACHE.LOG


        The cache.log file contains the debug and error messages that Squid generates. If we start your

Squid using the default RunCache .script, or start it with the -s command line option, a copy of certain

messages will go into your syslog facilities. It is a matter of personal preferences to use a separate file

for the squid log data.


        From the area of automatic log file analysis, the cache.log file does not have much to offer. We

will usually look into this file for automated error reports, when programming Squid, testing new

features, or searching for reasons of a perceived misbehavior, etc.
        USERAGENT.LOG


        The user agent log file is only maintained, if


        l.We configure the compile time —enable-useragent-log option, and




                                                   18
2.We pointed the useragentjog configuration option to a file.


      From the user agent log file you are able to find out about distribution of browsers of your clients.

Using this option in conjunction with a loaded production squid might not be the best of all ideas.




      STORE.LOG


       The store.log file covers the objects currently kept on disk or removed ones. As a kind of

transaction log it is usually used for debugging purposes. A definitive statement, whether an object

resides on your disks is only possible after analyzing the complete log file. The release (deletion) of an

object may be logged at a later time than the swap out (save to disk).


       The store.log file may be of interest to log file analysis which looks into the objects on your

disks and the time they spend there, or how many times a hot object was accessed. The latter may be

covered by another log file, too. With knowledge of the cache_dir configuration option, this log file

allows for a URL to filename mapping without recurring your cache disks. However, the Squid

developers recommend to treat store.log primarily as a debug file, and so should you, unless you know

what you are doing.




                                                       2.0
HIERARCHY.LOG


     This log file exists for Squid-1.0 only. The format is


      [date]       URL peer status peer host




     ACCESS.LOG


          Most log file analysis program are based on the entries in access.log. Currently, there are two file

formats possible for the log file, depending on your configuration for the emulate^ httpd Jog option. By

default, Squid will log in its native log file format. If the above option is enabled. Squid will log in the

common log file format as defined by the CER'N web daemon.


          'The Common Logfile Format is used by numerous HTTP servers. This format consists of the

following seven fields:

     remote host rfc931 authuser [date] "method URL" status bytes


           It is pars able by a variety of tools. The common format contains different information than the

native log file format. The HTTP version is logged, which is not logged in native log file format.


           The log contents include the site name, the IP address of the requesting instance, date and time

in unix time format, bytes transferred, the requesting method and other such features. Log files are

usually large in size, large enough to be mined. However, the values of an entire line of input changes

with the change in header.


          The common log file format contains other information than the native log file, and less. The

native format contains more information for the admin interested in cache evaluation. The access.log is

the squid log that has been made use of in this project. The log file was in the form of a text file shown

below :




                                                     20
View llei|>
File Eft Form*
ii85s..?.?._.s                                  1.1.1.1.44. ::»xc .n .:5s :i:iic
                        .:.3 -CP>:5S/290 i8 5 . ON__.CT                                ''/64  .ioi             .:iS87.:jii
1198 85.141.2J7.136 ICP_MI5S/200 143 CONNECT login.icq.can :443 -DIRECT/205.188.153.121 -11204073887.231
8219 .'06.51. 233.54 TCPJ4ISS/200 10286 TOST http ://www.go_gle.com/ -DIRECT/203. 131.197.213 text/ht _ilDl?040.'38.7.237
1229 203.212.38.43 TCF.flISS/302 630 GET http ://Ww.around-japjn .net/cg1-b1n/rjnk/access.egl' -DIRtCi/210.188.2-5.12 text/html[1_04073337.263 170*7 81.199.63.27 TCP_HISS/200 5901 GET
http ://VnbBjil.charter.net /imaqes/portal/MailAd.ipg -DIREC1/2Q9.225.8.224 image/ipegll204073387.265 1257 211.125.33.125 TCPJ4ISS/302 679 GET http://wm.club-support.riet /cgl-
b1rVrank1ng /ranklink.cgl? -DIRECT/202.212,131.188 text/html 112040/3887.266 1257 63.245.235.44 TCPJ.ISS/200 183 CONNECT login.icq.ttjm:443 -DIRECT/205.188.153.121 -11204073887.441
7891 206.51.237.114 TCP.MSS/500 758 POST http://Ww. 7hue.com.cn/djtj/crjmnientAdd _Coinnent.asp -DIRECT/210.51.1 J.83 text/html 11204073887.471
1463 219.117,248.243 TCP_MISS/20u 6286 GET http://Ww.google .com/ -DIRECT/64.233.183.104 text/html _120407.8S7.4Bb
465 89.149.209.159 TCPJ1ISS/2Q. 977 POSI http://hiysstud1o.co_/proxy5/check.php -DIRECT/89.149.221.164 text/html[12040/3887.642
23638 82.46.97.132 TCPJ4ISS/999 3002 GET http ://202.86.4.199/config/i5p_.erify_user' -DIRECT/202.86.4.1.99 text/htmlJ12O4073887. 668
645 206.51.233.54 TCPJ.ISS/200 466 POST http://nuhost.info/eye.php -OIRECT/66.232.113.44 cext/Titmll1_.10.3387.G72
649 66.232.105.20C TCP..MISS/200 467 POST http ://nuhost.info/_ye .php -DIRECT/66.232.113.44 text/himli)12u4073887.68i
3653 24.195,130.110 TCPJMISS/999 5080 GET http://209.191.92.64/confiq/isp_verify_USer'- -DIRECT, .09.191..2.64 te*t/ht«illll204073887.6.5
673 82.146.41.117 TCPJII5S/200 810 GET http://sinarteh.coiri .ru/proxy_checker/proxy_dest .php -DIRECT/82.146.46.25 text/html 03 2 04 0 . 3 8 87 . 731
708 216.163.8,34 TCPJMI55/200 581 GET hitp://itiobilel.login.vip.den.yahoo.cun/config/pwtokeiugrjt? -DIRECT/716.155. 200.61 application/octet-stream!
1204073887. 743
2 5 6 3 5 60.172 . 204 . 2 5 0 TCPJ.ISS/200 12077 GET http://aqr'l.diytrade.c_ii/sdp/514222/2/iiid-2732062/3707270.htiin -MKECr/210.245.160.41 text/html 1120407388/.76
:747 89.128.26.162 TCPJ.ISS/200 581 GET http://_i.17.manlier.in.yahoo.com/conf1g/pwtoken_get? -DIRECT/202.86.4.201 appHcat1on/octet-streainQl204073887.824
:801 59.23.225.51 TCPJMISS/200 595 GET http ://w_w .arca _-_Hriners.c_t_/banners.php' -DIRECT/74.86.170.171 text/html[1204073837.835
754 147.32.92 .702 TCPJ.I55/302 386 GET http ://pod-o-lee .iiiyiiiinicity.fr/sec -DIRECT/87.98.205 .19 text /htni!lll2O4073687.903
2684 61.28.181.18 TCPJ .ISS/500 451 POST http ://sheblogs.peopleaggregator.net /content.php ? -DIRECT/207.7.143.178 text/htmllll204073887.974
951 32.146.61.251 TCPJMISS/200 139 CONNECT 205.188.153.97:443 -blRECT/205.188.153.97 -111204073888.010
3001 219.161,217.101 TCP_MSS/200 4144 GET http://mamono .2ch.net/test/read.cgi/tvd/1200928402/1 -DIRECT/207.29.253.220 text/html[120S073888.153 1131 71.17.129.165 TCPJUSS/200 583 GET
http://17.login.krs.yahoo.com /confiq/pwroken_get? -DIRECT/211.115.98.81 application/octet-screaM1204 0 7 3 8 38.189 1166 216.150.79.194 TCP_MISS/200 182 CONNECT 205.1881153.249:443
-DIRECT /205.188.153.249 -[1204073388.270 6264 211.154.46.103 TCP_MISS/200 199 CONNECT tcpconn.tencent.com:443 -DIRICT/219.133.49.206 -03204073888.423
1400 206.51.233.54 TCP_MI55/200 973 POST hnp ;//hpcgi2.nifty.comA "inokankyo/BBS2/./aska.cgi -DIRECT/202.248.237.181 text/html .1204073888.423 4 10 64.124 . 9.8 KP_HIT            /:00 10400 GkT
http://Ww.pltorenihousariddiglts.com/test.ixt -NONE/- te«t /pl»1r.U.040738J8.$45
34422 72.232.10.91 ICPJII55/200 5942 POST http ://www .volijriteertravelcostarica.co_/fonjri /po5ting.php' -DIRECT/212.203.66.68 text/html .1204073888.634
1612 80.64.94.254 TCP._l.I5S/209 292 CONNECT 61.12.161.135:443 -DIRECT/64.12.161.185 -.120407388..649
636 66.197.130.149 TCPJMISS/200 601 POST http ://sm.cusbbs.caii /proxy.php -DIRECT/66.197.130.149 text/htm! 1112040738.8.682
669 206.51.233.54 TCPJ.ISS/200 466 POST http ://riuhost.1nfo/eye.php -DIRECT/66.232.113.44 text/htm^ H204073883.759
746         206.51.225.48 TCPJMISS/200 401 POST http://megafasthost.info/eye.php -DIRECT/.2.232.67.226 text/html [1204073888.760
747         66.232.113.194 TCPJMISS/200 402 POST http ://h1kuf.com/eye.php -DIRECT/72.232.225.186 text/html .1204073838.765
753 69.46.20.87 TCP_MIS5/200 399 POST http://megjfasthost.info/eye.php -DIRECT/72.232.67.226 text /html 11204 0 73 8 8 8 . 792 779 66.109.21.182 TCPJMISS/200 935 GET
http://botmasternet.com/proxy/http/engine.php -DIRECT/216.195.32.131 text/html [1204073388.818 5801 66.232.113.206 TCPJ1ISS/302 802 POST
http ://wwj.fngeets.net/phpbb/posting.php ? -DIRECT/205.134.165.122 text/html[1204073388.821 80S 66 . 2 3 2.113.194 TCPJ.ISS/200 402 POST http://hikuf.com /eve.php
-DIRECT/72 .232.225.186 text/html 01 204 0 7 3 S88.833
8804 72.232.200.219 TCPJMISS/200 945 POST http://add-1n.co .3p             /tbbs /old/imqbbs/1mgboard.cg1 -DIRECT/202.222.30.89 text/ht_i.lC1204073888.S41 828 66.232.113.194 1CPJ4ISS/200 4 02 POST
http ://hikuf.com/eye.php -DIRECT/72.232.225.186 text/htmlD1204U73838.849
8821 206.51.237.114 TCPJMISS/200 521 POST http ://tesi.zJleJs1ng.c_n/Guestboofc/e_jdd msg.asp -DIRECT/210.51.169.29 text/htmiDl/04073888.852 839 90.61.95.208 TCP_MIiS/200 753 GET
http://engine.espace .netaven1r.com/ ? -DIRECT/213.186.52.197 tex.Aitmlll204 0 7 S888.939 ♦926 06.232.113.62 TCPJ.ISS/200 1957 POST http ://victors-iwma.com /index.php -DIRECT/65.110.48.60
           B
text/html 204073888.94929 210.170.204.201 TCPJ1ISS/302 913 GET http;//www, gettakara.com/tok /rankllnk.cgi? -DIRECT/206.223.148.15 text/htmlD12O4073888.947 9935 72.21.34.26 TCP_HISS/302
374 POST http://ww .dinexus.nl/guestbook/s1gnbook.php - DIRECT/85.92.140.60 text/html 11204073889.000 84 7 77 . 73.185.2 5 0 TCP.XI5S/304 4 4 0 GET
http://www .singlepjreritmeet.com /conmunity/imjges/htiil_liook.qif -DIRECi/63.241.160.71 -112O4073339.023 2001 24.20.117.148 TCPJ.ISS/200 10340 GET http://www.youtube.c_it/barackobama
-OIRECT/208.65.153.238 text/ht_ilo_204073889.221 3212 58.19.162.2 TCP_MISS/200 3916 GET http ://www .kyksy .C_ll/5ite/promotion.php' -DIRECT/91.121.88.177 text/html 112040,'3839.251 1238
84.53.86.19 TCPJMISS/200 183 CONNECT 205.183.153.100:443 -OIRECT/205.188.153.100 -.1204073889. 271 1256 82,114.228 .67 TCP_MI5S/200 185 CONNECT login. icq.CO»i:443
-DIRECT/205.188.153.121 -_L.2O4073889.414
451 89.128.26.162 TCPJMISS/200 581 GET http ://ml7.iiiember.1n.yahoo.com/conf1g/pwtoken_get ? -OIRECT /202.86.4.201 applicjtion/o.re'.-stream012040?388-.499
15911 206.51.226.106 TCPJ.ISS/200 701 POST http://www.qixiusoft.cn /addjnsg.asp -DIRECT/222.191.251.101 text/html 11204 0 7 3 8 89 . 5 08
19622 24. 95.156.140 TCP_MI55/999 3002 GET http ://n37.loqin.mud .yahoo.com /config/login? -DIRECT/209.191.92.100 text/html [1204073889.604
2581 201.248.194.111 TCPJ1IS5/999 5082 GET http://209.191.92.73/conf1g/1sp_ver1fy_iiser? -DIRECT/209.191.92.73 text/htmlol.204 0 7 3 8 8 9 . 634
7629 69.46.23.203 TCPJUSS/502 1366 POST http://megafasthost,info/eye!php -DIPECT/72.232.67,226 text/htii.111204973889.648
7642 206.51.225 .48 FCP.HISS/502 1366 POST http://megjfjsthost.info/eye.php -DIRECT/72.232.67,226 text /html 11204073889.659
6642 141.151.215.9 TCP_MISS/999 5082 GET http://fl.m_iiber.ukl.yahoo.com/confiq/login? -DIRECT/217.12.8.235 text/htm 111204073889.674
41070 72.233.58.23 TCP_MISS/200 3053 POST http://blogs .shintak.info /archive/2005/06   /ie             /6309.aspx -DIRECT/70.85.106.148 text /html_1204D73839.689
686 216.163.3.34
      i
                 tcpj             GET
                            .55/200 581          http://rhobilel.login.v1p.dcn .yahoo.ccmi/conf1g /pwtokeri_get 7 - OIRECT /216.155.200.61
appi cat1on/octet-streaml2C4073889.714
3706 69.46.27.184 TCPJMISS/302 580 POST http ://helpdesk.fasthit.net /index.php -DIRECT/202.53. 5.147 text /html 11204073889.723
6706 89.149.220.229 TCPJ1ISS/200 675 HEAD http ://ww.axishq.wwlionline.com /phpBB2/v1e _topic.php? -DIRECT/66,28.224.201 text/html[1204073889.741
738 69.46.23.203 TCPJ-S5/200 400 POST http://meqjfJSthost.info/eye.php -DIRECT/72.232.67.226 text/hti_lB12O4073889.770
76 7 66 . 2 3 2.113.194 TCPJitSS/200 402 POST http ://hikuf.com /eye.php -DIRECT/72.232.225.186 text /html 11204 0 7 3 8 89 . 971
3962 213.227.245.146 TCPJ.I5S/200 184 CONNECT login.icq.coi»:443 -DIRECT/205.188.153,121 -11204073890.016
36739 124.115.0.172 TCPJ.ISS/200 4701 GET http://__vj.ba1du.eom/s? -DIRECT/202. 108.22.44 text/htmlD12040738.0,022
401 3 6 9 . 64 . 45.239 TCPJMISS/200 4530 POST http ://www .denic.de/Ae _w1.ois/iridex -DIRECT/81.91.170.12 toxt/ht_il.l20.1073890.o22
1019 194.186.94.194 TCP_HISS/200 144 CONNECT 205.188.179.233:443 -DIRECT/205.188.179.233 -11234073890.129
988 82.115.48.59 TCPJUSS/200 489 GET http ://gadr.et.h1t .g_.1us .pl /-l2O4074244l40/redot.gif? -DIRECT/194.9.24.41 .mage /gifQ12O4073S90.i56 32445 68.73.167.159 TCP-WSS/999 5084 GET
http ://87.248.107.127/ci _ifig/isp _verify_user' -DIRECT/87, 248.107.127 text /htmlll2O407<990.178 6357 69.46.23.203 TCPJMISS/200 585 POST http ://tenayagroup.eom/blog/_p -cc_ment5             -
post.php -DIRECT/198.170.85.4 text/tltm 111204073890.228




Figure 3.3.2.1 : Access.log used as database
3.3.3 SQUID RESULT CODES


                 The TCP_ codes refer to requests on the HTTP port (usually 3128). The UDP_ codes

refer to requests on the ICP port (usually 3130). If ICP logging was disabled using the logicp

queries option, no ICP replies will be logged.


            TCPJEIIT




                                                                                                      21
A valid copy of the requested object was in the


cache. TCP_MISS


The requested object was not in the


cache. TCP REFRESH HIT


The requested object was cached but STALE. The IMS query for the object resulted in "304

not modi lied".


TCP REFFAILHIT


The requested object was cached but STALE. The IMS query failed and the stale object was
delivered.


TCPREFRESHJVHSS


The requested object was cached but STALE. The IMS query returned the new content.


TCP CLIENTJREFRESH MISS


The client issued a "no-cache" pragma, or some analogous cache control command along

with the request. Thus, the cache has to-prefect the object.




                                              22
TCP IMS_HIT


The client issued an IMS request for an object which was in the cache and fresh. TCP


SWAPFAIL MISS


The object was believed to be in the cache, but could not be accessed.


TCPNEGATIVEHIT


Request for a negatively cached object, e.g. "404 not found", for which the cache believes to know

that it is inaccessible. Also refer to the explanations for negative^ ttl in your squid.conf file.


TCPMEMHIT


A valid copy of the requested object was in the cache and it was in memory, thus avoiding disk
accesses.


TCPDENIED


Access was denied for this request.


TCP_OFFLINE_IIIT


The requested object was retrieved from the cache during offline mode. The offline mode never
validates any object.


UDP HIT


A valid copy of the requested object was in the cache.


UDP MISS


The requested object is not in this cache.


      UDPDENIED


      Access was denied for this request.


      UDP_IN VALID An invalid request


      was received. UDP_MISS_NOFEl


      CH




                                                     23
During "-Y" startup, or during frequent failures, a cache in hit only mode will return either
      UDPJHIT or this code. Neighbors will thus only fetch hits.


      NONE


      Seen with errors and cache manager requests.




3.4 HTTP RESULT CODES


          These are taken from RFC 2616 and verified for Squid. Squid-2 uses almost all codes except
307 (Temporary Redirect), 416 (Request Range Not Satisfactory), and 417 (Expectation Failed).
Extra codes include 0 for a result code being unavailable, and. 600 to signal an invalid header, a
proxy error. Also, some definitions were added as for RFC 2518. Yes, there are really two entries
for status code 424, compare with http_status in src/enums.h;


    000          USED MOSTLY WITH UDP TRAFFIC



    100          CONTINUE



    101          SWITCHING PROTOCOLS



    102          PROCESSING



    200          OK




                      201CREATED

                      202ACCEPTED

                      203NON-AUTHORITATIVE INFORMATION

                      204NO CONTENT

                      205RESET CONTENT

                      206PARTIAL CONTENT

                      207MULTI STATUS



                                                   24
300MULTIPLE CHOICES

301MOVED PERMANENTLY

302MOVED TEMPORARILY

304NOT MODIFIED

305USE PROXY

307       TEMPORARY REDIRECT


400BAD REQUEST

401UNAUTHORIZED

402PAYMENT REQUIRED

403FORBIDDEN

404NOT FOUND

405METHOD NOT ALLOWED


406NOT ACCEPTABLE

407PROXY AUTHENTICATION REQUIRED

408REQUEST TIMEOUT

409CONFLICT

410GONE

411LENGTH REQUIRED

412PRECONDITION FAILED

413REQUEST ENTITY TOO LARGE

414REQUEST URI TOO LARGE

415UNSUPPORTED MEDIA TYPE

416REQUEST RANGE NOT SATISFIABLE




                      25
417        EXPECTATION FAILED


                     424        LOCKED


                     424        FAILED DEPENDENCY


                     433        UNPROCESSABLE ENTITY


                     500INTERNAL SERVER ERROR

                     501NOT IMPLEMENTED

                     502BAD GATEWAY TABLE 3.4.1 : HTTP

                     result codes
3.5 HTTP REQUEST METHODS


         Squid recognizes several request methods as defined in RFC 2616. Newer versions o Squid

also recognize RFC 2518 "HTTP Extensions for Distributed Authoring WEBDAV extensions.


 GET                       OBJECT RETRIEVAL AND SIMPLE SEARCHES.



  HEAD                     METADATA RETRIEVAL.



  'OST                     SUBMIT DATA (TO A PROGRAM).



  PUT                      UPLOAD DATA (E.G. TO A FILE).


  DELETE                   REMOVE RESOURCE (E.G. FILE).



 TRACE                     APPLN LAYER TRACE OF REQUEST ROUTE.



 OPTIONS                   REQUEST AVAILABLE COMM. OPTIONS.



 CONNECT                   TUNNEL SSL CONNECTION.



 PROPF1ND                  RETRIEVE PROPERTIES OF AN OBJEC




                                                 26
PROPATCH                     CHANGE PROPERTIES OF AN OBJECT

  COPY                         CREATE A DUPLICATE OF SRC IN DST.



  MOVE                         ATOMICALLY MOVE SRC TO DST.


  LOCK                         LOCK AN OBJECT AGAINST MODIFICATIONS.


  UNLOCK                       UNLOCK AN OBJECT.




TABLE 3.4.2 : HTTP request methods
                                                  CHAPTER 4


                                                    CODING




4.1 FEATURES OF LANGUAGE (PERL)Practical Extraction and Reporting Language is an interpreted

language optimized for scanning arbitrary text files, extracting information from those text files, and

printing reports based on that information, it's also a good language for many system management

tasks.


         •The language is intended to be practical (easy to use, efficient, complete) rather than beautiful
         (tiny, elegant, minimal).


         •It combines (in the author's opinion, anyway) some of the best features of c, sed, awk, and sh, so
         people familiar with those languages should have little difficulty with it. (language historians
         will also note some vestiges of Pascal and even basic-plus.)


         •Unlike most UNIX utilities, Perl does not arbitrarily limit the size of our data — if we have got
         the memory, Perl can slurp in our whole file as a single string, recursion is of unlimited depth.


         •The hash tables used by associative arrays grow as necessary to prevent degraded performance.
         Perl uses sophisticated pattern matching techniques to scan large amounts of data very quickly.


         •Although optimized for scanning text, Perl can also deal with binary data, and can make dbm
         files look like associative arrays (where dbm is available).Setuid Perl scripts are safer than c

         programs through a dataflow tracing mechanism which prevents many stupid security holes.




                                                        27
•The overall structure of Perl derives broadly from C. Perl is procedural in nature, with variables,
expressions, assignment statements, brace-delimited code blocks, control structures, and

subroutines.

 •Perl also takes features from shell programming. All variables are marked with leading sigils.
 which unambiguously identify the data type (scalar, array, hash, etc.) of the variable in context.

 Importantly, sigils allow variables to be interpolated directly into strings.


 •Perl has many built-in functions which provide tools often used in shell programming (though
 many of these tools are implemented by programs external to the shell) like sorting, and calling

 on system facilities.


 •Perl takes lists from Lisp, associative arrays (hashes) from AWK, and regular expressions
 from sed. These simplify and facilitate many parsing, text handling, and data management

 tasks.


 •In Perl 5, features were added that support complex data structures, first-class functions (i.e.,
 closures as values), and an object-oriented programming model. These include references,

 packages, class-based method dispatch, and lexically scoped variables, along with compiler

 directives .


 •All versions of Perl do automatic data typing and memory management. The interpreter knows
 the type and storage requirements of every data object in the program; it allocates and frees

 storage for them as necessary using reference counting (so it cannot reallocate circular data

 structures without manual intervention). Legal type conversions -for example, conversions

 from number to string—are done automatically at run time; illegal type conversions are fatal

 errors.


 •Perl has a context-sensitive grammar which can be affected by code executed during an
 intermittent run-time phase. Therefore Perl cannot be parsed by a straight Lex/Yacc

 lexer/parser combination. Instead, the interpreter implements its own laxer, which coordinates

 with a modified GNU bison parser to resolve ambiguities in the language.


 •The execution of a Perl program divides broadly into two phases: compile-timc and runtime.
 At compile time, the interpreter parses the program text into a syntax tree. At run time, it

 executes the program by walking the tree.




                                               28
4.2 PERL CODE FOR MINING
                                                                        6 :
                                                             i




                                             2


                                                             12 j nptn (DAT, Sdi.uifiJ .-f ! ! 1.1   f   ?iile content-<LiU>;




                                                                          ]:eM7h * line ft'".

                                                                       Ltiop(f  line);
                                                             ?.U | (5ET,?tP,iC3,SBYTt;,;MT,8KAHi:,;P:;;H.: ^1| peint
                                                             "*NA«E"; 32 : print "n"; 83! inumfgarray, "SWAHr'.i ;
                                                             ■2*1 ! ! -<:S ■


                                                             j 27 : £uie dch (IJaEt ttyj

                                                                              icounc»<5 )++;




                                                                          teach $Weye (keys '




 MC- i |ii iiit "-ii      • :;;sor.rn- n;                    o- frequency of usAGEnnW


[ 43 jforeaeh Ske-; (Kort hashValuePeaceiiCtingNum (ktty* ('* hash)))




                                                                                                                                                                                   ...j

                                                                 FIGURE 4.2.1: PERL Program for mining




                                                                                                     The Perl code to mine access.log makes use of the construct splitf) which is

                                         required to split a line of text in the log file. The extracted site name is pushed into an array for

                                         comparison purposes. After the required comparison to determine the number of times that a site has

                                         been repeated, both the site and its corresponding count is inserted into a hash array.


                                                                                                          The Hashed array is now utilized for sorting the site name in the descending
                                         order of its count. The count and the corresponding site name is displayed as the output.
                                                               4.3 DISPLAYED OUTPUT


                                                                                                .                    -
                                                             He "dt                vm Rut feUM* Pflri Serve                 Mndm ti*>




                                                             («"j:."l61.I53:4«
                                                             'login.lC3.eom:443

                                                             Ihttp;/fvvv.google.com/

                                                             l.t.tp://w««.around-]apan.net/c:gi-bin,/tarikyacce33.cgi?

                                                             http://rebiaail.cliartec.net/iniages/portal/IIaiHd.jP9 6ttp://»«».club-

                                                             support.net/cgi-bio/cank'ing/ranklink.cgii? ■ login.icq.com:413

                                                             http://«vv.2hue.com.cti/d*i'.a/cuuDetit'/Add Conwent.asp
http;//iww,ti(jogla.com/

           http://Biysstudio.crjm/proxy5/checJt.php

           ht tp://ZOZ. 86.4.199/conf ig/ ispverify_u3er 7

           fcttp://nuhost- into/eye■php

           http://nuhost.info/eye.php

           http://E09.191.92.64/conIig/isp_verify_usei-?

           http://5marteh.com.ru/proxy checker/proxy de3t.php

           http://nobilel.lcjgin.vip.den.yahoo.com/coiifig/pHtoken gut?

           http://«qrl.diyocarte.cc^S !Jp/514222/2/ind-2732062/3707270.html

           tttp://i»r*.BW*>?L. in. yahoo, cui/coni ig/pirtoken_get?

           httpj//wwb.arcartebanaers.com/banners.php?

           http://pod-o-lee.inymiriicity.fr/sec

           http://shebiog3.p«oplt:aggreqati:t:.riet/ct'ntent.php?

             .15.188.153.97:413 V.tp://mamono. 2ch.net/test/read, cai/cvoV

           1200928402/1

          FIGURE 4.2.2 : VISITED SITES
           '.-■:i>; "ir,.priM i"
           used      otif        once.
          v.--rr»""- - - - ---------------^aaJi                                         pv.ia   ............-;   - ""■*** *■   ■



             This is the output to the program in figure 4. It displays only the sites that have been

    reqtiested for, visited and even those that have been denied access from the proxy server. Hence, the

    log records all the transactions that have been successful and those that have failed.




     fen Run Oatahjie


{* 511 Input                                          TOTAL SITES VISITED : 5238




                                                      SITES SORTED IN ORDER OF FREQUENCY OF
                                                      USiGF.:



                                                               200                 11
                                                               93                  11
                                                               80                  11
                                                               69                  10
                                                               53                  10
                                                               51                  10
                                                               50
                                                               11
                                                               31
                                                               26
                                                               24
                                                               23
                                                               23
                                                               22
                                                               20
                                                               19
                                                               19
                                                               18
                                                               18
                                                               17
                                                               15
                                                               14
                                                               13
                                                               13
                                                               13
                                                               13
                                                               13
                                                               12
                                                               11
http://megalast                205.188.153.121:443
                          host.info/eye.ph               http://nwbllei.login.vip.dcn.yahoo.com/config/pwtoken get?
                          p                              205.1B8.153.100:113
                          http://miho3t.into             http: // iwf iiids. org/ eye. php
                          /eye.php                       http://vap.a0I.com/autn/l03iB.d0
                          loyin.icq.com:44               http://htts.biog.sina.com. .-n/hits?
                          3                              http:,'/202.86,4,192/config/pwtokenjjec?
                          hi tp ://hi)nf.                http://www.brtidu.c0m/3?
                          coai/'eye. phr                 205.188.153.219:143
                          205.188.179.233:               205.IBS.153.94:113
                          143                            http://wwa.wgbni.cn
                          http                           http://www.tteecotic.com/marlcer.php
                          ://wvw.dertic.de//             61.12.161.153:413
                          wet'Vhois/                     61.12,161.185:443
                          index                          http ://espace. netavettir. com/ diffusion/ ? http://72.2l.31.2S/-sirset/eye.php
                          http://thedou Hies             205.188.153.99:443 205.188.153.97:113
                          ite. com/ eye.                     III
                                                         http:          17.146.187.137/config/pwtoken_get? http://vw.dti-
                          php                            tanker.coM/public/jp/click:?
                          http://www.goo                 http://m22.member.in.yahoo.com/config/pwtoken get?
                          gle.conv'                      www.tlcketmastet.con: 143 http://botttia.tterrioi
                          64.12.200.89:443               .com/proxy/http/eiigine.plip http ://www. youtttbt.com/barackoboma http://
                                                         www.google.com/search? http://72.233.58.23/-sirsct/eve.chp



|NaiTfi''iiiain,:MT''tt^aJycfce:poisWe!^oatMtetkiie$.^lffi20 :
Maine 'man ET" used onV ooce possible typo at
sortedSites.pl tie 20: ■ Name 'man IP' used onV once
possiete two at sottedsitespl rte 20.




Figure 4.2.3 : Sites sorted in frequency of usage




 BYTES DOWNLOAD
 EI1                                                      yiTK NAHE
       606811                http        //2O2.1Q4.241.3/qq£ile/qq/update/qqiipCiateeenter206.zip
       89926                 http        //hwk, antrecotci. net/cgi-bin/bbs.cgi
       89955                 http        //uw.casba.ne, jp/cgi~bin/ca3-bbs/yybb3.cgi
       78307                 http        //B»H.blowjob-pics.info/submit.html
       78240                 http        //www.rfiy-real-live.com/girl/sayuki/bbs/c lever, cgi
       6442 6                http        //iBage32.singleparentrseet.coBi/30/l4S/4689l15/ 1137852.jpg
       62330                 http        //bp 12 3. spre ebb. cost/index, php?
       62414                 http        //forum, pouweb.com/ showthread.php?
       61633                 http        //ww.soybean.co. jp/cgi-Qpt/bbs/soybean_bbs.cgi?
       58949                 http        //tvoyapolovina.at.ua/
       56631                 http        //uw. spike, com/search?
       56594                 http        //wwu. gennim-guji-clappas.net/ggc/index, php?
       49106                 http        //engine.espace.netavenir.com/lib/NETAVENIR/HETAVtNlR.is
       47775                 http        //www.theeharly.f2s.com/ver taller.php
       47558                 http        //tnithlaidbear.com/showdetails.php?
       45410                 http        / / 3eshg. coin/ vb/sendmessage. php
       45039                 http        //kr.blog. yahoo.coin/cmkr/tHBLCWarite curt, html
       43060                 http        //www. hardplayharcl.com/bb3/yybbs.cgi
       42152                 http        //comedy,irk.ni/guestbook/gueatbook/
       42142                 http        //05xx. sub, jp/ sfsrver/bbs/ index.cgi?
       41878                 http        /Jvm.aemwT.vz
       39246                 http        / / veetra. auto-art. org/ web/sue/6/ ?
       38502                 http        //www. yahoo, corn/
       38110                 http        //www.nuninova.org/cat-list/4/added/278
       34569                 http        //www.ostee,com/cgi-bin/bbs/clever.cgi?
       33900                 http        / / www. x-iaods. co. nz/t orum/ index. php?
       33895                 http        //www.oztee. coiti/cgi-bin/bbs/clever. cgi
       33595                 http        //www. rainboapush.org/cgi-biti/discus/board-post. pi
       33449                 http        //faithandrear.blogharbor.com/blog/ciiidKdo post corntiTent
       33206                 http        //cim-phil.hp.infoaeek.co.jp/cgi-bin/yybbs.cgi
       30382                 http        //ok. 2 lcii.com/toplist/song.jsp?
       29594                 http        //search,en.yahoo.com/search?
       2 8757                http        //blog.sina.com.cn/s/blog 4al87039010005ui».ht»l
       27543                 http
                                         //phot0370.nas2a-klasa         .pl7devll3 /O6l
                                                                                    /0        /266/OO61266097.jpg
       26430                 http        //ews.sogou.eoio/websearch/corp/search. jsp?
       2593 6                http        //hi. baid«.cora/hggggi8/b log/it ett^bedcci4 3 d447ca4c3 9e3d62e9.html
       25483                 http        //www.gecsan.ru/vent cond.html
       25464                 http        //hww.ticketamstec.cora/event/06003F65BEE317745
       25316                 http        //M«w.3ingleparentBieet.cora/coi«(iunity/nieinber/?
       25227                 http        //uiiiqueduiiip.coi'd/ indes. php?
       25225                 http
                                         /7phot03l7.aasza-kltt3a.pl/dev42/0/036/134/00361340V8.jpg
       25105                 http        //sacradoctrina.b logspot.com/2006/ll/gestur efj-toHarti.':"-snare r.s-
       25040                 http        //inwaoes. gooqle .com/ imaqes ?

                                                                                                   pervasiveness.html
Nome "main ET" used only once: ixjistole
lypo <i rowidowiloadedpl line 1 h«P8 ■ '6Wi
MT"u^onivnrco' possUel>)o jl
iixAldu^n>>odded.pl line 19 Name '
toanJP'usedor»vonce, potable lw»
atmoridowibauedpiSue 1S.
Figure 4.2.4 :                        Sites        sorted      in     terms   of    bytes
downloaded




 I* Sid Input! '(! Scrip) © Sid [Up'i

 I   TCP'" HISS   /iob

 fCP^HISS/200

 TCPJ1ISS/2QQ

 TCPJIISS/200

 TCPJUSS/302

 JCP_flISS/4CJl


 TCPJHSS/200

 TCP        HISS/200

 TCP JUSS/2QQ |

 CP_HI3S/200

 TCP_HIS5/200




 NUMBER OF SITES THAT WEP.E DEN IIP ACCESS




                                                ACCESS DENIED SITES ***'


 ms94.UEl.com.tw:25, TCP_DENKD/4Q3

 iroxyzone.ru:8030, TCP_DF.NI£D/403

 proxyvay.net:60, TCP_DENIED/403

 Cup,mail.xmte.net;25, iCP_DENIED/403 H-

 iW.ftp8.co.uk:80,       TCP_DENIED/4C3

 www.google.com:80,        TCP DENIED/403



                                                                                   <».■■■■
                                                   !M
    ; ,man:APP"usetion|i'«ico: poBfc!e-jT»aUepdtfiedpl   »li                       -w.
 ■F     *j.o'iri«n::MT',u)edon(|Jor«!
 ijoufcletypoattcpctencd.plline 12 ■■■.'■c "maitlP" wed onk>
 one* owtiie !wo a' tcudoniedplliiw 12.




 Figure 4.2.5 : Sites that were denied access
CHAPTER 5


                                                   TESTING




5.1 SYSTEM TESTING


        Testing is a set activity that can be planned and conducted systematically. Testing begins at the

module level and work towards the integration of entire computers based system. Nothing is complete

without testing, as it is vital success of the system.


        Testing Objectives:


        There are several rides that can serve as testing objectives, they are Testing is a process of


        executing a program with the intent of finding an error A good test case is one that has high


        probability of finding an undiscovered error. A successful test is one that uncovers an


        undiscovered error.




        If testing is conducted successfully according to the objectives as stated above, it would

uncover errors in the software. Also testing demonstrates that software functions appear to the working

according to the specification, that performance requirements appear to have been met.


      There are three ways to test a program


      •For Correctness

      •For Implementation efficiency

      •For Computational Complexity.

        Tests for correctness are supposed to verify that a program does exactly what it was designed

to do. This is much more difficult than it may at first appear, especially for large programs.


        Tests for implementation efficiency attempt to find ways to make a correct program faster or

use less storage. It is a code-refining process, which reexamines the implementation phase of algorithm

development.


        Tests for computational complexity amount to an experimental analysis of the complexity of an

algorithm or an experimental comparison of two or more algorithms, which solve the same problem.


      Testing Correctness



                                                     33
The following ideas should be a part of any testing plan:


        •Preventive Measures

        •Spot checks

        •Testing all parts of the program

        •Test Data

        •Looking for trouble

        •Time for testing

        •Re Testing

         The data is entered in all forms separately and whenever an error occurred, it is corrected
immediately. A quality team deputed by the management verified all the necessary documents and
tested the Software while entering the data at all levels. The entire testing process can be divided into
3 phases


        Unit Testing


        Integrated Testing


        Final/ System testing




5.1.1 UNIT TESTING


        As this system was partially GUI based WINDOWS application, the following were tested in this

phase


        Tab Order


        Reverse Tab Order


        Field length


        Front end validations


          In our system, Unit testing has been successfully handled. The test data was given to each and

every module in all respects and got the desired output. Each module has been tested found working

properly.




                                                    34
5.1.2 INTEGRATION TESTING


      Test data should be prepared carefully since the data only determines the efficiency and accuracy

of the system. Artificial data are prepared solely for testing. Every program validates the input data
5.1.3 VALIDATION TESTING


        In this, all the Code Modules were tested individually one after the other. The following were
tested in all the modules


       Loop testing


       Boundary Value analysis


       Equivalence Partitioning Testing


        In our case all the modules were combined and given the test data. The combined module

works successfully with out any side effect on other programs. Everything was found tine working.




5.1.4 OUTPUT TESTING


        This is the final step in testing. In this the entire system was tested as a whole with all forms,

code, modules and class modules. This form of testing is popularly known as Black Box testing or

system testing.


        Black Box testing methods focus on the functional requirement of the software. That is, Black

Box testing enables the software engineer to derive sets of input conditions that will fully exercise all

functional requirements for a program.


        Black Box testing attempts to find errors in the following categories; incorrect or missing

functions, interface errors, errors in data structures or external database access, performance errors and

initialization errors and termination errors.


                                                CHAPTER 6


                                                CONCLUSION




        The project report entitled "DATAMINING USING FUZZY LOGIC" has come to its final
stage. The system has been developed with much care that it is free of errors and at the same time it is
efficient and less time consuming. The important thing is that the system is robust. We have tried our
level best to make the complete the project with all its required features.




                                                    35
However due to time constraints the fuzzy implementation over the mined data has not been

possible. Since, the queries related to mining require the proper retrieval of data, actual connl is

preferred over applying fuzziness into count.
                                                  APPENDICES




OVERVIEW OF PERL EXPRESS 2.5




        PERL EXPRESS 2.5 is a free integrated development environment (IDE) for Perl with multiple

tools for writing and debugging your scripts. It features multiple CGI scripts for editing, running, and

debugging; multiple input fdes; full server simulation; queries created from an internal Web browser or

query editor; test MySQL, MS Access scripts: interactive I/O; directory window; code library; and

code templates.


        Perl Express allows us to set environment variables used for running and debugging script. It
has a customizable code editor with syntax highlighting, unlimited text size, printing, line numbering,
bookmarks, column selection, a search-and-replace engine, multilevel undo/redo operations. Version
2.5 adds command line and bug fixes.
                                                RESUME




          The developed system is flexible and changes can be made easily. The system is developed

with an insight into the necessary modification that may be required in the future. Hence the system

can be maintained successfully without much rework.


           One of the main future enhancements of our system is to include fuzzy logic which is a form

of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather

than precise.
                                                       REFERENCES




        1.frequent Pattern Mining in Web Log Data - Renata Ivancsy, lstvan Vajk

        2.Squid-Style Transaction Logging (log formats) - http://www.cisco.com/

        3.Mining interesting knowledge from weblogs: a survey - Federico Michele Facca,

        Pier Luca lanzi.

        4.http://software.techrepublic.com.com/abstract.aspx

        5.http://en.wikipedia.org/

        6.http://msdn.microsoft.com/


                                                  36

Contenu connexe

Tendances

Herbal plant recognition using deep convolutional neural network
Herbal plant recognition using deep convolutional neural networkHerbal plant recognition using deep convolutional neural network
Herbal plant recognition using deep convolutional neural networkjournalBEEI
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139IJRAT
 
Software Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesSoftware Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesAM Publications
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesIJTET Journal
 
IRJET- Detection of Plant Leaf Diseases using Image Processing and Soft-C...
IRJET-  	  Detection of Plant Leaf Diseases using Image Processing and Soft-C...IRJET-  	  Detection of Plant Leaf Diseases using Image Processing and Soft-C...
IRJET- Detection of Plant Leaf Diseases using Image Processing and Soft-C...IRJET Journal
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
Privacy Preservation and Restoration of Data Using Unrealized Data Sets
Privacy Preservation and Restoration of Data Using Unrealized Data SetsPrivacy Preservation and Restoration of Data Using Unrealized Data Sets
Privacy Preservation and Restoration of Data Using Unrealized Data SetsIJERA Editor
 
Privacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted dataPrivacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted dataIOSR Journals
 

Tendances (9)

1699 1704
1699 17041699 1704
1699 1704
 
Herbal plant recognition using deep convolutional neural network
Herbal plant recognition using deep convolutional neural networkHerbal plant recognition using deep convolutional neural network
Herbal plant recognition using deep convolutional neural network
 
Paper id 252014139
Paper id 252014139Paper id 252014139
Paper id 252014139
 
Software Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining TechniquesSoftware Bug Detection Algorithm using Data mining Techniques
Software Bug Detection Algorithm using Data mining Techniques
 
Characterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining TechniquesCharacterizing and Processing of Big Data Using Data Mining Techniques
Characterizing and Processing of Big Data Using Data Mining Techniques
 
IRJET- Detection of Plant Leaf Diseases using Image Processing and Soft-C...
IRJET-  	  Detection of Plant Leaf Diseases using Image Processing and Soft-C...IRJET-  	  Detection of Plant Leaf Diseases using Image Processing and Soft-C...
IRJET- Detection of Plant Leaf Diseases using Image Processing and Soft-C...
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Privacy Preservation and Restoration of Data Using Unrealized Data Sets
Privacy Preservation and Restoration of Data Using Unrealized Data SetsPrivacy Preservation and Restoration of Data Using Unrealized Data Sets
Privacy Preservation and Restoration of Data Using Unrealized Data Sets
 
Privacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted dataPrivacy Preserving Clustering on Distorted data
Privacy Preserving Clustering on Distorted data
 

En vedette

EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)
EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)
EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)Youji Sakai
 
Weblogic Server
Weblogic ServerWeblogic Server
Weblogic Serveracsvianabr
 
WebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisWebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisErik Rose
 
The Django Book - Chapter 6 the django admin site
The Django Book - Chapter 6  the django admin siteThe Django Book - Chapter 6  the django admin site
The Django Book - Chapter 6 the django admin siteVincent Chien
 
Uhy global directory-2013
Uhy global directory-2013Uhy global directory-2013
Uhy global directory-2013Dawgen Global
 
saic annual reports 2003
saic annual reports 2003saic annual reports 2003
saic annual reports 2003finance42
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Igor Sfiligoi
 
East Algarve Magazine - NOVEMBER 2010
East Algarve Magazine - NOVEMBER 2010East Algarve Magazine - NOVEMBER 2010
East Algarve Magazine - NOVEMBER 2010Nick Eamag
 
Be2Awards and Be2Talks 2013 - event slides
Be2Awards and Be2Talks 2013 - event slidesBe2Awards and Be2Talks 2013 - event slides
Be2Awards and Be2Talks 2013 - event slidesBe2camp Admin
 
Securing the e health cloud
Securing the e health cloudSecuring the e health cloud
Securing the e health cloudBong Young Sung
 
Cookies
CookiesCookies
Cookiesepo273
 
Nca career wise detailer edition march 2010
Nca career wise detailer edition march 2010Nca career wise detailer edition march 2010
Nca career wise detailer edition march 2010guest9c4d5d
 
For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...
For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...
For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...Susannah Greenberg
 
2345014 unix-linux-bsd-cheat-sheets-i
2345014 unix-linux-bsd-cheat-sheets-i2345014 unix-linux-bsd-cheat-sheets-i
2345014 unix-linux-bsd-cheat-sheets-iLogesh Kumar Anandhan
 
ISC West 2014 Korea Pavilion Directory
ISC West 2014 Korea Pavilion DirectoryISC West 2014 Korea Pavilion Directory
ISC West 2014 Korea Pavilion DirectoryCindy Moon
 

En vedette (20)

EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)
EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)
EPUB Media Overlays 3.0とFixed Layout(固定レイアウト)
 
Backlink iconia
Backlink iconiaBacklink iconia
Backlink iconia
 
ITIL v3 story
ITIL v3 storyITIL v3 story
ITIL v3 story
 
Weblogic Server
Weblogic ServerWeblogic Server
Weblogic Server
 
WebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and HubrisWebLion Hosting: Leveraging Laziness, Impatience, and Hubris
WebLion Hosting: Leveraging Laziness, Impatience, and Hubris
 
Discover the Baltic states for studies
Discover the Baltic states for studiesDiscover the Baltic states for studies
Discover the Baltic states for studies
 
The Django Book - Chapter 6 the django admin site
The Django Book - Chapter 6  the django admin siteThe Django Book - Chapter 6  the django admin site
The Django Book - Chapter 6 the django admin site
 
Uhy global directory-2013
Uhy global directory-2013Uhy global directory-2013
Uhy global directory-2013
 
saic annual reports 2003
saic annual reports 2003saic annual reports 2003
saic annual reports 2003
 
Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012Condor overview - glideinWMS Training Jan 2012
Condor overview - glideinWMS Training Jan 2012
 
East Algarve Magazine - NOVEMBER 2010
East Algarve Magazine - NOVEMBER 2010East Algarve Magazine - NOVEMBER 2010
East Algarve Magazine - NOVEMBER 2010
 
Be2Awards and Be2Talks 2013 - event slides
Be2Awards and Be2Talks 2013 - event slidesBe2Awards and Be2Talks 2013 - event slides
Be2Awards and Be2Talks 2013 - event slides
 
Securing the e health cloud
Securing the e health cloudSecuring the e health cloud
Securing the e health cloud
 
Cookies
CookiesCookies
Cookies
 
EdCamp News & UpDates
EdCamp News & UpDatesEdCamp News & UpDates
EdCamp News & UpDates
 
Nca career wise detailer edition march 2010
Nca career wise detailer edition march 2010Nca career wise detailer edition march 2010
Nca career wise detailer edition march 2010
 
For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...
For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...
For Self-Published Authors. Creative Content Opps. Bookexpo America uPublishU...
 
2345014 unix-linux-bsd-cheat-sheets-i
2345014 unix-linux-bsd-cheat-sheets-i2345014 unix-linux-bsd-cheat-sheets-i
2345014 unix-linux-bsd-cheat-sheets-i
 
ISC West 2014 Korea Pavilion Directory
ISC West 2014 Korea Pavilion DirectoryISC West 2014 Korea Pavilion Directory
ISC West 2014 Korea Pavilion Directory
 
Uk norway ib directory
Uk norway ib directoryUk norway ib directory
Uk norway ib directory
 

Similaire à Datamining

knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)Kartik Kalpande Patil
 
A review on data mining
A  review on data miningA  review on data mining
A review on data miningEr. Nancy
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningBRNSSPublicationHubI
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization techniquemustafasmart
 
The Transpose Technique On Number Of Transactions Of...
The Transpose Technique On Number Of Transactions Of...The Transpose Technique On Number Of Transactions Of...
The Transpose Technique On Number Of Transactions Of...Amanda Brady
 
Applying Classification Technique using DID3 Algorithm to improve Decision Su...
Applying Classification Technique using DID3 Algorithm to improve Decision Su...Applying Classification Technique using DID3 Algorithm to improve Decision Su...
Applying Classification Technique using DID3 Algorithm to improve Decision Su...IJMER
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Reviewijdpsjournal
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousingSunny Gandhi
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applicationsSubrat Swain
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 
Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...
Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...
Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...IJMER
 

Similaire à Datamining (20)

Introduction
IntroductionIntroduction
Introduction
 
knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)knowledge discovery and data mining approach in databases (2)
knowledge discovery and data mining approach in databases (2)
 
A review on data mining
A  review on data miningA  review on data mining
A review on data mining
 
Data Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope SurveyData Mining Applications And Feature Scope Survey
Data Mining Applications And Feature Scope Survey
 
Hi2413031309
Hi2413031309Hi2413031309
Hi2413031309
 
A Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data MiningA Comprehensive Study on Outlier Detection in Data Mining
A Comprehensive Study on Outlier Detection in Data Mining
 
Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization technique
 
The Transpose Technique On Number Of Transactions Of...
The Transpose Technique On Number Of Transactions Of...The Transpose Technique On Number Of Transactions Of...
The Transpose Technique On Number Of Transactions Of...
 
FR.pptx
FR.pptxFR.pptx
FR.pptx
 
Applying Classification Technique using DID3 Algorithm to improve Decision Su...
Applying Classification Technique using DID3 Algorithm to improve Decision Su...Applying Classification Technique using DID3 Algorithm to improve Decision Su...
Applying Classification Technique using DID3 Algorithm to improve Decision Su...
 
Data Mining System and Applications: A Review
Data Mining System and Applications: A ReviewData Mining System and Applications: A Review
Data Mining System and Applications: A Review
 
Unit 3.pdf
Unit 3.pdfUnit 3.pdf
Unit 3.pdf
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
data mining and data warehousing
data mining and data warehousingdata mining and data warehousing
data mining and data warehousing
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
Fundamentals of data mining and its applications
Fundamentals of data mining and its applicationsFundamentals of data mining and its applications
Fundamentals of data mining and its applications
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...
Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...
Enhanced K-Mean Algorithm to Improve Decision Support System Under Uncertain ...
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Datamining

  • 1. DATAMINING PROJECT REPORT Submitted by SHY AM KUMAR S MTHIN GOPINADH AJITH JOHN ALIAS RI TO GEORGE CHERIAN 1 INTRODUCTION 1.1 ABOUT THE TOPIC Data Mining is the process of discovering new correlations, patterns, and trends by digging into (mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and mathematical techniques. Data mining can also be defined as the process of extracting knowledge hidden from large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The alternative name of Data Mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc. Data mining is the principle of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods, it has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases". 1.2 ABOUT THE PROJECT The Project has been developed in our college in an effort to identify the most frequently visited sites, the site from where the most voluminous downloading has taken place and the sites that have been denied access when referred to by the users. 1
  • 2. Our college uses the Squid proxy server and our aim is to extract useful knowledge from one of the log files in it. After a combined scrutiny of the log files the log named access.log was decided to be used as the database. Hence our project was to mine the contents ofaccess.log .
  • 3. Finally the PERL programming language was used for manipulating the contents of the log file. PERL EXPRESS 2.5 was the platform used to develop the mining application. The log file content is in the form of standard text file requiring extensive and quick siring manipulation to retrieve the necessary contents. The programs were required to sort the mined contents in the descending order of its frequency of usage and size. CHAPTER 2 REQUIREMENT ANALYSIS 2.1 INTRODUCTION Requirement analysis is the process of gathering and interpreting facts, diagnosing problems and using the information lo recommend improvements on the system. It is a problem solving activity that requires intensive communication between the system users and system developers. Requirement analysis or study is an important phase of any system development process. The system is studied to the minutest detail and analyzed. The system analyst plays the role of an interrogator and dwells deep into the working of the present system. The system is viewed as a whole and the inputs to the system are identified. The outputs from the organization are traced through the various processing that the inputs phase through in the organization. A detailed study of these processes must be made by various techniques like Interviews, Questionnaires etc. The data collected by these sources must be scrutinized to arrive to a conclusion. The conclusion is an understanding of how the system functions. This system is called the existing system. Now, the existing system is subjected to close study and the problem areas are identified. The designer now functions as a problem solver and tries to sort out the difficulties that the enterprise faces. The solutions are given as a proposal. The proposal is then weighed with the existing system analytically and the best one is selected. The proposal is presented to the user for an endorsement by the user. The proposal is reviewed on user request and suitable changes are made. This loop ends as soon as the user is satisfied with the proposal. 3
  • 4. 2.2 PROPOSED SYSTEM In order to make the programming strategy optimal, complete and least complex a detailed understanding of data mining, related concepts and associated algorithms are required. This is to be followed by effective implementation of the algorithm using the best possible alternative. 2.3 DATAM1NING (KDD PROCESS) The Knowledge Discovery from Data process involved / includes relevant prior knowledge and goals of applications: Creating a large dataset, Preprocessing of the data, Filtering or clearing, data transformation, identifying dimcnsionally and useful feature. It also involves classification, association, regression, clustering and summarization. Choosing the mining algorithm is the most important parameter for the process. The final stage includes pattern evaluation which means visualization, transformation, removing redundant pattern etc. use of discovery knowledge of the process. DM Technology and System: Data mining methods involves neural network, evolutionary programming, memory base programming, Decision trees. Genetic Algorithms, Nonlinear regression methods these work also involve fuzzy logic, which is a superset of conventional Boolean logic that has been extended handle the concept of partial truth, partial false between completely true and complete false. The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction. Knowledge discovery provides explicit information that has a readable form and can be understood by a user. Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g. rule based systems) and opaque in others such as neural networks. Moreover, some data mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery. Metadata, or data about a given data set, are often expressed in a condensed data mine-able format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts. 4
  • 5. Data Mining is the process of discovering new correlations, patterns, and trends by digging into (mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and mathematical techniques. Data mining can also be defined as the process of extracting knowledge hidden from large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The alternative name of Data Mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc. The importance of collecting data thai reflect your business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid- range companies. LOG files Preprocessing Data cleaning Session identification Data conversion mjnsup Frequent mjnsup Frequent mjnsup Frequent Iternset Sequence Subtree Discovery Discovery Discovery | Pattern RESULTS i Analysis Figure 2.3.1 : Process of web usage mining However, the bottleneck of turning this data into your success is the difficulty of extracting knowledge about the system you study from the collected data. DSS are computerize tools develop assist decision makers through the process of making of decision. This is inherently prescription which enhances decision making in some way. DSS are closely related to the concept of rationality which means the tendency to act in a reasonable'way to make good decision. To produce the key decision for an organization involve product/service, distribution of the product using different distribution channel, calculation /computation of the output on different time and space, prediction/trend of the output for 5
  • 6. individual product or service with in estimated time frame and finally the schedule of the production on the basis of demand, capacity and resource. The main aim and objective of the work is to develop a system on dynamic decision which depend on product life cycle individual characteristics graph analysis has been done to give enhance and advance thought to analysis the pattern of the product. The system has been reviewed in terms of local and global aspect. 2.4 WORKING OF DATAMINTNG While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining. Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an otitdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Data mining consists of five major elements: •Extract, transform, and load transaction data onto the data warehouse system. •Store and manage the data in a multidimensional database system. •Provide data access to business analysts and information technology professionals. 6
  • 7. •Analyze the data by application software. •Present the data in a useful format, such as a graph or table. 1 .Classification and Regression Trees (CART) and Chi Square 2.Detection (CHAID) : CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART' segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID. •Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the A:-nearest neighbor technique. •Rule induction: The extraction of useful if-then rules from data based on statistical significance. • Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relation. 2.5 DATA MINING ALGORITHMS The data mining algorithm is the mechanism that creates mining models. To create a model, an algorithm first analyzes a set of data, looking for specific patterns and trends. The algorithm then uses the results of this analysis to define the parameters of the mining model. The mining model that an algorithm creates can take various forms, including: •A set of rules that describe how products are grouped together in a transaction. •A decision tree that predicts whether a particular customer will buy a product. •A mathematical model that forecasts sales. • A set of clusters that describe how the cases in a dataset are related. 7
  • 8. Microsoft SQL Server 2005 Analysis Services (SSAS) provides several algorithms for use in your data mining solutions. These algorithms are a subset of all the algorithms that can be used for data mining. You can also use third-party algorithms that comply with the OLE DB for Data Mining specification. For more information about third-party algorithms, see Plugin Algorithms. Analysis Services includes the following algorithm types: •Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Decision Trees Algorithm. •Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. An example of a regression algorithm is the Time Series Algorithm. •Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. An example of a segmentation algorithm is the Clustering Algorithm. •Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis. » Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path How. An example of a sequence analysis algorithm is the Sequence Clustering Algorithm. 2.6 SOFTWARE REQUIREMENTS OPERATION SYSTEM WINDOWS XP SP2 PERL COMPILER. PERL ACTIVE PERL SCRIPT EDITOR PERL EXPRESS SERVER SOFTWARE IIS SERVER 8
  • 9. 2.7 FUZZY LOGIC Fuzzy logic is a form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather than precise. Just as in fuzzy set theory the set membership values can range (inclusively) between 0 and 1, in fuzzy logic the degree of truth of a statement can range between 0 and 1 and is not constrained to the two truth values ftrue, false} as in classic predicate logic. And when linguistic variables are used, these degrees may be managed by specific functions, as discussed below. Both fuzzy degrees of truth and probabilities range between 0 and 1 and hence may seem similar at first. However, they are distinct conceptually; fuzzy truth represents membership in vaguely defined sets, not likelihood of some event or condition as in probability theory. For example, if a 100-ml glass contains 30 ml of water, then, for two fuzzy sets, Empty and Full, one might define the glass as being 0.7 empty and 0.3 full. Note that the concept of emptiness would be subjective and thus would depend on the observer or designer. Another designer might equally well design a set membership function where the glass would be considered full for all values down to 50 ml. A probabilistic setting would first define a scalar variable for the fullness of the glass, and second, conditional distributions describing the probability that someone would call the glass full given a specific fullness level. Note that the conditioning can be achieved by having a specific observer that randomly selects ihe label for the glass, a distribution over deterministic observers, or both. While fuzzy logic avoids talking about randomness in this context, this simplification at the same time obscures what is exactly meant by the statement the 'glass is 0.3 full'. 2.7.1 APPLYING FUZZY TRUTH VALUES A basic application might characterize sub ranges of a continuous variable. For instance, a temperature measurement for anti-lock brakes might have several separate membership functions defining particular temperature ranges needed to control the brakes properly. Each function maps the same temperature value to a truth value in the 0 to I range. These truth values can then be used to determine how the brakes should be controlled. In this image, cold, warm, and hot are functions mapping a temperature scale. A point on that scale has three "truth values" — one for each of the three functions. The vertical line in the image represents a particular temperature that the three arrows (truth values) gauge. Since the red arrow 9
  • 10. points to zero, this temperature may be interpreted as "not hot". The orange arrow (pointing at 0.2) may describe it as "slightly warm" and the blue arrow (pointing at 0.8) "fairly cold". 2.7.2 FUZZY LINGUISTIC VARIABLES While variables in mathematics usually take numerical values, in fuzzy logic applications, the non-numeric linguistic variables are often used to facilitate the expression of rules and facts. A linguistic variable such as age may have a value such as young or its opposite defined as old. ITowever, the great utility of linguistic variables is that they can be modified via linguistic operations on the primary terms. For instance, if young is associated with the value 0.7 then very young is automatically deduced as having the value 0.7 * 0.7 = 0.49. And not very young gets the value (l - 0.49), i.e. 0.51. In this example, the operator very(X) was defined as X * X, however in general these operators may be uniformly, but flexibly defined to fit the application, resulting in a great deal of power for the expression of both rules and fuzzy facts. CHAPTER 3 SYSTEM DESIGN System design is the solution to the creation of a new system. This phase is composed of several systems. This phase focuses on the detailed implementation of the feasible system. Its emphasis is on translating design specifications to performance specification. System design has two phases of development logical and physical design. During logical design phase the analyst describes inputs (sources), out puts (destinations), databases (data sores) and procedures (data flows) all in a format that meats the uses requirements. The analyst also specifies the user needs and at a level that virtually determines the information How into and 10
  • 11. out of the system and the data resources. Here the logical design is done through data flow diagrams and database design. The physical design is followed by physical design or coding. Physical design produces the working system by defining the design specifications, which tell the programmers exactly what the candidate system must do. The programmers write the necessary programs that accept input from the user, perform necessary processing on accepted data through call and produce the required report on a hard copy or display it on the screen. 3.1 DATABASE DESIGN The data mining process involves the manipulation of large data sets. Hence, a large database is a key requirement in the mining operation. Ordered set of information is now to be extracted from this database. The overall objective in the development of database technology has been to treat data as an organizational resource and as an integrated whole. DBMS allow data to be protected and organized separately from other resources. Database is an integrated collection of data. The most significant form of data as seen by the programmers is data as stored on the direct access storage devices. This is the difference between logical and physical data. Database files are the key source of information into the system. It is the process of designing database files, which are the key source of information to the system. The files should be properly designed and planned for collection, accumulation, editing and retrieving the required information. The organization of data in database aims to achieve three major objectives: - •Data integration. •Data integrity. •Data independence. 11
  • 12. A large data set is difficult to parse and to interpret the knowledge contained in it. Since the data base used in this project is the log file of a proxy server called SQUID, a detailed study of the squid style transaction logging is also required. 3.2 PKOXY SERVER A proxy server is a server (a computer system or an application program) which services the requests of its clients by forwarding requests to other servers. A client connects to the proxy server, requesting some service, such as a file, connection, web page, or other resource, available from a different server. The proxy server provides the resource by connecting to the specified server and requesting the service on behalf of the client. A proxy server may optionally alter the client's request or the server's response, and sometimes it may serve the request without contacting the specified server. In this case, it would 'cache' the first request to the remote server, so it could save the information for later, and make everything as fast as possible. A proxy server that passes all requests and replies unmodified is usually called a gateway or sometimes tunneling proxy. A proxy server can be placed in the user's local computer or at specific key points between the user and the destination servers or the Internet. • Caching proxy server A proxy server can service requests without contacting the specified server, by retrieving content saved from a previous request, made by the same client or even other clients. This is called caching. • Web proxy A proxy that focuses on WWW traffic is called a "web proxy". The most common use of a web proxy is to serve as a web cache. Most proxy programs (e.g. Squid, Net Cache) provide a means to deny access to certain URLs in a blacklist, thus providing content filtering. • Content Filtering Web Proxy A content filtering web proxy server provides administrative control over the content that may be relayed through the proxy. It is commonly used in commercial and non-commercial organizations (especially schools) to ensure that Internet usage conforms to acceptable use policy. • Anonymizing proxy server 12
  • 13. An anonymous proxy server (sometimes called a web proxy) generally attempts to anonymize web surfing. These can easily be overridden by site administrators, and thus rendered useless in some cases. There are different varieties of anonymizers. • Hostile proxy Proxies can also be installed by online criminals, in order to eavesdrop upon the dataflow between the client machine and the web. All accessed pages, as well as all forms submitted, can be captured and analyzed by the proxy operator. 3.3 THE SQUID PROXY SERVER Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages. Squid has extensive access controls and makes a great server accelerator. It runs on Unix and Windows and is licensed under the GNU GPL. Squid is used by hundreds of Internet Providers world-wide to provide their users with the best possible web access. Squid optimizes the data flow between client and server to improve performance and caches frequently-used content to save bandwidth. Squid can also route content requests to servers in a wide variety of ways to build cache server hierarchies which optimize network throughput. Thousands of web-sites around the Internet use Squid to drastically increase their content delivery. Squid can reduce your server load and improve delivery speeds to clients. Squid can also be used to deliver content from around the world - copying only the content being used, rather than inefficiently copying everything. Finally, Squid's advanced content routing configuration allows you to build content clusters to route and load balance requests via a variety of web servers. Squid is a fully-featured HTTP/1.0 proxy which is almost HTTP/1.1 compliant. Squid offers a rich access control, authorization and logging environment to develop web proxy and content serving applications. Squid is one of the projects which grew out of the initial content distribution and caching work in the mid-90s. It has grown to include extra features such as powerful access control, authorization, logging, content distribution/replication, traffic management and shaping and more. It has many, many work - arounds, new and old. to deal with incomplete and incorrect HTTP implementations. 13
  • 14. Squid allows Internet Providers to save on their bandwidth through content caching. Cached content means data is served locally and users will see this through faster download speeds with frequently-used content. A well-tuned proxy server (even without caching!) can improve user speeds purely by optimizing TCP flows. Its easy to tune servers to deal with the wide variety of latencies found on the internet - something that desktop environments just aren't tuned for. Squid allows ISPs to avoid needing to spend large amounts of money on upgrading core equipment and transit links to cope with ever-demanding content growth. It also allows ISPs to prioritize and control certain web content types where dictated by technical or economic reasons. 3.3.1 SQUID STYLE TRANSACTION-LOGGING Transaction logs allow administrators to view the traffic that has passed through the Content Engine. Typical fields in the transaction log are the date and time when a request was made, the URL that was requested, whether it was a cache-hit or a cache-miss, the type of request, the number of bytes transferred, and the source IP. High-performance caching presents additional challenges other than how to quickly retrieve objects from storage, memory, or the web. Administrators of caches are often interested in what requests have been made of the cache and what the results of these requests were. This information is then used for such applications as: •Problem identification and solving •Load monitoring •Billing •Statistical analysis •Security problems • Cost analysis and provisioning 14
  • 15. Squid log file format is: time elapsed remotehost code/status bytes method URL rfc931 peerstatus/peerhost type A Squid log format example looks like this: 1012429341.115 100 172.16.100.152 TCP REFRESHJVIISS/304 1100 GET http://www.cisco.com/iiiiages/ homepage/news.gif - DlRECT/www.cisco.com - Squid logs are a valuable source of information about cache workloads and performance. The logs record not only access information but also system configuration errors and resource consumption, such as memory and disk space. 15
  • 16. Field Description lme UNIX time stamp as Coordinated Jniversal Time (UTC) seconds with a millisecond ■esolution. Elapsed Length of time in milliseconds that the ache was busy with the transaction. Note Entries are logged after the reply las been sent, not during the lifetime of the transaction. Remote Host IP address of the requesting instance. Code/Status Two entries separated by a slash. The first mtry contains information on the result of the xansaction: the kind of request, how it was satisfied, or in what way it failed. The second ■ mtry contains the HTTP result codes. Bytes Amount of data delivered to the client. This does not constitute the net object size, because headers are also counted. Also, failed ■equests may deliver an error page, the size of which is also logged here. 16
  • 17. Method i ........................................ ...ARequest method to obtain an object for jxample, GET.URLURL requested.Rfc93 1Contains the authentication server's identification or lookup names of the requesting ;lient. This field will always be a "-" (dash).Peerstatus/Peerhost I Two entries separated by a slash. The first ;ntry represents a code that explains how the •equest was handled, for example, by forwarding t to a peer, or returning the request to the source. The second entry contains the name of the host rrom which the object was requested. This host nay be the origin site, a parent, or any other peer. Mso note that the host name may be numerical.Type i1 ! ..................................Content type of the object as seen in the HITTP reply header. In the ACNS 4.1 software, :his field will always contain a "-" (dash). Table 3.3.1.1 : Squid-Style Format
  • 18. 3.3.2 SQUID LOG FILES The logs are a valuable source of information about Squid workloads and performance. The logs record not only access information, but also system configuration errors and resource consumption (eg, memory, disk space). There are several log file maintained by Squid. Some have 10 be explicitly activated during compile time, others can safely be deactivated during run-time. There are a few basic points common to all log files. The lime stamps logged into the log files are usually UTC seconds unless stated otherwise. The initial time stamp usually contains a millisecond extension. SQUID.OUT If we run your Squid from the Run Cache script, a file squid.out contains the Squid startup times, and also all fatal errors, e.g. as produced by an assertQ failure. If we are not using Run Cache, you will not see such a file. CACHE.LOG The cache.log file contains the debug and error messages that Squid generates. If we start your Squid using the default RunCache .script, or start it with the -s command line option, a copy of certain messages will go into your syslog facilities. It is a matter of personal preferences to use a separate file for the squid log data. From the area of automatic log file analysis, the cache.log file does not have much to offer. We will usually look into this file for automated error reports, when programming Squid, testing new features, or searching for reasons of a perceived misbehavior, etc. USERAGENT.LOG The user agent log file is only maintained, if l.We configure the compile time —enable-useragent-log option, and 18
  • 19. 2.We pointed the useragentjog configuration option to a file. From the user agent log file you are able to find out about distribution of browsers of your clients. Using this option in conjunction with a loaded production squid might not be the best of all ideas. STORE.LOG The store.log file covers the objects currently kept on disk or removed ones. As a kind of transaction log it is usually used for debugging purposes. A definitive statement, whether an object resides on your disks is only possible after analyzing the complete log file. The release (deletion) of an object may be logged at a later time than the swap out (save to disk). The store.log file may be of interest to log file analysis which looks into the objects on your disks and the time they spend there, or how many times a hot object was accessed. The latter may be covered by another log file, too. With knowledge of the cache_dir configuration option, this log file allows for a URL to filename mapping without recurring your cache disks. However, the Squid developers recommend to treat store.log primarily as a debug file, and so should you, unless you know what you are doing. 2.0
  • 20. HIERARCHY.LOG This log file exists for Squid-1.0 only. The format is [date] URL peer status peer host ACCESS.LOG Most log file analysis program are based on the entries in access.log. Currently, there are two file formats possible for the log file, depending on your configuration for the emulate^ httpd Jog option. By default, Squid will log in its native log file format. If the above option is enabled. Squid will log in the common log file format as defined by the CER'N web daemon. 'The Common Logfile Format is used by numerous HTTP servers. This format consists of the following seven fields: remote host rfc931 authuser [date] "method URL" status bytes It is pars able by a variety of tools. The common format contains different information than the native log file format. The HTTP version is logged, which is not logged in native log file format. The log contents include the site name, the IP address of the requesting instance, date and time in unix time format, bytes transferred, the requesting method and other such features. Log files are usually large in size, large enough to be mined. However, the values of an entire line of input changes with the change in header. The common log file format contains other information than the native log file, and less. The native format contains more information for the admin interested in cache evaluation. The access.log is the squid log that has been made use of in this project. The log file was in the form of a text file shown below : 20
  • 21. View llei|> File Eft Form* ii85s..?.?._.s 1.1.1.1.44. ::»xc .n .:5s :i:iic .:.3 -CP>:5S/290 i8 5 . ON__.CT ''/64 .ioi .:iS87.:jii 1198 85.141.2J7.136 ICP_MI5S/200 143 CONNECT login.icq.can :443 -DIRECT/205.188.153.121 -11204073887.231 8219 .'06.51. 233.54 TCPJ4ISS/200 10286 TOST http ://www.go_gle.com/ -DIRECT/203. 131.197.213 text/ht _ilDl?040.'38.7.237 1229 203.212.38.43 TCF.flISS/302 630 GET http ://Ww.around-japjn .net/cg1-b1n/rjnk/access.egl' -DIRtCi/210.188.2-5.12 text/html[1_04073337.263 170*7 81.199.63.27 TCP_HISS/200 5901 GET http ://VnbBjil.charter.net /imaqes/portal/MailAd.ipg -DIREC1/2Q9.225.8.224 image/ipegll204073387.265 1257 211.125.33.125 TCPJ4ISS/302 679 GET http://wm.club-support.riet /cgl- b1rVrank1ng /ranklink.cgl? -DIRECT/202.212,131.188 text/html 112040/3887.266 1257 63.245.235.44 TCPJ.ISS/200 183 CONNECT login.icq.ttjm:443 -DIRECT/205.188.153.121 -11204073887.441 7891 206.51.237.114 TCP.MSS/500 758 POST http://Ww. 7hue.com.cn/djtj/crjmnientAdd _Coinnent.asp -DIRECT/210.51.1 J.83 text/html 11204073887.471 1463 219.117,248.243 TCP_MISS/20u 6286 GET http://Ww.google .com/ -DIRECT/64.233.183.104 text/html _120407.8S7.4Bb 465 89.149.209.159 TCPJ1ISS/2Q. 977 POSI http://hiysstud1o.co_/proxy5/check.php -DIRECT/89.149.221.164 text/html[12040/3887.642 23638 82.46.97.132 TCPJ4ISS/999 3002 GET http ://202.86.4.199/config/i5p_.erify_user' -DIRECT/202.86.4.1.99 text/htmlJ12O4073887. 668 645 206.51.233.54 TCPJ.ISS/200 466 POST http://nuhost.info/eye.php -OIRECT/66.232.113.44 cext/Titmll1_.10.3387.G72 649 66.232.105.20C TCP..MISS/200 467 POST http ://nuhost.info/_ye .php -DIRECT/66.232.113.44 text/himli)12u4073887.68i 3653 24.195,130.110 TCPJMISS/999 5080 GET http://209.191.92.64/confiq/isp_verify_USer'- -DIRECT, .09.191..2.64 te*t/ht«illll204073887.6.5 673 82.146.41.117 TCPJII5S/200 810 GET http://sinarteh.coiri .ru/proxy_checker/proxy_dest .php -DIRECT/82.146.46.25 text/html 03 2 04 0 . 3 8 87 . 731 708 216.163.8,34 TCPJMI55/200 581 GET hitp://itiobilel.login.vip.den.yahoo.cun/config/pwtokeiugrjt? -DIRECT/716.155. 200.61 application/octet-stream! 1204073887. 743 2 5 6 3 5 60.172 . 204 . 2 5 0 TCPJ.ISS/200 12077 GET http://aqr'l.diytrade.c_ii/sdp/514222/2/iiid-2732062/3707270.htiin -MKECr/210.245.160.41 text/html 1120407388/.76 :747 89.128.26.162 TCPJ.ISS/200 581 GET http://_i.17.manlier.in.yahoo.com/conf1g/pwtoken_get? -DIRECT/202.86.4.201 appHcat1on/octet-streainQl204073887.824 :801 59.23.225.51 TCPJMISS/200 595 GET http ://w_w .arca _-_Hriners.c_t_/banners.php' -DIRECT/74.86.170.171 text/html[1204073837.835 754 147.32.92 .702 TCPJ.I55/302 386 GET http ://pod-o-lee .iiiyiiiinicity.fr/sec -DIRECT/87.98.205 .19 text /htni!lll2O4073687.903 2684 61.28.181.18 TCPJ .ISS/500 451 POST http ://sheblogs.peopleaggregator.net /content.php ? -DIRECT/207.7.143.178 text/htmllll204073887.974 951 32.146.61.251 TCPJMISS/200 139 CONNECT 205.188.153.97:443 -blRECT/205.188.153.97 -111204073888.010 3001 219.161,217.101 TCP_MSS/200 4144 GET http://mamono .2ch.net/test/read.cgi/tvd/1200928402/1 -DIRECT/207.29.253.220 text/html[120S073888.153 1131 71.17.129.165 TCPJUSS/200 583 GET http://17.login.krs.yahoo.com /confiq/pwroken_get? -DIRECT/211.115.98.81 application/octet-screaM1204 0 7 3 8 38.189 1166 216.150.79.194 TCP_MISS/200 182 CONNECT 205.1881153.249:443 -DIRECT /205.188.153.249 -[1204073388.270 6264 211.154.46.103 TCP_MISS/200 199 CONNECT tcpconn.tencent.com:443 -DIRICT/219.133.49.206 -03204073888.423 1400 206.51.233.54 TCP_MI55/200 973 POST hnp ;//hpcgi2.nifty.comA "inokankyo/BBS2/./aska.cgi -DIRECT/202.248.237.181 text/html .1204073888.423 4 10 64.124 . 9.8 KP_HIT /:00 10400 GkT http://Ww.pltorenihousariddiglts.com/test.ixt -NONE/- te«t /pl»1r.U.040738J8.$45 34422 72.232.10.91 ICPJII55/200 5942 POST http ://www .volijriteertravelcostarica.co_/fonjri /po5ting.php' -DIRECT/212.203.66.68 text/html .1204073888.634 1612 80.64.94.254 TCP._l.I5S/209 292 CONNECT 61.12.161.135:443 -DIRECT/64.12.161.185 -.120407388..649 636 66.197.130.149 TCPJMISS/200 601 POST http ://sm.cusbbs.caii /proxy.php -DIRECT/66.197.130.149 text/htm! 1112040738.8.682 669 206.51.233.54 TCPJ.ISS/200 466 POST http ://riuhost.1nfo/eye.php -DIRECT/66.232.113.44 text/htm^ H204073883.759 746 206.51.225.48 TCPJMISS/200 401 POST http://megafasthost.info/eye.php -DIRECT/.2.232.67.226 text/html [1204073888.760 747 66.232.113.194 TCPJMISS/200 402 POST http ://h1kuf.com/eye.php -DIRECT/72.232.225.186 text/html .1204073838.765 753 69.46.20.87 TCP_MIS5/200 399 POST http://megjfasthost.info/eye.php -DIRECT/72.232.67.226 text /html 11204 0 73 8 8 8 . 792 779 66.109.21.182 TCPJMISS/200 935 GET http://botmasternet.com/proxy/http/engine.php -DIRECT/216.195.32.131 text/html [1204073388.818 5801 66.232.113.206 TCPJ1ISS/302 802 POST http ://wwj.fngeets.net/phpbb/posting.php ? -DIRECT/205.134.165.122 text/html[1204073388.821 80S 66 . 2 3 2.113.194 TCPJ.ISS/200 402 POST http://hikuf.com /eve.php -DIRECT/72 .232.225.186 text/html 01 204 0 7 3 S88.833 8804 72.232.200.219 TCPJMISS/200 945 POST http://add-1n.co .3p /tbbs /old/imqbbs/1mgboard.cg1 -DIRECT/202.222.30.89 text/ht_i.lC1204073888.S41 828 66.232.113.194 1CPJ4ISS/200 4 02 POST http ://hikuf.com/eye.php -DIRECT/72.232.225.186 text/htmlD1204U73838.849 8821 206.51.237.114 TCPJMISS/200 521 POST http ://tesi.zJleJs1ng.c_n/Guestboofc/e_jdd msg.asp -DIRECT/210.51.169.29 text/htmiDl/04073888.852 839 90.61.95.208 TCP_MIiS/200 753 GET http://engine.espace .netaven1r.com/ ? -DIRECT/213.186.52.197 tex.Aitmlll204 0 7 S888.939 ♦926 06.232.113.62 TCPJ.ISS/200 1957 POST http ://victors-iwma.com /index.php -DIRECT/65.110.48.60 B text/html 204073888.94929 210.170.204.201 TCPJ1ISS/302 913 GET http;//www, gettakara.com/tok /rankllnk.cgi? -DIRECT/206.223.148.15 text/htmlD12O4073888.947 9935 72.21.34.26 TCP_HISS/302 374 POST http://ww .dinexus.nl/guestbook/s1gnbook.php - DIRECT/85.92.140.60 text/html 11204073889.000 84 7 77 . 73.185.2 5 0 TCP.XI5S/304 4 4 0 GET http://www .singlepjreritmeet.com /conmunity/imjges/htiil_liook.qif -DIRECi/63.241.160.71 -112O4073339.023 2001 24.20.117.148 TCPJ.ISS/200 10340 GET http://www.youtube.c_it/barackobama -OIRECT/208.65.153.238 text/ht_ilo_204073889.221 3212 58.19.162.2 TCP_MISS/200 3916 GET http ://www .kyksy .C_ll/5ite/promotion.php' -DIRECT/91.121.88.177 text/html 112040,'3839.251 1238 84.53.86.19 TCPJMISS/200 183 CONNECT 205.183.153.100:443 -OIRECT/205.188.153.100 -.1204073889. 271 1256 82,114.228 .67 TCP_MI5S/200 185 CONNECT login. icq.CO»i:443 -DIRECT/205.188.153.121 -_L.2O4073889.414 451 89.128.26.162 TCPJMISS/200 581 GET http ://ml7.iiiember.1n.yahoo.com/conf1g/pwtoken_get ? -OIRECT /202.86.4.201 applicjtion/o.re'.-stream012040?388-.499 15911 206.51.226.106 TCPJ.ISS/200 701 POST http://www.qixiusoft.cn /addjnsg.asp -DIRECT/222.191.251.101 text/html 11204 0 7 3 8 89 . 5 08 19622 24. 95.156.140 TCP_MI55/999 3002 GET http ://n37.loqin.mud .yahoo.com /config/login? -DIRECT/209.191.92.100 text/html [1204073889.604 2581 201.248.194.111 TCPJ1IS5/999 5082 GET http://209.191.92.73/conf1g/1sp_ver1fy_iiser? -DIRECT/209.191.92.73 text/htmlol.204 0 7 3 8 8 9 . 634 7629 69.46.23.203 TCPJUSS/502 1366 POST http://megafasthost,info/eye!php -DIPECT/72.232.67,226 text/htii.111204973889.648 7642 206.51.225 .48 FCP.HISS/502 1366 POST http://megjfjsthost.info/eye.php -DIRECT/72.232.67,226 text /html 11204073889.659 6642 141.151.215.9 TCP_MISS/999 5082 GET http://fl.m_iiber.ukl.yahoo.com/confiq/login? -DIRECT/217.12.8.235 text/htm 111204073889.674 41070 72.233.58.23 TCP_MISS/200 3053 POST http://blogs .shintak.info /archive/2005/06 /ie /6309.aspx -DIRECT/70.85.106.148 text /html_1204D73839.689 686 216.163.3.34 i tcpj GET .55/200 581 http://rhobilel.login.v1p.dcn .yahoo.ccmi/conf1g /pwtokeri_get 7 - OIRECT /216.155.200.61 appi cat1on/octet-streaml2C4073889.714 3706 69.46.27.184 TCPJMISS/302 580 POST http ://helpdesk.fasthit.net /index.php -DIRECT/202.53. 5.147 text /html 11204073889.723 6706 89.149.220.229 TCPJ1ISS/200 675 HEAD http ://ww.axishq.wwlionline.com /phpBB2/v1e _topic.php? -DIRECT/66,28.224.201 text/html[1204073889.741 738 69.46.23.203 TCPJ-S5/200 400 POST http://meqjfJSthost.info/eye.php -DIRECT/72.232.67.226 text/hti_lB12O4073889.770 76 7 66 . 2 3 2.113.194 TCPJitSS/200 402 POST http ://hikuf.com /eye.php -DIRECT/72.232.225.186 text /html 11204 0 7 3 8 89 . 971 3962 213.227.245.146 TCPJ.I5S/200 184 CONNECT login.icq.coi»:443 -DIRECT/205.188.153,121 -11204073890.016 36739 124.115.0.172 TCPJ.ISS/200 4701 GET http://__vj.ba1du.eom/s? -DIRECT/202. 108.22.44 text/htmlD12040738.0,022 401 3 6 9 . 64 . 45.239 TCPJMISS/200 4530 POST http ://www .denic.de/Ae _w1.ois/iridex -DIRECT/81.91.170.12 toxt/ht_il.l20.1073890.o22 1019 194.186.94.194 TCP_HISS/200 144 CONNECT 205.188.179.233:443 -DIRECT/205.188.179.233 -11234073890.129 988 82.115.48.59 TCPJUSS/200 489 GET http ://gadr.et.h1t .g_.1us .pl /-l2O4074244l40/redot.gif? -DIRECT/194.9.24.41 .mage /gifQ12O4073S90.i56 32445 68.73.167.159 TCP-WSS/999 5084 GET http ://87.248.107.127/ci _ifig/isp _verify_user' -DIRECT/87, 248.107.127 text /htmlll2O407<990.178 6357 69.46.23.203 TCPJMISS/200 585 POST http ://tenayagroup.eom/blog/_p -cc_ment5 - post.php -DIRECT/198.170.85.4 text/tltm 111204073890.228 Figure 3.3.2.1 : Access.log used as database 3.3.3 SQUID RESULT CODES The TCP_ codes refer to requests on the HTTP port (usually 3128). The UDP_ codes refer to requests on the ICP port (usually 3130). If ICP logging was disabled using the logicp queries option, no ICP replies will be logged. TCPJEIIT 21
  • 22. A valid copy of the requested object was in the cache. TCP_MISS The requested object was not in the cache. TCP REFRESH HIT The requested object was cached but STALE. The IMS query for the object resulted in "304 not modi lied". TCP REFFAILHIT The requested object was cached but STALE. The IMS query failed and the stale object was delivered. TCPREFRESHJVHSS The requested object was cached but STALE. The IMS query returned the new content. TCP CLIENTJREFRESH MISS The client issued a "no-cache" pragma, or some analogous cache control command along with the request. Thus, the cache has to-prefect the object. 22
  • 23. TCP IMS_HIT The client issued an IMS request for an object which was in the cache and fresh. TCP SWAPFAIL MISS The object was believed to be in the cache, but could not be accessed. TCPNEGATIVEHIT Request for a negatively cached object, e.g. "404 not found", for which the cache believes to know that it is inaccessible. Also refer to the explanations for negative^ ttl in your squid.conf file. TCPMEMHIT A valid copy of the requested object was in the cache and it was in memory, thus avoiding disk accesses. TCPDENIED Access was denied for this request. TCP_OFFLINE_IIIT The requested object was retrieved from the cache during offline mode. The offline mode never validates any object. UDP HIT A valid copy of the requested object was in the cache. UDP MISS The requested object is not in this cache. UDPDENIED Access was denied for this request. UDP_IN VALID An invalid request was received. UDP_MISS_NOFEl CH 23
  • 24. During "-Y" startup, or during frequent failures, a cache in hit only mode will return either UDPJHIT or this code. Neighbors will thus only fetch hits. NONE Seen with errors and cache manager requests. 3.4 HTTP RESULT CODES These are taken from RFC 2616 and verified for Squid. Squid-2 uses almost all codes except 307 (Temporary Redirect), 416 (Request Range Not Satisfactory), and 417 (Expectation Failed). Extra codes include 0 for a result code being unavailable, and. 600 to signal an invalid header, a proxy error. Also, some definitions were added as for RFC 2518. Yes, there are really two entries for status code 424, compare with http_status in src/enums.h; 000 USED MOSTLY WITH UDP TRAFFIC 100 CONTINUE 101 SWITCHING PROTOCOLS 102 PROCESSING 200 OK 201CREATED 202ACCEPTED 203NON-AUTHORITATIVE INFORMATION 204NO CONTENT 205RESET CONTENT 206PARTIAL CONTENT 207MULTI STATUS 24
  • 25. 300MULTIPLE CHOICES 301MOVED PERMANENTLY 302MOVED TEMPORARILY 304NOT MODIFIED 305USE PROXY 307 TEMPORARY REDIRECT 400BAD REQUEST 401UNAUTHORIZED 402PAYMENT REQUIRED 403FORBIDDEN 404NOT FOUND 405METHOD NOT ALLOWED 406NOT ACCEPTABLE 407PROXY AUTHENTICATION REQUIRED 408REQUEST TIMEOUT 409CONFLICT 410GONE 411LENGTH REQUIRED 412PRECONDITION FAILED 413REQUEST ENTITY TOO LARGE 414REQUEST URI TOO LARGE 415UNSUPPORTED MEDIA TYPE 416REQUEST RANGE NOT SATISFIABLE 25
  • 26. 417 EXPECTATION FAILED 424 LOCKED 424 FAILED DEPENDENCY 433 UNPROCESSABLE ENTITY 500INTERNAL SERVER ERROR 501NOT IMPLEMENTED 502BAD GATEWAY TABLE 3.4.1 : HTTP result codes 3.5 HTTP REQUEST METHODS Squid recognizes several request methods as defined in RFC 2616. Newer versions o Squid also recognize RFC 2518 "HTTP Extensions for Distributed Authoring WEBDAV extensions. GET OBJECT RETRIEVAL AND SIMPLE SEARCHES. HEAD METADATA RETRIEVAL. 'OST SUBMIT DATA (TO A PROGRAM). PUT UPLOAD DATA (E.G. TO A FILE). DELETE REMOVE RESOURCE (E.G. FILE). TRACE APPLN LAYER TRACE OF REQUEST ROUTE. OPTIONS REQUEST AVAILABLE COMM. OPTIONS. CONNECT TUNNEL SSL CONNECTION. PROPF1ND RETRIEVE PROPERTIES OF AN OBJEC 26
  • 27. PROPATCH CHANGE PROPERTIES OF AN OBJECT COPY CREATE A DUPLICATE OF SRC IN DST. MOVE ATOMICALLY MOVE SRC TO DST. LOCK LOCK AN OBJECT AGAINST MODIFICATIONS. UNLOCK UNLOCK AN OBJECT. TABLE 3.4.2 : HTTP request methods CHAPTER 4 CODING 4.1 FEATURES OF LANGUAGE (PERL)Practical Extraction and Reporting Language is an interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information, it's also a good language for many system management tasks. •The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). •It combines (in the author's opinion, anyway) some of the best features of c, sed, awk, and sh, so people familiar with those languages should have little difficulty with it. (language historians will also note some vestiges of Pascal and even basic-plus.) •Unlike most UNIX utilities, Perl does not arbitrarily limit the size of our data — if we have got the memory, Perl can slurp in our whole file as a single string, recursion is of unlimited depth. •The hash tables used by associative arrays grow as necessary to prevent degraded performance. Perl uses sophisticated pattern matching techniques to scan large amounts of data very quickly. •Although optimized for scanning text, Perl can also deal with binary data, and can make dbm files look like associative arrays (where dbm is available).Setuid Perl scripts are safer than c programs through a dataflow tracing mechanism which prevents many stupid security holes. 27
  • 28. •The overall structure of Perl derives broadly from C. Perl is procedural in nature, with variables, expressions, assignment statements, brace-delimited code blocks, control structures, and subroutines. •Perl also takes features from shell programming. All variables are marked with leading sigils. which unambiguously identify the data type (scalar, array, hash, etc.) of the variable in context. Importantly, sigils allow variables to be interpolated directly into strings. •Perl has many built-in functions which provide tools often used in shell programming (though many of these tools are implemented by programs external to the shell) like sorting, and calling on system facilities. •Perl takes lists from Lisp, associative arrays (hashes) from AWK, and regular expressions from sed. These simplify and facilitate many parsing, text handling, and data management tasks. •In Perl 5, features were added that support complex data structures, first-class functions (i.e., closures as values), and an object-oriented programming model. These include references, packages, class-based method dispatch, and lexically scoped variables, along with compiler directives . •All versions of Perl do automatic data typing and memory management. The interpreter knows the type and storage requirements of every data object in the program; it allocates and frees storage for them as necessary using reference counting (so it cannot reallocate circular data structures without manual intervention). Legal type conversions -for example, conversions from number to string—are done automatically at run time; illegal type conversions are fatal errors. •Perl has a context-sensitive grammar which can be affected by code executed during an intermittent run-time phase. Therefore Perl cannot be parsed by a straight Lex/Yacc lexer/parser combination. Instead, the interpreter implements its own laxer, which coordinates with a modified GNU bison parser to resolve ambiguities in the language. •The execution of a Perl program divides broadly into two phases: compile-timc and runtime. At compile time, the interpreter parses the program text into a syntax tree. At run time, it executes the program by walking the tree. 28
  • 29. 4.2 PERL CODE FOR MINING 6 : i 2 12 j nptn (DAT, Sdi.uifiJ .-f ! ! 1.1 f ?iile content-<LiU>; ]:eM7h * line ft'". Ltiop(f line); ?.U | (5ET,?tP,iC3,SBYTt;,;MT,8KAHi:,;P:;;H.: ^1| peint "*NA«E"; 32 : print "n"; 83! inumfgarray, "SWAHr'.i ; ■2*1 ! ! -<:S ■ j 27 : £uie dch (IJaEt ttyj icounc»<5 )++; teach $Weye (keys ' MC- i |ii iiit "-ii • :;;sor.rn- n; o- frequency of usAGEnnW [ 43 jforeaeh Ske-; (Kort hashValuePeaceiiCtingNum (ktty* ('* hash))) ...j FIGURE 4.2.1: PERL Program for mining The Perl code to mine access.log makes use of the construct splitf) which is required to split a line of text in the log file. The extracted site name is pushed into an array for comparison purposes. After the required comparison to determine the number of times that a site has been repeated, both the site and its corresponding count is inserted into a hash array. The Hashed array is now utilized for sorting the site name in the descending order of its count. The count and the corresponding site name is displayed as the output. 4.3 DISPLAYED OUTPUT . - He "dt vm Rut feUM* Pflri Serve Mndm ti*> («"j:."l61.I53:4« 'login.lC3.eom:443 Ihttp;/fvvv.google.com/ l.t.tp://w««.around-]apan.net/c:gi-bin,/tarikyacce33.cgi? http://rebiaail.cliartec.net/iniages/portal/IIaiHd.jP9 6ttp://»«».club- support.net/cgi-bio/cank'ing/ranklink.cgii? ■ login.icq.com:413 http://«vv.2hue.com.cti/d*i'.a/cuuDetit'/Add Conwent.asp
  • 30. http;//iww,ti(jogla.com/ http://Biysstudio.crjm/proxy5/checJt.php ht tp://ZOZ. 86.4.199/conf ig/ ispverify_u3er 7 fcttp://nuhost- into/eye■php http://nuhost.info/eye.php http://E09.191.92.64/conIig/isp_verify_usei-? http://5marteh.com.ru/proxy checker/proxy de3t.php http://nobilel.lcjgin.vip.den.yahoo.com/coiifig/pHtoken gut? http://«qrl.diyocarte.cc^S !Jp/514222/2/ind-2732062/3707270.html tttp://i»r*.BW*>?L. in. yahoo, cui/coni ig/pirtoken_get? httpj//wwb.arcartebanaers.com/banners.php? http://pod-o-lee.inymiriicity.fr/sec http://shebiog3.p«oplt:aggreqati:t:.riet/ct'ntent.php? .15.188.153.97:413 V.tp://mamono. 2ch.net/test/read, cai/cvoV 1200928402/1 FIGURE 4.2.2 : VISITED SITES '.-■:i>; "ir,.priM i" used otif once. v.--rr»""- - - - ---------------^aaJi pv.ia ............-; - ""■*** *■ ■ This is the output to the program in figure 4. It displays only the sites that have been reqtiested for, visited and even those that have been denied access from the proxy server. Hence, the log records all the transactions that have been successful and those that have failed. fen Run Oatahjie {* 511 Input TOTAL SITES VISITED : 5238 SITES SORTED IN ORDER OF FREQUENCY OF USiGF.: 200 11 93 11 80 11 69 10 53 10 51 10 50 11 31 26 24 23 23 22 20 19 19 18 18 17 15 14 13 13 13 13 13 12 11
  • 31. http://megalast 205.188.153.121:443 host.info/eye.ph http://nwbllei.login.vip.dcn.yahoo.com/config/pwtoken get? p 205.1B8.153.100:113 http://miho3t.into http: // iwf iiids. org/ eye. php /eye.php http://vap.a0I.com/autn/l03iB.d0 loyin.icq.com:44 http://htts.biog.sina.com. .-n/hits? 3 http:,'/202.86,4,192/config/pwtokenjjec? hi tp ://hi)nf. http://www.brtidu.c0m/3? coai/'eye. phr 205.188.153.219:143 205.188.179.233: 205.IBS.153.94:113 143 http://wwa.wgbni.cn http http://www.tteecotic.com/marlcer.php ://wvw.dertic.de// 61.12.161.153:413 wet'Vhois/ 61.12,161.185:443 index http ://espace. netavettir. com/ diffusion/ ? http://72.2l.31.2S/-sirset/eye.php http://thedou Hies 205.188.153.99:443 205.188.153.97:113 ite. com/ eye. III http: 17.146.187.137/config/pwtoken_get? http://vw.dti- php tanker.coM/public/jp/click:? http://www.goo http://m22.member.in.yahoo.com/config/pwtoken get? gle.conv' www.tlcketmastet.con: 143 http://botttia.tterrioi 64.12.200.89:443 .com/proxy/http/eiigine.plip http ://www. youtttbt.com/barackoboma http:// www.google.com/search? http://72.233.58.23/-sirsct/eve.chp |NaiTfi''iiiain,:MT''tt^aJycfce:poisWe!^oatMtetkiie$.^lffi20 : Maine 'man ET" used onV ooce possible typo at sortedSites.pl tie 20: ■ Name 'man IP' used onV once possiete two at sottedsitespl rte 20. Figure 4.2.3 : Sites sorted in frequency of usage BYTES DOWNLOAD EI1 yiTK NAHE 606811 http //2O2.1Q4.241.3/qq£ile/qq/update/qqiipCiateeenter206.zip 89926 http //hwk, antrecotci. net/cgi-bin/bbs.cgi 89955 http //uw.casba.ne, jp/cgi~bin/ca3-bbs/yybb3.cgi 78307 http //B»H.blowjob-pics.info/submit.html 78240 http //www.rfiy-real-live.com/girl/sayuki/bbs/c lever, cgi 6442 6 http //iBage32.singleparentrseet.coBi/30/l4S/4689l15/ 1137852.jpg 62330 http //bp 12 3. spre ebb. cost/index, php? 62414 http //forum, pouweb.com/ showthread.php? 61633 http //ww.soybean.co. jp/cgi-Qpt/bbs/soybean_bbs.cgi? 58949 http //tvoyapolovina.at.ua/ 56631 http //uw. spike, com/search? 56594 http //wwu. gennim-guji-clappas.net/ggc/index, php? 49106 http //engine.espace.netavenir.com/lib/NETAVENIR/HETAVtNlR.is 47775 http //www.theeharly.f2s.com/ver taller.php 47558 http //tnithlaidbear.com/showdetails.php? 45410 http / / 3eshg. coin/ vb/sendmessage. php 45039 http //kr.blog. yahoo.coin/cmkr/tHBLCWarite curt, html 43060 http //www. hardplayharcl.com/bb3/yybbs.cgi 42152 http //comedy,irk.ni/guestbook/gueatbook/ 42142 http //05xx. sub, jp/ sfsrver/bbs/ index.cgi? 41878 http /Jvm.aemwT.vz 39246 http / / veetra. auto-art. org/ web/sue/6/ ? 38502 http //www. yahoo, corn/ 38110 http //www.nuninova.org/cat-list/4/added/278 34569 http //www.ostee,com/cgi-bin/bbs/clever.cgi? 33900 http / / www. x-iaods. co. nz/t orum/ index. php? 33895 http //www.oztee. coiti/cgi-bin/bbs/clever. cgi 33595 http //www. rainboapush.org/cgi-biti/discus/board-post. pi 33449 http //faithandrear.blogharbor.com/blog/ciiidKdo post corntiTent 33206 http //cim-phil.hp.infoaeek.co.jp/cgi-bin/yybbs.cgi 30382 http //ok. 2 lcii.com/toplist/song.jsp? 29594 http //search,en.yahoo.com/search? 2 8757 http //blog.sina.com.cn/s/blog 4al87039010005ui».ht»l 27543 http //phot0370.nas2a-klasa .pl7devll3 /O6l /0 /266/OO61266097.jpg 26430 http //ews.sogou.eoio/websearch/corp/search. jsp? 2593 6 http //hi. baid«.cora/hggggi8/b log/it ett^bedcci4 3 d447ca4c3 9e3d62e9.html 25483 http //www.gecsan.ru/vent cond.html 25464 http //hww.ticketamstec.cora/event/06003F65BEE317745 25316 http //M«w.3ingleparentBieet.cora/coi«(iunity/nieinber/? 25227 http //uiiiqueduiiip.coi'd/ indes. php? 25225 http /7phot03l7.aasza-kltt3a.pl/dev42/0/036/134/00361340V8.jpg 25105 http //sacradoctrina.b logspot.com/2006/ll/gestur efj-toHarti.':"-snare r.s- 25040 http //inwaoes. gooqle .com/ imaqes ? pervasiveness.html Nome "main ET" used only once: ixjistole lypo <i rowidowiloadedpl line 1 h«P8 ■ '6Wi MT"u^onivnrco' possUel>)o jl iixAldu^n>>odded.pl line 19 Name ' toanJP'usedor»vonce, potable lw» atmoridowibauedpiSue 1S.
  • 32. Figure 4.2.4 : Sites sorted in terms of bytes downloaded I* Sid Input! '(! Scrip) © Sid [Up'i I TCP'" HISS /iob fCP^HISS/200 TCPJ1ISS/2QQ TCPJIISS/200 TCPJUSS/302 JCP_flISS/4CJl TCPJHSS/200 TCP HISS/200 TCP JUSS/2QQ | CP_HI3S/200 TCP_HIS5/200 NUMBER OF SITES THAT WEP.E DEN IIP ACCESS ACCESS DENIED SITES ***' ms94.UEl.com.tw:25, TCP_DENKD/4Q3 iroxyzone.ru:8030, TCP_DF.NI£D/403 proxyvay.net:60, TCP_DENIED/403 Cup,mail.xmte.net;25, iCP_DENIED/403 H- iW.ftp8.co.uk:80, TCP_DENIED/4C3 www.google.com:80, TCP DENIED/403 <».■■■■ !M ; ,man:APP"usetion|i'«ico: poBfc!e-jT»aUepdtfiedpl »li -w. ■F *j.o'iri«n::MT',u)edon(|Jor«! ijoufcletypoattcpctencd.plline 12 ■■■.'■c "maitlP" wed onk> one* owtiie !wo a' tcudoniedplliiw 12. Figure 4.2.5 : Sites that were denied access
  • 33. CHAPTER 5 TESTING 5.1 SYSTEM TESTING Testing is a set activity that can be planned and conducted systematically. Testing begins at the module level and work towards the integration of entire computers based system. Nothing is complete without testing, as it is vital success of the system. Testing Objectives: There are several rides that can serve as testing objectives, they are Testing is a process of executing a program with the intent of finding an error A good test case is one that has high probability of finding an undiscovered error. A successful test is one that uncovers an undiscovered error. If testing is conducted successfully according to the objectives as stated above, it would uncover errors in the software. Also testing demonstrates that software functions appear to the working according to the specification, that performance requirements appear to have been met. There are three ways to test a program •For Correctness •For Implementation efficiency •For Computational Complexity. Tests for correctness are supposed to verify that a program does exactly what it was designed to do. This is much more difficult than it may at first appear, especially for large programs. Tests for implementation efficiency attempt to find ways to make a correct program faster or use less storage. It is a code-refining process, which reexamines the implementation phase of algorithm development. Tests for computational complexity amount to an experimental analysis of the complexity of an algorithm or an experimental comparison of two or more algorithms, which solve the same problem. Testing Correctness 33
  • 34. The following ideas should be a part of any testing plan: •Preventive Measures •Spot checks •Testing all parts of the program •Test Data •Looking for trouble •Time for testing •Re Testing The data is entered in all forms separately and whenever an error occurred, it is corrected immediately. A quality team deputed by the management verified all the necessary documents and tested the Software while entering the data at all levels. The entire testing process can be divided into 3 phases Unit Testing Integrated Testing Final/ System testing 5.1.1 UNIT TESTING As this system was partially GUI based WINDOWS application, the following were tested in this phase Tab Order Reverse Tab Order Field length Front end validations In our system, Unit testing has been successfully handled. The test data was given to each and every module in all respects and got the desired output. Each module has been tested found working properly. 34
  • 35. 5.1.2 INTEGRATION TESTING Test data should be prepared carefully since the data only determines the efficiency and accuracy of the system. Artificial data are prepared solely for testing. Every program validates the input data 5.1.3 VALIDATION TESTING In this, all the Code Modules were tested individually one after the other. The following were tested in all the modules Loop testing Boundary Value analysis Equivalence Partitioning Testing In our case all the modules were combined and given the test data. The combined module works successfully with out any side effect on other programs. Everything was found tine working. 5.1.4 OUTPUT TESTING This is the final step in testing. In this the entire system was tested as a whole with all forms, code, modules and class modules. This form of testing is popularly known as Black Box testing or system testing. Black Box testing methods focus on the functional requirement of the software. That is, Black Box testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program. Black Box testing attempts to find errors in the following categories; incorrect or missing functions, interface errors, errors in data structures or external database access, performance errors and initialization errors and termination errors. CHAPTER 6 CONCLUSION The project report entitled "DATAMINING USING FUZZY LOGIC" has come to its final stage. The system has been developed with much care that it is free of errors and at the same time it is efficient and less time consuming. The important thing is that the system is robust. We have tried our level best to make the complete the project with all its required features. 35
  • 36. However due to time constraints the fuzzy implementation over the mined data has not been possible. Since, the queries related to mining require the proper retrieval of data, actual connl is preferred over applying fuzziness into count. APPENDICES OVERVIEW OF PERL EXPRESS 2.5 PERL EXPRESS 2.5 is a free integrated development environment (IDE) for Perl with multiple tools for writing and debugging your scripts. It features multiple CGI scripts for editing, running, and debugging; multiple input fdes; full server simulation; queries created from an internal Web browser or query editor; test MySQL, MS Access scripts: interactive I/O; directory window; code library; and code templates. Perl Express allows us to set environment variables used for running and debugging script. It has a customizable code editor with syntax highlighting, unlimited text size, printing, line numbering, bookmarks, column selection, a search-and-replace engine, multilevel undo/redo operations. Version 2.5 adds command line and bug fixes. RESUME The developed system is flexible and changes can be made easily. The system is developed with an insight into the necessary modification that may be required in the future. Hence the system can be maintained successfully without much rework. One of the main future enhancements of our system is to include fuzzy logic which is a form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather than precise. REFERENCES 1.frequent Pattern Mining in Web Log Data - Renata Ivancsy, lstvan Vajk 2.Squid-Style Transaction Logging (log formats) - http://www.cisco.com/ 3.Mining interesting knowledge from weblogs: a survey - Federico Michele Facca, Pier Luca lanzi. 4.http://software.techrepublic.com.com/abstract.aspx 5.http://en.wikipedia.org/ 6.http://msdn.microsoft.com/ 36