SlideShare une entreprise Scribd logo
1  sur  75
Mikhail Golovnya, Dan Steinberg, Scott Cardell
        golomi@salford-systems.com
   Question 1: Given a set of page views, will the visitor
    view another page on the site or will the visitor leave?

   Question 2: Given a set of page views, which product
    will the visitor view in the remainder of the session?

   Question 3: Given a set of purchases over a period of
    time, characterize visitors who spend more than $12
    (order amount) on an average order at the site?

   Question 4 and 5: insight versions of questions 1 and
    2
   Gazelle.com is a leg-wear and leg-care web
    retailer

   Soft-launch: Jan 30, 2000

   Hard-launch: Feb 29, 2000

    ◦ With an Ally McBeal TV ad on 28th and strong $10 off
      promotion

   Training set: 2 months

   Test sets: one month (split into two test sets)
   (insert decision tree process)
   (insert home page image)
   Web Application Server:

    ◦ Takes care of sessionizing (unique session ID is assigned
      to each user’s session)

    ◦ Takes care of registration and logging in (unique
      customer ID is assigned to each registered user)

    ◦ Uses dynamic HTML unique page view is identified via
      a combination of page view template (*.jhtml or *.jsp)
      and query parameters (product ID, vendor
      ID, assortment ID, etc.)

   All data supplied come directly from the web
    application server logs
   Acxiom enhancements: age, gender, marital status, vehicle
    lifestyle, own/rent, etc.

   Keynote records (about 250,000) removed. They hit the
    home page 3 times a minute, 24 hours

   Personal information was removed, including:
    Names, addresses, login, credit card, phones, host
    name/IP, verification question/answer. Cookie, e-mail
    were obfuscated.

   Test users were removed based on multiple criteria (e.g.
    credit card number) not available to competitors

   Original data and aggregated data (to session level) were
    provided
   CLICKS
    ◦ Contains click-stream information

    ◦ Each record is a page view

    ◦ Basis for questions 1 and 2

    ◦ Each sequence of clicks forms a session

    ◦ Session continues for any page view except for the last

   ORDER LINES
    ◦ Contains order information

    ◦ Each record is an order line

    ◦ Order is a collection of order lines with the same order ID

    ◦ Basis for question 3
Session                   Session ID
Sequence                  Sequence number of the click
SessckID                  Session cookie ID
Visitnum                  Session visit count (from the
                          cookie)
Proctime                  Request processing time
Npage                     Session length (in clicks)
Sesslen                   Session length (in seconds)
Usragent                  Session user agent
Sessref                   Session referrer
Date and time variables
Contlvl*                    Page view template

Prodlvl*                    Product for product templates

Asslvl*                     Assortment for other than product
                            templates
Final                       Last page in this session

Refcont*                    Referring page content,
Refasrt*                    assortment, product
Refprod*
Weekday, hour, date         Day, hour, and date variables

Other auxiliary variables
Brand                              Brand name (leg-wear products)
Maker                              Product maker
Audience                           Product audience
Basorfas                           Basic or fashion
Prodform                           Product form
Look                               Product look
Length, size                       Length, size, depth, etc.
Collect                            Collection
Texture                            Texture
Over 40 different variables, all
highly missing
CustID                       Customer ID


Nfail                        Number of failed logins


Sesslcnt                     Session login count


Account creation date/time
variables
Email                        User’s e-mail address
Freqwear                     What do you wear most frequently?
Howfind                      How did you find us?
Legcare                      Your favorite leg care brand
Sendmail                     Allow sending solicitation e-mails
Nadult                       Number of adults
Nkids                        Number of kids
State                        Residency state
19 variables in total, all
significantly missing
Owntruck, Own***                    Truck Owner, RV owner, etc.
Ownbkcrd, Own*Crd                   Bank card holder, gas card holder,
                                    etc.
Age                                 Age
Marital                             Marital status
Mailresp                            Mail responder
Income                              Estimated income
Pool                                Presence of pool
61 variables in total, all highly
missing
   Detailed understanding of all initial variables (this took
    nearly 50% of total project time!!!)

   Creating new predictors (features):
    ◦ Slicing a variable into a set of key dimensions
    ◦ Combining different levels into logical groups to reduce the total
      number of categories
    ◦ Combining a set of variables into one informative dimension
    ◦ Creating new features to account for different layers of
      aggregation (CLICKS vs. SESSIONS vs. ORDERS vs. USERS)

   Developing the master KEEP list:
    ◦ Separating “illegal” predictors from “legitimate” ones
    ◦ Removing “useless” predictors (duplicates, nearly unary, extremely
      missing)
   Possibly dividing the large CLICKS data base into logical
    segments (Registered Users vs. Unregistered, Short Sessions vs.
    Long Sessions) with subsequent separate analyses and KEEP lists
    within each segment

   Defining the right CART model set-up (especially for PRIORS and
    COSTS)

   Running different CART models, analyzing the
    performance, revisiting all of the steps above to
    develop/test/reject new features

   For questions 1 and 2 choose the models with the highest overall
    score (adjusted for the evaluation criteria)

   For question 3 learn as much as possible from all of the above
   SESSION REFERRER (SESSREF)

    ◦ Carries on extremely useful information regarding where the user
      was immediately before initiating a GAZELLE session

    ◦ In its raw form practically useless (too many levels)

   SESSION USER AGENT (USRAGENT)

    ◦ Provides detailed information about the user’s browser, including
      operating system and AOL/MSN connection

    ◦ Helps in identifying “artificial” users (ROBOTS)

    ◦ Again, practically useless in its raw form
   Referring Host (REFWEB) is one of the dimensions
    extracted after slicing the referrer
    ◦ Still has thousands of distinct levels (How many web-servers are
      out there?!!)

    ◦ Want to simplify for a more informative use

    ◦ Same services may have a variety of different host names

   New logical groups of REFWEB:
    ◦ Search engines (yahoo, excite, Google, etc.)

    ◦ Fashion sites (Fashion Mall, Shop Now)

    ◦ Bargain sites (Free Gifts, My Coupons, etc.)

    ◦ “Specialty Sites” (Winnie-Cooper!!!)

    ◦ NULL (session was initiated via a bookmark or direct typing in)
   Answer: Winnie-Cooper is a 31 year old guy
    who wears pantyhose and has a pantyhose
    site. 8,700 visitors came from his site(!)

   We might and we should expect different
    behavior of “Winnie-Cooper” users from
    everyone else
   All PRODLVL*, CONTLVL*, and ASSLVL* variables turned
    out to be nearly useless for direct modeling and awkward
    for interpretation

   PRODLVL1-PRODLVL3 represent different path levels in
    the file system that point into individual product
    information

   Reasonable to combine all three paths into a unique
    product descriptor PRODP

   Similarly, generate unique assortment and content
    descriptors CONTP and ASSP

   Finally, combine all three descriptors into a single page
    view descriptor (static equivalent of dynamic HTML)
    VIEWCAT- an extremely useful interpretation variable
   (insert images)
   Adding clicks history
    ◦ 1-page back, 2-pages back, 3-pages back, etc.

    ◦ Dummies indicating if a given “epoch” page (home page,
      registration page, Donna Karan, etc.) has already been viewed
      prior to this click in the current session

    ◦ Counting the number of views up to this click in the session for
      the selected “epoch” pages

   Adding session history
    ◦ Identifying previous sessions based on either USERID (registered
      users) or COOKIE (unregistered users)

    ◦ Collect history features from the previous sessions (first visit,
      ordered ever, ordered previously, viewed Donna Karan products
      before, etc.)
   Adding registration history
    ◦ CUSTID is only defined for the session in which the user logged in
      explicitly

    ◦ Using COOKIEID, it is possible to approximately identify
      anonymous sessions that belong to a registered user

    ◦ Define REGISTEV=YES for any session that was initiated by a
      registered user (even prior to the registration event)

    ◦ This also gives rise to additional related features (registered
      previously , have yet to register, etc.)

   Aggregating order lines
    ◦ Mostly for question 3: summarizing order-line characteristics to
      the ORDERS and USER levels (buy socks, buy leg-care, buy black,
      buy fashion, etc.)
   Initial CLICKS data base had about 900,000
    records and 220 variables

   After the filtration and adding new features the
    number of variables grew up to 450

   Dividing CLICKS into segments seems justifiable

   A CART run with DEPTH=2 reveals that
    SEQUENCE=1 is the root splitter for both
    question 1 and 2

   There is something special about the first click!
   (insert tables)

   Conclusion: usually the first click also
    becomes the last (come and leave!)
   Again, running CART DEPTH=2 on SEQUENCE>1
    shows that the next split separates registered ever
    users from non-registered

   Median session length (after removing lengths 1):
    ◦ Never registered 8
    ◦ Registered at some point 26

   Naturally, a registered user will have a longer session
    than a non-registered user

   Similarly, CART finds additional splits on
    SEQUENCE=2, SEQUENCE=[3,4,5], and SEQUENCE>5
   (insert image)
   Complete CLICKS data set should be used for training
    to exploit all available information

   However, the evaluation criterion for question 1 is
    referring to the SESSION level: will the SESSION
    continue?

   Prior Probabilities should be set manually to SESSION
    level values to adjust CART to the evaluation criterion

   Since we have 5 different partitions of the CLICKS
    database, 5 different sets of Prior Probabilities must
    be specified
   (insert image)

   The “majority rule” is very hard to beat!
   Checking rules for the right child of the root
    split

   (insert image)

   Root split separates crawlers, robots, and
    unusual browsers
   Node Report- Further insight into the root
    splitter

   (insert image)

   The root splitter is very powerful

   The root splitter is also quite “unique”
   Checking the second split

   (insert image)

   Second split distinguishes ever registered
    users from anonymous users
   (insert image)

   This node has the largest probability of exit

   This segment gives the best predictive power
   Root split separates “killer” pages from
    “killing” pages

   (insert images)
   (insert image)

   Still quite difficult to predict!
   Again, root split separates “killer” pages from
    “killing” pages!

   (insert image)

   This variable might be difficult to interpret

   CONTP1 could be used instead- much easier
    to interpret
   (insert image)

   The tree is large, yet it is extremely difficult
    to predict!
   (insert image)

   Still a very hard prediction problem
   QUESTION: Given a set of page views, which product brand
    (Hanes, Donna Karan, American Essentials, or None) will the
    visitor view in the remainder of the session?

   Evaluation Criterion:
    ◦ 2 units if the session visited the predicted brand;

    ◦ 1 unit if the session did not visit any of the three brands
      and the prediction was none;

    ◦ O units otherwise;

    ◦ All sessions of length 1 will be excluded
   For the given (truncated) session only    “Single event”
    8 outcomes are possible in the            Will use
    remainder of the session:                 directly

                                              -O
    ◦   None brands are visited
                                              -H
    ◦   Only Hanes visited
                                              -D
    ◦   Only Donna Karan visited
                                              -A
    ◦   Only American Essentials visited
    ◦   Only Hanes and Donna Karan            -HD
    ◦   Only Donna and American Essentials    -DA
    ◦   Only Hanes and American Essentials    -HA
    ◦   All three visited                     -AHD
                                              “Double or
   Thus we have 8- level target that         Triple” Must
    should be mapped into 4 distinct          convert to
    levels for final prediction and scoring   “single”
Outcome   # sessions   • Number of sessions in
                       the clipped click stream
O         72,269       •Only a few sessions
H         4,417        result to “double or
                       triple”
D         3,964
A         2,644          D
HD        325            D
DA        20             H
                         D
HA        153            Conversion rules
AHD       8              (defined by the
                         dominant class)
Total     90,800
   Costs must be used to incorporate the
    evaluation criterion

   (insert table)
   The segmentation is done using the same
    technique that was used in Question 1
    segmentation

   (insert image)
   First we try GINI splitting rule

   (insert image)

   The tree is big, but the accuracy is low
   Now let’s try TWOING

   (insert image)

   All red nodes predict NONE

   Smaller tree, better accuracy
   Now focus on DONNA views

   (insert image)

   Now all red nodes predict DONNA
   Variable Importance clarifies which variables
    have the largest predictive power

   (insert image)
   Using TWOING splitting rule

   (insert image)

   Short sessions are the easiest to predict
   For SEQUENCE=5 and above

   (insert image)

   Longer sessions are becoming quite
    challenging
   In the evaluation, each session with at least 2 clicks is
    randomly clipped to a shorter length

   This means that a session of length T>1 is clipped to
    length S with probability 1/(T-1) for S=1,…,T-1

   For each terminal node in a CART tree the training cases
    must be weighted by the appropriate clipping probability
    when calculating the within-node probabilities

   Predict OTHER is its revised probability was more than
    twice that of the highest probability brand; otherwise, the
    highest probability brand was predicted
   Characterize visitors who spend more than $12 on an
    average order at the site

   Small dataset of 3,465 purchases 1,831 customers

   Insight question- no test set

   Submission requirement:
    ◦ Report of up to 1,000 words and 10 graphs

    ◦ Business users should be able to understand report

    ◦ Observations should be correct and interesting average order

    ◦ tax>$2 implies heavy spender is not interesting nor actionable
   (insert graph)
   (insert images)
   (insert graphs)
   (insert graph)
   (insert graph)
   (insert graphs)
   Orders come from different cities:
    ◦ 80% of orders coming from San Francisco and Chicago are
      heavy spenders

    ◦ 40% of orders coming from New York are heavy spenders

    ◦ Orders coming from elsewhere have only 25% of heavy
      spenders

   Color makes the difference- buying black products
    implies heavy spender

   Color is also related to city: orders from large cities
    have higher percent of black color
   Leg-care products are more expensive
    ◦ 75% of leg-care orders were above $12 threshold
    ◦ Only 25% of leg-wear orders were above $12

   Pantyhose are more expensive than socks

   Hanes and Donna Karan imply heavy
    spenders

   American Essentials imply low spenders
   Referrals from Shopnow or Fashion Mall imply heavy
    spenders, whereas MyCoupons are low spenders

   Work Dress business casual or business imply heavy
    spender

   Sunday and Monday are heavy spender days

   Income makes the difference, but not much:
    ◦ 40% of very high income users are high spenders

    ◦ 32% of very low income are also high spenders

    ◦ Only 25% are high spenders for everyone else
   AOL users tend to spend less

    ◦ 20% of AOL users are high spenders

    ◦ 29% of the remaining users are high spenders

    ◦ This might also be explained by the lack of testing
      GAZELLE site on the AOL browsers (incompatibility
      issues)

   Luxury vehicle implies heavy spender
    (slightly)
   (insert graph)
   (insert image)
   The parts marked in red might safely be removed

   Normally want to remove all graphic content
    queries like
    ◦ GIF and JPG files
    ◦ Other unnecessary content

   May reduce the size of the raw web log up to 5
    times

   The resulting web log now contains only the
    most important pieces of information
   (insert image)
   The clean web log is still not suitable for any
    data processing since each row basically
    represents a set of characters

   Need to convert each legitimate line into a
    delimited list of data fields

   Want to choose a delimiter that never occurs
    in the raw web log

   Will have to drop all corrupt log entries
   (insert image)
   Each line (entry) in a web log corresponds to a
    single resource request

   A user normally issues a set of logically
    connected requests called SESSION

   Multiple users may share the same time frame
    intermixed log entries

   HTTP protocol is MEMORYLESS need to solve
    the problem of identifying different sessions
   Using COOKIES to mark client’s station
    ◦ Might be disabled by “paranoid” clients

    ◦ Might be deleted or “exhausted”

   Using URL encoding
    ◦ Requires dynamic HTML (ASP, JSP, Servlets)

   Using pure web log heuristics
    ◦ Mostly matching on IP-address, user agent, and referrer
      fields

    ◦ May be done on any server that supports extended log
      format

    ◦ Somewhat imprecise in identifying sessions under certain
      “unfavorable” conditions
   Identifying END OF SESSION event
    ◦ Widely used 30-minute standard does not always work

   Proxy Servers
    ◦ Multiple users may share the same IP address
    ◦ Cached requests are “forever lost” for the server’s log
    ◦
   Dynamic IP addresses
    ◦ A single user might have different IP address within the same
      session

   Spiders and Robots
    ◦ Completely violate any human “logic” and may generate a lot of
      “false” or “huge” sessions

   Smart heuristic programming may reduce the ambiguity
    down to as low as 5%
   (insert image)
   The “referrer” field provides extremely valuable
    information about the user

   “Referrer” links back to the previous request

   Empty “referrer” indicates that the request was initiated
    from a bookmark or by direct typing of the URL

   Non empty “referrer” either links back to the previous
    resource requested from the server or gives the URL of the
    “outside” resource that the user was accessing
    immediately before initiating the current session

   Not easy to use directly: too many distinct values
   Referrer just like any other URL might be decomposed into the
    following pieces
    ◦ Protocol used (http, https, etc.)

    ◦ Site (domain name of the server)

    ◦ Domain (com, edu, uk, etc.)

    ◦ Resource (including path relative to the server)

    ◦ Port (usually missing for default assignment)

    ◦ Query string

   Should consider grouping “sites into logical segments (search
    engines, specialty sites, etc.)

   May require further processing of the “resource” and “query
    string” (key-words, categories, etc.)
   (insert image)

Contenu connexe

Similaire à 2000 KDD Cup Winners

EuroIA 2009 Designing Exploding Websites Smit Boersma
EuroIA 2009 Designing Exploding Websites Smit BoersmaEuroIA 2009 Designing Exploding Websites Smit Boersma
EuroIA 2009 Designing Exploding Websites Smit BoersmaIskander Smit
 
Designing Exploding Websites (Euro IA 2009)
Designing Exploding Websites (Euro IA 2009)Designing Exploding Websites (Euro IA 2009)
Designing Exploding Websites (Euro IA 2009)Peter Boersma
 
Building Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceBuilding Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceRaymond Gao
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
 
Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Looker
 
1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools
1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools
1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing toolsNETUsergroupZentrals
 
Advance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksAdvance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksEyal Trabelsi
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.BI
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiHBaseCon
 
Functional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayFunctional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayDebasish Ghosh
 
Euro Ia Designing Exploding Websites Share
Euro Ia Designing Exploding Websites ShareEuro Ia Designing Exploding Websites Share
Euro Ia Designing Exploding Websites ShareInfo.nl
 
Online shopping ppt
Online shopping pptOnline shopping ppt
Online shopping pptNitesh Dubey
 
Web Analytics Primer
Web Analytics PrimerWeb Analytics Primer
Web Analytics PrimerChad Richeson
 
20.project inventry management system
20.project inventry management system20.project inventry management system
20.project inventry management systemLapi Mics
 
Data and Consumer Product Development
Data and Consumer Product DevelopmentData and Consumer Product Development
Data and Consumer Product DevelopmentGaurav Bhalotia
 
Srs_of_E_commerce_Online_Book_Shopping_1.doc.pdf
Srs_of_E_commerce_Online_Book_Shopping_1.doc.pdfSrs_of_E_commerce_Online_Book_Shopping_1.doc.pdf
Srs_of_E_commerce_Online_Book_Shopping_1.doc.pdfBdBangladesh
 

Similaire à 2000 KDD Cup Winners (20)

EuroIA 2009 Designing Exploding Websites Smit Boersma
EuroIA 2009 Designing Exploding Websites Smit BoersmaEuroIA 2009 Designing Exploding Websites Smit Boersma
EuroIA 2009 Designing Exploding Websites Smit Boersma
 
Designing Exploding Websites (Euro IA 2009)
Designing Exploding Websites (Euro IA 2009)Designing Exploding Websites (Euro IA 2009)
Designing Exploding Websites (Euro IA 2009)
 
Building Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and SalesforceBuilding Social Enterprise with Ruby and Salesforce
Building Social Enterprise with Ruby and Salesforce
 
Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...Why use big data tools to do web analytics? And how to do it using Snowplow a...
Why use big data tools to do web analytics? And how to do it using Snowplow a...
 
1030 track2 komp
1030 track2 komp1030 track2 komp
1030 track2 komp
 
Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...Creating a Single Source of Truth: Leverage all of your data with powerful an...
Creating a Single Source of Truth: Leverage all of your data with powerful an...
 
1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools
1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools
1,2,3 … testing : is this thing on(line)? Meet your new Microsoft Testing tools
 
Advance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricksAdvance sql - window functions patterns and tricks
Advance sql - window functions patterns and tricks
 
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For EcommerceDeep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
Deep.bi - Real-time, Deep Data Analytics Platform For Ecommerce
 
1120 track2 komp
1120 track2 komp1120 track2 komp
1120 track2 komp
 
Design Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and KijiDesign Patterns for Building 360-degree Views with HBase and Kiji
Design Patterns for Building 360-degree Views with HBase and Kiji
 
Functional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 WayFunctional Domain Modeling - The ZIO 2 Way
Functional Domain Modeling - The ZIO 2 Way
 
Euro Ia Designing Exploding Websites Share
Euro Ia Designing Exploding Websites ShareEuro Ia Designing Exploding Websites Share
Euro Ia Designing Exploding Websites Share
 
Web Servers
Web Servers Web Servers
Web Servers
 
Designing DDD Aggregates
Designing DDD AggregatesDesigning DDD Aggregates
Designing DDD Aggregates
 
Online shopping ppt
Online shopping pptOnline shopping ppt
Online shopping ppt
 
Web Analytics Primer
Web Analytics PrimerWeb Analytics Primer
Web Analytics Primer
 
20.project inventry management system
20.project inventry management system20.project inventry management system
20.project inventry management system
 
Data and Consumer Product Development
Data and Consumer Product DevelopmentData and Consumer Product Development
Data and Consumer Product Development
 
Srs_of_E_commerce_Online_Book_Shopping_1.doc.pdf
Srs_of_E_commerce_Online_Book_Shopping_1.doc.pdfSrs_of_E_commerce_Online_Book_Shopping_1.doc.pdf
Srs_of_E_commerce_Online_Book_Shopping_1.doc.pdf
 

Plus de Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 

Plus de Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 

Dernier

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Dernier (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

2000 KDD Cup Winners

  • 1. Mikhail Golovnya, Dan Steinberg, Scott Cardell golomi@salford-systems.com
  • 2. Question 1: Given a set of page views, will the visitor view another page on the site or will the visitor leave?  Question 2: Given a set of page views, which product will the visitor view in the remainder of the session?  Question 3: Given a set of purchases over a period of time, characterize visitors who spend more than $12 (order amount) on an average order at the site?  Question 4 and 5: insight versions of questions 1 and 2
  • 3. Gazelle.com is a leg-wear and leg-care web retailer  Soft-launch: Jan 30, 2000  Hard-launch: Feb 29, 2000 ◦ With an Ally McBeal TV ad on 28th and strong $10 off promotion  Training set: 2 months  Test sets: one month (split into two test sets)
  • 4. (insert decision tree process)
  • 5. (insert home page image)
  • 6. Web Application Server: ◦ Takes care of sessionizing (unique session ID is assigned to each user’s session) ◦ Takes care of registration and logging in (unique customer ID is assigned to each registered user) ◦ Uses dynamic HTML unique page view is identified via a combination of page view template (*.jhtml or *.jsp) and query parameters (product ID, vendor ID, assortment ID, etc.)  All data supplied come directly from the web application server logs
  • 7. Acxiom enhancements: age, gender, marital status, vehicle lifestyle, own/rent, etc.  Keynote records (about 250,000) removed. They hit the home page 3 times a minute, 24 hours  Personal information was removed, including: Names, addresses, login, credit card, phones, host name/IP, verification question/answer. Cookie, e-mail were obfuscated.  Test users were removed based on multiple criteria (e.g. credit card number) not available to competitors  Original data and aggregated data (to session level) were provided
  • 8. CLICKS ◦ Contains click-stream information ◦ Each record is a page view ◦ Basis for questions 1 and 2 ◦ Each sequence of clicks forms a session ◦ Session continues for any page view except for the last  ORDER LINES ◦ Contains order information ◦ Each record is an order line ◦ Order is a collection of order lines with the same order ID ◦ Basis for question 3
  • 9. Session Session ID Sequence Sequence number of the click SessckID Session cookie ID Visitnum Session visit count (from the cookie) Proctime Request processing time Npage Session length (in clicks) Sesslen Session length (in seconds) Usragent Session user agent Sessref Session referrer Date and time variables
  • 10. Contlvl* Page view template Prodlvl* Product for product templates Asslvl* Assortment for other than product templates Final Last page in this session Refcont* Referring page content, Refasrt* assortment, product Refprod* Weekday, hour, date Day, hour, and date variables Other auxiliary variables
  • 11. Brand Brand name (leg-wear products) Maker Product maker Audience Product audience Basorfas Basic or fashion Prodform Product form Look Product look Length, size Length, size, depth, etc. Collect Collection Texture Texture Over 40 different variables, all highly missing
  • 12. CustID Customer ID Nfail Number of failed logins Sesslcnt Session login count Account creation date/time variables
  • 13. Email User’s e-mail address Freqwear What do you wear most frequently? Howfind How did you find us? Legcare Your favorite leg care brand Sendmail Allow sending solicitation e-mails Nadult Number of adults Nkids Number of kids State Residency state 19 variables in total, all significantly missing
  • 14. Owntruck, Own*** Truck Owner, RV owner, etc. Ownbkcrd, Own*Crd Bank card holder, gas card holder, etc. Age Age Marital Marital status Mailresp Mail responder Income Estimated income Pool Presence of pool 61 variables in total, all highly missing
  • 15. Detailed understanding of all initial variables (this took nearly 50% of total project time!!!)  Creating new predictors (features): ◦ Slicing a variable into a set of key dimensions ◦ Combining different levels into logical groups to reduce the total number of categories ◦ Combining a set of variables into one informative dimension ◦ Creating new features to account for different layers of aggregation (CLICKS vs. SESSIONS vs. ORDERS vs. USERS)  Developing the master KEEP list: ◦ Separating “illegal” predictors from “legitimate” ones ◦ Removing “useless” predictors (duplicates, nearly unary, extremely missing)
  • 16. Possibly dividing the large CLICKS data base into logical segments (Registered Users vs. Unregistered, Short Sessions vs. Long Sessions) with subsequent separate analyses and KEEP lists within each segment  Defining the right CART model set-up (especially for PRIORS and COSTS)  Running different CART models, analyzing the performance, revisiting all of the steps above to develop/test/reject new features  For questions 1 and 2 choose the models with the highest overall score (adjusted for the evaluation criteria)  For question 3 learn as much as possible from all of the above
  • 17. SESSION REFERRER (SESSREF) ◦ Carries on extremely useful information regarding where the user was immediately before initiating a GAZELLE session ◦ In its raw form practically useless (too many levels)  SESSION USER AGENT (USRAGENT) ◦ Provides detailed information about the user’s browser, including operating system and AOL/MSN connection ◦ Helps in identifying “artificial” users (ROBOTS) ◦ Again, practically useless in its raw form
  • 18. Referring Host (REFWEB) is one of the dimensions extracted after slicing the referrer ◦ Still has thousands of distinct levels (How many web-servers are out there?!!) ◦ Want to simplify for a more informative use ◦ Same services may have a variety of different host names  New logical groups of REFWEB: ◦ Search engines (yahoo, excite, Google, etc.) ◦ Fashion sites (Fashion Mall, Shop Now) ◦ Bargain sites (Free Gifts, My Coupons, etc.) ◦ “Specialty Sites” (Winnie-Cooper!!!) ◦ NULL (session was initiated via a bookmark or direct typing in)
  • 19. Answer: Winnie-Cooper is a 31 year old guy who wears pantyhose and has a pantyhose site. 8,700 visitors came from his site(!)  We might and we should expect different behavior of “Winnie-Cooper” users from everyone else
  • 20. All PRODLVL*, CONTLVL*, and ASSLVL* variables turned out to be nearly useless for direct modeling and awkward for interpretation  PRODLVL1-PRODLVL3 represent different path levels in the file system that point into individual product information  Reasonable to combine all three paths into a unique product descriptor PRODP  Similarly, generate unique assortment and content descriptors CONTP and ASSP  Finally, combine all three descriptors into a single page view descriptor (static equivalent of dynamic HTML) VIEWCAT- an extremely useful interpretation variable
  • 21. (insert images)
  • 22. Adding clicks history ◦ 1-page back, 2-pages back, 3-pages back, etc. ◦ Dummies indicating if a given “epoch” page (home page, registration page, Donna Karan, etc.) has already been viewed prior to this click in the current session ◦ Counting the number of views up to this click in the session for the selected “epoch” pages  Adding session history ◦ Identifying previous sessions based on either USERID (registered users) or COOKIE (unregistered users) ◦ Collect history features from the previous sessions (first visit, ordered ever, ordered previously, viewed Donna Karan products before, etc.)
  • 23. Adding registration history ◦ CUSTID is only defined for the session in which the user logged in explicitly ◦ Using COOKIEID, it is possible to approximately identify anonymous sessions that belong to a registered user ◦ Define REGISTEV=YES for any session that was initiated by a registered user (even prior to the registration event) ◦ This also gives rise to additional related features (registered previously , have yet to register, etc.)  Aggregating order lines ◦ Mostly for question 3: summarizing order-line characteristics to the ORDERS and USER levels (buy socks, buy leg-care, buy black, buy fashion, etc.)
  • 24. Initial CLICKS data base had about 900,000 records and 220 variables  After the filtration and adding new features the number of variables grew up to 450  Dividing CLICKS into segments seems justifiable  A CART run with DEPTH=2 reveals that SEQUENCE=1 is the root splitter for both question 1 and 2  There is something special about the first click!
  • 25. (insert tables)  Conclusion: usually the first click also becomes the last (come and leave!)
  • 26. Again, running CART DEPTH=2 on SEQUENCE>1 shows that the next split separates registered ever users from non-registered  Median session length (after removing lengths 1): ◦ Never registered 8 ◦ Registered at some point 26  Naturally, a registered user will have a longer session than a non-registered user  Similarly, CART finds additional splits on SEQUENCE=2, SEQUENCE=[3,4,5], and SEQUENCE>5
  • 27. (insert image)
  • 28. Complete CLICKS data set should be used for training to exploit all available information  However, the evaluation criterion for question 1 is referring to the SESSION level: will the SESSION continue?  Prior Probabilities should be set manually to SESSION level values to adjust CART to the evaluation criterion  Since we have 5 different partitions of the CLICKS database, 5 different sets of Prior Probabilities must be specified
  • 29. (insert image)  The “majority rule” is very hard to beat!
  • 30. Checking rules for the right child of the root split  (insert image)  Root split separates crawlers, robots, and unusual browsers
  • 31. Node Report- Further insight into the root splitter  (insert image)  The root splitter is very powerful  The root splitter is also quite “unique”
  • 32. Checking the second split  (insert image)  Second split distinguishes ever registered users from anonymous users
  • 33. (insert image)  This node has the largest probability of exit  This segment gives the best predictive power
  • 34. Root split separates “killer” pages from “killing” pages  (insert images)
  • 35. (insert image)  Still quite difficult to predict!
  • 36. Again, root split separates “killer” pages from “killing” pages!  (insert image)  This variable might be difficult to interpret  CONTP1 could be used instead- much easier to interpret
  • 37. (insert image)  The tree is large, yet it is extremely difficult to predict!
  • 38. (insert image)  Still a very hard prediction problem
  • 39. QUESTION: Given a set of page views, which product brand (Hanes, Donna Karan, American Essentials, or None) will the visitor view in the remainder of the session?  Evaluation Criterion: ◦ 2 units if the session visited the predicted brand; ◦ 1 unit if the session did not visit any of the three brands and the prediction was none; ◦ O units otherwise; ◦ All sessions of length 1 will be excluded
  • 40. For the given (truncated) session only “Single event” 8 outcomes are possible in the Will use remainder of the session: directly -O ◦ None brands are visited -H ◦ Only Hanes visited -D ◦ Only Donna Karan visited -A ◦ Only American Essentials visited ◦ Only Hanes and Donna Karan -HD ◦ Only Donna and American Essentials -DA ◦ Only Hanes and American Essentials -HA ◦ All three visited -AHD “Double or  Thus we have 8- level target that Triple” Must should be mapped into 4 distinct convert to levels for final prediction and scoring “single”
  • 41. Outcome # sessions • Number of sessions in the clipped click stream O 72,269 •Only a few sessions H 4,417 result to “double or triple” D 3,964 A 2,644 D HD 325 D DA 20 H D HA 153 Conversion rules AHD 8 (defined by the dominant class) Total 90,800
  • 42. Costs must be used to incorporate the evaluation criterion  (insert table)
  • 43. The segmentation is done using the same technique that was used in Question 1 segmentation  (insert image)
  • 44. First we try GINI splitting rule  (insert image)  The tree is big, but the accuracy is low
  • 45. Now let’s try TWOING  (insert image)  All red nodes predict NONE  Smaller tree, better accuracy
  • 46. Now focus on DONNA views  (insert image)  Now all red nodes predict DONNA
  • 47. Variable Importance clarifies which variables have the largest predictive power  (insert image)
  • 48. Using TWOING splitting rule  (insert image)  Short sessions are the easiest to predict
  • 49. For SEQUENCE=5 and above  (insert image)  Longer sessions are becoming quite challenging
  • 50. In the evaluation, each session with at least 2 clicks is randomly clipped to a shorter length  This means that a session of length T>1 is clipped to length S with probability 1/(T-1) for S=1,…,T-1  For each terminal node in a CART tree the training cases must be weighted by the appropriate clipping probability when calculating the within-node probabilities  Predict OTHER is its revised probability was more than twice that of the highest probability brand; otherwise, the highest probability brand was predicted
  • 51. Characterize visitors who spend more than $12 on an average order at the site  Small dataset of 3,465 purchases 1,831 customers  Insight question- no test set  Submission requirement: ◦ Report of up to 1,000 words and 10 graphs ◦ Business users should be able to understand report ◦ Observations should be correct and interesting average order ◦ tax>$2 implies heavy spender is not interesting nor actionable
  • 52. (insert graph)
  • 53. (insert images)
  • 54. (insert graphs)
  • 55. (insert graph)
  • 56. (insert graph)
  • 57. (insert graphs)
  • 58. Orders come from different cities: ◦ 80% of orders coming from San Francisco and Chicago are heavy spenders ◦ 40% of orders coming from New York are heavy spenders ◦ Orders coming from elsewhere have only 25% of heavy spenders  Color makes the difference- buying black products implies heavy spender  Color is also related to city: orders from large cities have higher percent of black color
  • 59. Leg-care products are more expensive ◦ 75% of leg-care orders were above $12 threshold ◦ Only 25% of leg-wear orders were above $12  Pantyhose are more expensive than socks  Hanes and Donna Karan imply heavy spenders  American Essentials imply low spenders
  • 60. Referrals from Shopnow or Fashion Mall imply heavy spenders, whereas MyCoupons are low spenders  Work Dress business casual or business imply heavy spender  Sunday and Monday are heavy spender days  Income makes the difference, but not much: ◦ 40% of very high income users are high spenders ◦ 32% of very low income are also high spenders ◦ Only 25% are high spenders for everyone else
  • 61. AOL users tend to spend less ◦ 20% of AOL users are high spenders ◦ 29% of the remaining users are high spenders ◦ This might also be explained by the lack of testing GAZELLE site on the AOL browsers (incompatibility issues)  Luxury vehicle implies heavy spender (slightly)
  • 62. (insert graph)
  • 63.
  • 64. (insert image)
  • 65. The parts marked in red might safely be removed  Normally want to remove all graphic content queries like ◦ GIF and JPG files ◦ Other unnecessary content  May reduce the size of the raw web log up to 5 times  The resulting web log now contains only the most important pieces of information
  • 66. (insert image)
  • 67. The clean web log is still not suitable for any data processing since each row basically represents a set of characters  Need to convert each legitimate line into a delimited list of data fields  Want to choose a delimiter that never occurs in the raw web log  Will have to drop all corrupt log entries
  • 68. (insert image)
  • 69. Each line (entry) in a web log corresponds to a single resource request  A user normally issues a set of logically connected requests called SESSION  Multiple users may share the same time frame intermixed log entries  HTTP protocol is MEMORYLESS need to solve the problem of identifying different sessions
  • 70. Using COOKIES to mark client’s station ◦ Might be disabled by “paranoid” clients ◦ Might be deleted or “exhausted”  Using URL encoding ◦ Requires dynamic HTML (ASP, JSP, Servlets)  Using pure web log heuristics ◦ Mostly matching on IP-address, user agent, and referrer fields ◦ May be done on any server that supports extended log format ◦ Somewhat imprecise in identifying sessions under certain “unfavorable” conditions
  • 71. Identifying END OF SESSION event ◦ Widely used 30-minute standard does not always work  Proxy Servers ◦ Multiple users may share the same IP address ◦ Cached requests are “forever lost” for the server’s log ◦  Dynamic IP addresses ◦ A single user might have different IP address within the same session  Spiders and Robots ◦ Completely violate any human “logic” and may generate a lot of “false” or “huge” sessions  Smart heuristic programming may reduce the ambiguity down to as low as 5%
  • 72. (insert image)
  • 73. The “referrer” field provides extremely valuable information about the user  “Referrer” links back to the previous request  Empty “referrer” indicates that the request was initiated from a bookmark or by direct typing of the URL  Non empty “referrer” either links back to the previous resource requested from the server or gives the URL of the “outside” resource that the user was accessing immediately before initiating the current session  Not easy to use directly: too many distinct values
  • 74. Referrer just like any other URL might be decomposed into the following pieces ◦ Protocol used (http, https, etc.) ◦ Site (domain name of the server) ◦ Domain (com, edu, uk, etc.) ◦ Resource (including path relative to the server) ◦ Port (usually missing for default assignment) ◦ Query string  Should consider grouping “sites into logical segments (search engines, specialty sites, etc.)  May require further processing of the “resource” and “query string” (key-words, categories, etc.)
  • 75. (insert image)