SlideShare une entreprise Scribd logo
1  sur  121
Télécharger pour lire hors ligne
User Experiments
   in Recommender Systems
Introduction
 Welcome everyone!
Introduction
Bart Knijnenburg
 - Current: UC Irvine
    -Informatics
    -   PhD candidate

 - TUHuman Technology Interaction
      Eindhoven
    -
    -   Researcher & Teacher
    -   Master Student

 - Carnegie Mellon
   University
    -   Human-Computer Interaction
    -   Master Student


                                     INFORMATION AND COMPUTER SCIENCES
Introduction
Bart Knijnenburg
 - First user-centric
   evaluation framework
    -   UMUAI, 2012

 - Founder ofonUCERSTI
    - workshop user-centric
        evaluation of recommender
        systems and their interfaces

 - Statistics expert reviewer
    - Conference + journal
    -   Research methods teacher
    -   SEM advisor


                                       INFORMATION AND COMPUTER SCIENCES
Introduction




               INFORMATION AND COMPUTER SCIENCES
Introduction

        “What is a user experiment?”



  “A user experiment is a scientific method to
    investigate how and why system aspects
influence the users’ experience and behavior.”


                                   INFORMATION AND COMPUTER SCIENCES
Introduction
My goal:
   Teach how to scientifically evaluate recommender systems
   using a user-centric approach
   How? User experiments!

My approach:
 - I will provide a broad theoretical framework
 - I will cover every step in conducting a user experiment
 - I will teach the “statistics of the 21st century”
                                                 INFORMATION AND COMPUTER SCIENCES
Introduction
Welcome everyone!

Evaluation framework
A theoretical foundation for user-centric evaluation

Hypotheses
What do I want to find out?

Participants
Population and sampling

Testing A vs. B
Experimental manipulations

Measurement
Measuring subjective valuations

Analysis
Statistical evaluation of the results
Evaluation framework
A theoretical foundation for user-centric evaluation
Framework

Offline evaluations may not give the same outcome as
online evaluations
   Cosley et al., 2002; McNee et al., 2002

Solution: Test with real users




                                             INFORMATION AND COMPUTER SCIENCES
Framework

System                        Interaction
algorithm                           rating




                        INFORMATION AND COMPUTER SCIENCES
Framework

Higher accuracy does not
always mean higher
satisfaction
   McNee et al., 2006

Solution: Consider other
behaviors



                           INFORMATION AND COMPUTER SCIENCES
Framework

System                        Interaction
algorithm                           rating
                              consumption
                                 retention




                        INFORMATION AND COMPUTER SCIENCES
Framework

The algorithm counts for only 5% of the relevance of a
recommender system
   Francisco Martin - RecSys 2009 keynote

Solution: test those other aspects




                                               INFORMATION AND COMPUTER SCIENCES
Framework

  System                         Interaction
 algorithm                             rating
interaction                      consumption
presentation                        retention




                           INFORMATION AND COMPUTER SCIENCES
Framework


 “Testing a recommender against a random
videoclip system, the number of clicked clips
    and total viewing time went down!”




                                   INFORMATION AND COMPUTER SCIENCES
Framework
             number of
          clips watched
          from beginning
               to end                           total                        number of
                               +            viewing time                   clips clicked




                   personalized                     −                        −
                    recommendations
                                      OSA

                                                        perceived system
                                                     effectiveness
                           +                                               EXP

                                            +
                perceived recommendation
                           quality
                                      SSA
                                            +

                                                             choice
                                                        satisfaction
                                                                           EXP



Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010




                                                                                           INFORMATION AND COMPUTER SCIENCES
Framework
Behavior is hard to interpret
   Relationship between behavior and satisfaction is not
   always trivial

User experience is a better predictor of long-term retention
   With behavior only, you will need to run for a long time

Questionnaire data is more robust
   Fewer participants needed


                                                 INFORMATION AND COMPUTER SCIENCES
Framework
Measure subjective valuations with questionnaires
   Perception and experience

Triangulate these data with behavior
   Ground subjective valuations in observable actions
   Explain observable actions with subjective valuations

Measure every step in your theory
   Create a chain of mediating variables


                                                INFORMATION AND COMPUTER SCIENCES
Framework

  System        Perception   Experience         Interaction
 algorithm       usability     system                 rating
interaction       quality      process          consumption
presentation      appeal      outcome              retention




                                          INFORMATION AND COMPUTER SCIENCES
Framework

Personal and situational characteristics may have an
important impact
   Adomavicius et al., 2005; Knijnenburg et al., 2012

Solution: measure those as well




                                                 INFORMATION AND COMPUTER SCIENCES
Framework
                         Situational Characteristics
                   routine         system trust        choice goal


  System        Perception         Experience           Interaction
 algorithm       usability            system                  rating
interaction       quality            process            consumption
presentation      appeal             outcome               retention


                             Personal Characteristics
                   gender             privacy            expertise

                                                  INFORMATION AND COMPUTER SCIENCES
Framework
                         Situational Characteristics
                   routine       system trust        choice goal
                  Objective System Aspects (OSA)
  System        Perception     Experience     Interaction
                  These are manipulations
 algorithm       usability         system                   rating
interaction
                   - visual / interaction design consumption
                  quality            process
presentation
                   - recommenderoutcome
                  appeal
                                      algorithm
                                                   retention
                   - presentation of recommendations
                   - additional features
                          Personal Characteristics
                   gender          privacy             expertise

                                                INFORMATION AND COMPUTER SCIENCES
Framework
                         Situational Characteristics
                   routine
  User Experience (EXP)          system trust    choice goal

  Different aspects may
  System different things
  influence       Perception       Experience         Interaction
 algorithm          usability       system                 rating
   - interface -> system
interaction
      evaluations    quality        process          consumption

   -
presentation        appeal
     preference elicitation        outcome              retention
       method -> choice process
   - algorithm -> quality of Personal Characteristics
                             the
       final choice    gender        privacy           expertise

                                               INFORMATION AND COMPUTER SCIENCES
Framework
                         Situational Characteristics
                   routine        system trust    choice goal
                                 Subjective System Aspects
                                 (SSA)
  System        Perception        Experienceto EXP
                                    Link OSA     Interaction
 algorithm       usability           system
                                    (mediation)     rating
interaction       quality            process          consumption
                                    Increase the robustness
presentation      appeal            of the effects of OSAs on
                                    outcome             retention
                                    EXP
                                    How and why OSAs
                             Personal Characteristics
                                    affect EXP
                   gender            privacy            expertise

                                                 INFORMATION AND COMPUTER SCIENCES
Framework
                           Situational Characteristics
                     routine         system trust        choice goal


  System        Perception       Experience       Interaction
  Personal and Situational characteristics (PC and SC)
 algorithm         usability            system                  rating
     Effect of specific user and task
interaction         quality            process            consumption
     Beyond the influence of the system
presentation        appeal             outcome               retention
     Here used for evaluation, not for augmenting algorithms

                               Personal Characteristics
                     gender             privacy            expertise

                                                    INFORMATION AND COMPUTER SCIENCES
Framework
                           Situational Characteristics
                     routine       system trust        choice goal

  Interaction (INT)
 System          Perception        Experience           Interaction
  Observable behavior
 algorithm         usability         system                   rating
   -  browsing, viewing, log-ins
interaction          quality         process            consumption
presentation of evaluation
  Final step         appeal         outcome                retention

       Grounds EXP in “real” data
                            Personal Characteristics
                     gender          privacy             expertise

                                                  INFORMATION AND COMPUTER SCIENCES
Framework
                         Situational Characteristics
                   routine         system trust        choice goal


  System        Perception         Experience           Interaction
 algorithm       usability            system                  rating
interaction       quality            process            consumption
presentation      appeal             outcome               retention


                             Personal Characteristics
                   gender             privacy            expertise

                                                  INFORMATION AND COMPUTER SCIENCES
Framework
Has suggestions for measurement scales
   e.g. How can we measure something like “satisfaction”?

Provides a good starting point for causal relations
   e.g. How and why do certain system aspects influence the
   user experience?

Useful for integrating existing work
   e.g. How do recommendation list length, diversity and
   presentation each have an influence on user experience?

                                                INFORMATION AND COMPUTER SCIENCES
Framework


“Statisticians, like artists, have the bad habit
    of falling in love with their models.”

                 George Box




                                      INFORMATION AND COMPUTER SCIENCES
test your recommender system with real users


measure behaviors other                          test aspects other than
     than ratings                                     the algorithm



  measure subjective                              take situational and
   valuations with                               personal characteristics
    questionnaires                                    into account

          Evaluation framework
               Help in measurement and causal relations


     use our framework as a starting point for user-centric research
Hypotheses
What do I want to find out?
Hypotheses


“Can you test if my recommender system
                 is good?”




                               INFORMATION AND COMPUTER SCIENCES
Hypotheses

What does good mean?
- Recommendation accuracy?
- Recommendation quality?
- System usability?
- System satisfaction?
We need to define measures



                             INFORMATION AND COMPUTER SCIENCES
Hypotheses


“Can you test if the user interface of my
recommender system scores high on this
           usability scale?”




                                 INFORMATION AND COMPUTER SCIENCES
Hypotheses

What does high mean?
   Is 3.6 out of 5 on a 5-point scale “high”?
   What are 1 and 5?
   What is the difference between 3.6 and 3.7?

We need to compare the UI against something



                                                 INFORMATION AND COMPUTER SCIENCES
Hypotheses


“Can you test if the UI of my recommender
 system scores high on this usability scale
     compared to this other system?”




                                  INFORMATION AND COMPUTER SCIENCES
Hypotheses




My new travel recommender   Travelocity


                                   INFORMATION AND COMPUTER SCIENCES
Hypotheses
Say we find that it scores higher on usability... why does it?
 - different date-picker method
 - different layout
 - different number of options available
Apply the concept of ceteris paribus to get rid of
confounding variables
   Keep everything the same, except for the thing you want
   to test (the manipulation)
   Any difference can be attributed to the manipulation

                                                  INFORMATION AND COMPUTER SCIENCES
Hypotheses




My new travel recommender     Previous version
                            (too many options)

                                      INFORMATION AND COMPUTER SCIENCES
Hypotheses
To learn something from the study, we need a theory behind
the effect
   For industry, this may suggest further improvements to the
   system
   For research, this makes the work generalizable

Measure mediating variables
   Measure understandability (and a number of other
   concepts) as well
   Find out how they mediate the effect on usability

                                                INFORMATION AND COMPUTER SCIENCES
Hypotheses

An example:

We compared three
recommender systems
  Three different algorithms
  Ceteris paribus!



                               INFORMATION AND COMPUTER SCIENCES
Hypotheses



Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012



The mediating variables show the entire story


                                                                            INFORMATION AND COMPUTER SCIENCES
Hypotheses

Matrix Factorization recommender with   Matrix Factorization recommender with
explicit feedback (MF-E)                implicit feedback (MF-I)
(versus generally most popular; GMP)         (versus most popular; GMP)
                                  OSA                                     OSA



              +                                        +



     perceived recommendation                perceived recommendation                perceived system
             variety                    +            quality                    +   effectiveness
                           SSA                                     SSA                                  EXP




       Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012




                                                                                     INFORMATION AND COMPUTER SCIENCES
Hypotheses

“An approximate answer to the right problem
  is worth a good deal more than an exact
    answer to an approximate problem.”

                John Tukey



                                  INFORMATION AND COMPUTER SCIENCES
define measures

  compare system                                      get rid of
aspects against each                                confounding
       other                                          variables



apply the concept of                              look for a theory
  ceteris paribus                              behind the found effects


                       Hypotheses
                       What do I want to find out?


        measure mediating variables to explain the effects
Participants
Population and sampling
Participants

“We are testing our recommender system
      on our colleagues/students.”

                 -or-

       “We posted the study link
        on Facebook/Twitter.”


                                   INFORMATION AND COMPUTER SCIENCES
Participants
Are your connections, colleagues, or students typical users
of your system?
 - They may have more knowledge of the field of study
 - They may feel more excited about the system
 - They may know what the experiment is about
 - They probably want to please you
You should sample from your target population
   An unbiased sample of users of your system

                                                INFORMATION AND COMPUTER SCIENCES
Participants


“We only use data from frequent users.”




                                INFORMATION AND COMPUTER SCIENCES
Participants
What are the consequences of limiting your scope?
   You run the risk of catering to that subset of users only
   You cannot make generalizable claims about users

For scientific experiments, the target population may be
unrestricted
   Especially when your study is more about human nature
   than about a specific system


                                                  INFORMATION AND COMPUTER SCIENCES
Participants


“We tested our system with 10 users.”




                               INFORMATION AND COMPUTER SCIENCES
Participants
Is this a decent sample size?
                                  Anticipated Needed
   Can you attain statistically   e!ect size sample size
   significant results?
   Does it provide a wide            small           385
   enough inductive base?
                                   medium             54
Make sure your sample is
large enough
                                     large            25
   40 is typically the bare
   minimum

                                              INFORMATION AND COMPUTER SCIENCES
sample from your target population




the target population                             make sure your sample
may be unrestricted                                 is large enough


                        Participants
                        Population and sampling
Testing A vs. B
 Experimental manipulations
Manipulations


“Are our users more satisfied if our news
recommender shows only recent items?”




                                 INFORMATION AND COMPUTER SCIENCES
Manipulations
Proposed system or treatment:
   Filter out any items > 1 month old

What should be my baseline?
 - Filter out items < 1 month old?
 - Unfiltered recommendations?
 - Filter out items > 3 months old?
You should test against a reasonable alternative
   “Absence of evidence is not evidence of absence”

                                              INFORMATION AND COMPUTER SCIENCES
Manipulations


“The first 40 participants will get the baseline,
     the next 40 will get the treatment.”




                                     INFORMATION AND COMPUTER SCIENCES
Manipulations

These two groups cannot be expected to be similar!
   Some news item may affect one group but not the other

Randomize the assignment of conditions to participants
   Randomization neutralizes (but doesn’t eliminate)
   participant variation



                                               INFORMATION AND COMPUTER SCIENCES
Manipulations
Between-subjects design:        100 participants


Randomly assign half the
participants to A, half to B
   Realistic interaction       50               50
   Manipulation hidden from
   user
   Many participants needed


                                          INFORMATION AND COMPUTER SCIENCES
Manipulations
Within-subjects design:         50 participants


Give participants A first,
then B
 - Remove subject variability
 - Participant may see the
   manipulation
 - Spill-over effect
                                         INFORMATION AND COMPUTER SCIENCES
Manipulations
Within-subjects design:          50 participants


Show participants A and B
simultaneously
 - Remove subject variability
 - Participants can compare
   conditions
 - Not a realistic interaction
                                          INFORMATION AND COMPUTER SCIENCES
Manipulations
Should I do within-subjects or between-subjects?

Use between-subjects designs for user experience
   Closer to a real-world usage situation
   No unwanted spill-over effects

Use within-subjects designs for psychological research
   Effects are typically smaller
   Nice to control between-subjects variability


                                                  INFORMATION AND COMPUTER SCIENCES
Manipulations
                                         Low       High
You can test multiple                  diversity diversity
manipulations in a factorial      5
design                                  5+low         5+high
                               items
The more conditions, the         10
                                     10+low          10+high
more participants you will     items
need!                            20
                                     20+low         20+high
                               items


                                             INFORMATION AND COMPUTER SCIENCES
Manipulations

Let’s test an algorithm against random recommendations
   What should we tell the participant?

Beware of the Placebo effect!
   Remember: ceteris paribus!
   Other option: manipulate the message (factorial design)



                                               INFORMATION AND COMPUTER SCIENCES
Manipulations
     “We were demonstrating our new
recommender to a client. They were amazed
by how well it predicted their preferences!”

“Later we found out that we forgot to activate
     the algorithm: the system was giving
   completely random recommendations.”

                (anonymized)

                                    INFORMATION AND COMPUTER SCIENCES
test against a reasonable alternative


randomize assignment                            use between-subjects
    of conditions                                 for user experience



use within-subjects for                            you can test more
psychological research                            than two conditions


                  Testing A vs. B
                     Experimental manipulations


       you can test multiple manipulations in a factorial design
Measurement
Measuring subjective valuations
Measurement


“To measure satisfaction, we asked users
     whether they liked the system
      (on a 5-point rating scale).”




                                 INFORMATION AND COMPUTER SCIENCES
Measurement

Does the question mean the same to everyone?
 - John likes the system because it is convenient
 - Mary likes the system because it is easy to use
 - Dave likes it because the recommendations are good
A single question is not enough to establish content validity
   We need a multi-item measurement scale



                                               INFORMATION AND COMPUTER SCIENCES
Measurement
Perceived system effectiveness:
 - Using the system is annoying
 - The system is useful
 - Using the system makes me happy
 - Overall, I am satisfied with the system
 - I would recommend the system to others
 - I would quickly abandon using this system
                                               INFORMATION AND COMPUTER SCIENCES
Measurement
Use both positively and negatively phrased items
 - They make the questionnaire less “leading”
 - They help filtering out bad participants
 - They explore the “flip-side” of the scale
The word “not” is easily overlooked
   Bad: “The recommendations were not very novel”
   Good: “The recommendations felt outdated”


                                                INFORMATION AND COMPUTER SCIENCES
Measurement

Choose simple over specialized words
   Participants may have no idea they are using a
   “recommender system”

Avoid double-barreled questions
   Bad: “The recommendations were relevant and fun”



                                                INFORMATION AND COMPUTER SCIENCES
Measurement


“We asked users ten 5-point scale questions
       and summed the answers.”




                                  INFORMATION AND COMPUTER SCIENCES
Measurement
Is the scale really measuring a single thing?
 - 5 items measure satisfaction, the other 5 convenience
 - The items are not related enough to make a reliable scale
Are two scales really measuring different things?
 - They are so closely related that they actually measure the
   same thing

We need to establish convergent and discriminant validity
   This makes sure the scales are unidimensional

                                                 INFORMATION AND COMPUTER SCIENCES
Measurement
Solution: factor analysis
 - Define latent factors, specify how items “load” on them
 - Factor analysis will determine how well the items “fit”
 - It will give you suggestions for improvement
Benefits of factor analysis:
 - Establishes convergent and discriminant validity
 - Outcome is a normally distributed measurement scale
 - The scale captures the “shared essence” of the items
                                               INFORMATION AND COMPUTER SCIENCES
Measurement
var1       var2    var3   var4      var5     var6      qual1   qual2      qual3      qual4            diff1      diff2          diff3          diff4   diff5




                     perceived                                    perceived
                                                                                                                              choice
       1          recommendation                               recommendation                                                                           1
                                                      1
                    variety                                      quality                                             difficulty




                                            movie                                             choice
                            1                                                                                             1
                                          expertise                                      satisfaction
Factors

Items
                                   exp1     exp2      exp3             sat1   sat2     sat3    sat4       sat5           sat6           sat7




                                                                                                                  INFORMATION AND COMPUTER SCIENCES
Measurement
                     low communality,
                                                                                                           loads on
                                                                                                        quality, variety,   low
                  high residual with qual2                                                              and satisfaction communality

var1       var2    var3   var4      var5     var6      qual1   qual2      qual3      qual4            diff1      diff2          diff3          diff4   diff5




                     perceived                                    perceived
                                                                                                                              choice
       1          recommendation                               recommendation                                                                           1
                                                      1
                    variety                                      quality                                             difficulty




                                            movie                                             choice
                            1                                                                                             1
                                          expertise                                      satisfaction




                                   exp1     exp2      exp3             sat1   sat2     sat3    sat4       sat5           sat6           sat7



                                                                 low communality                              high residual with var1

                                                                                                                  INFORMATION AND COMPUTER SCIENCES
Measurement
       AVE: 0.622                         AVE: 0.756                       AVE: 0.435 (!)
var1
    sqrt(AVE) = 0.789 var6
      var2 var3   var4
                                      sqrt(AVE) qual3 qual4
                                    qual1  qual2
                                                 = 0.870                 sqrt(AVE) = 0.659
                                                                            diff1 diff2 diff4
   largest corr.: 0.491              largest corr.: 0.709               largest corr.: -0.438

          perceived                         perceived
                                                                                   choice
 1     recommendation                    recommendation                                         1
                                    1
         variety                           quality                            difficulty




                          movie                                choice
                 1                                                             1
                        expertise                          satisfaction



                      AVE: 0.793                            AVE: 0.655
                    exp1 exp2   exp3                 sat1  sat3 sat4  sat5 sat6
                  sqrt(AVE) = 0.891                      sqrt(AVE) = 0.809
                 highest corr.: 0.234                   highest corr.: 0.709

                                                                            INFORMATION AND COMPUTER SCIENCES
Measurement

“Great! Can I learn how to do this myself?”



       Check the video tutorials at
          www.statmodel.com



                                  INFORMATION AND COMPUTER SCIENCES
establish content validity with multi-item scales




 follow the general                             establish convergent
 principles for good                              and discriminant
questionnaire items                                     validity




                 Measurement
                  Measuring subjective valuations


                        use factor analysis
Analysis
Statistical evaluation of the results
Analysis

                                         Perceived quality
Manipulation -> perception:          1
   Do these two algorithms        0.5
   lead to a different level of     0
   perceived quality?
                                  -0.5
T-test                              -1
                                             A              B




                                                  INFORMATION AND COMPUTER SCIENCES
Analysis
                                      System effectiveness
                            2

Perception -> experience:    1
   Does perceived quality
                            0
   influence system
   effectiveness?           -1

Linear regression           -2
                                 -3             0                    3
                                      Recommendation quality



                                                    INFORMATION AND COMPUTER SCIENCES
Analysis
                                                  Perceived quality
                                   0.6
Two manipulations ->
                                                                          low diversification
perception:                        0.5
                                                                          high diversification
   What is the combined            0.4

   effect of list diversity and    0.3
   list length on perceived        0.2
   recommendation quality?
                                    0.1

Factorial ANOVA                      0
                                     5 items                10 items               20 items
                                  Willemsen et al.: “Not just more of the same”, submitted to TiiS



                                                                      INFORMATION AND COMPUTER SCIENCES
Analysis
                               var1       var2    var3       var4    var5     var6         qual1      qual2      qual3      qual4              diff1      diff2          diff3          diff4   diff5




Only one method: structural           1
                                                    perceived
                                                 recommendation
                                                                                           1
                                                                                                        perceived
                                                                                                     recommendation
                                                                                                                                                                       choice
                                                                                                                                                                                                 1
                                                                                                                                                              difficulty

equation modeling
                                                   variety                                              quality




   The statistical method of
                                                                             movie                                                     choice
                                                              1                                                                                                    1
                                                                           expertise                                              satisfaction




   the 21st century                                                 exp1      exp2     exp3                   sat1   sat2     sat3      sat4       sat5           sat6           sat7




Combines factor analysis
and path models
                                                                            Top-20                                                      Lin-20
                                                                  vs Top-5 recommendations                                    vs Top-5 recommendations




 - Turn items into factors
                                                    .455 (.211)              .612 (.220)                                                  .879 (.265)
                                                                                                           -.804 (.230)
                                                    p < .05                  p < .01                                                      p < .001
                                                         +             +                                   p < .001           −                        +
                                                perceived                              +                perceived                                  +                        choice
                                             recommendation                                          recommendation
                                                 variety                     .503 (.090)                quality                      .336 (.089)
                                                                                                                                                                       difficulty




 - Test causal relations
                                                                               p < .001                                                p < .001

                                                         +                                             +                                                               -.417 (.125)
                                                                                                                                  1.151 (.161)
                                                    .181 (.075)                                    .205 (.083)                        p < .001                         p < .005
                                                      p < .05                                        p < .05                 +                               −
                                                                                                                                                                                        .894 (.287)
                                                                             movie                                                        choice                                        p < .005
                                                                                                                                                                             +
                                                                       expertise                                                    satisfaction




                                                                                                                     INFORMATION AND COMPUTER SCIENCES
Analysis
                                var1       var2    var3       var4    var5     var6         qual1      qual2      qual3      qual4              diff1      diff2          diff3          diff4   diff5




                                                     perceived                                           perceived
                                                                                                                                                                        choice
                                       1          recommendation                                      recommendation                                                                              1
                                                                                            1
                                                    variety                                              quality                                               difficulty




Very simple reporting                                          1
                                                                              movie
                                                                            expertise
                                                                                                                                        choice
                                                                                                                                   satisfaction
                                                                                                                                                                    1




 - Report overall fit + effect                                        exp1      exp2     exp3                   sat1   sat2     sat3      sat4       sat5           sat6           sat7




   of each causal relation
 - A path that explains the                                                  Top-20
                                                                   vs Top-5 recommendations
                                                                                                                                         Lin-20
                                                                                                                               vs Top-5 recommendations




   effects                                           .455 (.211)
                                                     p < .05
                                                          +             +
                                                                              .612 (.220)
                                                                              p < .01
                                                                                                            -.804 (.230)
                                                                                                            p < .001           −
                                                                                                                                           .879 (.265)
                                                                                                                                           p < .001
                                                                                                                                                        +
                                                 perceived                              +                perceived                                  +                        choice
                                              recommendation                                          recommendation
                                                  variety                     .503 (.090)                quality                      .336 (.089)
                                                                                                                                                                        difficulty
                                                                                p < .001                                                p < .001

                                                          +                                             +                                                               -.417 (.125)
                                                                                                                                   1.151 (.161)
                                                     .181 (.075)                                    .205 (.083)                        p < .001                         p < .005
                                                       p < .05                                        p < .05                 +                               −
                                                                                                                                                                                         .894 (.287)
                                                                              movie                                                        choice                                        p < .005
                                                                                                                                                                              +
                                                                        expertise                                                    satisfaction




                                                                                                                      INFORMATION AND COMPUTER SCIENCES
Analysis
Example from Bollen et al.: “Choice Overload”
    What is the effect of the number of recommendations?
    What about the composition of the recommendation list?

Tested with 3 conditions:
 - Toprecs: 1 2 3 4 5
         5:
    -
 - Toprecs: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
         20:
    -
 - Lin recs: 1 2 3 4 5 99 199 299 399 499 599 699 799 899 999 1099 1199 1299 1399 1499
       20:
     -
                                                                INFORMATION AND COMPUTER SCIENCES
Analysis




Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010

                                                                         INFORMATION AND COMPUTER SCIENCES
Analysis
                                 Top-20                                      Lin-20
                          vs Top-5 recommendations                    vs Top-5 recommendations


      Simple regression
                                                                            “Inconsistent mediation”
“Full mediation”     +         +                                      −                 +
                perceived                   +           perceived                   +              choice
             recommendation                          recommendation
               variety                                 quality                                   difficulty

                                                                          Trade-o!
       Measured by+var1-var6                          +
         (not shown here)                                             +                     −

                                   movie                                       choice
                                                                                                    +
                                                                                                            Additional e!ect
                               expertise                                  satisfaction


                     Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010

                                                                                                 INFORMATION AND COMPUTER SCIENCES
Analysis
                      Top-20                                               Lin-20
               vs Top-5 recommendations                          vs Top-5 recommendations




     .455 (.211)        .612 (.220)                                          .879 (.265)
                                                  -.804 (.230)
     p < .05            p < .01                                              p < .001
        +           +                             p < .001       −                      +
   perceived                      +            perceived                            +               choice
recommendation                              recommendation
  variety               .503 (.090)            quality                  .336 (.089)
                                                                                                difficulty
                          p < .001                                        p < .001

         +                                    +                                                 -.417 (.125)
                                                                     1.151 (.161)
     .181 (.075)                          .205 (.083)                    p < .001               p < .005
       p < .05                              p < .05              +                          −
                                                                                                          .894 (.287)
                        movie                                                choice                       p < .005
                                                                                                    +
                    expertise                                          satisfaction


        Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010

                                                                                                INFORMATION AND COMPUTER SCIENCES
Analysis

“Great! Can I learn how to do this myself?”



       Check the video tutorials at
          www.statmodel.com



                                  INFORMATION AND COMPUTER SCIENCES
Analysis
Homoscedasticity is a                                    20
prerequisite for linear stats




                                Interaction time (min)
                                                         15
   Not true for count data,
   time, etc.
                                                         10
Outcomes need to be
unbounded and continuous                                  5

   Not true for binary                                   0
   answers, counts, etc.                                      0       5          10            15
                                                                  Level of commitment

                                                                          INFORMATION AND COMPUTER SCIENCES
Analysis
Use the correct methods for non-normal data
 - Binary data: logit/probit regression
 - 5- or 7-point scales: ordered logit/probit regression
 - Count data: Poisson regression
Don’t use “distribution-free” stats
   They were invented when calculations were done by hand

Do use structural equation modeling
   MPlus can do all these things “automatically”

                                                   INFORMATION AND COMPUTER SCIENCES
Analysis
Standard regression requires




                                Number of followed recommendations
                                                                     5
uncorrelated errors
   Not true for repeated                                             4

   measures, e.g. “rate these                                        3
   5 items”
                                                                     2
   There will be a user-bias
   (and maybe an item-bias)                                          1

Golden rule: data-points                                             0
                                                                         0       5          10            15
should be independent
                                                                             Level of commitment

                                                                                     INFORMATION AND COMPUTER SCIENCES
Analysis



                               Number of followed recommendations
                                                                      5
Two ways to account for
repeated measures:
                                                                    3.75
 - Define a random
   intercept for each user                                           2.5

 - Impose an error                                                  1.25
   covariance structure

Again, in MPlus this is easy                                          0
                                                                           0       5          10            15
                                                                               Level of commitment

                                                                                       INFORMATION AND COMPUTER SCIENCES
Analysis
A manipulation only causes
things
                                  Disclosure                         Perceived
For all other variables:             help
                                   (R2 = .302)
                                                                   privacy threat
                                                                     (R2 = .565)




 - Common sense                      +           −            +            −


 - Psych literature               Trust in the
                                   company
                                   (R2 = .662)         +
                                                                    Satisfaction
                                                                  with the system
                                                                     (R2 = .674)



 - Evaluation frameworks     Knijnenburg & Kobsa.: “Making Decisions about Privacy”


Example: privacy study


                                                           INFORMATION AND COMPUTER SCIENCES
Analysis


“All models are wrong, but some are useful.”

                George Box




                                   INFORMATION AND COMPUTER SCIENCES
use structural equation models




  use correct                                        use correct
  methods for                                       methods for
non-normal data                                  repeated measures




                      Analysis
               Statistical evaluation of the results


use manipulations and theory to make inferences about causality
Introduction
User experiments: user-centric evaluation of recommender systems

Evaluation framework
Use our framework as a starting point for user experiments

Hypotheses
Construct a measurable theory behind the expected effect

Participants
Select a large enough sample from your target population

Testing A vs. B
Assign users to different versions of a system aspect, ceteris paribus

Measurement
Use factor analysis and follow the principles for good questionnaires

Analysis
Use structural equation models to test causal models
“It is the mark of a truly intelligent person
         to be moved by statistics.”




          George Bernard Shaw
Resources
User-centric evaluation
   Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining
   the User Experience of Recommender Systems. UMUAI 2012.
   Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to
   Support the User-Centric Evaluation of Recommender Systems. RecSys 2011.

Choice overload
   Willemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the
   same: Preventing Choice Overload in Recommender Systems by Offering
   Small Diversified Sets. Submitted to TiiS.
   Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding
   Choice Overload in Recommender Systems. RecSys 2010.



                                                                   INFORMATION AND COMPUTER SCIENCES
Resources
Preference elicitation methods
   Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How
   Different Users Call for Different Interaction Methods in Recommender
   Systems. RecSys 2011.
   Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation
   Methods on the User Experience of a Recommender System. CHI 2010.
   Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive
   preference elicitation methods on user satisfaction of a recommender system.
   RecSys 2009.

Social recommenders
   Knijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and
   Control in Social Recommender Systems. RecSys 2012.

                                                                   INFORMATION AND COMPUTER SCIENCES
Resources
User feedback and privacy
   Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information
   Disclosure in Context-Aware Recommender Systems. Submitted to TiiS.
   Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and
   Providing Feedback: The User-Experience of a Recommender System. EC-
   Web 2010.

Statistics books
   Agresti, A.: An Introduction to Categorical Data Analysis. 2nd ed. 2007.
   Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis. 2004.
   Kline, R.B.: Principles and Practice of Structural Equation Modeling. 3rd ed. 2010.
   Online tutorial videos at www.statmodel.com


                                                                   INFORMATION AND COMPUTER SCIENCES
Resources
Questions? Suggestions? Collaboration proposals?
   Contact me!



Contact info
   E: bart.k@uci.edu
   W: www.usabart.nl
   T: @usabart



                                            INFORMATION AND COMPUTER SCIENCES

Contenu connexe

Tendances

_ interview as a method for_a
  _ interview as a method for_a  _ interview as a method for_a
_ interview as a method for_aAfriye Quamina
 
Social science research: Principles, Methods, and Practices
Social science research:  Principles, Methods, and PracticesSocial science research:  Principles, Methods, and Practices
Social science research: Principles, Methods, and PracticesDr. Vivencio (Ven) Ballano
 
Cognitive and physical development in middle childhood
Cognitive and physical development in middle childhoodCognitive and physical development in middle childhood
Cognitive and physical development in middle childhoodCarlos F Martinez
 
research proposal_example
research proposal_exampleresearch proposal_example
research proposal_exampleShah Ishaq
 
Usability Testing 101 - an introduction
Usability Testing 101 - an introductionUsability Testing 101 - an introduction
Usability Testing 101 - an introductionElizabeth Snowdon
 
Jean Piaget’s Theory
Jean Piaget’s TheoryJean Piaget’s Theory
Jean Piaget’s Theoryguest807bbe
 
Constructivist theory of learning
Constructivist theory of learningConstructivist theory of learning
Constructivist theory of learningPhyllis Chandler
 
My first week of college
My first week of collegeMy first week of college
My first week of collegejxa5
 
Cognitivism ( Piaget and Vigotsky)
Cognitivism ( Piaget and Vigotsky)Cognitivism ( Piaget and Vigotsky)
Cognitivism ( Piaget and Vigotsky)M'hamed JAMILI
 
Операційні системи і їх реалізація
Операційні системи і їх реалізаціяОпераційні системи і їх реалізація
Операційні системи і їх реалізаціяAlexandra Ilina
 

Tendances (13)

_ interview as a method for_a
  _ interview as a method for_a  _ interview as a method for_a
_ interview as a method for_a
 
Social science research: Principles, Methods, and Practices
Social science research:  Principles, Methods, and PracticesSocial science research:  Principles, Methods, and Practices
Social science research: Principles, Methods, and Practices
 
Cognitive and physical development in middle childhood
Cognitive and physical development in middle childhoodCognitive and physical development in middle childhood
Cognitive and physical development in middle childhood
 
research proposal_example
research proposal_exampleresearch proposal_example
research proposal_example
 
Usability Testing 101 - an introduction
Usability Testing 101 - an introductionUsability Testing 101 - an introduction
Usability Testing 101 - an introduction
 
Jean Piaget’s Theory
Jean Piaget’s TheoryJean Piaget’s Theory
Jean Piaget’s Theory
 
Heuristic ux-evaluation
Heuristic ux-evaluationHeuristic ux-evaluation
Heuristic ux-evaluation
 
Constructivist theory of learning
Constructivist theory of learningConstructivist theory of learning
Constructivist theory of learning
 
My first week of college
My first week of collegeMy first week of college
My first week of college
 
Cognitivism ( Piaget and Vigotsky)
Cognitivism ( Piaget and Vigotsky)Cognitivism ( Piaget and Vigotsky)
Cognitivism ( Piaget and Vigotsky)
 
Lev vygotsky
Lev vygotskyLev vygotsky
Lev vygotsky
 
Constructivism
ConstructivismConstructivism
Constructivism
 
Операційні системи і їх реалізація
Операційні системи і їх реалізаціяОпераційні системи і їх реалізація
Операційні системи і їх реалізація
 

En vedette

Explaining the User Experience of Recommender Systems with User Experiments
Explaining the User Experience of Recommender Systems with User ExperimentsExplaining the User Experience of Recommender Systems with User Experiments
Explaining the User Experience of Recommender Systems with User ExperimentsBart Knijnenburg
 
Running Lean Startup Experiments
Running Lean Startup ExperimentsRunning Lean Startup Experiments
Running Lean Startup ExperimentsLifehack HQ
 
Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...
Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...
Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...HedgeFundFundamentals
 
Understand A/B Testing in 9 use cases & 7 mistakes
Understand A/B Testing in 9 use cases & 7 mistakesUnderstand A/B Testing in 9 use cases & 7 mistakes
Understand A/B Testing in 9 use cases & 7 mistakesTheFamily
 
Test for Success: A Guide to A/B Testing on Emails
Test for Success: A Guide to A/B Testing on EmailsTest for Success: A Guide to A/B Testing on Emails
Test for Success: A Guide to A/B Testing on EmailsMarketo
 
A/B-testing voor beginners
A/B-testing voor beginnersA/B-testing voor beginners
A/B-testing voor beginnersFrank Abrahams
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsChris Saint-Amant
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain
 
A/B Testing Framework Design
A/B Testing Framework DesignA/B Testing Framework Design
A/B Testing Framework DesignPatrick McKenzie
 
Entrepreneurship powerpoint slide
Entrepreneurship powerpoint slideEntrepreneurship powerpoint slide
Entrepreneurship powerpoint slideMahlatsi Lerato
 

En vedette (11)

Explaining the User Experience of Recommender Systems with User Experiments
Explaining the User Experience of Recommender Systems with User ExperimentsExplaining the User Experience of Recommender Systems with User Experiments
Explaining the User Experience of Recommender Systems with User Experiments
 
Running Lean Startup Experiments
Running Lean Startup ExperimentsRunning Lean Startup Experiments
Running Lean Startup Experiments
 
Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...
Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...
Short Selling: An Important Tool for Price Discovery and Liquidity in the Fin...
 
Understand A/B Testing in 9 use cases & 7 mistakes
Understand A/B Testing in 9 use cases & 7 mistakesUnderstand A/B Testing in 9 use cases & 7 mistakes
Understand A/B Testing in 9 use cases & 7 mistakes
 
Test for Success: A Guide to A/B Testing on Emails
Test for Success: A Guide to A/B Testing on EmailsTest for Success: A Guide to A/B Testing on Emails
Test for Success: A Guide to A/B Testing on Emails
 
A/B-testing voor beginners
A/B-testing voor beginnersA/B-testing voor beginners
A/B-testing voor beginners
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
 
Growth Hacking
Growth HackingGrowth Hacking
Growth Hacking
 
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
 
A/B Testing Framework Design
A/B Testing Framework DesignA/B Testing Framework Design
A/B Testing Framework Design
 
Entrepreneurship powerpoint slide
Entrepreneurship powerpoint slideEntrepreneurship powerpoint slide
Entrepreneurship powerpoint slide
 

Similaire à Tutorial on Conducting User Experiments in Recommender Systems

Inspectability and Control in Social Recommenders
Inspectability and Control in Social RecommendersInspectability and Control in Social Recommenders
Inspectability and Control in Social RecommendersBart Knijnenburg
 
Typicality-Based Collaborative Filtering Recommendation
Typicality-Based Collaborative Filtering RecommendationTypicality-Based Collaborative Filtering Recommendation
Typicality-Based Collaborative Filtering RecommendationPapitha Velumani
 
(2008) Statistical Analysis Framework for Biometric System Interoperability T...
(2008) Statistical Analysis Framework for Biometric System Interoperability T...(2008) Statistical Analysis Framework for Biometric System Interoperability T...
(2008) Statistical Analysis Framework for Biometric System Interoperability T...International Center for Biometric Research
 
Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012KM Chicago
 
Chi1417 Richardson
Chi1417 RichardsonChi1417 Richardson
Chi1417 Richardsonmerichar
 
Typicality based collaborative filtering recommendation
Typicality based collaborative filtering recommendationTypicality based collaborative filtering recommendation
Typicality based collaborative filtering recommendationPapitha Velumani
 
Safety Management Systems Process Vs Tradition
Safety Management Systems Process Vs TraditionSafety Management Systems Process Vs Tradition
Safety Management Systems Process Vs TraditionEdward Hanna, CSP, CIH
 
A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...
A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...
A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...Health Informatics New Zealand
 
Testing & implementation
Testing & implementationTesting & implementation
Testing & implementationmira benissa
 
CA Quality Management System
CA Quality Management SystemCA Quality Management System
CA Quality Management SystemWahyu Prasetianto
 
A&D - Introduction to Analysis & Design of Information System
A&D - Introduction to Analysis & Design of Information SystemA&D - Introduction to Analysis & Design of Information System
A&D - Introduction to Analysis & Design of Information Systemvinay arora
 
ATI Courses Professional Development Short Course Applied Measurement Engin...
ATI Courses Professional Development Short Course Applied  Measurement  Engin...ATI Courses Professional Development Short Course Applied  Measurement  Engin...
ATI Courses Professional Development Short Course Applied Measurement Engin...Jim Jenkins
 
A vision on collaborative computation of things for personalized analyses
A vision on collaborative computation of things for personalized analysesA vision on collaborative computation of things for personalized analyses
A vision on collaborative computation of things for personalized analysesDaniele Gianni
 
Fcv core liu
Fcv core liuFcv core liu
Fcv core liuzukun
 
Recommendation System using Machine Learning Techniques
Recommendation System using Machine Learning TechniquesRecommendation System using Machine Learning Techniques
Recommendation System using Machine Learning TechniquesIRJET Journal
 
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNINGENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNINGIRJET Journal
 

Similaire à Tutorial on Conducting User Experiments in Recommender Systems (20)

Inspectability and Control in Social Recommenders
Inspectability and Control in Social RecommendersInspectability and Control in Social Recommenders
Inspectability and Control in Social Recommenders
 
KW001
KW001KW001
KW001
 
NessPRO Italy on CAST
NessPRO Italy on CASTNessPRO Italy on CAST
NessPRO Italy on CAST
 
Typicality-Based Collaborative Filtering Recommendation
Typicality-Based Collaborative Filtering RecommendationTypicality-Based Collaborative Filtering Recommendation
Typicality-Based Collaborative Filtering Recommendation
 
(2008) Statistical Analysis Framework for Biometric System Interoperability T...
(2008) Statistical Analysis Framework for Biometric System Interoperability T...(2008) Statistical Analysis Framework for Biometric System Interoperability T...
(2008) Statistical Analysis Framework for Biometric System Interoperability T...
 
Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012Metaphors as design points for collaboration 2012
Metaphors as design points for collaboration 2012
 
Title
TitleTitle
Title
 
Chi1417 Richardson
Chi1417 RichardsonChi1417 Richardson
Chi1417 Richardson
 
Typicality based collaborative filtering recommendation
Typicality based collaborative filtering recommendationTypicality based collaborative filtering recommendation
Typicality based collaborative filtering recommendation
 
Safety Management Systems Process Vs Tradition
Safety Management Systems Process Vs TraditionSafety Management Systems Process Vs Tradition
Safety Management Systems Process Vs Tradition
 
A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...
A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...
A Usability Perspective on Clinical Handover Improvement- CHIPS with Everythi...
 
Testing & implementation
Testing & implementationTesting & implementation
Testing & implementation
 
CA Quality Management System
CA Quality Management SystemCA Quality Management System
CA Quality Management System
 
A&D - Introduction to Analysis & Design of Information System
A&D - Introduction to Analysis & Design of Information SystemA&D - Introduction to Analysis & Design of Information System
A&D - Introduction to Analysis & Design of Information System
 
ATI Courses Professional Development Short Course Applied Measurement Engin...
ATI Courses Professional Development Short Course Applied  Measurement  Engin...ATI Courses Professional Development Short Course Applied  Measurement  Engin...
ATI Courses Professional Development Short Course Applied Measurement Engin...
 
A vision on collaborative computation of things for personalized analyses
A vision on collaborative computation of things for personalized analysesA vision on collaborative computation of things for personalized analyses
A vision on collaborative computation of things for personalized analyses
 
UX Research
UX ResearchUX Research
UX Research
 
Fcv core liu
Fcv core liuFcv core liu
Fcv core liu
 
Recommendation System using Machine Learning Techniques
Recommendation System using Machine Learning TechniquesRecommendation System using Machine Learning Techniques
Recommendation System using Machine Learning Techniques
 
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNINGENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
ENTERTAINMENT CONTENT RECOMMENDATION SYSTEM USING MACHINE LEARNING
 

Plus de Bart Knijnenburg

Profiling Facebook Users' Privacy Behaviors
Profiling Facebook Users' Privacy BehaviorsProfiling Facebook Users' Privacy Behaviors
Profiling Facebook Users' Privacy BehaviorsBart Knijnenburg
 
Information Disclosure Profiles for Segmentation and Recommendation
Information Disclosure Profiles for Segmentation and RecommendationInformation Disclosure Profiles for Segmentation and Recommendation
Information Disclosure Profiles for Segmentation and RecommendationBart Knijnenburg
 
Counteracting the negative effect of form auto-completion on the privacy calc...
Counteracting the negative effect of form auto-completion on the privacy calc...Counteracting the negative effect of form auto-completion on the privacy calc...
Counteracting the negative effect of form auto-completion on the privacy calc...Bart Knijnenburg
 
Simplifying Privacy Decisions: Towards Interactive and Adaptive Solutions
Simplifying Privacy Decisions: Towards Interactive and Adaptive SolutionsSimplifying Privacy Decisions: Towards Interactive and Adaptive Solutions
Simplifying Privacy Decisions: Towards Interactive and Adaptive SolutionsBart Knijnenburg
 
FYI: Communication Style Preferences Underlie Differences in Location-Sharing...
FYI: Communication Style Preferences Underlie Differences in Location-Sharing...FYI: Communication Style Preferences Underlie Differences in Location-Sharing...
FYI: Communication Style Preferences Underlie Differences in Location-Sharing...Bart Knijnenburg
 
Preference-based Location Sharing: Are More Privacy Options Really Better?
Preference-based Location Sharing: Are More Privacy Options Really Better?Preference-based Location Sharing: Are More Privacy Options Really Better?
Preference-based Location Sharing: Are More Privacy Options Really Better?Bart Knijnenburg
 
Big data - A critical appraisal
Big data - A critical appraisalBig data - A critical appraisal
Big data - A critical appraisalBart Knijnenburg
 
Helping Users with Information Disclosure Decisions: Potential for Adaptation...
Helping Users with Information Disclosure Decisions: Potential for Adaptation...Helping Users with Information Disclosure Decisions: Potential for Adaptation...
Helping Users with Information Disclosure Decisions: Potential for Adaptation...Bart Knijnenburg
 
CSCW networked privacy workshop - keynote
CSCW networked privacy workshop - keynoteCSCW networked privacy workshop - keynote
CSCW networked privacy workshop - keynoteBart Knijnenburg
 
Privacy in Mobile Personalized Systems - The Effect of Disclosure Justifications
Privacy in Mobile Personalized Systems - The Effect of Disclosure JustificationsPrivacy in Mobile Personalized Systems - The Effect of Disclosure Justifications
Privacy in Mobile Personalized Systems - The Effect of Disclosure JustificationsBart Knijnenburg
 
Using latent features diversification to reduce choice difficulty in recommen...
Using latent features diversification to reduce choice difficulty in recommen...Using latent features diversification to reduce choice difficulty in recommen...
Using latent features diversification to reduce choice difficulty in recommen...Bart Knijnenburg
 
Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...
Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...
Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...Bart Knijnenburg
 
Recommendations and Feedback - The user-experience of a recommender system
Recommendations and Feedback - The user-experience of a recommender systemRecommendations and Feedback - The user-experience of a recommender system
Recommendations and Feedback - The user-experience of a recommender systemBart Knijnenburg
 

Plus de Bart Knijnenburg (14)

Profiling Facebook Users' Privacy Behaviors
Profiling Facebook Users' Privacy BehaviorsProfiling Facebook Users' Privacy Behaviors
Profiling Facebook Users' Privacy Behaviors
 
Information Disclosure Profiles for Segmentation and Recommendation
Information Disclosure Profiles for Segmentation and RecommendationInformation Disclosure Profiles for Segmentation and Recommendation
Information Disclosure Profiles for Segmentation and Recommendation
 
Counteracting the negative effect of form auto-completion on the privacy calc...
Counteracting the negative effect of form auto-completion on the privacy calc...Counteracting the negative effect of form auto-completion on the privacy calc...
Counteracting the negative effect of form auto-completion on the privacy calc...
 
Simplifying Privacy Decisions: Towards Interactive and Adaptive Solutions
Simplifying Privacy Decisions: Towards Interactive and Adaptive SolutionsSimplifying Privacy Decisions: Towards Interactive and Adaptive Solutions
Simplifying Privacy Decisions: Towards Interactive and Adaptive Solutions
 
FYI: Communication Style Preferences Underlie Differences in Location-Sharing...
FYI: Communication Style Preferences Underlie Differences in Location-Sharing...FYI: Communication Style Preferences Underlie Differences in Location-Sharing...
FYI: Communication Style Preferences Underlie Differences in Location-Sharing...
 
Preference-based Location Sharing: Are More Privacy Options Really Better?
Preference-based Location Sharing: Are More Privacy Options Really Better?Preference-based Location Sharing: Are More Privacy Options Really Better?
Preference-based Location Sharing: Are More Privacy Options Really Better?
 
Big data - A critical appraisal
Big data - A critical appraisalBig data - A critical appraisal
Big data - A critical appraisal
 
Helping Users with Information Disclosure Decisions: Potential for Adaptation...
Helping Users with Information Disclosure Decisions: Potential for Adaptation...Helping Users with Information Disclosure Decisions: Potential for Adaptation...
Helping Users with Information Disclosure Decisions: Potential for Adaptation...
 
CSCW networked privacy workshop - keynote
CSCW networked privacy workshop - keynoteCSCW networked privacy workshop - keynote
CSCW networked privacy workshop - keynote
 
Hcsd talk ibm
Hcsd talk ibmHcsd talk ibm
Hcsd talk ibm
 
Privacy in Mobile Personalized Systems - The Effect of Disclosure Justifications
Privacy in Mobile Personalized Systems - The Effect of Disclosure JustificationsPrivacy in Mobile Personalized Systems - The Effect of Disclosure Justifications
Privacy in Mobile Personalized Systems - The Effect of Disclosure Justifications
 
Using latent features diversification to reduce choice difficulty in recommen...
Using latent features diversification to reduce choice difficulty in recommen...Using latent features diversification to reduce choice difficulty in recommen...
Using latent features diversification to reduce choice difficulty in recommen...
 
Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...
Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...
Recsys2011 presentation "Each to his own - How Different Users Call for Diffe...
 
Recommendations and Feedback - The user-experience of a recommender system
Recommendations and Feedback - The user-experience of a recommender systemRecommendations and Feedback - The user-experience of a recommender system
Recommendations and Feedback - The user-experience of a recommender system
 

Dernier

DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfPrerana Jadhav
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDhatriParmar
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptxmary850239
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...Nguyen Thanh Tu Collection
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Association for Project Management
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfChristalin Nelson
 

Dernier (20)

DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Narcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdfNarcotic and Non Narcotic Analgesic..pdf
Narcotic and Non Narcotic Analgesic..pdf
 
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptxDecoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
Decoding the Tweet _ Practical Criticism in the Age of Hashtag.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx4.11.24 Mass Incarceration and the New Jim Crow.pptx
4.11.24 Mass Incarceration and the New Jim Crow.pptx
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
31 ĐỀ THI THỬ VÀO LỚP 10 - TIẾNG ANH - FORM MỚI 2025 - 40 CÂU HỎI - BÙI VĂN V...
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxQ4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
Team Lead Succeed – Helping you and your team achieve high-performance teamwo...
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
Indexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdfIndexing Structures in Database Management system.pdf
Indexing Structures in Database Management system.pdf
 

Tutorial on Conducting User Experiments in Recommender Systems

  • 1. User Experiments in Recommender Systems
  • 2.
  • 4. Introduction Bart Knijnenburg - Current: UC Irvine -Informatics - PhD candidate - TUHuman Technology Interaction Eindhoven - - Researcher & Teacher - Master Student - Carnegie Mellon University - Human-Computer Interaction - Master Student INFORMATION AND COMPUTER SCIENCES
  • 5. Introduction Bart Knijnenburg - First user-centric evaluation framework - UMUAI, 2012 - Founder ofonUCERSTI - workshop user-centric evaluation of recommender systems and their interfaces - Statistics expert reviewer - Conference + journal - Research methods teacher - SEM advisor INFORMATION AND COMPUTER SCIENCES
  • 6. Introduction INFORMATION AND COMPUTER SCIENCES
  • 7. Introduction “What is a user experiment?” “A user experiment is a scientific method to investigate how and why system aspects influence the users’ experience and behavior.” INFORMATION AND COMPUTER SCIENCES
  • 8. Introduction My goal: Teach how to scientifically evaluate recommender systems using a user-centric approach How? User experiments! My approach: - I will provide a broad theoretical framework - I will cover every step in conducting a user experiment - I will teach the “statistics of the 21st century” INFORMATION AND COMPUTER SCIENCES
  • 9. Introduction Welcome everyone! Evaluation framework A theoretical foundation for user-centric evaluation Hypotheses What do I want to find out? Participants Population and sampling Testing A vs. B Experimental manipulations Measurement Measuring subjective valuations Analysis Statistical evaluation of the results
  • 10.
  • 11.
  • 12. Evaluation framework A theoretical foundation for user-centric evaluation
  • 13. Framework Offline evaluations may not give the same outcome as online evaluations Cosley et al., 2002; McNee et al., 2002 Solution: Test with real users INFORMATION AND COMPUTER SCIENCES
  • 14. Framework System Interaction algorithm rating INFORMATION AND COMPUTER SCIENCES
  • 15. Framework Higher accuracy does not always mean higher satisfaction McNee et al., 2006 Solution: Consider other behaviors INFORMATION AND COMPUTER SCIENCES
  • 16. Framework System Interaction algorithm rating consumption retention INFORMATION AND COMPUTER SCIENCES
  • 17. Framework The algorithm counts for only 5% of the relevance of a recommender system Francisco Martin - RecSys 2009 keynote Solution: test those other aspects INFORMATION AND COMPUTER SCIENCES
  • 18. Framework System Interaction algorithm rating interaction consumption presentation retention INFORMATION AND COMPUTER SCIENCES
  • 19. Framework “Testing a recommender against a random videoclip system, the number of clicked clips and total viewing time went down!” INFORMATION AND COMPUTER SCIENCES
  • 20. Framework number of clips watched from beginning to end total number of + viewing time clips clicked personalized − − recommendations OSA perceived system effectiveness + EXP + perceived recommendation quality SSA + choice satisfaction EXP Knijnenburg et al.: “Receiving Recommendations and Providing Feedback”, EC-Web 2010 INFORMATION AND COMPUTER SCIENCES
  • 21. Framework Behavior is hard to interpret Relationship between behavior and satisfaction is not always trivial User experience is a better predictor of long-term retention With behavior only, you will need to run for a long time Questionnaire data is more robust Fewer participants needed INFORMATION AND COMPUTER SCIENCES
  • 22. Framework Measure subjective valuations with questionnaires Perception and experience Triangulate these data with behavior Ground subjective valuations in observable actions Explain observable actions with subjective valuations Measure every step in your theory Create a chain of mediating variables INFORMATION AND COMPUTER SCIENCES
  • 23. Framework System Perception Experience Interaction algorithm usability system rating interaction quality process consumption presentation appeal outcome retention INFORMATION AND COMPUTER SCIENCES
  • 24. Framework Personal and situational characteristics may have an important impact Adomavicius et al., 2005; Knijnenburg et al., 2012 Solution: measure those as well INFORMATION AND COMPUTER SCIENCES
  • 25. Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction algorithm usability system rating interaction quality process consumption presentation appeal outcome retention Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 26. Framework Situational Characteristics routine system trust choice goal Objective System Aspects (OSA) System Perception Experience Interaction These are manipulations algorithm usability system rating interaction - visual / interaction design consumption quality process presentation - recommenderoutcome appeal algorithm retention - presentation of recommendations - additional features Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 27. Framework Situational Characteristics routine User Experience (EXP) system trust choice goal Different aspects may System different things influence Perception Experience Interaction algorithm usability system rating - interface -> system interaction evaluations quality process consumption - presentation appeal preference elicitation outcome retention method -> choice process - algorithm -> quality of Personal Characteristics the final choice gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 28. Framework Situational Characteristics routine system trust choice goal Subjective System Aspects (SSA) System Perception Experienceto EXP Link OSA Interaction algorithm usability system (mediation) rating interaction quality process consumption Increase the robustness presentation appeal of the effects of OSAs on outcome retention EXP How and why OSAs Personal Characteristics affect EXP gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 29. Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction Personal and Situational characteristics (PC and SC) algorithm usability system rating Effect of specific user and task interaction quality process consumption Beyond the influence of the system presentation appeal outcome retention Here used for evaluation, not for augmenting algorithms Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 30. Framework Situational Characteristics routine system trust choice goal Interaction (INT) System Perception Experience Interaction Observable behavior algorithm usability system rating - browsing, viewing, log-ins interaction quality process consumption presentation of evaluation Final step appeal outcome retention Grounds EXP in “real” data Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 31. Framework Situational Characteristics routine system trust choice goal System Perception Experience Interaction algorithm usability system rating interaction quality process consumption presentation appeal outcome retention Personal Characteristics gender privacy expertise INFORMATION AND COMPUTER SCIENCES
  • 32. Framework Has suggestions for measurement scales e.g. How can we measure something like “satisfaction”? Provides a good starting point for causal relations e.g. How and why do certain system aspects influence the user experience? Useful for integrating existing work e.g. How do recommendation list length, diversity and presentation each have an influence on user experience? INFORMATION AND COMPUTER SCIENCES
  • 33. Framework “Statisticians, like artists, have the bad habit of falling in love with their models.” George Box INFORMATION AND COMPUTER SCIENCES
  • 34. test your recommender system with real users measure behaviors other test aspects other than than ratings the algorithm measure subjective take situational and valuations with personal characteristics questionnaires into account Evaluation framework Help in measurement and causal relations use our framework as a starting point for user-centric research
  • 35.
  • 36.
  • 37. Hypotheses What do I want to find out?
  • 38. Hypotheses “Can you test if my recommender system is good?” INFORMATION AND COMPUTER SCIENCES
  • 39. Hypotheses What does good mean? - Recommendation accuracy? - Recommendation quality? - System usability? - System satisfaction? We need to define measures INFORMATION AND COMPUTER SCIENCES
  • 40. Hypotheses “Can you test if the user interface of my recommender system scores high on this usability scale?” INFORMATION AND COMPUTER SCIENCES
  • 41. Hypotheses What does high mean? Is 3.6 out of 5 on a 5-point scale “high”? What are 1 and 5? What is the difference between 3.6 and 3.7? We need to compare the UI against something INFORMATION AND COMPUTER SCIENCES
  • 42. Hypotheses “Can you test if the UI of my recommender system scores high on this usability scale compared to this other system?” INFORMATION AND COMPUTER SCIENCES
  • 43. Hypotheses My new travel recommender Travelocity INFORMATION AND COMPUTER SCIENCES
  • 44. Hypotheses Say we find that it scores higher on usability... why does it? - different date-picker method - different layout - different number of options available Apply the concept of ceteris paribus to get rid of confounding variables Keep everything the same, except for the thing you want to test (the manipulation) Any difference can be attributed to the manipulation INFORMATION AND COMPUTER SCIENCES
  • 45. Hypotheses My new travel recommender Previous version (too many options) INFORMATION AND COMPUTER SCIENCES
  • 46. Hypotheses To learn something from the study, we need a theory behind the effect For industry, this may suggest further improvements to the system For research, this makes the work generalizable Measure mediating variables Measure understandability (and a number of other concepts) as well Find out how they mediate the effect on usability INFORMATION AND COMPUTER SCIENCES
  • 47. Hypotheses An example: We compared three recommender systems Three different algorithms Ceteris paribus! INFORMATION AND COMPUTER SCIENCES
  • 48. Hypotheses Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012 The mediating variables show the entire story INFORMATION AND COMPUTER SCIENCES
  • 49. Hypotheses Matrix Factorization recommender with Matrix Factorization recommender with explicit feedback (MF-E) implicit feedback (MF-I) (versus generally most popular; GMP) (versus most popular; GMP) OSA OSA + + perceived recommendation perceived recommendation perceived system variety + quality + effectiveness SSA SSA EXP Knijnenburg et al.: “Explaining the user experience of recommender systems”, UMUAI 2012 INFORMATION AND COMPUTER SCIENCES
  • 50. Hypotheses “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” John Tukey INFORMATION AND COMPUTER SCIENCES
  • 51. define measures compare system get rid of aspects against each confounding other variables apply the concept of look for a theory ceteris paribus behind the found effects Hypotheses What do I want to find out? measure mediating variables to explain the effects
  • 52.
  • 53.
  • 55. Participants “We are testing our recommender system on our colleagues/students.” -or- “We posted the study link on Facebook/Twitter.” INFORMATION AND COMPUTER SCIENCES
  • 56. Participants Are your connections, colleagues, or students typical users of your system? - They may have more knowledge of the field of study - They may feel more excited about the system - They may know what the experiment is about - They probably want to please you You should sample from your target population An unbiased sample of users of your system INFORMATION AND COMPUTER SCIENCES
  • 57. Participants “We only use data from frequent users.” INFORMATION AND COMPUTER SCIENCES
  • 58. Participants What are the consequences of limiting your scope? You run the risk of catering to that subset of users only You cannot make generalizable claims about users For scientific experiments, the target population may be unrestricted Especially when your study is more about human nature than about a specific system INFORMATION AND COMPUTER SCIENCES
  • 59. Participants “We tested our system with 10 users.” INFORMATION AND COMPUTER SCIENCES
  • 60. Participants Is this a decent sample size? Anticipated Needed Can you attain statistically e!ect size sample size significant results? Does it provide a wide small 385 enough inductive base? medium 54 Make sure your sample is large enough large 25 40 is typically the bare minimum INFORMATION AND COMPUTER SCIENCES
  • 61. sample from your target population the target population make sure your sample may be unrestricted is large enough Participants Population and sampling
  • 62.
  • 63.
  • 64. Testing A vs. B Experimental manipulations
  • 65. Manipulations “Are our users more satisfied if our news recommender shows only recent items?” INFORMATION AND COMPUTER SCIENCES
  • 66. Manipulations Proposed system or treatment: Filter out any items > 1 month old What should be my baseline? - Filter out items < 1 month old? - Unfiltered recommendations? - Filter out items > 3 months old? You should test against a reasonable alternative “Absence of evidence is not evidence of absence” INFORMATION AND COMPUTER SCIENCES
  • 67. Manipulations “The first 40 participants will get the baseline, the next 40 will get the treatment.” INFORMATION AND COMPUTER SCIENCES
  • 68. Manipulations These two groups cannot be expected to be similar! Some news item may affect one group but not the other Randomize the assignment of conditions to participants Randomization neutralizes (but doesn’t eliminate) participant variation INFORMATION AND COMPUTER SCIENCES
  • 69. Manipulations Between-subjects design: 100 participants Randomly assign half the participants to A, half to B Realistic interaction 50 50 Manipulation hidden from user Many participants needed INFORMATION AND COMPUTER SCIENCES
  • 70. Manipulations Within-subjects design: 50 participants Give participants A first, then B - Remove subject variability - Participant may see the manipulation - Spill-over effect INFORMATION AND COMPUTER SCIENCES
  • 71. Manipulations Within-subjects design: 50 participants Show participants A and B simultaneously - Remove subject variability - Participants can compare conditions - Not a realistic interaction INFORMATION AND COMPUTER SCIENCES
  • 72. Manipulations Should I do within-subjects or between-subjects? Use between-subjects designs for user experience Closer to a real-world usage situation No unwanted spill-over effects Use within-subjects designs for psychological research Effects are typically smaller Nice to control between-subjects variability INFORMATION AND COMPUTER SCIENCES
  • 73. Manipulations Low High You can test multiple diversity diversity manipulations in a factorial 5 design 5+low 5+high items The more conditions, the 10 10+low 10+high more participants you will items need! 20 20+low 20+high items INFORMATION AND COMPUTER SCIENCES
  • 74. Manipulations Let’s test an algorithm against random recommendations What should we tell the participant? Beware of the Placebo effect! Remember: ceteris paribus! Other option: manipulate the message (factorial design) INFORMATION AND COMPUTER SCIENCES
  • 75. Manipulations “We were demonstrating our new recommender to a client. They were amazed by how well it predicted their preferences!” “Later we found out that we forgot to activate the algorithm: the system was giving completely random recommendations.” (anonymized) INFORMATION AND COMPUTER SCIENCES
  • 76. test against a reasonable alternative randomize assignment use between-subjects of conditions for user experience use within-subjects for you can test more psychological research than two conditions Testing A vs. B Experimental manipulations you can test multiple manipulations in a factorial design
  • 77.
  • 78.
  • 80. Measurement “To measure satisfaction, we asked users whether they liked the system (on a 5-point rating scale).” INFORMATION AND COMPUTER SCIENCES
  • 81. Measurement Does the question mean the same to everyone? - John likes the system because it is convenient - Mary likes the system because it is easy to use - Dave likes it because the recommendations are good A single question is not enough to establish content validity We need a multi-item measurement scale INFORMATION AND COMPUTER SCIENCES
  • 82. Measurement Perceived system effectiveness: - Using the system is annoying - The system is useful - Using the system makes me happy - Overall, I am satisfied with the system - I would recommend the system to others - I would quickly abandon using this system INFORMATION AND COMPUTER SCIENCES
  • 83. Measurement Use both positively and negatively phrased items - They make the questionnaire less “leading” - They help filtering out bad participants - They explore the “flip-side” of the scale The word “not” is easily overlooked Bad: “The recommendations were not very novel” Good: “The recommendations felt outdated” INFORMATION AND COMPUTER SCIENCES
  • 84. Measurement Choose simple over specialized words Participants may have no idea they are using a “recommender system” Avoid double-barreled questions Bad: “The recommendations were relevant and fun” INFORMATION AND COMPUTER SCIENCES
  • 85. Measurement “We asked users ten 5-point scale questions and summed the answers.” INFORMATION AND COMPUTER SCIENCES
  • 86. Measurement Is the scale really measuring a single thing? - 5 items measure satisfaction, the other 5 convenience - The items are not related enough to make a reliable scale Are two scales really measuring different things? - They are so closely related that they actually measure the same thing We need to establish convergent and discriminant validity This makes sure the scales are unidimensional INFORMATION AND COMPUTER SCIENCES
  • 87. Measurement Solution: factor analysis - Define latent factors, specify how items “load” on them - Factor analysis will determine how well the items “fit” - It will give you suggestions for improvement Benefits of factor analysis: - Establishes convergent and discriminant validity - Outcome is a normally distributed measurement scale - The scale captures the “shared essence” of the items INFORMATION AND COMPUTER SCIENCES
  • 88. Measurement var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction Factors Items exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 INFORMATION AND COMPUTER SCIENCES
  • 89. Measurement low communality, loads on quality, variety, low high residual with qual2 and satisfaction communality var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 low communality high residual with var1 INFORMATION AND COMPUTER SCIENCES
  • 90. Measurement AVE: 0.622 AVE: 0.756 AVE: 0.435 (!) var1 sqrt(AVE) = 0.789 var6 var2 var3 var4 sqrt(AVE) qual3 qual4 qual1 qual2 = 0.870 sqrt(AVE) = 0.659 diff1 diff2 diff4 largest corr.: 0.491 largest corr.: 0.709 largest corr.: -0.438 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty movie choice 1 1 expertise satisfaction AVE: 0.793 AVE: 0.655 exp1 exp2 exp3 sat1 sat3 sat4 sat5 sat6 sqrt(AVE) = 0.891 sqrt(AVE) = 0.809 highest corr.: 0.234 highest corr.: 0.709 INFORMATION AND COMPUTER SCIENCES
  • 91. Measurement “Great! Can I learn how to do this myself?” Check the video tutorials at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
  • 92. establish content validity with multi-item scales follow the general establish convergent principles for good and discriminant questionnaire items validity Measurement Measuring subjective valuations use factor analysis
  • 93.
  • 94.
  • 96. Analysis Perceived quality Manipulation -> perception: 1 Do these two algorithms 0.5 lead to a different level of 0 perceived quality? -0.5 T-test -1 A B INFORMATION AND COMPUTER SCIENCES
  • 97. Analysis System effectiveness 2 Perception -> experience: 1 Does perceived quality 0 influence system effectiveness? -1 Linear regression -2 -3 0 3 Recommendation quality INFORMATION AND COMPUTER SCIENCES
  • 98. Analysis Perceived quality 0.6 Two manipulations -> low diversification perception: 0.5 high diversification What is the combined 0.4 effect of list diversity and 0.3 list length on perceived 0.2 recommendation quality? 0.1 Factorial ANOVA 0 5 items 10 items 20 items Willemsen et al.: “Not just more of the same”, submitted to TiiS INFORMATION AND COMPUTER SCIENCES
  • 99. Analysis var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 Only one method: structural 1 perceived recommendation 1 perceived recommendation choice 1 difficulty equation modeling variety quality The statistical method of movie choice 1 1 expertise satisfaction the 21st century exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 Combines factor analysis and path models Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations - Turn items into factors .455 (.211) .612 (.220) .879 (.265) -.804 (.230) p < .05 p < .01 p < .001 + + p < .001 − + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty - Test causal relations p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction INFORMATION AND COMPUTER SCIENCES
  • 100. Analysis var1 var2 var3 var4 var5 var6 qual1 qual2 qual3 qual4 diff1 diff2 diff3 diff4 diff5 perceived perceived choice 1 recommendation recommendation 1 1 variety quality difficulty Very simple reporting 1 movie expertise choice satisfaction 1 - Report overall fit + effect exp1 exp2 exp3 sat1 sat2 sat3 sat4 sat5 sat6 sat7 of each causal relation - A path that explains the Top-20 vs Top-5 recommendations Lin-20 vs Top-5 recommendations effects .455 (.211) p < .05 + + .612 (.220) p < .01 -.804 (.230) p < .001 − .879 (.265) p < .001 + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction INFORMATION AND COMPUTER SCIENCES
  • 101. Analysis Example from Bollen et al.: “Choice Overload” What is the effect of the number of recommendations? What about the composition of the recommendation list? Tested with 3 conditions: - Toprecs: 1 2 3 4 5 5: - - Toprecs: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 20: - - Lin recs: 1 2 3 4 5 99 199 299 399 499 599 699 799 899 999 1099 1199 1299 1399 1499 20: - INFORMATION AND COMPUTER SCIENCES
  • 102. Analysis Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
  • 103. Analysis Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations Simple regression “Inconsistent mediation” “Full mediation” + + − + perceived + perceived + choice recommendation recommendation variety quality difficulty Trade-o! Measured by+var1-var6 + (not shown here) + − movie choice + Additional e!ect expertise satisfaction Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
  • 104. Analysis Top-20 Lin-20 vs Top-5 recommendations vs Top-5 recommendations .455 (.211) .612 (.220) .879 (.265) -.804 (.230) p < .05 p < .01 p < .001 + + p < .001 − + perceived + perceived + choice recommendation recommendation variety .503 (.090) quality .336 (.089) difficulty p < .001 p < .001 + + -.417 (.125) 1.151 (.161) .181 (.075) .205 (.083) p < .001 p < .005 p < .05 p < .05 + − .894 (.287) movie choice p < .005 + expertise satisfaction Bollen et al.: “Understanding Choice Overload in Recommender Systems”, RecSys 2010 INFORMATION AND COMPUTER SCIENCES
  • 105. Analysis “Great! Can I learn how to do this myself?” Check the video tutorials at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
  • 106. Analysis Homoscedasticity is a 20 prerequisite for linear stats Interaction time (min) 15 Not true for count data, time, etc. 10 Outcomes need to be unbounded and continuous 5 Not true for binary 0 answers, counts, etc. 0 5 10 15 Level of commitment INFORMATION AND COMPUTER SCIENCES
  • 107. Analysis Use the correct methods for non-normal data - Binary data: logit/probit regression - 5- or 7-point scales: ordered logit/probit regression - Count data: Poisson regression Don’t use “distribution-free” stats They were invented when calculations were done by hand Do use structural equation modeling MPlus can do all these things “automatically” INFORMATION AND COMPUTER SCIENCES
  • 108. Analysis Standard regression requires Number of followed recommendations 5 uncorrelated errors Not true for repeated 4 measures, e.g. “rate these 3 5 items” 2 There will be a user-bias (and maybe an item-bias) 1 Golden rule: data-points 0 0 5 10 15 should be independent Level of commitment INFORMATION AND COMPUTER SCIENCES
  • 109. Analysis Number of followed recommendations 5 Two ways to account for repeated measures: 3.75 - Define a random intercept for each user 2.5 - Impose an error 1.25 covariance structure Again, in MPlus this is easy 0 0 5 10 15 Level of commitment INFORMATION AND COMPUTER SCIENCES
  • 110. Analysis A manipulation only causes things Disclosure Perceived For all other variables: help (R2 = .302) privacy threat (R2 = .565) - Common sense + − + − - Psych literature Trust in the company (R2 = .662) + Satisfaction with the system (R2 = .674) - Evaluation frameworks Knijnenburg & Kobsa.: “Making Decisions about Privacy” Example: privacy study INFORMATION AND COMPUTER SCIENCES
  • 111. Analysis “All models are wrong, but some are useful.” George Box INFORMATION AND COMPUTER SCIENCES
  • 112. use structural equation models use correct use correct methods for methods for non-normal data repeated measures Analysis Statistical evaluation of the results use manipulations and theory to make inferences about causality
  • 113.
  • 114.
  • 115. Introduction User experiments: user-centric evaluation of recommender systems Evaluation framework Use our framework as a starting point for user experiments Hypotheses Construct a measurable theory behind the expected effect Participants Select a large enough sample from your target population Testing A vs. B Assign users to different versions of a system aspect, ceteris paribus Measurement Use factor analysis and follow the principles for good questionnaires Analysis Use structural equation models to test causal models
  • 116.
  • 117. “It is the mark of a truly intelligent person to be moved by statistics.” George Bernard Shaw
  • 118. Resources User-centric evaluation Knijnenburg, B.P., Willemsen, M.C., Gantner, Z., Soncu, H., Newell, C.: Explaining the User Experience of Recommender Systems. UMUAI 2012. Knijnenburg, B.P., Willemsen, M.C., Kobsa, A.: A Pragmatic Procedure to Support the User-Centric Evaluation of Recommender Systems. RecSys 2011. Choice overload Willemsen, M.C., Graus, M.P., Knijnenburg, B.P., Bollen, D.: Not just more of the same: Preventing Choice Overload in Recommender Systems by Offering Small Diversified Sets. Submitted to TiiS. Bollen, D.G.F.M., Knijnenburg, B.P., Willemsen, M.C., Graus, M.P.: Understanding Choice Overload in Recommender Systems. RecSys 2010. INFORMATION AND COMPUTER SCIENCES
  • 119. Resources Preference elicitation methods Knijnenburg, B.P., Reijmer, N.J.M., Willemsen, M.C.: Each to His Own: How Different Users Call for Different Interaction Methods in Recommender Systems. RecSys 2011. Knijnenburg, B.P., Willemsen, M.C.: The Effect of Preference Elicitation Methods on the User Experience of a Recommender System. CHI 2010. Knijnenburg, B.P., Willemsen, M.C.: Understanding the effect of adaptive preference elicitation methods on user satisfaction of a recommender system. RecSys 2009. Social recommenders Knijnenburg, B.P., Bostandjiev, S., O'Donovan, J., Kobsa, A.: Inspectability and Control in Social Recommender Systems. RecSys 2012. INFORMATION AND COMPUTER SCIENCES
  • 120. Resources User feedback and privacy Knijnenburg, B.P., Kobsa, A: Making Decisions about Privacy: Information Disclosure in Context-Aware Recommender Systems. Submitted to TiiS. Knijnenburg, B.P., Willemsen, M.C., Hirtbach, S.: Getting Recommendations and Providing Feedback: The User-Experience of a Recommender System. EC- Web 2010. Statistics books Agresti, A.: An Introduction to Categorical Data Analysis. 2nd ed. 2007. Fitzmaurice, G.M., Laird, N.M., Ware, J.H.: Applied Longitudinal Analysis. 2004. Kline, R.B.: Principles and Practice of Structural Equation Modeling. 3rd ed. 2010. Online tutorial videos at www.statmodel.com INFORMATION AND COMPUTER SCIENCES
  • 121. Resources Questions? Suggestions? Collaboration proposals? Contact me! Contact info E: bart.k@uci.edu W: www.usabart.nl T: @usabart INFORMATION AND COMPUTER SCIENCES