Talk at Evaluation, SummerPIT 2019, Aarhus University, 15th August 2019
https://alandix.com/academic/talks/PIT-2019-validation-and-mechanism/
Sometimes evaluation is straightforward. Perhaps our goal is to create a system in a well-understood environment that is fastest to use or with least errors. In this case, and if we believe design choices are effectively independent, then we can run a lab or in-situ study to compare design alternatives. However many things do not fit into this easy-to-evaluate category. Sometimes our goals or more diffuse or long term: sustainability, behavioural change, improving education. Sometimes the thing we wish to 'evaluate' is 'generative' such as toolkits or frameworks used by developers or designers to create systems that then are used by others. In these cases simple post-hoc 'try it and measure it' approaches to evaluation fail, or at best give partial results. However post-hoc evaluation is only one way to validate work – data (quantitative or qualitative) should be combined with an understanding of mechanism, how things work, in order to justify, generalise and innovate.
Validation and mechanism: exploring the limits of evaluation
1. Summer PIT 2019
Validation and Mechanism
exploring the limits of evaluation
Alan Dix
http://alandix.com/academic/talks/PIT-2019-validation-and-mechanism/
7. easy questions
How fast do people recognise a menu option
Is product A easer to learn than product B
even then …
individuals or average
better for whom?
WEIRD …
8. WEIRD people
Henrich J, Heine S, Norenzayan A. (2010). The weirdest people in the world?
Behav Brain Sci. 2010 Jun;33(2-3):61-83; discussion 83-135.
doi:10.1017/S0140525X0999152X. Epub 2010 Jun 15
Western, Educated, Industrialized,
Rich, and Democratic
14. purpose
Two types of evaluation
purpose stage
formative improve a design development
summative say “this is good” contractual/sales
investigative gain understanding researchinvestigative
user research /
big changes
gain understanding
Three
15. exploration / formative
– find any interesting issues
– stats about deciding priorities
validation / summative
– exhaustive: find all problems/issues
– verifying: is hypothesis true, does system work
– mensuration: how good, how prevalent
explanation / investigative
– matching qualitative/quantitative, small/large samples
16. are five users enough?
original work
Nielsen & Landauer (1993) about iterative process
not summative – not for stats!
how many?
to find enough to do in next development cycle
depends on size of project and complexity
now-a-days with cheap development maybe n=1
but always more in next cycle
N.B. later work
on saturation
19. validating work
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
your work
evaluation
• experiments
• user studies
• peer review
20. generative artefacts
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
artefact
evaluation
singularity
different people
different situations
different designers
• toolkits
• devices
• interfaces
• guidelines
• methodologies
(pure) evaluation of generative artefacts
is methodologically unsound
21. your work
validating work
• justification
– expert opinion
– previous research
– new experiments
• evaluation
– experiments
– user studies
– peer review
justification
• expert opinion
• previous research
• new experiments
evaluation
• experiments
• user studies
• peer review
22. justification vs. validation
• different disciplines
– mathematics: proof = justification
– medicine: drug trials = evaluation
• combine them:
– look for weakness in justification
– focus evaluation there
evaluationjustification
27. mechanism
• reduction reconstruction
– formal hypothesis testing
+ may be qualitative too
– more scientific precision
• wholistic analytic
– field studies, ethnographies
+ ‘end to end’ experiments
– more ecological validity
?
? ? ? ? ?
• wholistic analytic
– field studies, ethnographies
+ ‘end to end’ experiments
– more ecological validity
28. example: mobile font size
early paper on fonts in mobile menus:
well conducted experiment
statistically significant results
conclusion gives best font size
but … a menu selection task includes:
1. visual search (better big fonts)
2. if not found scroll/page display (better small fonts)
3. when found touch target (better big fonts)
no single best size – the balance depends on menu length, etc.
29.
30.
31. what have you
really shown?
stats are about the measure,
but what does it measure
32. what have you really shown
• think about the conditions
– are there other explanations for data?
• individual or population
– small #of groups/individuals, many measurements
– sig. statistics => effect reliable for each individual
– but are individuals representative of all?
• systems vs properties
33. a little story …
BIG ACM conference – ‘good’ empirical paper
looking at collaborative support for a task X
three pieces of software:
A – domain specific software, synchronous
B – generic software, synchronous
C – generic software, asynchronous
A
B C
domai
n
spec.generic
34. experiment
sensible quality measures
reasonable nos. subjects in each condition
significant results p<0.05
domain spec. > generic
asynchronous > synchronous
conclusion: really want asynchronous domain specific
A
B C
domain
spec.
generic
domain
spec.
generic
asyncsync
35. what’s wrong with that?
interaction effects
gap is interesting to study
not necessarily good to implement
more important …
if you blinked at the wrong moment …
NOT independent variables
three different pieces of software
like experiment on 3 people!
say system B was just bad
domain
spec.
generic asyncsync
A
B C
domain
spec.
generic
?
B < A B < C
36. what went wrong?
borrowed psych method
… but method embodies assumptions
single simple cause, controlled environment
interaction needs ecologically valid experiment
multiple causes, open situations
what to do?
understand assumptions and modify
39. don’t just look at average!
e.g. overall system A lower error rate than system B
but … system B better for experts
40. … and tasks too
e.g. PieTree
(interactive circular treemap)
exploding
Pie chart
good for finding
large things
unfolding
hierarchical
text view
good for finding
small things
44. points of comparison
• measures:
– average satisfaction 3.2 on a 5 point scale
– time to complete task in range 13.2–27.6 seconds
– good or bad?
• need a point of comparison
– but what?
– self, similar system, created or real??
– think purpose ...
• what constitutes a ‘control’
– think!!
45. types of knowledge
• descriptive
– explaining what happened
• predictive
– saying what will happen
cause effect
– where science often ends
• synthetic
– working out what to do to make what you want happen
effect cause
– design and engineering
• synthetic
– working out what to do to make what you want happen
effect cause
– design and engineering
46. different kinds of evaluation
endless arguments
– quantitative vs. qualitative
– in the lab vs. in the wild
– experts vs. real users (vs UG students!)
really
– combine methods
e.g. quantitative – what is true & qualitative – why
– what is appropriate and possible
47. when does it end?
in a world of perpetual beta ...
real use is the ultimate evaluation
• logging, bug reporting, etc.
• how do people really use the product?
• are some features never used?