This is our presentation of the paper General Features in Knowledge Tracing to Model Multiple Subskills, Temporal Item Response Theory, and Expert Knowledge, which is nominated for Best Paper Award. This is a new student model that allows flexible features to help inferring latent knowledge state. Code is available at http://ml-smores.github.io/fast/.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subskills, Temporal Item Response Theory, and Expert Knowledge
1. General Features in Knowledge Tracing
Applications to Multiple Subskills,
Temporal IRT & Expert Knowledge
* First authors
Yun Huang, University of Pittsburgh*
José P. González-Brenes, Pearson*
Peter Brusilovsky, University of Pittsburgh
2. This talk…
• What? Determine student mastery of a skill
• How? Novel algorithm called FAST
– Enables features in Knowledge Tracing
• Why? Better and faster student modeling
– 25% better AUC, a classification metric
– 300 times faster than popular general purpose
student modeling techniques (BNT-SM)
3. Outline
• Introduction
• FAST – Feature-Aware Student Knowledge Tracing
• Experimental Setup
• Applications
1. Multiple subskills
2. Temporal Item Response Theory
3. Paper exclusive: Expert knowledge
• Execution time
• Conclusion
4. Motivation
• Personalize learning of students
– For example, teach students new material as
they learn, so we don’t teach students
material they know
• How? Typically with Knowledge Tracing
6. :
:
û û ü
ü
ü
û û ü
ü
Masters a
skill or not
• Knowledge Tracing fits a two-
state HMM per skill
• Binary latent variables indicate
the knowledge of the student
of the skill
• Four parameters:
1. Initial Knowledge
2. Learning
3. Guess
4. Slip
Transition
Emission
7. What’s wrong?
• Only uses performance data
(correct or incorrect)
• We are now able to capture feature rich data
– MOOCs & intelligent tutoring systems are able to
log fine-grained data
– Used a hint, watched video, after hours practice…
• … these features can carry information or
intervene on learning
8. What’s a researcher gotta do?
• Modify Knowledge Tracing algorithm
• For example, just on a small-scale
literature survey, we find at least nine
different flavors of Knowledge Tracing
9. So you want to publish in EDM?
1. Think of a feature (e.g., from a MOOC)
2. Modify Knowledge Tracing
3. Write Paper
4. Publish
5. Loop!
10. Are all of those models sooooo
different?
• No! we identify three main variants
• We call them the “Knowledge Tracing
Family”
11. Knowledge Tracing Family
No features
Emission
(guess/slip)
Transition
(learning)
Both
(guess/slip and
learning)
• Item
difficulty
(Gowda
et
al
’11;
Pardos
et
al
’11)
• Student
ability
(Pardos
et
al
’10)
• Subskills
(Xu
et
al
’12)
• Help
(Sao
Pedro
et
al
’13)
• Student
ability
(Lee
et
al
’12;
Yudelson
et
al
’13)
• Item
difficulty
(Schultz
et
al
’13)
• Help
(Becker
et
al
’08)
k
y
k
y
f
k
y
f
k
y
f
f
12. • Each model is successful for
an ad hoc purpose only
– Hard to compare models
– Doesn’t help to build a
cognition theory
14. • These models are not
scalable:
– Rely on Bayes Net’s
conditional probability tables
– Memory performance grows
exponentially with number of
features
– Runtime performance grows
exponentially with number of
features (with exact
inference)
19. Something old…
k
y
f
f
• Uses the most general model
in the Knowledge Tracing
Family
• Parameterizes learning and
emission (guess+slip)
probabilities
20. Something new…
k
y
f
f
• Instead of using inefficient
conditional probability tables,
we use logistic regression
[Berg-Kirkpatrick et al’10 ]
• Exponential complexity ->
linear complexity
21. Example:
# of features # of pararameters in KTF # of parameters in FAST
0 2 2
1 4 3
10 2048 12
25 67,108,864 27
25 features are not that many, and yet they
can become intractable with Knowledge
Tracing Family
22. Something blue?
k
y
f
f
• Not a lot of changes to
implement prediction
• Training requires quite a bit of
changes
– We use a recent modification of
the Expectation-Maximization
algorithm proposed for
Computational Linguistics
problems
[Berg-Kirkpatrick et al’10 ]
23. (A parenthesis)
• Jose’s corollary: Each
equation in a presentation
would send to sleep half the
audience
• Equations are in the paper!
“Each
equaMon
I
include
in
the
book
would
halve
the
sales”
26. Slip/guess lookup:
Mastery p(Correct)
False (1)
True (2)
Use the multiple
parameters of logistic
regression to fill the
values of a “no-
features”conditional
probability table!
[Berg-Kirkpatrick et al’10 ]
28. observation 1
observation 2
observation n
...
feature1feature2
featurekfeature1feature2
featurekfeature1feature2
featurek
... ... ...
observation 1
observation 2
observation n
...
{
{
{
active when
mastered
active when
not mastered
always active
Features:Instance
weights:
probabilityof
notmastering
probabilityof
mastering
Slip/Guess logistic regression
29. observation 1
observation 2
observation n
...
feature1feature2
featurekfeature1feature2
featurekfeature1feature2
featurek
... ... ...
observation 1
observation 2
observation n
...
{
{
{
active when
mastered
active when
not mastered
always active
Features:Instance
weights:
probabilityof
notmastering
probabilityof
mastering
Slip/Guess logistic regression
When FAST
uses only
intercept terms
as features for
the two levels
of mastery, it is
equivalent to
Knowledge
Tracing!
31. Collected from QuizJET, a tutor for learning Java programming.
March 28, 2014 31
Each question is generated from a template,
and students can try multiple attempts
Students give values for a variable or the
output
Java code
Tutoring System
32. March 28, 2014 32
Data
• Smaller dataset:
– ~21,000 observations
– First attempt: ~7,000 observations
– 110 students
• Unbalanced: 70% correct
• 95 question templates
• “Hierarchical” cognitive model:
19 skills, 99 subskills
33. • Predict future performance given history
- Will a student get answer correctly at t=0 ?
- At t =1 given t = 0 performance ?
- At t = 2 given t = 0, 1 performance ? ….
• Area Under Curve metric
- 1: perfect classifier
- 0.5: random classifier
March 28, 2014 33
Evaluation
36. Multiple subskills &
KnowledgeTracing
• Original Knowledge Tracing can not
model multiple subskills
• Most Knowledge Tracing variants assume
equal importance of subskills during
training (and then adjust it during testing)
• State of the art method, LR-DBN [Xu and
Mostow ’11] assigns importance in both
training and testing
37. FAST can handle multiple subskills
• Parameterize learning
• Parameterize slip and guess
• Features: binary variables that indicate
presence of subskills
38. FAST vs Knowledge Tracing:
Slip parameters of subskills
• Conventional
Knowledge assumes
that all subskills have
the same difficulty
(red line)
• FAST can identify
different difficulty
between subskills
• Does it matter?
subskills within a skill:
39. State of the art (Xu & Mostow’11)
• The 95% of confidence intervals are within +/- .01 points
Model AUC
LR-DBN .71
KT - Weakest .69
KT - Multiply .62
40. Benchmark
Model AUC
LR-DBN .71
Single-skill KT .71
KT - Weakest .69
KT - Multiply .62
• The 95% of confidence intervals are within +/- .01 points
• We are testing on non-overlapping students, LR-DBN was
designed/tested in overlapping students and didn’t compare to
single skill KT
!
41. Benchmark
Model AUC
LR-DBN .71
Single-skill KT .71
KT - Weakest .69
KT - Multiply .62
• The 95% of confidence intervals are within +/- .01 points
• We are testing on non-overlapping students, LR-DBN was
designed/tested in overlapping students and didn’t compare to
single skill KT
!
42. Benchmark
• The 95% of confidence intervals are within +/- .01 points
Model AUC
FAST .74
LR-DBN .71
Single-skill KT .71
KT - Weakest .69
KT - Multiply .62
44. Two paradigms:
(50 years of research in 1 slide)
• Knowledge Tracing
– Allows learning
– Every item = same difficulty
– Every student = same ability
• Item Response Theory
– NO learning
– Models items difficulties
– Models student abilities
46. Item Response Theory
• The simplest of its forms, it’s the Rasch
model
• The Rasch can be formulated in many
ways:
– Typically using latent variables
– Logistic regression
• a feature per student
• a feature per item
• We end up with a lot of features! – Good thing we
are using FAST ;-)
47. Results
AUC
Knowledge Tracing .65
FAST + student .64
FAST + item .73
FAST + IRT .76
• The 95% of confidence intervals are within +/- .03 points
25%
improvement
48. Disclaimer
• In our dataset, most students answer
items in the same order
• Item estimates are biased
• Future work: define continuous IRT
difficulty features
– It’s easy in FAST ;-)
50. March 28, 2014 50
7,100 11,300 15,500 19,800
0
10
20
30
40
50
60
23
28
46
54
0.08 0.10 0.12 0.15
# of observations
executiontime(min.)
BNT−SM (no feat.)
FAST (no feat.)
FAST is 300x faster than BNT-SM!
51. LR-DBN vs FAST
• We use the authors’ implementation of
LR-DBN
• LR-DBN takes about 250 minutes
• FAST only takes about 44 seconds
• 15,500 datapoints
• This is on an old laptop, no parallelization,
nothing fancy
• (details on the paper)
53. Comparison of existing techniques
March 28, 2014 53
allows
features
slip/
guess
recency/
ordering
learning
FAST ✓
✓
✓
✓
PFA
Pavlik et al ’09
✓
✗
✗
✓
Knowledge Tracing
Corbett & Anderson ’95
✗
✓
✓
✓
Rasch Model
Rasch ’60
✓
✗
✗
✗
54. • FAST lives by its name
• FAST provides high flexibility in utilizing
features, and as our studies show, even
with simple features improves significantly
over Knowledge Tracing
55. • The effect of features depends on how
smartly they are designed and on the
dataset
• I am looking forward for more clever uses
of feature engineering for FAST in the
community