Machine learning techniques enable instructional systems to learn about their students, topics, and pedagogical strategy. Just like master teachers who learn after years of experience, tutoring systems learn to adapt their teaching strategies to new students, and new domains and to personalize their teaching for individual students. Typical instructional systems persist in the same behavior originally encoded within them. However by using ML, tutoring systems learn from the behavior of earlier students and extend their existing knowledge. A variety of ML techniques are used with intelligent tutors, including HMM, neural networks, expectation maxima, Bayes Networks, statistical learning, regression modeling, causal modeling, and statistical models.
Learning to Teach: Improving Instruction with Machine Learning Techniques
1. Beverly Park Woolf
School of Computer Science, University of Massachusetts
bev@cs.umass.edu
Learning to Teach:
Machine Learning to Improve
Instruction
NIPS 2015 Workshop on Human Propelled
Machine Learning, Dec 13, 2014
2. Long, long Term Goal
Millions of schoolchildren will have access to
what Alexander the Great enjoyed as a royal
prerogerative:
“the personal services of a tutor as well
informed as Aristotle”
Pat Suppes, Stanford University, 1966
Died Nov 2014)
”Students will have
instant access to vast
stores of knowledge
through their
computerized tutors”
3. Alexander the Great valued learning so highly, that
he said he was more indebted to Aristotle for giving him
knowledge than to his father for giving him life.
4. We are on track.
Key components:
Artificial Intelligence
Machine Learning
Learning Sciences
We are able to achieve personal services of a
tutor for every student and instant access to
vast stores of knowledge
6. Model the Student
Model the Domain
Personalize Tutoring
Assess Learning
Intelligent
Tutoring
Systems
Learning
@ Scale
7. Research Questions
How to retrieve substance from educational data?
What do teachers and students need to know?
What do researchers in Learning Sciences want to know?
8. • Explore large educational data sets and how they are
analyzed
– create models and pattern finding.
• How are researchers in the field of educational
technology using a variety of techniques to use data
to improve teaching and learning?
9. What Kind of ML Techniques?
– Visualization and modeling
– Decision trees
– Bayesian networks
– Logistic Regression
– Temporal Models
– Markov Models
– Classification: Naïve Bayes, Neural Networks,
Decision trees
19. A data-driven approach toward automatic prediction of students
emotional states without sensors and while students are still actively
engaged in their learning.
Models from students ongoing behavior. A cross-validation revealed
small gains in accuracy for the more sophisticated state-based
models and better predictions of the remaining unpredicted cases,
compared to the baseline models.
By modifying the context of the tutoring system including students
perceived emotion around mathematics, a tutor can now
optimize and improve a students mathematics attitudes.
David H. Shanabrook, David G. Cooper, Beverly Park Woolf, and Ivon Arroyo
28. Problem state patterns
IBMs Many Eyes Word Tree algorithm. The total 1280 ATT (attempted and solved) events.
Most frequently ATT was followed by a SOF event (see top tree). The second level of the
tree shows that the sequence ATT ATT the highest frequent event changes to the ATT
event, i.e. the shift in behavior occurs after two ATT states (see second tree and
30. Problem Statement
• Background
– Develop a machine learning component for a math tutoring
system used by high school students (SAT, MCAS)
– Focus on estimating the “state” of a student, which is then used
for selecting an appropriate pedagogical action
• Problem
– Using a model to estimate student ability, but…
– Students appear unmotivated in ~30% of problems
• Solution
– Explicitly model motivation (as a dynamic variable) and student
proficiency in a single model
31. Detection of Motivation
Unmotivated students do not reap the full rewards of
using a computer-based intelligent tutoring system. Detection
of improper behavior is thus an important component of an
online student model.
Dynamic mixture model based on Item Response Theory. This
model simultaneously estimates a student’s proficiency and
changing motivation level.
By accounting for student motivation, the dynamic mixture
model researchers can more accurately estimate proficiency
and the probability of a correct response.
32. • Created Item Response Theory (IRT) models for modeling the student's
knowledge
• Data consists of responses (correct/incorrect) for 400 students across 70
problems, where a student performs ~33 problems on average
• - implemented an EM algorithm to learn the parameters of the IRT model
• - cross-validated results indicate the model can predict with 72% accuracy
how the student will perform on each problem
• - algorithms can be used online to estimate a student's ability while
interacting with the tutor
• - currently working on an extension of the IRT model to include information
relevant to a student's motivation (time spent on problem, number of hints
requested)
33. Low Student Motivation
• Example: Actual data from a student performing
12 problems (green = correct, red = incorrect)
– Problems are of roughly equal difficulty
• Student appears to perform well in beginning and
worse toward the end
• Conclusion: The student’s proficiency is average
121110987654321 …
34. Low Student Motivation
• Conclusion: Poor performance on the last five
problems is due to low motivation (not
proficiency)
121110987654321
0
10
20
30
40
50
Time (s)
To First
Response
Student is
unmotivated
Use observed
data to infer
motivation!
…
35. Low Student Motivation
• Opportunity for intelligent tutoring systems to
improve student learning by addressing
motivation
• This issue is being dealt with on a larger scale
by the educational assessment community
– Wise & Demars 2005. Low Examinee Effort in
Low-Stakes Assessment: Potential Problems and
Solutions. Educational Assessment.
36. Hidden Markov Model (HMM)
• A HMM is used to capture a student’s
changing behavior (level of motivation)
H1 H2 Hn
M1 M2 Mn
…
…
Mi (hidden) Hi (observed)
Unmotivated – Hint
Time to first response < tmin AND
Number of hints before correct response > hmax
Unmotivated – Guess
Time to first response < tmin AND
Number of hints before correct response < hmin
Motivated If other two cases don’t apply
37. • New edges (in red) change the conditional
probability of a student’s response: P(Ui | ,
Mi)
U1 U2 Un
…
H1 H2 Hn
M1 M2 Mn
…
… Motivation (Mi )
affects student
response (Ui )
38. Parameter Estimation
• Uses an Expectation-Maximization algorithm to
estimate parameters
– M-Step is iterative, similar to the Iterative Reweighted
Least Squares (IRLS) algorithm
• Model consists of discrete and continuous variables
– Integral for the continuous variable is approximated using
a quadrature technique
• Only parameters not estimated
– P(Ui | , Mi=unmotivated-guess) = 0.2
– P(Ui | , Mi=unmotivated-hint) = 0.02
39. Modeling Ability and Motivation
• Combined model does not decrease the ability
estimate when the student is unmotivated
Combined model
separates ability from
motivation (IRT model
lumps them together)
40. Experiments
• Data: 400 high school students, 70 problems, a student
finished 32 problems on average
• Train the Model
– Estimate parameters
• Test the Model
– For each student, for each problem:
• Estimate and P(Mi) via maximum likelihood
• Predict P(Mi+1) given HMM dynamics
• Predict Ui+1. Does it match actual Ui+1?
• Compare combined model vs. just an IRT model
41. Results
• Combined model achieved 72.5% cross-
validation accuracy versus 72.0% for the IRT
model
– Gap is not statistically significant
• Opportunities for improving the accuracy of
the combined model
– Longer sequences (per student)
– Better model of the dynamics, P(Mi+1 | Mi)
42. Conclusions
• Proposed a new, flexible model to jointly estimate
student motivation and ability
– Not separating ability from motivation conflates the two
concepts
– Easily adjusted for other tutoring systems
• Combined model achieved similar accuracy to IRT
model
• Online inference in real-time
– Implemented in Java; ran it in one high school in May ’06
44. Sensors used in the classroom
Bayesian networks
and Linear regression
45. Linear Models to Predict Emotions
Variables that help predict self-report of emotions. The result suggest that
emotion depends on the context in which the emotion occurs (math problem
just solved) and also can be predicted from physiological activity captured by the
sensors (bottom row).
48. Domain Model
The Andes Bayesian network before (left) and
after (right) the observation A-is-a body.
Kurt VanLehn.
49. Domain Model
Student actions (left)
and the self-
explanation model
(right).
The physics problem
asks the student to fi
nd the tension force
exerted on a person
hanging by a rope
tied to his waist.
Assume the
midshipman was
named Jake.
55. Predicting Student Time To Complete
Two agents were built to predict student time to solve
problems (Beck et al., 2000) .
1) Population student model (PSM): responsible for
modeling how students interacted with the tutor, based
on data from the entire population of users and input
characteristics of the student, as well as information
about the problem to be solved and output about the
expected time (in seconds) the student would need to
solve that problem.
2) Pedagogical agent (PA), and it was responsible for
constructing a teaching policy. It was a reinforcement
learning agent that reasoned about a student’s knowledge
and provided customized examples and hints
tailored for each student (Beck and Woolf, 2001; Beck et
al., 1999a, 2000) .
56. The tutor predicted a current student’s reaction to a variety
of teaching actions, such as presentation of specifi c problem type.
(Beck et al, 2000)
Overview of the ADVISOR machine
learning component in AnimalWatch.
57. The tutor predicted a current student’s reaction to a variety
of teaching actions, such as presentation of specific problem type.
Accounted for roughly 50% of the variance in the amount of time the system predicted a
student would spend on a problem and the actual time spent to solve a problem.
(Beck et al, 2000)
59. Cycle Network
Cycle network in DT tutor. The network is rolled out to three time periods
representing current, possible, and projected student actions. (From Murray et
al., 2004.)
60. Models being Evaluated
Sarah Schultz, WPI
Which model, learned over data, helps predict future performance best?
Few issues to solve
60
62. 62
Pedagogical Moves : Dynamically adjusted
Empirical-based estimates of effort lead to adjusted problem difficulty and
other affective and meta-cognitive feedback
63. 63
E(Ii)
IL IH
E(Hi)
HL HH
E(Ti)
TL
TH
0 1 2 3 4 0 1 2 3 4 5 6 7
Incorrect Attempts Hints Time (each bar=5seconds)
What is “normal” behavior?
In EACH problem pi i=1, .., N N=Total problems in system
Within expected behavior
A new student encounters this problem…
Is their behavior within expectation, or atypical?
Looking across the whole population of students who used a problem
64. 64
What is odd behavior?
E(Ii)
IL IH
E(Hi)
HL HH
E(Ti)
TL
TH
0 1 2 3 4 0 1 2 3 4 5 6 7
Incorrect Attempts Hints Time (each bar=5seconds)
Attempts < E(Ii) — IL Hints > E(Hi) + HH Time < E(Ti) — TL
In any problem pi i=1, .., N N=Total problems in system
Odd behavior
Few Inc. Attempts Lots of Hints Little Time
< > <
65. 65
Increasing Problem Difficulty
At the next time step. Assume we know problem difficulty of items.
Harder(H[1..m],g ) = H ceiling
m
g
æ
è
ç
ö
ø
÷
é
ë
ê
ù
û
ú
LastProbSeen
Sorted list of harder math problems
Hardest of allEasiest
m
H=
= 3Parameter
X
--> Challenge rate
66. 66
Decreasing Problem Difficulty
At the next time step. Assume we know problem difficulty of items.
LastProbSeen
Sorted list of easier math problems
HardestEasiest of all
n
E=
= 3Parameter
X
Easier(E[1..n],g ) = E ceiling n -
n
g
æ
è
ç
ö
ø
÷
é
ë
ê
ù
û
ú
68. Stanford’s Computer Science Course
Machine learning techniques were used to autonomously create a
graphical model of how students in an introductory programming
course progress through the homework assignment.
Machine learning algorithms found patterns in how students
solved the Checkerboard Karel problem. These patterns were more
informative at predicting how well students would perform on the
class midterm than the grades students received on the
assignment. The algorithm captured a meaningful general trend
about how students were solving programming problems.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February).
Modeling how students learn to program. In Proceedings of the 43rd ACM
technical symposium on Computer Science Education (pp. 153-160). ACM.
69. Student Modeling in Computer
Programming
Bag of Words Difference: Researchers first built histograms of the different key
words used in a computer program and used the Euclidean distance between two
histograms as a naïve measure of the dissimilarity. This is akin to distance measures
of text commonly used in information retrieval systems.
Application Program Interface (API) Call Dissimilarity: They ran each program with
standard inputs and recorded the resulting sequence of API calls. They used
Needleman-Wunsch global DNA alignment to measure the difference between the
lists of API calls generated by the two programs.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012,
February). Modeling how students learn to program. In Proceedings of
the 43rd ACM technical symposium on Computer Science Education
(pp. 153-160). ACM.
70. Hidden Markov Model
The first step in their student modeling process was to learn a high level representation of
how each student progressed through the checkerboard Karel assignment. To learn this
representation they modeled a student’s progress as a Hidden Markov Model (HMM) [17].
Learning a HMM. Each state from the HMM becomes a node in the FSM and the weight of a
directed edge from one node to another provides the probability of transitioning from one
state to the next. The program's Hidden Markov Model of state transitions for a given
student. The node "codet" denotes the code snapshot of the student at time t, and the node
"statet" denotes the high-level milestone that the student is in at time t. N is the number of
snapshots for the student.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February).
Modeling how students learn to program. In Proceedings of the 43rd ACM
technical symposium on Computer Science Education (pp. 153-160). ACM.
71. Dissimilarity Matrix
Clustering on a sample of 2000 random snapshots from the training set returned a
group of well-defined snapshot clusters (see Figure 2). The value of K that maximized
silhouette score (a measure of how natural the clustering was) was 26 clusters. A
visual inspection of these clusters confirmed that snapshots which clustered together
were functionally similar pieces of code.
Dissimilarity matrix for
clustering of 2000 snapshots.
Each row and column in the
matrix represents a snapshot
and the entry at row i, column j
represents how similar
snapshot i and j are (dark
means more similar)
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012, February).
Modeling how students learn to program. In Proceedings of the 43rd ACM
technical symposium on Computer Science Education (pp. 153-160). ACM.
72. The finite set of high-level or milestones that a student could be in. A state is defined by a
set of snapshots where all the snapshots in the set came from the same milestone.
The transition probability, of being in a state given the state the student was in in the
previous unit of time.
The emission probability, of seeing a specific snapshot given that you are in a particular
state. To calculate the emission probability we interpreted each of the states as emitting
snapshots with normally distributed dissimilarities. In other words, given the dissimilarity
between a particular snapshot of student code and a state’s "representative" snapshot,
we can calculate the probability that the student snapshot came from a given state using
a Normal distribution based on the dissimilarity.
Piech, C., Sahami, M., Koller, D., Cooper, S., & Blikstein, P. (2012,
February). Modeling how students learn to program. In Proceedings of
the 43rd ACM technical symposium on Computer Science Education
(pp. 153-160). ACM.
73. The landscape of solutions for “gradient descent for linear regression”
representing over 40,000 student code submissions with edges drawn between
syntactically similar submissions and colors corresponding to performance on a
battery of unit tests (red submissions passed all unit tests).
Huang, J., Piech, C., Nguyen, A., & Guibas, L. (2013, June). Syntactic and
functional variability of a million code submissions in a machine learning
mooc. In AIED 2013 Workshops Proceedings Volume (p. 25).
Stanford’s MOOC:Teaching Machine Learning topics
74. Hour of Code Challenge Modeling
How Young Students Learn to Program
75. Code.org problem solving graph of learned policy for how to solve a single open
ended programming assignment from over 1M users. Each node is a unique
partial-solution. The node 0 is the correct answer.
Chris Piech, Stanford Ph.D.
student
Correct Answer
Arc: Next solution
an expert would
recommend.
Node: unique partial
solution.
76. Improved Retention
Code.org gathered over 137 million partial
solutions. Not all students made it through the
entire Hour of Code but retention was quite high
relative to other contemporary
open access courses.
77. 63K Peer Grading for 7K students
Blue Blob:
Student A
Red Circle:
Students who
were graded by
Student A.
Red Squares:
Students who graded
Student A
A Coursera course to teach HCI. Peer grading network of 63K peer grades
for 7K students. A single student is highlighted, red squares graded the
Chris Piech,
Stanford Ph.D.
78. Lan, A. S., Studer, C., Waters, A. E., & Baraniuk, R. G. (2013). Joint
topic modeling and factor analysis of textual information and graded
response data. arXiv preprint arXiv:1305.1956.
Circles: Concepts
Squares: Questions
Edges: Strong
Question Concept
Relationship
80. Long term goal
Millions of schoolchildren will have access to
what Alexander the Great enjoyed as a royal
prerogerative: “the personal services of a tutor
as well informed as Aristotle”
Pat Suppes, Stanford University, 1966
Died Nov 2014)
”Students will have
instant access to vast
stores of knowledge
through their
computerized tutors”
81. Long term goal
Millions of schoolchildren will have access to
what Alexander the Great enjoyed as a royal
prerogerative: “the personal services of a tutor
as well informed as Aristotle”
Pat Suppes, Stanford University, 1966
Died Nov 2014)
”Students will have
instant access to vast
stores of knowledge
through their
computerized tutors”
82. Thank You !
Any Questions?
Learning to Teach: Machine
Learning Techniques
To Improving Instruction
NIPS 2015 Workshop on Human Propelled
Machine Learning
Dec 13, 2014