Blue Canary was a higher education data and analytics company based in Phoenix, Arizona USA, acquired by Blackboard Inc in December of 2015. We worked with a university to help predict at-risk students in their undergraduate degree programs. Our model predicted attendance in a given week since we knew that missing a week of class was a proxy for attrition. The models were trained and selected using standard efficacy measures (precision, recall, F1 score). After using the models in production for 6 months, we saw that those metrics for actual data were fairly true to the training metrics. This validated the development of our predictive models.
This presentation was part of the Practitioner Track at LAK16 delivered April 28. 2016
LAK16 Practitioner Track presentation: Model Accuracy. Training vs Reality
1. 11
Model Accuracy
Training vs Reality
Mike Sharkey & Brian Becker
Blue Canary
Delivered by Dan Rinzel
Blackboard, Inc.
#LAK16 - Practitioner Track
April 28th, 2016
3. 33
Project Goals
Blue Canary built a predictive model for a client institution’s students enrolled
in their online program, to assess attrition risk
7 week courses, rolling starts every week
Policy definition for weekly attendance – students expected to attend &
post in 4 out of 7 days each week
strong correlation between attendance & attrition was assumed
Trained the model on data that included attendance and attrition
1,456 distinct courses that ran between Jan 2013 & Aug 2014
Class size x̄ = 23 enrolled students
19,506 distinct students
With the model proven, ran a live 6-month pilot
Rolled out to 100 faculty members teaching 1 of 3 introductory courses
in the bachelor’s degree program - ~4,500 students
Enabled integrated alerts for student advisors
Compared predictions to actual behavior
4. 44
Data Collection Process
Collected SIS and LMS fields from the institution to get historic data for
training the predictive model.
Historically, we know if the student did or did not meet the attendance
requirements, so we have the outcomes needed to develop a model.
From there, split the data into three buckets: 70% of the data, used to
train the model, and two other buckets each with 15%, used to test and
validate the model.
We then take specific fields that are important in identifying student
behavior to construct features. These features are the inputs to the
random forest machine learning modeling process
5. 55
Data Collection Process
Features sourced from SIS Data
Incoming GPA
Inbound Transfer Credits
Previous Course Grade
Family Income
Age
Days since last course
Gender
Credits earned (% of attempted)
Military service
Degree Program
# Failed/Dropped Courses
Features sourced from LMS Data
Current Course Grade
Met prior week attendance?
# days with posts in the last 7
# posts decile – main forum
# posts decile – all forums
Days since last post
6. 66
Measuring Efficacy: Methodology
To determine the accuracy of our machine learning model we use the
numerical values from a confusion matrix to calculate precision, recall and
F1 Score.
Using our scenario, precision is defined on the positive side as: of the
students we predicted would attend class that week, what percent actually
attended?
Recall is defined as: of the students that did attend class that week, what
percent did we accurately predict?
The F1 Score is simply the harmonic mean of precision and recall.
Went live with predictions in April 2015 - fed the model with current data
each day & compared actual weekly results against the accuracy of the
initial training model over a 6-month span
8. 88
Measuring Efficacy: Results & Lessons Learned
Graphs for Precision/Recall/F1 Score comparing training & practice go
here
0 0.05 0.1 0.15 0.2 0.25
# Withdrawn Courses
# Failed Courses
Credits earned (% of attempted)
Degree program
Military status
Days since last course
Gender
Current class - days since last post
Age bracket (decade)
Previous course grade
Salary decile
Current class - total posts decile
Cumulative GPA
Transfer Credits
Current class - previous week # posts
Current class - days with posts (rolling 7 day)
Current class - previous week attendance
Current class - cumulative performance
FEATURE DRIVERS RANKED BY IMPORTANCE WITHIN MODEL
Week 2-6 Model
Week 0-1 Model
9. 99
Enabling Triage & Intervention
Augmenting the other tools available to teachers in fully-online
courses
Creating efficiencies for advisors who may have large caseloads
of students to help with attrition risk diagnosis & intervention
Give both groups supplemental confidence in the prediction
numbers
Provide a Create Alert call to action
12. 1212
Key Takeaways
After running the model for six months, we see that the actual model
efficacy tracked very closely with the predicted model efficacy from
training. This is a positive testament to the power and validity of the
model.
Additionally, the model accuracy numbers we saw (in the 75-80% range)
are very much in line with the accuracy rates we have seen with models at
other institutions. This adds another level of confidence for using
predictive models as a diagnostic tool to address at-risk students and turn
those models into intervention-based actions.
Will student attend/post in 4 out of the 7 days of the week
Zero attendance for two weeks was an administrative auto-drop
For this model, this was the set of 18 features that meaningfully contributed to the prediction
Precision:o Precision of Week1-2 model from training: 84% o Precision of Week1-2 model in practice: 80%o Precision of Week3-7 model from training: 84% o Precision of Week3-7 model in practice: 84%
- Recall:o Recall of Week1-2 model from training: 91% o Recall of Week1-2 model in practice: 89%o Recall of Week3-7 model from training: 87% o Recall of Week3-7 model in practice: 84%
- F1 score:o F1 of Week1-2 model from training: 87% o F1 of Week1-2 model in practice: 85%o F1 of Week3-7 model from training: 85% o F1 of Week3-7 model in practice: 84%
Originally, one predictive model was made for the entire 7-week course. This presented a problem however, because as students progressed through the course, the predictors of attendance change. Creating multiple models would result in higher accuracy rates. We realized that by combining models from certain weeks together we could maintain a high level of accuracy without creating a set of models that was hard to maintain in the software. We finally settled on having two models (a Week0-1 model and a Week2-6 model) since the drivers of the model were similar at these thresholds, with cumulative performance standing out as the strongest driver from that week on out.
With the software and technology infrastructure available from the Blackboard acquisition, we will be able to generate and maintain a separate model for every week, so we won’t be as concerned with ”forcing” a breakpoint like this in the modeling, but it is illustrative.
Notice that the demographic data available before the class begins and in making the Week 0 prediction still provides useful drivers, including previous GPA, transfer credits and previous course grade.
But so what? We have a solid model with pretty high confidence, but how do we enable action based on these models?
Talking point – show the break point between the Week1-2 model and the Week3-7 model & talk about how we got there.
(originally one predictive model was made for the entire 7-week course. This presented a problem however, because as students progressed through the course, the predictors of attendance change. Creating multiple models would result in higher accuracy rates. Therefore, we created 7 different models, one for every week of the course. Now, though, maintaining 7 different models proved to be difficult and we realized that by combining models from certain weeks together we can maintain a high level of accuracy while lowering the number of models. We finally settled on having two models (a Week1-2 model and a Week3-7 model) since the drivers of the model were similar at these thresholds)
Precision:o Precision of Week1-2 model from training: 84% o Precision of Week1-2 model in practice: 80%o Precision of Week3-7 model from training: 84% o Precision of Week3-7 model in practice: 84%
- Recall:o Recall of Week1-2 model from training: 91% o Recall of Week1-2 model in practice: 89%o Recall of Week3-7 model from training: 87% o Recall of Week3-7 model in practice: 84%
- F1 score:o F1 of Week1-2 model from training: 87% o F1 of Week1-2 model in practice: 85%o F1 of Week3-7 model from training: 85% o F1 of Week3-7 model in practice: 84%
Talking point – show the break point between the Week1-2 model and the Week3-7 model & talk about how we got there.
(originally one predictive model was made for the entire 7-week course. This presented a problem however, because as students progressed through the course, the predictors of attendance change. Creating multiple models would result in higher accuracy rates. Therefore, we created 7 different models, one for every week of the course. Now, though, maintaining 7 different models proved to be difficult and we realized that by combining models from certain weeks together we can maintain a high level of accuracy while lowering the number of models. We finally settled on having two models (a Week1-2 model and a Week3-7 model) since the drivers of the model were similar at these thresholds)
Precision:o Precision of Week1-2 model from training: 84% o Precision of Week1-2 model in practice: 80%o Precision of Week3-7 model from training: 84% o Precision of Week3-7 model in practice: 84%
- Recall:o Recall of Week1-2 model from training: 91% o Recall of Week1-2 model in practice: 89%o Recall of Week3-7 model from training: 87% o Recall of Week3-7 model in practice: 84%
- F1 score:o F1 of Week1-2 model from training: 87% o F1 of Week1-2 model in practice: 85%o F1 of Week3-7 model from training: 85% o F1 of Week3-7 model in practice: 84%