Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Building Models Quickly Addressing Housing Overflow at Purdue - Greenplum Summit 2019

99 vues

Publié le

Greenplum Summit 2019
Ian Pytlarz

Publié dans : Logiciels
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Building Models Quickly Addressing Housing Overflow at Purdue - Greenplum Summit 2019

  1. 1. BUILDING MODELS QUICKLY ADDRESSING HOUSING OVERFLOW AT PURDUE Using Greenplum & XGBoost March 19, 2019
  2. 2. PURDUE UNIVERSITY AN INDIANA INSTITUTION • Located in West Lafayette, IN • Consists of one main campus and 3 regional campuses • Over 40,000 students enrolled o ~30k Undergraduate, 10k Graduate • Over 200 majors offered across ten academic colleges • Part of the Big Ten Conference
  3. 3. DATA SCIENCE & HIGHER EDUCATION A WORLD OF POSSIBILITY • Higher education has only just begun using Data Science • This means lots of new paths to forge • From the obvious: o Predict grades (done) o Maximizing financial aid through predicting yield • To the complex: o Course recommendation engine o Entry essay neural network
  4. 4. PURDUE’S IDAP SYSTEM
  5. 5. WHAT IS IDAP? PURDUE DATA SCIENCE & ANALYTICS • IDAP serves as a gray box for a wide variety of data sources: o Traditional: Student Information o Ancillary Data Sources: Degree Requirements, Student Activities o New: Network Logs, Card Swipes, LMS Clickpath • Also houses a modelling pipeline with several production models o At Risk (<2.5 GPA first semester) o Course GPA (C or worse in a course) o Yield (Which students will attend) • Faculty Research o Secure high-compute server (Raiden) o Pulls data from Greenplum
  6. 6. IDAP OVERVIEW
  7. 7. Refreshed data (incoming daily/weekly/monthly updates) feature generation pipeline Static features Static + time-sensitive LMS features Static + time-sensitive LMS + network + card logs features In-database parallel grid-search (XGBoost) MADlib Logistic Regression Sklearn AdaBoost Sklearn RandomForest Model selection Serialize to disk Structured, unstructured data sources scoring results • Student ID • Feature names, values, importance scores • Predictions Results sent to end- users Cleared by IDAP Data Scientist Modeling pipeline MODEL BUILDING AND SCORING PIPELINE
  8. 8. HOUSING CANCELLATIONS: PIPELINE IN ACTION
  9. 9. THE SITUATION TOO MANY STUDENTS, TOO LITTLE HOUSING • Admission to Purdue in Fall 2018 hit historic highs o 8,357 students in the entering class, on top of historic high enrollment each of the two prior years o Nearly 800 new students vs Fall 2017 • Housing not being built quick enough to keep up with demand • Hundreds more students than usual might be put into temporary and off-campus leased housing at the start of semester
  10. 10. THE SITUATION TOO MANY STUDENTS, TOO LITTLE HOUSING • Typical Problem Amplified • While temporary housing is normal at many universities, the need goes up with unexpected enrollment • Limited, Non-Ideal Space • Temporary space is not unlimited, nor is it ideal for learning • Off-Campus Leased Housing • Beyond temporary space, Purdue also leases space to house excess returning students • This is not campus-adjacent, and therefore also not ideal. Also not unlimited
  11. 11. THE SOLUTION BUILD A MODEL IN XGBOOST USING GREENPLUM • Build a model - quickly • The decision was made to try and predict which people coming to Purdue’s housing system would not show up • The goal – reduce the number of student move disruptions from temporary housing, and maximize on-campus housing space • From concept to execution, there were less than two months in which to create and implement the results of this model • Blending data • Housing data was not in the greenplum system, needed to be pulled in so it could be blended with everything needed for the model • Two Models • Divided into two models, for two fundamentally different groups: new students and those returning to campus housing
  12. 12. • First Iteration • The model was put together mostly using features from prior student success models • Performance & Usage • Initial performance allowed us to provide a sorted list of the most likely students to cancel • This list was used to make phone calls to these students and confirm their intent to utilize campus housing THE SOLUTION BUILD A MODEL IN XGBOOST AND GREENPLUM Returning Students – Version 1 Cancelled Precision Recall F-Score Support 0 0.932 0.775 0.846 1833 1 0.225 0.538 0.317 223 New Students – Version 1 Cancelled Precision Recall F-Score Support 0 0.997 0.956 0.976 2765 1 0.463 0.929 0.618 113
  13. 13. • Typical Year • Typically, rooms in the Union hotel are reserved as temporary space • Additionally, other temporary spaces usually house students until after October break • Fall 2018 Temporary Housing • Partly due to the calling students with high probability to cancel, temporary housing actually saw a reduction in strain • Not only were all students out of temporary housing by October break, but rooms at the PMU were released prior to the start of classes INITIAL SUCCESS MORE EFFICIENT SPACE USAGE
  14. 14. • There was a cohort of students that did not retain at Purdue, which the model missed • The model is highly unsure of many students • This was due, in part, to a bad definition of ‘returner’ and of ‘cancel’ in the model – it needed to be fixed and retrained SUCCESS WITH ISSUES USEFUL, NEEDS IMPROVEMENT
  15. 15. • Tuning & New Features • New features and further tuning of the model’s parameters massively improved the model for returning students • Impact • Far more accurate model, fewer calls required to reach the students intending to cancel RETRAINING ADDITIONAL FEATURE BUILD Returning Students – Version 2 Cancelled Precision Recall F-Score Support 0 0.961 0.938 0.949 1880 1 0.524 0.642 0.577 201 New Students – Version 2 Cancelled Precision Recall F-Score Support 0 0.996 0.965 0.980 2736 1 0.555 0.917 0.691 132
  16. 16. • Tuning & New Features • New features and further tuning of the model’s parameters massively improved the model for returning students • Impact • Far more accurate model, fewer calls required to reach the students intending to cancel RETRAINING ADDITIONAL FEATURE BUILD
  17. 17. • Post-hoc Data Recording • Fall 2019, housing will record who/when they call students so that we can better match that with the actual results when cancellations come in after August • Potential Future Retraining • New housing is being built on-campus to keep up with the growing population. Once that is online, cancellation patterns may change and require retraining • Otherwise, keeping up with post-hoc analysis of results should indicate when a retraining is next necessary • Due to the setup of the model in greenplum, retraining is quick & easy! NEXT STEPS FUTURE TUNING & USAGE
  18. 18. APPENDIX
  19. 19. IMPORTANT FEATURES TOP FEATURES IN XGBOOST MODELING RESULTS Rank Feature Score 1star_registration_promptness 272 2 hs_core_gpa 225 3 population 223 4 medianfemalebachincome 190 5 medianmalebachincome 174 6 hs_gpa 167 7 closet_rep_miles 166 8 bach25plus 159 9 per_capita_income 158 10 mast25plus 153 11 days_before_start_sign 120 12 highest_satr_ebrw 100 13 highest_satr_total 91 14 highest_satr_math 85 15 ap_avg 77 16 decision_count 75 17 ap_cnt 56 18 vstar_ind 45 Rank Feature Score 1semester_cdfw_rate 745 2 prior_overall_gpa 743 3 avg_weekly_rectrac_swipes 718 4 hs_core_gpa 579 5 medianfemalebachincome 563 6 closet_rep_miles 549 7 population 482 8 bach25plus 450 9 per_capita_income 432 10 ap_avg 422 11 medianmalebachincome 413 12 days_before_start_sign 403 13 num_room_changes_last_year 389 14 num_classes_registered 374 15 highest_satr_ebrw 372 16 mast25plus 363 17 hs_gpa 345 18 highest_satr_math 318 19 highest_satr_total 304 20 hs_gpa_vs_hs_inst_gpa_diff 280 21 hs_size 261 22 hs_inst_gpa 251 23 ap_cnt 248 24 age 201 25 roomie_avg_gpa 189 26 age_as_of_semstart 103 27 roomie_gpa_diff 92 New Students Model Returners Model

×