2. DATA SET
•
“Past Performance” from TrackMaster for races September 26, 2013 at Yonkers
Raceway
•
Published in advance of the race
•
Cost: $1.50
•
Comes in XML format – parsed using python
•
Contains 10 most recent PPs for each horse racing that day
•
12 races x 8 horses x 10 past performances = 960 records
•
Variables of use: Lengths back at each quarter, final time, lead final time, gait, age
(meta), track condition, track name, track length
•
Created race-level, horse-race-level, and longitudinal data sets for different aspects
of this analysis
3. GAIT AND CONDITION
• Hypothesis: Gait and track condition influence race time
• Gait
• Binary: Pacers and Trotters
• Each race is one or the other
• Each horse is one or the other
• Condition
• Categorical: Fast, Good, or Sloppy
• Each race categorized into one
• Created and cleaned race-level data set
• Means test showed means are different for both variables
• T-test showed these differences are statistically significant
8. CORRELATION: LENGTHS
BACK AT CALLS
• Some horses pull away early, others seem to wait for the
last quarter to go to the front
• TrackMaster reports lengths back from lead and calls at
each quarter
• Lengths are recorded as fractional numbers (to the
quarter) and as parts of horse
• Nose
• Head
• Neck
• Additional complication: “costly breaks” of pace and
disqualification
• Still not happy – strange lengths back for winners at final
11. AGE AND SPEED
•
Goal: Quantify how much
horses slow down with
age
•
Merged metadata for each
horse with past
performance data
•
Single-variable
regression analysis of
mean data set
•
Found that age is not a
great predictor of speed
•
Age: Discrete, yet not
categorical
12. MULTIVARIATE
REGRESSION
•
Longitudinal data set
•
Created dummy variables for past and present track conditions, gaits, and
track sizes
•
Used SAS’s “Lag” and “Last” Features
•
Removed disqualified races
•
Modeled race time based on current race conditions and two races prior
13. MULTIVARIATE
REGRESSION
Control Variables
Variables of Interest
Label
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Label
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
104.67788
4.81142
21.76
<.0001
Fast lag
0.35883
0.38598
0.93
0.3528
Lag final
time
0.01412
0.03120
0.45
0.6510
Sloppy lag
0.48532
0.43151
1.12
0.2610
Lag2 final
time
0.11361
0.02975
3.82
0.0001
Fast lag2
0.09472
0.37245
0.25
0.7993
Pacer
-3.68185
0.21247
-17.33
<.0001
Sloppy
lag2
-0.39904
0.42068
-0.95
0.3431
Fast
-0.77005
0.38954
-1.98
0.0484
5/8 Track
lag
0.14639
0.23680
0.62
0.5366
Sloppy
0.86942
0.43605
1.99
0.0465
1 Track lag 0.40192
0.51792
0.78
0.4379
Age
0.05312
0.04023
1.32
0.1871
5/8 track
lag2
0.58564
0.21764
2.69
0.0073
5/8 Track
-2.74052
0.20313
-13.49
<.0001
1 track
lag2
0.67260
0.49172
1.37
0.1717
1 Track
-3.18411
0.47824
-6.66
<.0001
Final race times from previous races are not
great determinants of final race time this race!
14. PREDICTION OF
SEPTEMBER 26 RACES
•
Used the coefficients
from my multivariate
regression and most
recent two races for each
horse
•
Ranked horses by
predicted race values
•
But my bets weren’t
great! But better than
choosing at random!
•
Reason: Low, low
variance in race times
among horses. Not
enough predictive power
in model, even with R^2 >
0.5
Predicting the Winner
Right
Wrong
15. FINAL THOUGHTS
•
SAS’s LAG and LAST features are great for
dealing with longitudinal data
•
Most work was on the DATA steps, not the
PROC steps
•
My model was based on only 960
occurrences, 96 horses
•
With more data, might model Pacers and
Trotters separately, Conditions separately
•
Still want to investigate lengths back for
winning horses
•
Learned much about SAS and about
harness racing