15. Data Processing
출발 시간이 10, 109로 600 ~ 2200 범위 를 벗어나는 아웃라이어로
판단하고 데이터 삭제
Scheduled departure time 데이터를 16개의 time block으로 재구성
예측 상황에서 미리 주어 질 수 없는 실제 비행기 출발 시간, 워싱턴 DC와
뉴욕 구간이기 때문에 모두 비슷한 수준 (평균 211.87, 중앙값 214,
최빈값 214, 표준 편차 13.31)이기 때문에 분석 변수에서 제외
명목형 변수인 tail number와 flight number 분석 변수에서 제외
비행 날짜는 요일에 비해 추후 예측에 활용할 여지가 적기 때문에 분석
변수에서 제외
데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할
16. Naï Bayes
ve
Conditional probabilities
Classes-->
ontime
Value
Prob
CO 0.036312849
DH 0.231843575
DL 0.188081937
MQ 0.118249534
CARRIER
OH 0.013035382
RU 0.174115456
UA 0.016759777
US 0.22160149
EWR 0.273743017
DEST
JFK 0.176908752
LGA 0.549348231
BWI 0.057728119
ORIGIN
DCA 0.645251397
IAD 0.297020484
0
1
Weather
1
0
Mon 0.131284916
Tue 0.14990689
Wed 0.148044693
DAY_WEEK
Thur 0.181564246
Fri 0.170391061
Sat 0.111731844
Sun 0.10707635
600-700 0.058659218
700-800 0.055865922
800-900 0.082867784
900-1000 0.047486034
1000-1100 0.044692737
1100-1200 0.040968343
1200-1300
0.0716946
Binned_CRS_
1300-1400 0.083798883
DEP_TIME
1400-1500 0.090316574
1500-1600 0.067970205
1600-1700 0.081005587
1700-1800 0.104283054
1800-1900 0.044692737
1900-2000 0.047486034
2000-2100 0.019553073
2100-2200 0.058659218
Input
Variables
delayed
Value
Prob
CO 0.06122449
DH 0.306122449
DL 0.118367347
MQ 0.163265306
OH 0.012244898
RU 0.244897959
UA 0.004081633
US 0.089795918
EWR 0.387755102
JFK 0.187755102
LGA 0.424489796
BWI 0.102040816
DCA 0.502040816
IAD 0.395918367
0 0.930612245
1 0.069387755
Mon 0.220408163
Tue 0.130612245
Wed 0.151020408
Thur 0.130612245
Fri 0.159183673
Sat 0.069387755
Sun 0.13877551
600-700 0.032653061
700-800 0.053061224
800-900 0.06122449
900-1000 0.016326531
1000-1100 0.032653061
1100-1200 0.016326531
1200-1300 0.065306122
1300-1400 0.048979592
1400-1500 0.146938776
1500-1600 0.085714286
1600-1700 0.07755102
1700-1800 0.13877551
1800-1900 0.028571429
1900-2000 0.089795918
2000-2100 0.024489796
2100-2200 0.081632653
Prior class probabilities
According to relative occurrences in training data
Class
ontime
delayed
Prob.
0.814253222 <-- Success Class
0.185746778
RU (Continental Express Airline)를 타고 수요일
15:00 ~ 16:00 출발 IAD에서 LGA로 갈 경우 (기상은
양호함)
Ontime = 0.81*0.174 * 0.148 * 0.068 * 0.297 * 0.549 *1
0.00022971
Delay = 0.186* 0.245* 0.424 * 0.396 * 0.151* 0.0857 *0.931
0.0000092
Ontime 확률 = 0.00022971 / (0.00022971 + 0.0000092)
96% (Cutoff value 50%를 넘으므로 ontime으로 분류)
17. Performance Evaluation
Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable)
Validation Data scoring - Summary Report
0.5
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
1049
25
delayed
205
40
Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
685
14
delayed
155
26
Error Report
# Cases
# Errors
1074
25
245
205
1319
230
Error Report
# Cases
# Errors
699
14
181
155
880
169
0.5
Class
ontime
delayed
Overall
% Error
2.33
83.67
17.44
Training Data scoring - Summary Report
Cut off Prob.Val. for Success (Updatable)
Error Report
# Cases
# Errors
1074
0
245
228
1319
228
% Error
2.00
85.64
19.20
Training Data scoring - Summary Report
0.3
Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
1074
0
delayed
228
17
Class
ontime
delayed
Overall
Class
ontime
delayed
Overall
Cut off Prob.Val. for Success (Updatable)
0.8
Classification Confusion Matrix
Predicted Class
Actual Class
ontime
delayed
ontime
672
402
delayed
83
162
% Error
0.00
93.06
17.29
Class
ontime
delayed
Overall
Error Report
# Cases
# Errors
1074
402
245
83
1319
485
% Error
37.43
33.88
36.77
18. Performance Evaluation
Decile-wise lift chart (training dataset)
1200
Cumulative
1000
Cumulative Flight
Status when
sorted using
predicted values
800
600
400
Cumulative Flight
Status using
average
200
0
0
500
1000
Decile mean / Global mean
Lift chart (training dataset)
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1500
1
2
3
4
# cases
Cumulative Flight
Status when
sorted using
predicted values
Cumulative Flight
Status using
average
500
7
8
9
10
Decile-wise lift chart (validation dataset)
1000
Decile mean / Global mean
Cumulative
800
700
600
500
400
300
200
100
0
# cases
6
Deciles
Lift chart (validation dataset)
0
5
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
Deciles
7
8
9
10