Analyzing Road Side Breath Test Data with WEKA

ANALYZING ROAD-SIDE BREATH
TEST DATA

GROUP MEMBERS
• Micheal Abaho
• Yogesh Shinde
• Natasha Thakur
• Mingyang Chen
• Huw Fulcher
• Kai Wang

OBJECTIVE
• To understand how attributes explain intoxication in
pulled over drivers
• Analyze the dataset
• Determine what attributes to classify intoxication with
• Perform classification using dataset
• Assess success of classification in explaining
intoxication

DATASET
• Acquired from data.gov.uk
• 2014 data on roadside breath
tests
• Approximately 300,000 records

• Reason for test
• Suspicion of Alcohol, Road Traffic Collision, Moving Traffic Violation and Other
• Month
• Jan to Dec
• Year
• 2014
• Week Type:
• Weekday and Weekend
• Time Band
• 12am-4am, 12pm-4pm, 4am-8am, 4pm-8pm, 8am-12pm, 8pm-12am and Unknown
• Age Band for Drivers
• 16-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-69, 70-98 and Other
• Gender for Drivers
• Male and Female
• Breath Alcohol Level
ATTRIBUTES

PRE-PROCESSING DATA
• Removing year
• Removing outliers
• Creating decision variable

REASON*
Intoxicated =
0.0735 * Reason=Suspicion of Alcohol +
0.0365 * Reason=Other +
-0.0428 * Reason=Moving Traffic Violatio
+ 0.1132

MONTH
Intoxicated =
-0.0453 * Month=Jan +
-0.0224 * Month=Feb +
-0.0173 * Month=Mar +
-0.0147 * Month=Apr +
-0.0086 * Month=May +
-0.0952 * Month=Jun +
-0.0189 * Month=Jul +
-0.013 * Month=Sep +
-0.0179 * Month=Oct +
-0.0295 * Month=Nov +
-0.1249 * Month=Dec
+ 0.1669

TIMEBAND*
Intoxicated =
0.1009 * TimeBand=12am-4am +
0.0733 * TimeBand=4am-8am +
-0.0368 * TimeBand=4pm-8pm
+
-0.0539 * TimeBand=12pm-4pm
+
-0.0598 * TimeBand=8am-12pm
+ 0.118

EVALUATION MEASURE
• A classifier predicts all data instances of a dataset as either
positive or negative.
• This classification (or prediction) produces four outcomes –
true positive, true negative, false positive and false negative.

WHAT IS TP,FP,FN,TN?
• True Positive (TP) – It is an instance which is correctly
predicted to belong to class.
• True Negative (TN) – It is an instance which is correctly
predicted to not belong to class.
• False Positive (FP) – It is an instance which is incorrectly
predicted to belong to class.
• False Negative (FN) – It is an instance which is incorrectly
predicted to not belong to class.

CONFUSION MATRIX
• A confusion matrix is a two by two table formed by counting of
the number of the four outcomes of a classifier that is TP, FP, TN,
FN.
Predicte
d
Class A Class B <- classified as
Observed TP FN Class A
FP TN Class B

MEASURES FROM THE CONFUSION
MATRIX
• Error rate (ERR) is calculated as the number of all incorrect predictions divided by the total
number of the dataset.
• The best error rate is 0.0, whereas the worst is 1.0.
• Accuracy (ACC) is calculated as the number of all correct predictions divided by the total number
of the dataset.
• The best accuracy is 1.0, whereas the worst is 0.0.

• True positive rate (TPR) is calculated as the number of correct positive predictions divided by the
total number of positives.
• The best sensitivity is 1.0, whereas the worst is 0.0.
• False positive rate (FPR) is calculated as the number of incorrect negative predictions divided by
the total number of negatives.
• The best false positive rate is 0.0 whereas the worst is 1.0.

• Precision (PREC) is calculated as the number of correct positive predictions divided by the total
number of positive predictions.
• The best precision is 1.0, whereas the worst is 0.0.
• Recall is proportion of actual positives that were predicted positive.
• F-measure is a harmonic mean of precision and recall.

• J48 is the improved version of C4.5
• C4.5 is a program that creates a decision tree based on a set of labelled input
data.
• First it constructs a very huge tree by considering all attribute values and
narrow down the decision rule with the help of pruning.
• Pruning reduces the size of decision trees by removing sections of the tree
that provide little power to classify instances.
• Information gain or entropy measure is used to get the best attribute to split
the Nodes.
• A tree structure is created with root node, intermediate and leaf nodes,
where Node
holds the decision and in turn decision helps to achieve our result.
CLASSIFICATION BASED ON TREES
(J48)

• Attributes: Reasons, AgeBand, TimeBand, Gender
• Object: Driver
• Class: Yes/No for intoxication.
• Test Mode:10 Fold Cross Validation
• Pruned Tree
EXPERIMENT WORK AND OUTCOME

Summary
J48:Pruned Tree
Number of Leaves :1
Size of the tree :1
No (323555.0/37379.0)
J48 CLASSIFICATION OUTPUT

Confusion Matrix
Predicted
Actual
Detailed Accuracy By Class

RULE BASED CLASSIFICATION (JRIP)
Decision Tree and Decision Table(classify rule)

• Repeated Incremental Pruning to
Produce Error Reduction (RIPPER)
• Optimized version of IREP (reduced
error pruning) a very common and
effective technique found in decision
tree algorithms

• The training data is split into a growing set and a pruning set
• Growing set: greedily adding conditions until the rule is perfect
• pruning set: delete conditions until find better rule
• Rule set generate by growing rule and pruning rule
• Optimization stage

WHICH CLASSIFICATION ALGORITHM?
• Accuracy of Classifier Both J48 and JRip
for our case is high
• Speed: Time 4.26s in JRip; 1.14 in J48
• Robustness：Noisy data/missing data
• Scalability：Size of dataset becomes
big

WHAT IT IS AND WHAT IT DOES
 Determines how a dependent variable is affected by one or more independent variables.
Dependent variable:- Is a result or something that is being predicted.
Independent variable: Predictor.
 Regression Equation (In its simplicity)
Y = a + bX + 𝒆
[ Y – (Dependent variable), X – (Ind variable)Expected value of )
 Aim is to ensure you find values of a and b such that e is small

THE REGRESSION MODEL DERIVED
y
𝑒 - error
a - intercept
X
𝑦 = 𝑎 + 𝑏𝑥 + 𝑒

LOGISTIC REGRESSION
Why this regression
1. Predictive analysis of a dichotomous dependent variable.
• E.g. for our case we are building a model that predicts whether some one
is intoxicated or not. i.e. what do factors like violating traffic rules, age-
band and time band tell us about the probability that a person is
intoxicated or not when they’re stopped by police.
2. We discover additional trends in data without having to run other tests
how each of the predictors affects the resultant dependent variable.

RESULTS AND EVALUATION – REGRESSION
MODEL
No Yes Precisio
n
Recal
l
F-
Measure
ROC
Area
No 285403 773 0.885 0.997 0.938 0.726
Yes 36923 456 0.371 0.012 0.024 0.726
Weighted
Average
0.826 0.883 0.832 0.726
Classified/Predicted
Actual
Correctly Classified Instances 285859 88.3494 %
Incorrectly Classified
Instances
37696 11.6506 %
Mean absolute error 0.1886
Root mean squared error 0.3075
Relative absolute error 92.2893 %
Root relative squared error 96.1835 %
Total Number of Instances 323555

Attribute Coefficients Odds
Reason=Suspicion of Alcohol -0.486 0.6151
Reason=Moving Traffic Violation 0.5511 1.7352
TimeBand=12am-4am -0.849 0.4278
TimeBand=4am-8am -0.6143 0.541
TimeBand=8am-12pm 0.6492 1.914
AgeBand=16-19 0.1976 1.2184
AgeBand=25-29 -0.2082 0.8121
AgeBand=70-98 0.86 2.3632
Gender=Male -0.1297 0.8784
Gender=Female 0.1268 1.1352
Intercept 2.3189
From 𝐥𝐨𝐠
𝒑
𝟏 −𝒑
= a + bX 𝐥𝐨𝐠
𝑰𝒏𝒕𝒐𝒙𝒊𝒄𝒂𝒕𝒆𝒅
𝟏 −𝒊𝒏𝒕𝒐𝒙𝒊𝒄𝒂𝒕𝒆𝒅
= 2.3189 – 0.486*(Sus_Alc) + 0.5511*(Mov_Traf) – 0.849
0.849 *
(Timeband) + …………………………..
Regression equation predicting whether some one is intoxicated or not.

CONCLUSION/WHAT WE DISCOVERED
• Four “optimal” attributes to use in classification
• J48 – Performs well but not practical
• JRip – Most accurate (Not by much) but needs
tweaking
• Regression – “Best” of the 3

CONCLUSION / RECOMMENDATION
• Test the data set for more assumptions – Normality,
Multi-collinearity,
• and Homoscedasticity.
• Transform the dataset to minimize the errors
generated from the biased number of cases
belonging to class (No – None intoxication).
• Explore further experiments including other factors
that are potential predictors of intoxication. e.g
Offences (How offensive is a person when asked to
pull-over by police).

Analyzing Road Side Breath Test Data with WEKA

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (6)

Similaire à Analyzing Road Side Breath Test Data with WEKA

Similaire à Analyzing Road Side Breath Test Data with WEKA (20)

Dernier

Dernier (20)

Analyzing Road Side Breath Test Data with WEKA

Notes de l'éditeur