SlideShare une entreprise Scribd logo
1  sur  41
ANALYZING ROAD-SIDE BREATH
TEST DATA
GROUP MEMBERS
• Micheal Abaho
• Yogesh Shinde
• Natasha Thakur
• Mingyang Chen
• Huw Fulcher
• Kai Wang
OBJECTIVE
• To understand how attributes explain intoxication in
pulled over drivers
• Analyze the dataset
• Determine what attributes to classify intoxication with
• Perform classification using dataset
• Assess success of classification in explaining
intoxication
DATASET
• Acquired from data.gov.uk
• 2014 data on roadside breath
tests
• Approximately 300,000 records
• Reason for test
• Suspicion of Alcohol, Road Traffic Collision, Moving Traffic Violation and Other
• Month
• Jan to Dec
• Year
• 2014
• Week Type:
• Weekday and Weekend
• Time Band
• 12am-4am, 12pm-4pm, 4am-8am, 4pm-8pm, 8am-12pm, 8pm-12am and Unknown
• Age Band for Drivers
• 16-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-69, 70-98 and Other
• Gender for Drivers
• Male and Female
• Breath Alcohol Level
ATTRIBUTES
EXPLORATORY ANALYSIS
PRE-PROCESSING DATA
• Removing year
• Removing outliers
• Creating decision variable
REASON*
Intoxicated =
0.0735 * Reason=Suspicion of Alcohol +
0.0365 * Reason=Other +
-0.0428 * Reason=Moving Traffic Violatio
+ 0.1132
MONTH
Intoxicated =
-0.0453 * Month=Jan +
-0.0224 * Month=Feb +
-0.0173 * Month=Mar +
-0.0147 * Month=Apr +
-0.0086 * Month=May +
-0.0952 * Month=Jun +
-0.0189 * Month=Jul +
-0.013 * Month=Sep +
-0.0179 * Month=Oct +
-0.0295 * Month=Nov +
-0.1249 * Month=Dec
+ 0.1669
WEEKTYPE
TIMEBAND*
Intoxicated =
0.1009 * TimeBand=12am-4am +
0.0733 * TimeBand=4am-8am +
-0.0368 * TimeBand=4pm-8pm
+
-0.0539 * TimeBand=12pm-4pm
+
-0.0598 * TimeBand=8am-12pm
+ 0.118
AGE* + GENDER*
CLASSIFICATION OF THE DATASET
EVALUATION MEASURE
• A classifier predicts all data instances of a dataset as either
positive or negative.
• This classification (or prediction) produces four outcomes –
true positive, true negative, false positive and false negative.
WHAT IS TP,FP,FN,TN?
• True Positive (TP) – It is an instance which is correctly
predicted to belong to class.
• True Negative (TN) – It is an instance which is correctly
predicted to not belong to class.
• False Positive (FP) – It is an instance which is incorrectly
predicted to belong to class.
• False Negative (FN) – It is an instance which is incorrectly
predicted to not belong to class.
CONFUSION MATRIX
• A confusion matrix is a two by two table formed by counting of
the number of the four outcomes of a classifier that is TP, FP, TN,
FN.
Predicte
d
Class A Class B <- classified as
Observed TP FN Class A
FP TN Class B
MEASURES FROM THE CONFUSION
MATRIX
• Error rate (ERR) is calculated as the number of all incorrect predictions divided by the total
number of the dataset.
• The best error rate is 0.0, whereas the worst is 1.0.
• Accuracy (ACC) is calculated as the number of all correct predictions divided by the total number
of the dataset.
• The best accuracy is 1.0, whereas the worst is 0.0.
• True positive rate (TPR) is calculated as the number of correct positive predictions divided by the
total number of positives.
• The best sensitivity is 1.0, whereas the worst is 0.0.
• False positive rate (FPR) is calculated as the number of incorrect negative predictions divided by
the total number of negatives.
• The best false positive rate is 0.0 whereas the worst is 1.0.
• Precision (PREC) is calculated as the number of correct positive predictions divided by the total
number of positive predictions.
• The best precision is 1.0, whereas the worst is 0.0.
• Recall is proportion of actual positives that were predicted positive.
• F-measure is a harmonic mean of precision and recall.
J48
• J48 is the improved version of C4.5
• C4.5 is a program that creates a decision tree based on a set of labelled input
data.
• First it constructs a very huge tree by considering all attribute values and
narrow down the decision rule with the help of pruning.
• Pruning reduces the size of decision trees by removing sections of the tree
that provide little power to classify instances.
• Information gain or entropy measure is used to get the best attribute to split
the Nodes.
• A tree structure is created with root node, intermediate and leaf nodes,
where Node
holds the decision and in turn decision helps to achieve our result.
CLASSIFICATION BASED ON TREES
(J48)
• Attributes: Reasons, AgeBand, TimeBand, Gender
• Object: Driver
• Class: Yes/No for intoxication.
• Test Mode:10 Fold Cross Validation
• Pruned Tree
EXPERIMENT WORK AND OUTCOME
Summary
J48:Pruned Tree
Number of Leaves :1
Size of the tree :1
No (323555.0/37379.0)
J48 CLASSIFICATION OUTPUT
Confusion Matrix
Predicted
Actual
Detailed Accuracy By Class
JRIP
RULE BASED CLASSIFICATION (JRIP)
Decision Tree and Decision Table(classify rule)
RULE BASED CLASSIFICATION (JRIP)
• Repeated Incremental Pruning to
Produce Error Reduction (RIPPER)
• Optimized version of IREP (reduced
error pruning) a very common and
effective technique found in decision
tree algorithms
RULE BASED CLASSIFICATION (JRIP)
• The training data is split into a growing set and a pruning set
• Growing set: greedily adding conditions until the rule is perfect
• pruning set: delete conditions until find better rule
• Rule set generate by growing rule and pruning rule
• Optimization stage
RULE OF JRIP
PERFORMANCE OF JRIP
COMPARE WITH J48
WHICH CLASSIFICATION ALGORITHM?
• Accuracy of Classifier Both J48 and JRip
for our case is high
• Speed: Time 4.26s in JRip; 1.14 in J48
• Robustness:Noisy data/missing data
• Scalability:Size of dataset becomes
big
REGRESSION
WHAT IT IS AND WHAT IT DOES
 Determines how a dependent variable is affected by one or more independent variables.
Dependent variable:- Is a result or something that is being predicted.
Independent variable: Predictor.
 Regression Equation (In its simplicity)
Y = a + bX + 𝒆
[ Y – (Dependent variable), X – (Ind variable)Expected value of )
 Aim is to ensure you find values of a and b such that e is small
THE REGRESSION MODEL DERIVED
y
𝑒 - error
a - intercept
X
𝑦 = 𝑎 + 𝑏𝑥 + 𝑒
LOGISTIC REGRESSION
Why this regression
1. Predictive analysis of a dichotomous dependent variable.
• E.g. for our case we are building a model that predicts whether some one
is intoxicated or not. i.e. what do factors like violating traffic rules, age-
band and time band tell us about the probability that a person is
intoxicated or not when they’re stopped by police.
2. We discover additional trends in data without having to run other tests
how each of the predictors affects the resultant dependent variable.
RESULTS AND EVALUATION – REGRESSION
MODEL
No Yes Precisio
n
Recal
l
F-
Measure
ROC
Area
No 285403 773 0.885 0.997 0.938 0.726
Yes 36923 456 0.371 0.012 0.024 0.726
Weighted
Average
0.826 0.883 0.832 0.726
Classified/Predicted
Actual
Correctly Classified Instances 285859 88.3494 %
Incorrectly Classified
Instances
37696 11.6506 %
Mean absolute error 0.1886
Root mean squared error 0.3075
Relative absolute error 92.2893 %
Root relative squared error 96.1835 %
Total Number of Instances 323555
Attribute Coefficients Odds
Reason=Suspicion of Alcohol -0.486 0.6151
Reason=Moving Traffic Violation 0.5511 1.7352
TimeBand=12am-4am -0.849 0.4278
TimeBand=4am-8am -0.6143 0.541
TimeBand=8am-12pm 0.6492 1.914
AgeBand=16-19 0.1976 1.2184
AgeBand=25-29 -0.2082 0.8121
AgeBand=70-98 0.86 2.3632
Gender=Male -0.1297 0.8784
Gender=Female 0.1268 1.1352
Intercept 2.3189
From 𝐥𝐨𝐠
𝒑
𝟏 −𝒑
= a + bX 𝐥𝐨𝐠
𝑰𝒏𝒕𝒐𝒙𝒊𝒄𝒂𝒕𝒆𝒅
𝟏 −𝒊𝒏𝒕𝒐𝒙𝒊𝒄𝒂𝒕𝒆𝒅
= 2.3189 – 0.486*(Sus_Alc) + 0.5511*(Mov_Traf) – 0.849
0.849 *
(Timeband) + …………………………..
Regression equation predicting whether some one is intoxicated or not.
CONCLUSION
CONCLUSION/WHAT WE DISCOVERED
• Four “optimal” attributes to use in classification
• J48 – Performs well but not practical
• JRip – Most accurate (Not by much) but needs
tweaking
• Regression – “Best” of the 3
CONCLUSION / RECOMMENDATION
• Test the data set for more assumptions – Normality,
Multi-collinearity,
• and Homoscedasticity.
• Transform the dataset to minimize the errors
generated from the biased number of cases
belonging to class (No – None intoxication).
• Explore further experiments including other factors
that are potential predictors of intoxication. e.g
Offences (How offensive is a person when asked to
pull-over by police).

Contenu connexe

En vedette

증강현실 인터랙션 기술동향 (김동철 선임)
증강현실 인터랙션 기술동향 (김동철 선임)증강현실 인터랙션 기술동향 (김동철 선임)
증강현실 인터랙션 기술동향 (김동철 선임)메가트렌드랩 megatrendlab
 
What Is Team Demand, And Why Should I Care?
What Is Team Demand, And Why Should I Care?What Is Team Demand, And Why Should I Care?
What Is Team Demand, And Why Should I Care?Receptive
 
自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~
自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~
自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~aha_oretama
 
受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話
受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話
受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話Jumpei Miyata
 

En vedette (6)

증강현실 인터랙션 기술동향 (김동철 선임)
증강현실 인터랙션 기술동향 (김동철 선임)증강현실 인터랙션 기술동향 (김동철 선임)
증강현실 인터랙션 기술동향 (김동철 선임)
 
Tema toegegooi tot georden
Tema   toegegooi tot geordenTema   toegegooi tot georden
Tema toegegooi tot georden
 
What Is Team Demand, And Why Should I Care?
What Is Team Demand, And Why Should I Care?What Is Team Demand, And Why Should I Care?
What Is Team Demand, And Why Should I Care?
 
증강현실의 활성화 및 마케팅 전략
증강현실의 활성화 및 마케팅 전략증강현실의 활성화 및 마케팅 전략
증강현실의 활성화 및 마케팅 전략
 
自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~
自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~
自動化を支えるCI/CDツールの私の選択 ~何をするためにCI/CDツールを選ぶか~
 
受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話
受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話
受入試験を自動化したらDevとQAのフィードバックループがまわりはじめた話
 

Similaire à Analyzing Road Side Breath Test Data with WEKA

R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsJen Stirrup
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit ivmalathieswaran29
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptxBillyMoses1
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxTemp762476
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionKishor Datta Gupta
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Maninda Edirisooriya
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in AgricultureAman Vasisht
 
Classification Assessment Methods.pptx
Classification Assessment  Methods.pptxClassification Assessment  Methods.pptx
Classification Assessment Methods.pptxRiadh Al-Haidari
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptxPriyadharshiniG41
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agricultureAboul Ella Hassanien
 
Boost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality predictionBoost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality predictionBindhuBhargaviTalasi
 
How to Implement TOC Principles and Tools in State Government and Achieve Gre...
How to Implement TOC Principles and Tools in State Government and Achieve Gre...How to Implement TOC Principles and Tools in State Government and Achieve Gre...
How to Implement TOC Principles and Tools in State Government and Achieve Gre...commonsenseLT
 

Similaire à Analyzing Road Side Breath Test Data with WEKA (20)

R - what do the numbers mean? #RStats
R - what do the numbers mean? #RStatsR - what do the numbers mean? #RStats
R - what do the numbers mean? #RStats
 
Parkinson disease classification recorded v2.0
Parkinson disease classification recorded   v2.0Parkinson disease classification recorded   v2.0
Parkinson disease classification recorded v2.0
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptx
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
Application of Machine Learning in Agriculture
Application of Machine  Learning in AgricultureApplication of Machine  Learning in Agriculture
Application of Machine Learning in Agriculture
 
Classification Assessment Methods.pptx
Classification Assessment  Methods.pptxClassification Assessment  Methods.pptx
Classification Assessment Methods.pptx
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
Dimensionality Reduction.pptx
Dimensionality Reduction.pptxDimensionality Reduction.pptx
Dimensionality Reduction.pptx
 
Vanderbilt b
Vanderbilt bVanderbilt b
Vanderbilt b
 
Feature selection with imbalanced data in agriculture
Feature selection with  imbalanced data in agricultureFeature selection with  imbalanced data in agriculture
Feature selection with imbalanced data in agriculture
 
Boost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality predictionBoost model accuracy of imbalanced covid 19 mortality prediction
Boost model accuracy of imbalanced covid 19 mortality prediction
 
How to Implement TOC Principles and Tools in State Government and Achieve Gre...
How to Implement TOC Principles and Tools in State Government and Achieve Gre...How to Implement TOC Principles and Tools in State Government and Achieve Gre...
How to Implement TOC Principles and Tools in State Government and Achieve Gre...
 

Dernier

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 

Dernier (20)

Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Analyzing Road Side Breath Test Data with WEKA

  • 2. GROUP MEMBERS • Micheal Abaho • Yogesh Shinde • Natasha Thakur • Mingyang Chen • Huw Fulcher • Kai Wang
  • 3. OBJECTIVE • To understand how attributes explain intoxication in pulled over drivers • Analyze the dataset • Determine what attributes to classify intoxication with • Perform classification using dataset • Assess success of classification in explaining intoxication
  • 4. DATASET • Acquired from data.gov.uk • 2014 data on roadside breath tests • Approximately 300,000 records
  • 5. • Reason for test • Suspicion of Alcohol, Road Traffic Collision, Moving Traffic Violation and Other • Month • Jan to Dec • Year • 2014 • Week Type: • Weekday and Weekend • Time Band • 12am-4am, 12pm-4pm, 4am-8am, 4pm-8pm, 8am-12pm, 8pm-12am and Unknown • Age Band for Drivers • 16-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-69, 70-98 and Other • Gender for Drivers • Male and Female • Breath Alcohol Level ATTRIBUTES
  • 7. PRE-PROCESSING DATA • Removing year • Removing outliers • Creating decision variable
  • 8. REASON* Intoxicated = 0.0735 * Reason=Suspicion of Alcohol + 0.0365 * Reason=Other + -0.0428 * Reason=Moving Traffic Violatio + 0.1132
  • 9. MONTH Intoxicated = -0.0453 * Month=Jan + -0.0224 * Month=Feb + -0.0173 * Month=Mar + -0.0147 * Month=Apr + -0.0086 * Month=May + -0.0952 * Month=Jun + -0.0189 * Month=Jul + -0.013 * Month=Sep + -0.0179 * Month=Oct + -0.0295 * Month=Nov + -0.1249 * Month=Dec + 0.1669
  • 11. TIMEBAND* Intoxicated = 0.1009 * TimeBand=12am-4am + 0.0733 * TimeBand=4am-8am + -0.0368 * TimeBand=4pm-8pm + -0.0539 * TimeBand=12pm-4pm + -0.0598 * TimeBand=8am-12pm + 0.118
  • 14. EVALUATION MEASURE • A classifier predicts all data instances of a dataset as either positive or negative. • This classification (or prediction) produces four outcomes – true positive, true negative, false positive and false negative.
  • 15. WHAT IS TP,FP,FN,TN? • True Positive (TP) – It is an instance which is correctly predicted to belong to class. • True Negative (TN) – It is an instance which is correctly predicted to not belong to class. • False Positive (FP) – It is an instance which is incorrectly predicted to belong to class. • False Negative (FN) – It is an instance which is incorrectly predicted to not belong to class.
  • 16. CONFUSION MATRIX • A confusion matrix is a two by two table formed by counting of the number of the four outcomes of a classifier that is TP, FP, TN, FN. Predicte d Class A Class B <- classified as Observed TP FN Class A FP TN Class B
  • 17. MEASURES FROM THE CONFUSION MATRIX • Error rate (ERR) is calculated as the number of all incorrect predictions divided by the total number of the dataset. • The best error rate is 0.0, whereas the worst is 1.0. • Accuracy (ACC) is calculated as the number of all correct predictions divided by the total number of the dataset. • The best accuracy is 1.0, whereas the worst is 0.0.
  • 18. • True positive rate (TPR) is calculated as the number of correct positive predictions divided by the total number of positives. • The best sensitivity is 1.0, whereas the worst is 0.0. • False positive rate (FPR) is calculated as the number of incorrect negative predictions divided by the total number of negatives. • The best false positive rate is 0.0 whereas the worst is 1.0.
  • 19. • Precision (PREC) is calculated as the number of correct positive predictions divided by the total number of positive predictions. • The best precision is 1.0, whereas the worst is 0.0. • Recall is proportion of actual positives that were predicted positive. • F-measure is a harmonic mean of precision and recall.
  • 20. J48
  • 21. • J48 is the improved version of C4.5 • C4.5 is a program that creates a decision tree based on a set of labelled input data. • First it constructs a very huge tree by considering all attribute values and narrow down the decision rule with the help of pruning. • Pruning reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. • Information gain or entropy measure is used to get the best attribute to split the Nodes. • A tree structure is created with root node, intermediate and leaf nodes, where Node holds the decision and in turn decision helps to achieve our result. CLASSIFICATION BASED ON TREES (J48)
  • 22. • Attributes: Reasons, AgeBand, TimeBand, Gender • Object: Driver • Class: Yes/No for intoxication. • Test Mode:10 Fold Cross Validation • Pruned Tree EXPERIMENT WORK AND OUTCOME
  • 23. Summary J48:Pruned Tree Number of Leaves :1 Size of the tree :1 No (323555.0/37379.0) J48 CLASSIFICATION OUTPUT
  • 25. JRIP
  • 26. RULE BASED CLASSIFICATION (JRIP) Decision Tree and Decision Table(classify rule)
  • 27. RULE BASED CLASSIFICATION (JRIP) • Repeated Incremental Pruning to Produce Error Reduction (RIPPER) • Optimized version of IREP (reduced error pruning) a very common and effective technique found in decision tree algorithms
  • 28. RULE BASED CLASSIFICATION (JRIP) • The training data is split into a growing set and a pruning set • Growing set: greedily adding conditions until the rule is perfect • pruning set: delete conditions until find better rule • Rule set generate by growing rule and pruning rule • Optimization stage
  • 32. WHICH CLASSIFICATION ALGORITHM? • Accuracy of Classifier Both J48 and JRip for our case is high • Speed: Time 4.26s in JRip; 1.14 in J48 • Robustness:Noisy data/missing data • Scalability:Size of dataset becomes big
  • 34. WHAT IT IS AND WHAT IT DOES  Determines how a dependent variable is affected by one or more independent variables. Dependent variable:- Is a result or something that is being predicted. Independent variable: Predictor.  Regression Equation (In its simplicity) Y = a + bX + 𝒆 [ Y – (Dependent variable), X – (Ind variable)Expected value of )  Aim is to ensure you find values of a and b such that e is small
  • 35. THE REGRESSION MODEL DERIVED y 𝑒 - error a - intercept X 𝑦 = 𝑎 + 𝑏𝑥 + 𝑒
  • 36. LOGISTIC REGRESSION Why this regression 1. Predictive analysis of a dichotomous dependent variable. • E.g. for our case we are building a model that predicts whether some one is intoxicated or not. i.e. what do factors like violating traffic rules, age- band and time band tell us about the probability that a person is intoxicated or not when they’re stopped by police. 2. We discover additional trends in data without having to run other tests how each of the predictors affects the resultant dependent variable.
  • 37. RESULTS AND EVALUATION – REGRESSION MODEL No Yes Precisio n Recal l F- Measure ROC Area No 285403 773 0.885 0.997 0.938 0.726 Yes 36923 456 0.371 0.012 0.024 0.726 Weighted Average 0.826 0.883 0.832 0.726 Classified/Predicted Actual Correctly Classified Instances 285859 88.3494 % Incorrectly Classified Instances 37696 11.6506 % Mean absolute error 0.1886 Root mean squared error 0.3075 Relative absolute error 92.2893 % Root relative squared error 96.1835 % Total Number of Instances 323555
  • 38. Attribute Coefficients Odds Reason=Suspicion of Alcohol -0.486 0.6151 Reason=Moving Traffic Violation 0.5511 1.7352 TimeBand=12am-4am -0.849 0.4278 TimeBand=4am-8am -0.6143 0.541 TimeBand=8am-12pm 0.6492 1.914 AgeBand=16-19 0.1976 1.2184 AgeBand=25-29 -0.2082 0.8121 AgeBand=70-98 0.86 2.3632 Gender=Male -0.1297 0.8784 Gender=Female 0.1268 1.1352 Intercept 2.3189 From 𝐥𝐨𝐠 𝒑 𝟏 −𝒑 = a + bX 𝐥𝐨𝐠 𝑰𝒏𝒕𝒐𝒙𝒊𝒄𝒂𝒕𝒆𝒅 𝟏 −𝒊𝒏𝒕𝒐𝒙𝒊𝒄𝒂𝒕𝒆𝒅 = 2.3189 – 0.486*(Sus_Alc) + 0.5511*(Mov_Traf) – 0.849 0.849 * (Timeband) + ………………………….. Regression equation predicting whether some one is intoxicated or not.
  • 40. CONCLUSION/WHAT WE DISCOVERED • Four “optimal” attributes to use in classification • J48 – Performs well but not practical • JRip – Most accurate (Not by much) but needs tweaking • Regression – “Best” of the 3
  • 41. CONCLUSION / RECOMMENDATION • Test the data set for more assumptions – Normality, Multi-collinearity, • and Homoscedasticity. • Transform the dataset to minimize the errors generated from the biased number of cases belonging to class (No – None intoxication). • Explore further experiments including other factors that are potential predictors of intoxication. e.g Offences (How offensive is a person when asked to pull-over by police).

Notes de l'éditeur

  1. Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)). Incrementally prune each rule and allow the pruning of any final sequences of the antecedents
  2. Grow one rule by greedily adding antecedents (or conditions) to the rule until the rule is perfect (i.e. 100% accurate). The procedure tries every possible value of each attribute and selects the condition with highest information gain: p(log(p/t)-log(P/T)). Incrementally prune each rule and allow the pruning of any final sequences of the antecedents Condition is