NASDAG.org
Data Science in the Automotive Industry
I am an Automotive Management Professional and a Computer
Science Engineer from France, with an extensive experience in ma...
Objective:
Categorize drivers based on their behaviour on the roads - their driving style
and the type of roads that they ...
Raw data for one trip
Machine learning approach (1/2)
❖ Preprocess the data using statistical smoothing and compression algorithms
➢ Kalman Filt...
d1L6Br1 d1L8Sr1 d1L5Sr2 d1L6Ur2 d2L8Ur2 d3L4Sr3 d2L5Ur3 d3L4Ur4 d3L6Sr4 d3L7Sr3 d4L4Ur5 d4L3Ur5 d4L2Ur7 d5L4Sr6 d3L3Ur5 d4...
LDA: Bayesian Topic Model
Per trip
“Driving Behaviour”
proportions
for each trip select a distribution of
“Driving Behavio...
Posterior Inference in LDA
❖ Goal is to obtain this posterior:
➢ How much a trip contain of “Driving Behaviour” k( ) and
➢...
Example trip in the new LDA space
❖ 2736 drivers
❖ 200 trips/driver
Total : 547200 csv files (5.92 GB)
Challenge:
To come up with a "telematic fingerprint" ...
❖ Transpose all trips into the new Driving Behaviours Space
❖ Take one by one each trip from a selected Driver
❖ Build a p...
MongoDB to hold 3.3 MM documents generated
Parallel processing setup on 4 DigitalOcean Droplets with 8CPU each
Gensim Libr...
Predicting
❖ Achieving an AUC of 0.9 on Kaggle without any ensembling technique
which confirms the robustness of my approa...
Thank you
http://nasdag.org
Prochain SlideShare
Chargement dans…5
×

Driving Behaviour as a Telematic Fingerprint

1 147 vues

Publié le

The objective of my final project at Metis is to categorize drivers based on their behaviour on the roads - their driving style and the type of roads that they follow.
The challenge associated with this objective is to identify uniquely a driver (and hence his proper “driving behaviour”) based on the GPS log of a mobile phone located inside the car.
My idea to solve this issue is to experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation (LDA) and explain the observed trips by the unobserved behaviour of drivers.

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Driving Behaviour as a Telematic Fingerprint

  1. 1. NASDAG.org Data Science in the Automotive Industry
  2. 2. I am an Automotive Management Professional and a Computer Science Engineer from France, with an extensive experience in managing complex projects in Supply Chain and IT, as well as starting, developing and acquiring businesses in France, Russia, USA and the Middle East. I came to Metis to understand, learn and practice how data science is transforming the Automotive Business. During my projects, I focused on: ● Sentiment Analysis / Topic Modeling ● Predictive Behavior Modeling ● Driver Telematics Philippe Dagher
  3. 3. Objective: Categorize drivers based on their behaviour on the roads - their driving style and the type of roads that they follow. Challenge: Identify uniquely a driver (and hence his proper “driving behaviour”) based on the GPS log of a mobile phone located inside the car. Idea: Experiment Topic Modeling techniques especially Latent Semantic Indexing/Analysis (LSI/LSA) and Latent Dirichlet Allocation (LDA) to explain the observed trips by the unobserved behaviour of drivers. Final Project @ Metis
  4. 4. Raw data for one trip
  5. 5. Machine learning approach (1/2) ❖ Preprocess the data using statistical smoothing and compression algorithms ➢ Kalman Filtering ➢ Ramer–Douglas–Peucker ❖ Extract road and driving style features ➢ per Segment: Length, Slip Angle, Convexity, Radius ➢ per Meter: Speed, Accelerations (tangential and normal), Jerk, Yaw, Pauses ❖ Bin the ouput and generate the Driving Alphabet ➢ ex: d0, d1, d2… v0, v1, v2… a0, a1, a2… etc ❖ Build the Driving Vocabulary - “Driving Slides” per meter ➢ ex: d3L4v2n3y1 ➢ for various preprocessing sensitivities or features combinations (langages) ❖ Translate trips from GPS log into documents ➢ Tokenize, filter, … data is ready!
  6. 6. d1L6Br1 d1L8Sr1 d1L5Sr2 d1L6Ur2 d2L8Ur2 d3L4Sr3 d2L5Ur3 d3L4Ur4 d3L6Sr4 d3L7Sr3 d4L4Ur5 d4L3Ur5 d4L2Ur7 d5L4Sr6 d3L3Ur5 d4L3Sr6 d5L4Ur6 d4L3Ur7 d5L9Sr5 d2L5Ur4 d3L2Ur7 d6L1Sr9 d5L0Sr9 d5L1Sr9 d5L7Ur5 d2L6Ur2 d2L3Ur5 d4L1Ur8 d5L2Ur7 d6L10Sr5 d6L8Sr5 d2L4Ur3 d3L3Ur6 d5L4Srp1 v2a6n0j0y0p1 v1a6n0j3y0p1 v1a1n0j6y0p1 v1a11n0j6y0p1 v1a7n0j11y0p1 v1a16n0j7y0p1 v2a7n0j1y0p1 v2a6n0j2y0p1 v2a10n0j2y0p1 v3a6n1j3y0p1 v3a2n2j3y0p1 v3a5n2j3y0p1 v4a2n2j3y1p1 v4a5n2j5y1p1 v4a5n3j5y1p1 v4a4n3j1y1p1 v4a6n3j6y1p1 v4a5n4j5y1p1 v4a4n3j6y1p1 v4a5n4j0y1p1 v4a5n3j6y1p1 v4a5n2j9y1p1 v4a11n3j7y1p1 v3a2n2j7y0p1 v3a12n2j7y0p1 v2a1n1j3y0p1 v2a5n1j9y0p1 v2a11n1j9y0p1 v3a6n1j7y0p1 v3a5n1j7y0p1 v3a6n2j6y0p1 v3a6n1j34y0p1 v3a62n2j71y0p1 v8a56n11j38y2p1 v4a13n3j7y1p1 v4a4n3j4y1p1 v4a5n3j6y1p1 v4a4n2j6y1p1 v4a6n3j1y1p1 v3a5n2j2y0p1 v3a3n2j6y0p1 v3a11n1j4y0p1 v2a8n1j0y0p1 v2a7n1j7y0p1 v2a17n1j1y0p1 v2a10p1 v6a0n3j4y0p1 v6a6n3j7y0p1 v6a6n3j3y0p1 v6a1n3j3y0p1 v6a6n3j3y0p1 v6a5n2j1y0p1 v5a6n2j4y0p1 v5a6n2j3y0p1 v5a12n1j2y0p1 v4a9n1j0y0p1 v3a9n1j2y0p1 v3a5n0j3y0p1 v3a1n0j6y0p1 v3a11n0j6y0p1 v3a0n1j3y0p1 v3a6n1j0y0p1 v3a5n1j3y0p1 v3a11n0j6y0p1 v4a1n0j4y0p1 v4a6n0j3y0p1 v4a2n0j7y0p1 v4a13n0j11y0p1 v5a7n0j4y0p1 v5a1n0j0y0p1 v5a1n0j3y0p1 v5a6n0j6y0p1 v5a6n0j2y0p1 v5a2n0j7y0p1 v6a11n0j10y0p1 v6a6n0j3y0p1 v6a0n0j3y0p1 v6a5n0j6y0p1 v6a5n0j2y0p1 v6a1n0j1y0p1 v6a0n0j3y0p1 v6a6n0j7y0p1 v6a6n0j7y0p1 v6a6n0j7y0p1 v6a6n0j3y0p1 v6a0n0j2y0p1 v6a5n0j6y0p1 v6a5n0j7y0p1 v6a6n0j4y0p1 v6a0n1j3y1j3y0p1 v6a6n1j6y0p1 v6a5n1j2y0p1 v7a1n1j4y0p1 v5a3n1j1y0p1 v5a6n1j3y0p1 v5a10n1j3y0p1 v4a8n0j0y0p1 v3a8n0j0y0p1 v3a8n0j3y0p1 v2a10n0j1y0p1 v2a7n0j3y0p1 v2a6n0j7y0p1 v3a7n0j3y0p1 v2a7n0j6y0p1 v3a14n0j7y0p1 v3a4n0j4y0p1 v3a2n0j6y0p1 v3a12n0j3y0p1 v3a8n0j2y0p1 v3a5n0j0y0p1 v3a6n0j4y0p1 v4a1n0j3y0p1 v4a5n0j2y0p1 v4a1n0j0y0p1 v4a0n0j0y0p1 v4a0n0j0y0p1 v4a0n0j0y0p2 v4a1n0j3y0p1 v4a6n0j7y0p1 v4a6n0j10y0p1 v4a11n0j6y0p1 v3a2n0j0y0p1 v3a1n0j3y0p1 v3a6n0j0y0p1 v3a6n0j0y0p1 v2a5n0j2y0p1 v2a3n0j5y0p1 v2a10n0j5y0p1 v1a2n0j0y0p1 v1a1n0j3y0p1 v1a5n0j10y0p1 v1a11n0j7y0p1 v1a3n0j7y0p1 v1a12n0j7y0p1 v2a3n0j1y0p1 v2a1n0j6y0p1 v2a11n0j10y0p1 v3a6n0j10y0p1 v3a12n0j7y0p1 v4a1n0j3y0p1 v4a5n0j10y0p1 v3a11n0j6y0p1 v4a2n0j3y0p1 v4a6n0j3y0p1 v5a0n0j7y0p1 v5a12n0j8y0p1 v5a4n0j4y0p1 v5a2n3j3y0p1 v5a3n3j4y0p1 v5a6n3j7y0p1 v5a6n3j5y0p1 v5a4n3j2y0p1 v5a1n3j3y0p1 v5a6n3j2y0p1 v5a1n2j4y0p1 v5a6n2j3y0p1 v5a2n3j4y0p1 v5a6n3j2y0p1 v5a6n2j3y0p1 v4a0n2j1y0p1 v4a2n2j1y0p1 v4a0n2j4y0p1 v4a6n2j7y0p1 v5a6n2j4y0p1 v4a5n2j0y0p1 v4a5n2j2y0p1 v4a9n2j2y0p1 v5a5n2j3y0p1 v5a9n3j1y0p1 v5a9n3j1y0p1 v5a7n1j2y0p1 d6L1v5n0y0 d6L1v4n0y0 d6L1v4n0y0 d6L1v5n0y0 d6L1v4n0y0 d6L1v4n0y0 d5L0v4n0y0 d5L0v4n0y0 d5L0v5n0y0 d5L0v4n0y0 d5L0v4n0y0 d5L0v4n0y0 d5L0v3n0y0 d5L0v3n0y0 d5L0v2n0y0 d5L0v2n0y0 d5L0v2n0y0 d5L0v2n0y0 d5L0v3n0y0 d5L0v2n0y0 d5L0v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1v3n0y0 d5L1vy1 d5L7v4n4y1 d5L7v4n3y1 d5L7v0n0y0 d5L7v0n0y0 d5L7v0n0y0 d5L7v1n0y0 d2L6v1n6y5 d2L6v2n8y6 d2L3v2n0y0 d2L3v2n0y0 d4L1v3n0y0 d4L1v3n0y0 d4L1v3n0y0 d4L1v4n0y0 d4L1v4n0y0 d4L1v4n0y0 d4L1v4n0y0 d4L1v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v5n0y0 d5L2v5n0y0 d5L2v5n0y0 d5L2v5n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v4n0y0 d5L2v5n0y0 d5L2v4n0y0 d5L2v5n0y0 d5L2v4n0y0 d d6L10v3n2y0 d6L10v4n2y0 d6L10v3n1y0 d6L10v3n1y0 d6L10v2n1y0 d6L10v2n1y0 d6L10v1n0y0 d6L10v2n0y0 d6L10v1n0y0 d6L10v1n0y0 d6L10v2n0y0 d6L10v1n0y0 d6L10v1n0y0 d6L10v1n0y0 d6L8v1n0y0 d6L8v1n0y0 d6L8v2n0y0 d6L8v2n0y0 d6L8v2n0y0 d6L8v3n1y0 d6L8v3n2y0 d6L8v3n2y0 d6L8v4n2y1 d6L8v4n2y1 d6L8v4n3y1 d6L8v4n3y1 d6L8v4n3y1 d6L8v4n4y1 d6L8v4n3y1 d6L8v4n4y1 d6L8v4n3y1 d6L8v4n2y1 d6L8v4n3y1 d6L8v3n2y0 d6L8v3n2y0 d6L8v2n1y0 d6L8v2n1y0 d6L8v2n1y0 d6L8v3n1y0 d2L5v1n3y2 d2L5v1n2y2 d3L5v1n2y1 d3L5v2n3y2 d3L5v2n4y2 d3L5v2n6y3 d3L5v2n2y1 d3L5v2n2y1 d3L5v3n4y2 d4L6v2n5y3 d4L6v2n6y3 d4L6v3n8y3 d4L6v3n7y3 d4L6v3n7y3 d4L6v2n6y3 d4L6v2n4y2 d4L6v2n3y2 d2L6v1n12y11 d2L6v1n10y10 d1L1v1n0y0 d3L3v1n1y1 d3L3v1n1y0 d3L3v1n0y0 d3L3v1n0y0 d3L3v1n0y0 d2L8v0n3y6 Example of a translated trip
  7. 7. LDA: Bayesian Topic Model Per trip “Driving Behaviour” proportions for each trip select a distribution of “Driving Behaviours” Dirichlet parameter Corpus: possible “Driving Behaviour” distributions for trips Per “Driving Slide” “Driving Behaviour” assignment for each “Driving Slide” select a “Driving Behaviour” Observed “Driving Slide” select actual “Driving Slide” from the slected “Driving Behaviour” “Driving Behaviours” each “Driving Behaviour” is a distribution of “Driving Slides” “Driving Behaviour” hyperparameter possible “Driving Slide” distributions for “Driving Behaviours”
  8. 8. Posterior Inference in LDA ❖ Goal is to obtain this posterior: ➢ How much a trip contain of “Driving Behaviour” k( ) and ➢ “Driving Behaviour” “Driving Slides” assignements z ❖ Which means that I need to calculate: ❖ GENSIM Library ➢ a Python+NumPy implementation of online LDA for inputs larger than the available RAM
  9. 9. Example trip in the new LDA space
  10. 10. ❖ 2736 drivers ❖ 200 trips/driver Total : 547200 csv files (5.92 GB) Challenge: To come up with a "telematic fingerprint" capable of distinguishing when a trip was driven by a given driver, knowing that among the 200 provided trips of each driver, a few number of trips was not driven by him/her. Submissions are judged on area under the ROC curve calculated in a global manner (all predictions together). Validation on a Kaggle Competition
  11. 11. ❖ Transpose all trips into the new Driving Behaviours Space ❖ Take one by one each trip from a selected Driver ❖ Build a prediction model trained with all other trips in the dataset: ➢ Trues if they belong to the selected Driver ➢ Falses if they do not belong to this Driver ❖ Predict with the trained model, the belonging of the selected Trip to the Driver, then Ensemble several predictions using various sensitivities to enhance the score... For performance reasons I will proceed by batches of 10 or 20 selected trips and compare each time to a randomly selected limited number of False trips Other outlier detection / clustering techniques appear to be less performing Machine learning approach (2/2)
  12. 12. MongoDB to hold 3.3 MM documents generated Parallel processing setup on 4 DigitalOcean Droplets with 8CPU each Gensim Library which implements three methods: ❖ latent semantic indexing (LSI, or LSA - A for Analysis) ❖ latent Dirichlet Allocation (LDA) ❖ random projections (RP) Also, it implements online versions of each technique. Setting the infrastructure
  13. 13. Predicting ❖ Achieving an AUC of 0.9 on Kaggle without any ensembling technique which confirms the robustness of my approach...
  14. 14. Thank you http://nasdag.org

×