SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Using machine learning to try
and predict taxi availability
Hari Allamraju

https://github.com/hari-allamraju
A few things …
What this talk is about?
• Examples of machine learning with Python, using the taxi
availability as a real world problem

• The talk will briefly cover the data sources, fetching and setting up
the schema etc

• Multiple algorithms will be explored to show how the results vary
and improve

• Assumes knowledge of Python and a general idea about working
with data in Python

• We will walk through the code and prediction results to understand
how the models work
What we will not do
• We will not dive into the mathematics behind all the
algorithms used - it is not necessary to know all the math
to be able to implement simple machine learning

• We will not go into any neural networks or deep learning
approaches - that would make this talk too complicated

• There are no examples of running the analysis on cloud
machine learning platforms - this keeps it simple enough
to run locally and later scale to any platform as you learn
and add more
Let’s get started!
What is machine learning?
Machine learning is the subfield of computer science that, according to Arthur
Samuel, gives "computers the ability to learn without being explicitly
programmed.” Samuel, an American pioneer in the field of computer gaming
and artificial intelligence, coined the term "machine learning" in 1959 while at
IBM.
Evolved from the study of pattern recognition and computational learning
theory in artificial intelligence,machine learning explores the study and
construction of algorithms that can learn from and make predictions on data –
such algorithms overcome following strictly static program instructions by
making data-driven predictions or decisions, through building a model from
sample inputs.
Quoting Wikipedia
It’s already part of our
daily life
Wasn’t Genisys supposed to go live
sometime in 2017?
More seriously…
• We see machine learning in action whenever

• We benefit from spam detection in email

• We see targeted ads in social media and e-commerce

• We experience surge pricing or receive coupons in ride
sharing apps

• Credit card fraud detection

• And many such data driven user experiences
Doing your own machine
learning
• Machine learning sounds very complex and daunting, but with some
mathematical or a good technical background it is pretty easy to get
started

• You don't need large computing power if you are dealing with small data
sets

• You don't need to be a mathematical genius and come up with new
algorithms

• And you don't need a lot of complicated data either if you are not trying
to save the world

• There are tools which can give you a fairly good experience with the
basics and get you prepped to take on bigger problems later
Tools in Python
• scikit-learn is probably the simplest Python library which
you can use to get started with machine learning

• Along with numpy and matplotlib you can easily analyse
and visualise the data and the algorithms

• This talk is based on analysis done using scikit-learn
But how do we get the
data?
• Data is the key for any experiment with machine learning and
although you don't need a ton of it, you need a good enough
sample to get started

• Fortunately there are a lot of open data sets that you can use,
like the Divvy Bike Share data from Chicago or other data sets
shared by various private and government agencies.

• You could even work with a local organization which is looking to
provide data driven user experiences or make the best use of
data that they might have collected as part of their operations

• But for this talk we will use something closer home - the
Singapore Taxi Availability data
The idea for this talk
• The idea for this talk, or the analysis behind it came from a
talk in the Singapore Python user group by Li Haoyi on the
open data in Singapore - 

https://github.com/lihaoyi/opendata

• The talk and the above GitHub repo give a very good intro to
the open data API’s, and specifically the taxi data available
in Singapore which forms the basis for today’s session

• Using Singapore data to try machine learning made more
sense due to knowledge about the people patterns and
other contextual information
What data do we have for
Singapore?
• Taxi availability is a very real world problem for which we wish
there was an accurate answer. So its a good candidate to
machine learn

• The Singapore LTA provides an API which we can invoke to get
the locations of all free taxis in Singapore, as a list of
coordinates at any given time

• This near real time (30 second or so delayed) data is pretty
useful and at the same time not very complicated and fits easily
for an exploration with machine learning

• The API is well documented on their website - 

https://data.gov.sg/dataset/taxi-availability
Taxi availability scatter plot
Data for a period of 20 mins
Preparing the data
• The data returned from the API is json and was stored to flat files.

• Data was collected for a period of one week, making the API call once
every 5 mins

• The data was then parsed into a grid of approximately 20x35, grouping
the taxi coordinates into cells with the count for that small area

• This data was then loaded to an SQLite DB with a simple schema that
allowed to load into Pandas data frames, and eventually use in scikit-
learn

• The raw data was about 500 MB in files and 100 MB when loaded to
the database grid format
Links to the material
• The code used to fetch the data, load to the DB and the
various utility methods we will see can be found here -
https://github.com/hari-allamraju/sg-taxidata

• There are some utility scripts that can be put into cron
straightaway

• There is a setup.py, so it can be installed from GitHub

• The notebooks we will walk through now can be found at
- https://github.com/hari-allamraju/pycon-talk-taxidata
Enough slides, lets
see code
What did we learn from the
analysis?
• Its easy to get started with basic machine learning and
predictions with small data sets

• There is no one golden algorithm and it varies with the
data sets and subsets used

• The accuracy depends largely on the data that we are able
to get and the variables we are able to capture in the data

• We would need other related information like weather, or
was there an MRT delay that day or was it a peak holiday
season etc
Can we get 100%
accuracy?
Why not?
• Although we say our simple models will better fit the data as we add more
features or use better algorithms, we might over fit the model to the data
that we have. If we over fit the data, then it might not be able to accurately
predict future data; small natural variations in the data that occur over a
period of time will not fit with the rigid model and we will see errors

• There will always be a degree of error no matter how many features we
collect - for processes that involve human beings, the general variation in
behaviour creates a different result, and that cannot be measured 100%;
and for fully automated processes there might be edge cases or external
dependencies that cause different results.

• We can reduce the error as we accumulate more data to train the model
and as we learn from the results, thus minimising future errors - deep
learning and other neural network methods make use of this as they can
have a feedback loop to learn from the prediction results
Questions!

Contenu connexe

Similaire à Using machine learning to try and predict taxi availability by Narahari Allamraju

Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabszekeLabs Technologies
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceData Science Milan
 
Machine learning is the new BI
Machine learning is the new BIMachine learning is the new BI
Machine learning is the new BICycloides
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation dataRob Worthington
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causationPeter Varhol
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Diego Oppenheimer
 
Tech essentials for Product managers
Tech essentials for Product managersTech essentials for Product managers
Tech essentials for Product managersNitin T Bhat
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsDesign Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsInductive Automation
 
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsDesign Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsInductive Automation
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Machine learning basics by akanksha bali
Machine learning basics by akanksha baliMachine learning basics by akanksha bali
Machine learning basics by akanksha baliAkanksha Bali
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics Akanksha Bali
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesPeter Varhol
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMProduct School
 

Similaire à Using machine learning to try and predict taxi availability by Narahari Allamraju (20)

Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
J sai subrahmanyam_Resume
J sai subrahmanyam_ResumeJ sai subrahmanyam_Resume
J sai subrahmanyam_Resume
 
Think Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial IntelligenceThink Big | Enterprise Artificial Intelligence
Think Big | Enterprise Artificial Intelligence
 
Machine learning is the new BI
Machine learning is the new BIMachine learning is the new BI
Machine learning is the new BI
 
How to be data savvy manager
How to be data savvy managerHow to be data savvy manager
How to be data savvy manager
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation data
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
Tech essentials for Product managers
Tech essentials for Product managersTech essentials for Product managers
Tech essentials for Product managers
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsDesign Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning Basics
 
Design Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning BasicsDesign Like a Pro: Machine Learning Basics
Design Like a Pro: Machine Learning Basics
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Machine learning basics by akanksha bali
Machine learning basics by akanksha baliMachine learning basics by akanksha bali
Machine learning basics by akanksha bali
 
Machine learning basics
Machine learning basics Machine learning basics
Machine learning basics
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
What is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PMWhat is Data as a Service by T-Mobile Principle Technical PM
What is Data as a Service by T-Mobile Principle Technical PM
 

Plus de PYCON MY PLT

Programming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolProgramming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolPYCON MY PLT
 
Train your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraTrain your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraPYCON MY PLT
 
Python in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPython in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPYCON MY PLT
 
Python testing like a pro by Keith Yang
Python testing like a pro by Keith YangPython testing like a pro by Keith Yang
Python testing like a pro by Keith YangPYCON MY PLT
 
The programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarThe programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarPYCON MY PLT
 
Data mining news articles by Amir Othman for PyCon APAC 2017
Data mining news articles by Amir Othman for PyCon APAC 2017Data mining news articles by Amir Othman for PyCon APAC 2017
Data mining news articles by Amir Othman for PyCon APAC 2017PYCON MY PLT
 

Plus de PYCON MY PLT (6)

Programming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High SchoolProgramming the BBC micro:bit with MicroPython by Dunham High School
Programming the BBC micro:bit with MicroPython by Dunham High School
 
Train your dragons! by Shilpa Karkera
Train your dragons! by Shilpa KarkeraTrain your dragons! by Shilpa Karkera
Train your dragons! by Shilpa Karkera
 
Python in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas LuPython in big data ecosystem by Nicholas Lu
Python in big data ecosystem by Nicholas Lu
 
Python testing like a pro by Keith Yang
Python testing like a pro by Keith YangPython testing like a pro by Keith Yang
Python testing like a pro by Keith Yang
 
The programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellarThe programmer's mind by Jessica McKellar
The programmer's mind by Jessica McKellar
 
Data mining news articles by Amir Othman for PyCon APAC 2017
Data mining news articles by Amir Othman for PyCon APAC 2017Data mining news articles by Amir Othman for PyCon APAC 2017
Data mining news articles by Amir Othman for PyCon APAC 2017
 

Dernier

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 

Dernier (20)

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Using machine learning to try and predict taxi availability by Narahari Allamraju

  • 1. Using machine learning to try and predict taxi availability Hari Allamraju https://github.com/hari-allamraju
  • 3. What this talk is about? • Examples of machine learning with Python, using the taxi availability as a real world problem • The talk will briefly cover the data sources, fetching and setting up the schema etc • Multiple algorithms will be explored to show how the results vary and improve • Assumes knowledge of Python and a general idea about working with data in Python • We will walk through the code and prediction results to understand how the models work
  • 4. What we will not do • We will not dive into the mathematics behind all the algorithms used - it is not necessary to know all the math to be able to implement simple machine learning • We will not go into any neural networks or deep learning approaches - that would make this talk too complicated • There are no examples of running the analysis on cloud machine learning platforms - this keeps it simple enough to run locally and later scale to any platform as you learn and add more
  • 6. What is machine learning? Machine learning is the subfield of computer science that, according to Arthur Samuel, gives "computers the ability to learn without being explicitly programmed.” Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term "machine learning" in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Quoting Wikipedia
  • 7. It’s already part of our daily life
  • 8. Wasn’t Genisys supposed to go live sometime in 2017?
  • 9. More seriously… • We see machine learning in action whenever • We benefit from spam detection in email • We see targeted ads in social media and e-commerce • We experience surge pricing or receive coupons in ride sharing apps • Credit card fraud detection • And many such data driven user experiences
  • 10. Doing your own machine learning • Machine learning sounds very complex and daunting, but with some mathematical or a good technical background it is pretty easy to get started • You don't need large computing power if you are dealing with small data sets • You don't need to be a mathematical genius and come up with new algorithms • And you don't need a lot of complicated data either if you are not trying to save the world • There are tools which can give you a fairly good experience with the basics and get you prepped to take on bigger problems later
  • 11. Tools in Python • scikit-learn is probably the simplest Python library which you can use to get started with machine learning • Along with numpy and matplotlib you can easily analyse and visualise the data and the algorithms • This talk is based on analysis done using scikit-learn
  • 12. But how do we get the data? • Data is the key for any experiment with machine learning and although you don't need a ton of it, you need a good enough sample to get started • Fortunately there are a lot of open data sets that you can use, like the Divvy Bike Share data from Chicago or other data sets shared by various private and government agencies. • You could even work with a local organization which is looking to provide data driven user experiences or make the best use of data that they might have collected as part of their operations • But for this talk we will use something closer home - the Singapore Taxi Availability data
  • 13. The idea for this talk • The idea for this talk, or the analysis behind it came from a talk in the Singapore Python user group by Li Haoyi on the open data in Singapore - 
 https://github.com/lihaoyi/opendata • The talk and the above GitHub repo give a very good intro to the open data API’s, and specifically the taxi data available in Singapore which forms the basis for today’s session • Using Singapore data to try machine learning made more sense due to knowledge about the people patterns and other contextual information
  • 14. What data do we have for Singapore? • Taxi availability is a very real world problem for which we wish there was an accurate answer. So its a good candidate to machine learn • The Singapore LTA provides an API which we can invoke to get the locations of all free taxis in Singapore, as a list of coordinates at any given time • This near real time (30 second or so delayed) data is pretty useful and at the same time not very complicated and fits easily for an exploration with machine learning • The API is well documented on their website - 
 https://data.gov.sg/dataset/taxi-availability
  • 15. Taxi availability scatter plot Data for a period of 20 mins
  • 16. Preparing the data • The data returned from the API is json and was stored to flat files. • Data was collected for a period of one week, making the API call once every 5 mins • The data was then parsed into a grid of approximately 20x35, grouping the taxi coordinates into cells with the count for that small area • This data was then loaded to an SQLite DB with a simple schema that allowed to load into Pandas data frames, and eventually use in scikit- learn • The raw data was about 500 MB in files and 100 MB when loaded to the database grid format
  • 17. Links to the material • The code used to fetch the data, load to the DB and the various utility methods we will see can be found here - https://github.com/hari-allamraju/sg-taxidata • There are some utility scripts that can be put into cron straightaway • There is a setup.py, so it can be installed from GitHub • The notebooks we will walk through now can be found at - https://github.com/hari-allamraju/pycon-talk-taxidata
  • 19. What did we learn from the analysis? • Its easy to get started with basic machine learning and predictions with small data sets • There is no one golden algorithm and it varies with the data sets and subsets used • The accuracy depends largely on the data that we are able to get and the variables we are able to capture in the data • We would need other related information like weather, or was there an MRT delay that day or was it a peak holiday season etc
  • 20. Can we get 100% accuracy?
  • 21.
  • 22. Why not? • Although we say our simple models will better fit the data as we add more features or use better algorithms, we might over fit the model to the data that we have. If we over fit the data, then it might not be able to accurately predict future data; small natural variations in the data that occur over a period of time will not fit with the rigid model and we will see errors • There will always be a degree of error no matter how many features we collect - for processes that involve human beings, the general variation in behaviour creates a different result, and that cannot be measured 100%; and for fully automated processes there might be edge cases or external dependencies that cause different results. • We can reduce the error as we accumulate more data to train the model and as we learn from the results, thus minimising future errors - deep learning and other neural network methods make use of this as they can have a feedback loop to learn from the prediction results