SlideShare une entreprise Scribd logo
1  sur  27
‘ANATOMY OF A DATA
SCIENCE PROJECT’
ADAM SROKA, SENIOR DATA SCIENTIST
WELCOME TO THE DECEMBER SCOTLAND DATA SCIENCE & TECHNOLOGY MEETUP
RICARDO ANTUNES, DATA SCIENTIST
INCREMENTAL GROUP
- with
&
MBN ACADEMY ARE THE OFFICIAL PARTNERS OF THE
DATA LAB MSC. PLACEMENT PROGRAMME 2018/19.
IF YOU ARE PART OF THE SCOTTISH BUSINESS
COMMUNITY AND INTERESTED IN BECOMING A
POTENTIAL HOST ORGANISATION, PLEASE EMAIL ROB
AT ACADEMY@MBNSOLUTIONS.COM
Anatomy of a Data Science
Project
Adam Sroka, Senior Data Scientist
adam.sroka@incrementalgroup.co.uk
@adzsroka
Why we exist
Digital technology
is changing
everything
Sustainable
success comes
from incremental
improvements
Our mission is to
enable
government and
industry to digitally
transform, one
step at a time
THE DATA SCIENCE
HIERARCHY OF NEEDS
THE DATA SCIENCE
HIERARCHY OF NEEDS
COMPLEXITY
FROM SIMPLICITY
What makes a good project?
A few points to consider before you start
Ask yourself…
1. Will this be easy to deploy & use?
2. Will it be considerably better than existing solutions?
3. Can parts be automated, reproduced, and reused?
4. Is it easy to understand, explain, and test?
5. Is it technically interesting?
Data
Making your raw materials go further
 This is always the
first step
 Build a reusable
set of tools for
measuring quality
 DataExplorer is
great start
Quality
Build pipelines
Models
Mastering the tools of the trade
“Never use a long word
where a short one will do.”
George Orwell
 It’s easy to get excited about the new big thing
 Sometimes seems like expressing intelligence has
taken priority over delivering value
 Marginal gains at the cost of understanding and
interoperability aren’t gains at all
Complexity
 What are other people
doing?
 Are there similar problems
to yours on Kaggle?
 Do you have any biases?
 Take a quick first pass with
everything and review
What works?
Tools
A few things to make your life easier
Templates
 Figure out a template that
works for you – then stick to it
 It makes moving between and
sharing projects tolerable
 https://drivendata.github.io/co
okiecutter-data-science/
 For longer lasting projects,
strongly consider build
automation
 This will manage rebuilding
what’s needed when you
make a change
 Tools like Azure Pipelines,
AWS CodeBuild, Luigi, or
even Makefiles
Build Tools
Containers
 Package your entire workspace
into easily manageable
containers
 Makes reproducibility and sharing
simple
 Many cloud platforms allow
automatic distribution of
containers to clusters
Resources
reddit.com/user/adzsroka/m/d
atascience
datatau.com
kagglenoobs.herokuapp.com
dataelixir.com
getrevue.co/profile/
datamachina/
Thanks!
Adam Sroka, Senior Data Scientist
adam.sroka@incrementalgroup.co.uk
@adzsroka

Contenu connexe

Tendances

Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at ScaleDomino Data Lab
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Domino Data Lab
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itDomino Data Lab
 
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.Venveo
 
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioH2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioSri Ambati
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellSri Ambati
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterDomino Data Lab
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPeculium Crypto
 
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...IADSS
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprisemark madsen
 
Operational analytics overview
Operational analytics overviewOperational analytics overview
Operational analytics overviewpallavi pentapati
 
1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his macRising Media, Inc.
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Ali Alkan
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationDomino Data Lab
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoSri Ambati
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsMichał Łopuszyński
 

Tendances (20)

Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
Data Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using itData Quality Analytics: Understanding what is in your data, before using it
Data Quality Analytics: Understanding what is in your data, before using it
 
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
Agile Analytics: The Secret to Test, Improve, Fail & Succeed Quickly.
 
1645 track 3 porter
1645 track 3 porter1645 track 3 porter
1645 track 3 porter
 
H2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.ioH2O World - What you need before doing predictive analysis - Keen.io
H2O World - What you need before doing predictive analysis - Keen.io
 
Evaluation of big data analysis
Evaluation of big data analysisEvaluation of big data analysis
Evaluation of big data analysis
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
Putting data science in your business a first utility feedback
Putting data science in your business a first utility feedbackPutting data science in your business a first utility feedback
Putting data science in your business a first utility feedback
 
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
Licensed to Analyze? Strata Data NY 2019 IADSS Session - Usama Fayyad, Hamit ...
 
Operationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the EnterpriseOperationalizing Machine Learning in the Enterprise
Operationalizing Machine Learning in the Enterprise
 
Operational analytics overview
Operational analytics overviewOperational analytics overview
Operational analytics overview
 
1555 track 1 huang_using his mac
1555 track 1 huang_using his mac1555 track 1 huang_using his mac
1555 track 1 huang_using his mac
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
Supporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentationSupporting innovation in insurance with randomized experimentation
Supporting innovation in insurance with randomized experimentation
 
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing ZhaoH2O World - Advanced Analytics at Macys.com - Daqing Zhao
H2O World - Advanced Analytics at Macys.com - Daqing Zhao
 
CRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining ProjectsCRISP-DM - Agile Approach To Data Mining Projects
CRISP-DM - Agile Approach To Data Mining Projects
 

Similaire à Anatomy of a data science project

Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 
Online productivity tools - SILS20090
Online productivity tools - SILS20090Online productivity tools - SILS20090
Online productivity tools - SILS20090is20090
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric TeamsData Con LA
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Lucas Jellema
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
 
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...SpagoWorld
 
Pausefest: Solve your own damn problem
Pausefest: Solve your own damn problemPausefest: Solve your own damn problem
Pausefest: Solve your own damn problemMike Ojo
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale
 
OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas
 
Digital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern WorkplaceDigital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern WorkplaceSani Garba Consulting
 
Azure - The Best Cloud for Developers
Azure - The Best Cloud for DevelopersAzure - The Best Cloud for Developers
Azure - The Best Cloud for DevelopersInovar Tech
 
Why Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud ComputingWhy Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud ComputingTechSoup Global
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeCodiax
 
Cloud Computing Webinar
Cloud Computing WebinarCloud Computing Webinar
Cloud Computing WebinarTechSoup
 
Capgemini Ron Tolido - the 3rd Platform and Insurance
Capgemini   Ron Tolido - the 3rd Platform and InsuranceCapgemini   Ron Tolido - the 3rd Platform and Insurance
Capgemini Ron Tolido - the 3rd Platform and InsuranceEDGEteam
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessInside Analysis
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 

Similaire à Anatomy of a data science project (20)

Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
Online productivity tools - SILS20090
Online productivity tools - SILS20090Online productivity tools - SILS20090
Online productivity tools - SILS20090
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Enabling Data centric Teams
Enabling Data centric TeamsEnabling Data centric Teams
Enabling Data centric Teams
 
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
Forms 2 Future - the ongoing journey into the future for Oracle based organiz...
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
EclipseDay Milano 2017 - How to make Data Science appealing with open source ...
 
A Methodology for Building the Internet of Things
A Methodology for Building the Internet of ThingsA Methodology for Building the Internet of Things
A Methodology for Building the Internet of Things
 
Pausefest: Solve your own damn problem
Pausefest: Solve your own damn problemPausefest: Solve your own damn problem
Pausefest: Solve your own damn problem
 
RightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to CloudRightScale Roadtrip Boston: Accelerate to Cloud
RightScale Roadtrip Boston: Accelerate to Cloud
 
OpenSistemas Corporate Presentation
OpenSistemas Corporate PresentationOpenSistemas Corporate Presentation
OpenSistemas Corporate Presentation
 
Digital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern WorkplaceDigital Transformation: A Case for Modern Workplace
Digital Transformation: A Case for Modern Workplace
 
Azure - The Best Cloud for Developers
Azure - The Best Cloud for DevelopersAzure - The Best Cloud for Developers
Azure - The Best Cloud for Developers
 
Why Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud ComputingWhy Should Nonprofits Care About Cloud Computing
Why Should Nonprofits Care About Cloud Computing
 
The silent project disruptor: Building AI solutions
The silent project disruptor: Building AI solutionsThe silent project disruptor: Building AI solutions
The silent project disruptor: Building AI solutions
 
Maciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The TradeMaciej Marek (Philip Morris International) - The Tools of The Trade
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
Cloud Computing Webinar
Cloud Computing WebinarCloud Computing Webinar
Cloud Computing Webinar
 
Capgemini Ron Tolido - the 3rd Platform and Insurance
Capgemini   Ron Tolido - the 3rd Platform and InsuranceCapgemini   Ron Tolido - the 3rd Platform and Insurance
Capgemini Ron Tolido - the 3rd Platform and Insurance
 
Where the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information AccessWhere the Warehouse Ends: A New Age of Information Access
Where the Warehouse Ends: A New Age of Information Access
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 

Dernier

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Anatomy of a data science project

  • 1. ‘ANATOMY OF A DATA SCIENCE PROJECT’ ADAM SROKA, SENIOR DATA SCIENTIST WELCOME TO THE DECEMBER SCOTLAND DATA SCIENCE & TECHNOLOGY MEETUP RICARDO ANTUNES, DATA SCIENTIST INCREMENTAL GROUP - with &
  • 2. MBN ACADEMY ARE THE OFFICIAL PARTNERS OF THE DATA LAB MSC. PLACEMENT PROGRAMME 2018/19. IF YOU ARE PART OF THE SCOTTISH BUSINESS COMMUNITY AND INTERESTED IN BECOMING A POTENTIAL HOST ORGANISATION, PLEASE EMAIL ROB AT ACADEMY@MBNSOLUTIONS.COM
  • 3. Anatomy of a Data Science Project Adam Sroka, Senior Data Scientist adam.sroka@incrementalgroup.co.uk @adzsroka
  • 4. Why we exist Digital technology is changing everything Sustainable success comes from incremental improvements Our mission is to enable government and industry to digitally transform, one step at a time
  • 8. What makes a good project? A few points to consider before you start
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. Ask yourself… 1. Will this be easy to deploy & use? 2. Will it be considerably better than existing solutions? 3. Can parts be automated, reproduced, and reused? 4. Is it easy to understand, explain, and test? 5. Is it technically interesting?
  • 14. Data Making your raw materials go further
  • 15.
  • 16.  This is always the first step  Build a reusable set of tools for measuring quality  DataExplorer is great start Quality
  • 19. “Never use a long word where a short one will do.” George Orwell
  • 20.  It’s easy to get excited about the new big thing  Sometimes seems like expressing intelligence has taken priority over delivering value  Marginal gains at the cost of understanding and interoperability aren’t gains at all Complexity
  • 21.  What are other people doing?  Are there similar problems to yours on Kaggle?  Do you have any biases?  Take a quick first pass with everything and review What works?
  • 22. Tools A few things to make your life easier
  • 23. Templates  Figure out a template that works for you – then stick to it  It makes moving between and sharing projects tolerable  https://drivendata.github.io/co okiecutter-data-science/
  • 24.  For longer lasting projects, strongly consider build automation  This will manage rebuilding what’s needed when you make a change  Tools like Azure Pipelines, AWS CodeBuild, Luigi, or even Makefiles Build Tools
  • 25. Containers  Package your entire workspace into easily manageable containers  Makes reproducibility and sharing simple  Many cloud platforms allow automatic distribution of containers to clusters
  • 27. Thanks! Adam Sroka, Senior Data Scientist adam.sroka@incrementalgroup.co.uk @adzsroka

Notes de l'éditeur

  1. Ease of adoption – awful interfaces
  2. Is this an improvement https://medium.com/@EvanSinar/7-data-visualization-types-you-should-be-using-more-and-how-to-start-4015b5d4adf2
  3. https://aeon.co/essays/is-technology-making-the-world-indecipherable
  4. Reusability https://www.freeimageslive.co.uk/free_stock_image/building-concept-jpg
  5. https://gengo.ai/datasets/the-50-best-free-datasets-for-machine-learning/ https://digital.nhs.uk/ https://registry.opendata.aws/ https://data.europa.eu/euodp/en/data/ https://data.gov.uk/ https://www.data.gov/
  6. https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html
  7. https://icons8.com/icon/2297/gis