Slides for the presentation given at the Data Science Scotland Meetup (https://www.meetup.com/Scotland-Data-Science-Technology-Meetup/events/256269263/).
This talk aimed to give some general advice, tips, and tricks about how to run a successful data science project.
Hosted by:
Incremental Group - https://www.linkedin.com/company/incremental-group/
MBN Solutions - https://www.linkedin.com/company/mbn-recruitment-solutions/
The Datalab - https://www.linkedin.com/company/the-data-lab-innovation-centre/
Streamlining Python Development: A Guide to a Modern Project Setup
Anatomy of a data science project
1. ‘ANATOMY OF A DATA
SCIENCE PROJECT’
ADAM SROKA, SENIOR DATA SCIENTIST
WELCOME TO THE DECEMBER SCOTLAND DATA SCIENCE & TECHNOLOGY MEETUP
RICARDO ANTUNES, DATA SCIENTIST
INCREMENTAL GROUP
- with
&
2. MBN ACADEMY ARE THE OFFICIAL PARTNERS OF THE
DATA LAB MSC. PLACEMENT PROGRAMME 2018/19.
IF YOU ARE PART OF THE SCOTTISH BUSINESS
COMMUNITY AND INTERESTED IN BECOMING A
POTENTIAL HOST ORGANISATION, PLEASE EMAIL ROB
AT ACADEMY@MBNSOLUTIONS.COM
3. Anatomy of a Data Science
Project
Adam Sroka, Senior Data Scientist
adam.sroka@incrementalgroup.co.uk
@adzsroka
4. Why we exist
Digital technology
is changing
everything
Sustainable
success comes
from incremental
improvements
Our mission is to
enable
government and
industry to digitally
transform, one
step at a time
8. What makes a good project?
A few points to consider before you start
9.
10.
11.
12.
13. Ask yourself…
1. Will this be easy to deploy & use?
2. Will it be considerably better than existing solutions?
3. Can parts be automated, reproduced, and reused?
4. Is it easy to understand, explain, and test?
5. Is it technically interesting?
19. “Never use a long word
where a short one will do.”
George Orwell
20. It’s easy to get excited about the new big thing
Sometimes seems like expressing intelligence has
taken priority over delivering value
Marginal gains at the cost of understanding and
interoperability aren’t gains at all
Complexity
21. What are other people
doing?
Are there similar problems
to yours on Kaggle?
Do you have any biases?
Take a quick first pass with
everything and review
What works?
23. Templates
Figure out a template that
works for you – then stick to it
It makes moving between and
sharing projects tolerable
https://drivendata.github.io/co
okiecutter-data-science/
24. For longer lasting projects,
strongly consider build
automation
This will manage rebuilding
what’s needed when you
make a change
Tools like Azure Pipelines,
AWS CodeBuild, Luigi, or
even Makefiles
Build Tools
25. Containers
Package your entire workspace
into easily manageable
containers
Makes reproducibility and sharing
simple
Many cloud platforms allow
automatic distribution of
containers to clusters