This document discusses common problems with data science projects and provides recommendations for developing an effective data strategy and infrastructure. Key points include:
- 71% of data science projects fail due to a lack of early involvement from data scientists, clear strategy and goals, and issues with data quality.
- Developing a data strategy involves understanding organizational goals, current infrastructure, a roadmap for building out platforms and tools, and ensuring collaboration across teams.
- Data quality issues are more common than assumed and make data science projects untestable. Proper instrumentation and quality data is critical.
- Data must be treated as a product with dedicated teams, goals, and budgets to drive innovation and success of projects.
5. • Data Scientists aren’t involved from the beginning
• No strategy
• Bad Data: more common than you think and
untestable
• Pointlessness
Why
6. • Everything stems from this
• Goals need to be attainable
• Data needs to be accessible and formatted
correctly
• You can’t conceive of what’s possible (or impossible)
Involvement
8. Your Data Strategy: Diagnostic
• Diagnostic: How did we get here?
• Understanding history and how your org drives decisions is key
• What will your org’s immune system allow?
• Infrastructure: what is currently in place and how did it happen?
• Goals: How do we drive revenue or KPIs?
9. Your Data Strategy: Roadmapping
• Roadmapping: What are we going to build?
• Data Architecture?
• Platform feasible?
• Who builds what when, for how much?
• How do we ensure a low-latency feedback loop? DS highly iterative
10. Your Data Strategy: Development
• Platform: What’s our stack?
• Storage: Where does data come from, go to, and latency/throughput
requirements on storage?
• Processing: Where do we transform data? Batch? Real-time? Bounds?
• Collaboration: How do we share results, data, and APIs across the org?
(always forgotten)
14. Untestable
• Data Scientists spend vast amounts of time fixing data
• …and you need to be OK with that
• Unit Testing doesn’t make sense in science
• Distributions fittings, etc
• Can only test via simulation: a whole ‘nother process
• “Simple” things take weeks to verify
15. Instrumentation
• Can you even verify your instrumentation?
• Are you collecting everything?
• Collecting the right thing?
• What if only 85% of the time?
• Systematically drop at high enough traffic?
• Someone comes into site through different channel from an acquisition 2 yr ago?
16. Software is Garbage
• Remember Hadoop?
• Spark?
• MLib bugs for years
• Wrong math won’t fail unit tests
• GIGO
• JSON, weekly microversioning, schema entropy…
• This is why DS efforts are so slow to start w/o initial involvement
• Don’t build the One True Data Platform
• one of our customers had 30 DBs including a critical out-of-license DB2
box
21. Data Must be Treated like a Product
• Build a Data Products Team
• Engineers, PMs, Design. Data Science. Not just analysts.
• KPIs, Goals, Measurability, Backlogs
• Budget
• Freedom to Innovate
• Staff of diverse backgrounds
A Data Platform will touch every part of your org