Polong Lin is a Data Scientist at IBM. He is a regular speaker on data science and develops content for free data education on bigdatauniversity.com using open data tools on datascientistworkbench.com. Polong earned his M.Sc. at the Univ. of Tsukuba.
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
Polong Lin(林伯龍)/how to approach data science problems from start to end
1. How to Approach
Data Science Problems
from Start to End
Polong Lin
Data Scientist
IBM Analytics, Emerging Technologies
@polonglin
@bigdatau
台灣資料科學年會
2. • Free online courses
• Data Science & Data Engineering
• A communityinitiative led by IBM
• Certificates and Badges
• > 450,000 users
What is Big Data University (BDU)?
9. • Every project begins with business understanding.
• What is the project objective?
• What are we trying to do – what is our goal?
1. Formulate a clear question
2. Define problem and solution requirements
9
1. Business
Understanding
Flight delays: Create some solution that can help
users predict if a flight on a given day will be
delayed or not delayed
1. Business understanding
15. Data Preparation typically includes:
• Data cleaning
• Merging data
• Transforming data
• Feature engineering
• Text analysis
15
6. Data preparation
6. Data
Preparation
Flights are classified as “delayed” if >15 min late.
• Delayed? [True or False]
Does time of day for departure predict delays?
• Hour
18. Modeling is a:
• Highly iterative process
• Multiple models may be used and tested
18
Modelling
Modeling
Using inputs:
• Year
• Month
• Day of Month
• Hour of departure
• Distance
• Destination airport
Predict:
Delay (True/False)
Logistic Regression
20. • Once finalized, the model is deployed into a production environment.
• May be in a limited / test environment until model is proven
• Involves additional groups, skills, and technologies
• Solution owner
• Marketing
• Application developers and designers
• IT administration
• Feedback to assess model performance
• Gathering and analysis of feedback for assessment
of the model’s performance and impact
• Iterative process for model refinement and redeployment
• Accelerate through automated processes
20
Deployment
Feedback
Prediction
Interpretation
Justification
Testing
Deployment and feedback