The document summarizes the author's experience competing in their first Kaggle competition to predict Parkinson's disease severity scores. Some key points:
- The author placed in the top 6.94% without any prior Kaggle experience, submitting over 90 solutions and creating over 50 models.
- Effective solutions can be simple, like averaging two simple models. The winning solution ignored blood test features.
- Competitive data science requires extensive time and effort exploring many hypotheses through experiment tracking and logging.
- Asking questions of others outside data science can provide new insights. Teams are valuable for sharing experiences.
- Public leaderboards may not reflect the true state and debugging code competition submissions is challenging. Learning
5. Metric
SMAPE (+1)
Symmetric Mean Absolute Percentage (+1)
UPDRS
Unified Parkinson's
Disease Rating Scale
The goal is to predict UPDRS scores that
measure the severity of Parkinson's disease:
• UPDRS_1 - Mentation, Behavior, and Mood
• UPDRS_2 - Activities of Daily Living
• UPDRS_3 – Body Motor Functions
• UPDRS_4 - Complications of Therapy
The higher the value, the higher the severity
Predict values for the current month and values
6, 12, 24 months later.
So, for one visit we need to predict 16 values.
6. Results
• Team Experience on Kaggle: none
• Notebooks created: 242
• Models created: 53
• Submissions: 91
• Score result: TOP 6.94%
• PB result: TOP 15% (262nd place)
• Winning team score by competition
metric (SMAPE): 60.042
• Average score in PB: 72.278
• Team score: 69.759
• Bronze Score: 69.743
• Silver Score: 69.738
• Gold Score: 60.936
7. 1st place solution
Final solution is a simple average of two models: LGB and NN.
Both models were trained on the same features
• Visit month
• Forecast horizon
• Target prediction month
• Indicator whether blood was taken during the visit
• Indicators whether a patient visit occurred on 6th, 18th and 48th
month
• Count of number of previous “non-annual” visits (6th or 18th)
• Index of the target (pivot the dataset to have a single target column)
The winning solution fully ignores the results of the blood tests. Team
tried hard to find any signal in this crucial piece of the data, but
unfortunately came to the conclusion that none of their approaches or
models can benefit from blood test features significant enough to
distinguish it from random variations.
10. Lesson 3
Be prepared for the tree of
hypotheses and options to
grow indefinitely.
A system for tracking
experiment results and
logging changes will be
needed very soon.
11. Lesson 4
You will probably
spend a lot of time
on ideas that will not
work….
But it will be an
invaluable
experience.
12. Lesson 5
Search for similar
competitions in the
past. Learn winning
techniques. Apply it.
13. Lesson 6
Don’t rely on other
people’s EDA and
automated data
analysis packages
14. Lesson 7
Ask all kinds of
questions, even the
wildest ones, about the
data and the topic of
the competition. Find
your answers. Consult
the experts in the field
relevant publications.
How many
times do I have to
lose at Kaggle to win ?
15. Lesson 8
Explain your mission
and approaches at
Kaggle competition to
ducks people far from
data science. Simple
questions and
explanations often
reveal valuable
https://en.wikipedia.org/wiki/Rubber_duck_debugging
17. Lesson 10
If the it is a ”Code
Competition” (API for
submitting solutions) be
prepared for a blind battle.
Getting a finished solution
to an accepted submission
via the API may take
longer than you think. ”Take a deep breath, step away from the code, sleep or go
for a walk, take your mind off it, then come back and examine
with fresh eyes”
https://www.kaggle.com/code-competition-debugging
You’re getting an error in a code competition. Now what? Writing code that
works perfectly on unseen data is difficult, even for experts. Don't get
discouraged or feel that you're the only one stuck.
To prevent probing, Kaggle does not provide highly specific debugging
messages in code competitions (whereby Kaggle reruns your code on a
hidden dataset). Submissions that error also count towards your team’s
daily submission limit…
18. Lesson 11
Don’t give up. There will
be demotivation. Just
don’t give up and go all
the way.
19. Lesson 12
The Team is Great!
Sharing your suffering,
joys, and triumphs with
your teammates is
priceless.
20. Lesson 13
Perhaps not everyone in
your own social circle will
appreciate the level of
involvement in the
competition. It really takes
a lot of time and attention.
21. Lesson 14
The competition is not
over until you understand
the winners’ solutions.
Me starting the Kaggle
competition
Me reading winners' solutions
22. Lesson 15
Competition is first about
learning and experience,
then about winning over
yourself, and only then
about winning over
others.
23. Send your complaints, suggestions and job offers
https://www.linkedin.com/in/samvelkoch/
samvelkoch@gmail.com
Medals
Login streak
Impulse and inspiration from the previous meetup
A bit overexcited
I may be the least experienced DS in this room, but highly likely one the most enthusiastic
The total number of patients was in the range 380-390
Two hundred and thirty thousand rows
Math notation and code for metric the winning used in their solution
SMAPE is expressed as a percentage
SMAPE has some limitations, such as sensitivity to zero values
It penalize for large erroкs. The larger errors result in higher percentage differences, leading to a greater penalty in the SMAPE calculation
The final LB score was so dense that even a tiny difference in few thousandth separated us from the medals zone. Which, by the way, is quite typical for most competitions on Kaggle
The further we progressed in our research, the more we felt that the goal of the contest, to build an effective model for predicting a patient’s condition using proteins and peptides, would not be achieved. The problem is not the lack of a linkage between proteins and Parkinson’s, but rather the data itself and the design of the competition. First, the data lacked alpha-synuclein, which has been the subject of the most promising research in recent years. Second, a control group of healthy patients appeared to be represented in a very small sample, and so suffered from the curse of dimensionality. Thirdly, the organizers of the competition did not make the use of proteins and peptides mandatory for the participants solutions. All three factors were borderline foul. I’m sure the organizers had compelling reasons to make such a dataset available to the community. I sincerely hope that the community’s decisions have provided researchers with answers to their questions, and that these answers have brought humanity one step closer to understanding Parkinson’s disease and finding effective approaches to its prevention and treatment.
Count of number of previous “non-annual” visits (6th or 18th):
A simple feature was noticed by only 18 teams
It is priceless to learn how to eliminate redundant data and find answers in your and public notebooks