Contenu connexe Similaire à “Data Versioning: Towards Reproducibility in Machine Learning,” a Presentation from Tryolabs (20) Plus de Edge AI and Vision Alliance (20) “Data Versioning: Towards Reproducibility in Machine Learning,” a Presentation from Tryolabs1. © 2022 Tryolabs
Data Versioning
Towards Reproducibility in
Machine Learning
Nicolás Eiris
Machine Learning Engineer
Tryolabs
2. © 2022 Tryolabs
© 2022 Tryolabs
Tryolabs
2
• We build custom AI solutions
• 70+ team members
• 12+ years of experience
• Served more than 150 clients
Trusted by
3. © 2022 Tryolabs
1. Main pain points in ML workflows
2. Useful open source tool
3. Takeaways
4. References
Agenda
3
4. © 2022 Tryolabs
© 2022 Tryolabs 4
Dilemma in ML development
Building everything manually from scratch
vs. using a tool to support the
development phase (from collecting data to
deploying on the edge).
6. © 2022 Tryolabs
© 2022 Tryolabs
Standard ML workflow
6
DATA
INGESTION
EXPLORATORY
DATA ANALYSIS
DATA CLEANING
EXPERIMENTATION
& EVALUATION
MODELING FEATURE
ENGINEERING
ROLLING OUT TO
PRODUCTION
7. © 2022 Tryolabs
© 2022 Tryolabs
ML pipeline in practice
7
model
features
data
data_v2
features copy
features_2
features_3
model_1
model_1_2
model_prefinal
model_data_v2
model_2_2
model_final
UPLOAD
CODE
SETUP STORAGE
& UPLOAD DATA
SETUP CLOUD
RUNNER (GPU,
NAS, ETC.)
WAIT THE DATA &
CODE DON’T MATCH
RUN TRAIN/TEST
SCRIPT
DOWNLOAD
DATA + CODE
SYNC DATA
AND CODE
OH NO!
THE
REQUIREMENTS
HAVE CHANGED
WHERE DO I
REPORT OUTPUT
RESULTS ANYWAY?
RAGE QUIT JOB
EDA
EDA_2
EDA_3
*EDA = Exploratory data analysis
8. © 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
8
1. Reproducibility
● Teamwork
● Usually ad-hoc processes
● Productivity bottleneck
● Challenges
○ Changes in data
○ Hyperparams inconsistency
○ Randomness
○ Manual and ad-hoc
execution of experiments
9. © 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
9
1. Reproducibility
“Changes are uploaded, please
run all the notebook again.”
10. © 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
10
• Complex READMEs on how to
gather data from remote
storage
• Security and data privacy
risks
• Manual versioning of dataset
changes
2. Data sharing
11. © 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
11
2. Data sharing
“I wish I could automate this
process…”
NO STORAGE
12. © 2022 Tryolabs
© 2022 Tryolabs
Main pain points in ML workflows
12
3. Experiments
execution & tracking
● Experiments setup traceability
challenges
● Inefficient results comparison &
evaluation
● Manual process:
○ Spreadsheet
○ Github (metadata files)
○ Tracking tools (big learning curve)
13. © 2022 Tryolabs
© 2022 Tryolabs
Ideal development experience
13
Structured pipeline
composed by
interdependent steps
Sharing
experiments,
models, and results
in a simple way
Easily adding files
or directories to a
remote repository
Stop worrying
about source code
and data association
15. © 2022 Tryolabs
© 2022 Tryolabs
DVC high-level overview
15
model
features
data
7fe5fc5
Update features
d512ef1
Update dataset
and input
parameters
23811e0
Adjusting input
parameters
e7eb61f
Add the new
dataset and
features
020c55f
Adjusting input
parameters
model
features
data
model
features
data
model
features
data
model
features
data
16. © 2022 Tryolabs
© 2022 Tryolabs
DVC high-level overview
16
Time
Accuracy
87%
0
Collaborate
Deploy to
production
commit 8d7aa3d
Rollback
Cloud Local Cache
76%
17. © 2022 Tryolabs
© 2022 Tryolabs
Main features
17
● Git-compatible
● Reproducible
● Low friction branching
● Storage agnostic
● ML pipeline framework
● Language & framework agnostic
● Track failures
● Experiments & metrics tracking
18. © 2022 Tryolabs
© 2022 Tryolabs
Pipelines
18
● Pipelines composed by
interdependent steps
○ Dependencies
○ Code to execute
○ Outputs
● Additional pipeline
visualization command
dvc dag
19. © 2022 Tryolabs
© 2022 Tryolabs
Metrics differences
19
Smooth comparison
process:
numeric and graphic
visualization
20. © 2022 Tryolabs
© 2022 Tryolabs
Continuous integration
20
• Automatically check data version
• Benchmark new model against
previously deployed models
• Metrics diff & interactive plots in
Pull Requests
• Re-train & refine in the cloud
PUSH DATA + CODE
SETUP CLOUD
RUNNER FROM
CI/CD
(GPU, NAS,
ETC.)
RUN
TRAIN/TEST
SCRIPT
PUSH &
REPORT
METRICS
TABLES/GRAPH
S IN PR
COMMENTS
WIN!
SOURCE: WWW.DVC.COM
21. © 2022 Tryolabs 21
Experiments batch execution
“I can’t believe the number of hours saved by
queuing and executing experiments in parallel.”
22. © 2022 Tryolabs
© 2022 Tryolabs
UI does not have to be built from scratch
22
SOURCE: WWW.DVC.COM
● Show plots for selected
experiments
● Compare results
● Run new experiments
● Generate trend charts
24. © 2022 Tryolabs
© 2022 Tryolabs 24
Takeaways
Adopting a
development support
tool across the entire
ML workflow may be
crucial for the success
of a project.
Stop reinventing the
wheel for common ML
challenges.
Boost developer’s
productivity by
enabling them to focus
on coding.
Integrating DVC tool
favors quality
attributes such as
maintainability,
scalability, and security.
Support end-to-end
experience, from EDA
to production.
25. © 2022 Tryolabs
© 2022 Tryolabs 25
Takeaways
Reproducibility
With a couple of commands,
replicate the environment
state from other team
members (without re-executing
all the pipeline or experiment).
Data sharing
Data and source
code association
out-of-the-box, with
a wide variety of
remote storage options.
Experiments
Quickly run multiple
experiments in
parallel with various
ways of visualizing and
comparing results.
26. © 2022 Tryolabs
© 2022 Tryolabs 26
Takeaways - tool vs. from scratch
We learned that for most of the cases, using an
all-in-one framework like DVC alleviates the
work vs. manually dealing with
Reproducibility, Experimentation, and Data
sharing tasks.
27. © 2022 Tryolabs
Resources
27
DVC documentation
https://dvc.org/doc
Platform to quickly get-started with DVC
https://katacoda.com/dvc/courses/get-started
Norfair - Tryolabs object tracking open-source library
https://github.com/tryolabs/norfair
Reproducibility in machine learning
https://towardsdatascience.com/reproducible-machine-
learning-cf1841606805