Scikit-learn: the state of the union 2016

1 713 vues

Publié le

Personal point of view on scikit-learn: past, present, and future.

This talks gives a bit of history, mentions exciting development, and a personal vision on the future.

Publié dans : Technologie
  • Soyez le premier à commenter

Scikit-learn: the state of the union 2016

  1. 1. Scikit-learn The state of the union Ga¨el Varoquaux Open Source Innovation Spring 2016 Personal point of view, as an opening to scikit-learn days 2016 in Paris
  2. 2. 1 Some history Scikit-learn canal historique G Varoquaux 2
  3. 3. 1 scikit-learn growth: users Website users (weekly): Google analytics Debian popcon: ∼ 1% of the Debian users G Varoquaux 3
  4. 4. 1 scikit-learn growth: users Website users (weekly): Google analytics Debian popcon: ∼ 1% of the Debian users Web searches: Google trends G Varoquaux 3
  5. 5. 1 scikit-learn growth: lines of code Lines of code: Huge feature set https://www.openhub.net/p/scikit-learn G Varoquaux 4
  6. 6. 1 scikit-learn growth: contributors Contributors: 759 contributors https://www.openhub.net/p/scikit-learn G Varoquaux 5
  7. 7. 1 Started as David Cournapeau’s failed PhD project David then preferred improving numpy/scipy That’s David sprinting in 2011 G Varoquaux 6
  8. 8. 1 2009: We (Inria Parietal) need machine learning My team takes over the development Hire a young guy (Fabian Pedregosa) Put post-docs and PhDs (Alexandre Gramfort, Vincent Michel...) Work in the open Pythonic, fast, documented G Varoquaux 7
  9. 9. 1 2010: ICML MLOSS workshop Machine Learning Open Source Software “The examples in the tutorial are pretty, but not particularly useful for the serious user.” “For the sustainability of the project it might be bet- ter to narrow the focus...” G Varoquaux 8
  10. 10. 1 2011: NIPS sprint People that I didn’t know were solving my problems G Varoquaux 9
  11. 11. 1 2011: NIPS sprint People that I didn’t know were solving my problems The project took off because of the community... G Varoquaux 9
  12. 12. 2 Upcoming cool stuff Upcoming 0.18 release G Varoquaux 10
  13. 13. 2 Less code: Lines of code: G Varoquaux 11
  14. 14. 2 Less code: Cython no longer embedded Lines of code: Generated C no longuer embedded in git ⇒ opens the door to fused-types (polymorphism) ⇒ multiple dtypes support in algorithm = memory saver Arthur MenschG Varoquaux 11
  15. 15. 2 Faster code: better algorithmics RandomizedPCA → PCA Automatic choice randomized linear algebra power iteration (arpack) full (lapack) For large data: up to 20× speed up https://github.com/scikit-learn/scikit-learn/issues/5243 Giorgio Patrini G Varoquaux 12
  16. 16. 2 Faster code: better algorithmics RandomizedPCA → PCA Automatic choice randomized linear algebra power iteration (arpack) full (lapack) For large data: up to 20× speed up https://github.com/scikit-learn/scikit-learn/issues/5243 Giorgio Patrini Elkan’s K means For large data: ∼ 2× speed up. https://github.com/scikit-learn/scikit-learn/pull/5414 Andreas M¨uller G Varoquaux 12
  17. 17. 2 New cross-validation objects from s k l e a r n . c r o s s v a l i d a t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d (y , n f o l d s =2) for t r a i n , t e s t in cv : X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] Data-independent nested-CV possible https://github.com/scikit-learn/scikit-learn/pull/4294 Raghav R V G Varoquaux 13
  18. 18. 2 New cross-validation objects from s k l e a r n . m o d e l s e l e c t i o n import S t r a t i f i e d K F o l d cv = S t r a t i f i e d K F o l d ( n f o l d s =2) for t r a i n , t e s t in cv . s p l i t (X, y): X t r a i n = X[ t r a i n ] y t a i n = y[ t r a i n ] Data-independent ⇒ nested-CV possible https://github.com/scikit-learn/scikit-learn/pull/4294 Raghav R V G Varoquaux 13
  19. 19. 2 Sequential / Bayesian search CV See hyper-parameter selection as a Bayesian optimization / noisy fit problem. ⇒ choose hyper-parameters cleverly, not on a grid Pull request stalled https://github.com/scikit-learn/scikit-learn/pull/5491 Fabian Pedregosa, Sebastien Dubois, & Manoj Kumar G Varoquaux 14
  20. 20. 3 Vision(s): the future G Varoquaux 15
  21. 21. Mission statement Enable progress via data science Lower the costs, less technicalities Machine learning for everybody and for everything G Varoquaux 16
  22. 22. Mission statement Enable progress via data science Lower the costs, less technicalities Machine learning for everybody and for everything Small hardware, medium data G Varoquaux 16
  23. 23. 3 Deep learning sklearn.neural network.MLPClassifier architecture-specification language GPUs unbound technicality G Varoquaux 17
  24. 24. 3 Deep learning sklearn.neural network.MLPClassifier architecture-specification language GPUs unbound technicality keras, caffe... G Varoquaux 17
  25. 25. 3 AutoML Automatic model selection Better hyper-parameter selection Better description and uniformization of estimators Integrate feedback from auto-sklearn G Varoquaux 18
  26. 26. 3 Better, faster, stronger Faster models From lightning, back to sklearn Inspiration from XGBoost the paper is out! G Varoquaux 19
  27. 27. 3 Better, faster, stronger Faster models From lightning, back to sklearn Inspiration from XGBoost the paper is out! Larger data More partial fit online forests? Less copies G Varoquaux 19
  28. 28. 3 Scaling up (out?) I don’t want java/scala Less fluid prototyping Cross-VM debugging hard Numerics in java slowers than Lapack Need C somewhere G Varoquaux 20
  29. 29. 3 Scaling up (out?) I don’t want java/scala They have: Coupling distributed store to computation Distributed job management Create new stack? Ride on this one? G Varoquaux 20
  30. 30. 3 Scaling up (out?) I don’t want java/scala They have: Coupling distributed store to computation Distributed job management Create new stack? Ride on this one? Blaze, Ibis, dask: require rewrite of algorithms dask promising for ETL New backends for joblib parallel and storage distributed, ssh G Varoquaux 20
  31. 31. Sustainable growth Reviewing is the bottleneck User support drowns core devs Users need stability (Airbus) Coding is not the only thing sprint, GSOC management, tutorials... G Varoquaux 21
  32. 32. Sustainable growth Reviewing is the bottleneck User support drowns core devs Users need stability (Airbus) Coding is not the only thing sprint, GSOC management, tutorials... Structure & stability How to organize funding and governance? process/meetings/reports/funding proposal... = work on project Passionate coders get a lot done unless they get drowned by meetings G Varoquaux 21
  33. 33. @GaelVaroquaux Funding: Inria, Nexedi, Paris-Saclay CDS, NYU CDS, GSoC

×