Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

R meetup talk scaling data science with dgit

596 vues

Publié le

First talk on the need for dataset versioning, structure of dgit, and demo. This was R data science meetup at Bangalore on March 26

Publié dans : Internet
  • Identifiez-vous pour voir les commentaires

  • Soyez le premier à aimer ceci

R meetup talk scaling data science with dgit

  1. 1. Scaling Data Science with dgit Dr. Venkata Pingali Founder, Scribble Data pingali@scribbledata.io https://github.com/pingali
  2. 2. Summary 1. Scaling impact of data science requires increasing trust and efficiency a. Trust requires auditability and reproducibility of results b. Efficiency requires standardization and automation 2. Dataset is a fundamental abstraction of data science 3. dgit enables git-like management of datasets a. Python package, open source, MIT licence b. Familiar git interface with modifications 4. Call to collaborate
  3. 3. dgit - 1 min summary
  4. 4. dgit - git wrapper for datasets 1. Python package, MIT license 2. Application of git 3. Beyond git - “Understands” data a. Metadata generation and management b. Automatic scanning of working directory for changes c. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with execution f. Pipeline support
  5. 5. Growing Pains in Data Science
  6. 6. Anonymized Random Slide from an Actual Presentation Implication: Large wasted spend, poor production design, baseline worsening
  7. 7. Decision-maker Questions 1. Where did the numbers come from? (Correctness, Lineage) a. Assumption, models, datasets 2. Is this an accident? Does it hold now? (Reproducibility, Retargetability) a. Model, dataset, and question revisions 3. Can you get the results faster? (Efficiency) a. Time, effort, cost 4. Can you also analyze X? (Extensibility) a. Different dataset, question 5. Could we try X? (Dataset generation - synthetic and real) a. What if scenarios, field experiments
  8. 8. Conceptual Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling All three roles could be in a single team!
  9. 9. Business Complexity is Discovered Over Time Incomplete context (history, semantics) Qtns not thought through Continuous revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  10. 10. Imperfect Data Queries due to Limited Understanding Dependencies not specified Wrong filters Known outliers Narrow specification (cubes) Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  11. 11. Weak process Lack of protocol (email/files) Missing validation checks No lineage No revisions Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  12. 12. Eagerness to Present Great Narratives Wrong input dataset Mistakes in pipeline Excel/adhoc transformations Model evolution Continuous revision of narratives Missing interpretation integrity checks (e.g. other time windows) Better methodology Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling
  13. 13. Process in Reality Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling Iterative Expensive Laborious
  14. 14. Actual Process Biz Analytics Team Data Engg Qtns, Context Data Req Datasets Model Results Story Telling Iterative Expensive Laborious http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/ "80% of .. companies strategic decision go haywire.. “flawed” data
  15. 15. Desired State 1. Trusted a. Every model should be auditable to the last record and step ⬅ b. Every model should be reproducible with zero human intervention ⬅ c. Enables use and development of mathematical judgment 2. Scalable a. Highly automated through most of the lifecycle ⬅ b. Continuous reduction in costs ⬅ c. Grow sublinearly with questions, datasets, models 3. Robust a. Younger, inexperienced staff ⬅ b. Weak processes
  16. 16. Process with Dataset Repository Biz Analytics Team Data Engg Server Side CI Dataset Rules Evaluation Rules Dependencies Materialized dataset v1 v2 v3Materialize Model Pipeline Pipeline Execution v4 Slide Content URN Context, Questions v5Evaluation Interpretation v6 Dataset as mutable object with memory No emails/google docs Continuous validation by thirdparty (server) Separate model development and evaluation
  17. 17. dgit
  18. 18. Dgit Structure dgitcore API Repo Mgr Git Backend S3 Validator Generator Instrumentation MySQLS3Regression ContentPlatform dgit CLI Metadata Basic
  19. 19. Demo Goals 1. Show end-to-end example (command line) a. Simple regression 2. Explain structure 3. Advanced features a. Validation (regression quality plugin) b. Generator (SQL) c. Pipeline (Dora)
  20. 20. Open Tasks 1. Dgit specific a. Cleanup and stabilization i. Python v2/3 compatibility ii. Plugins to do various tasks (anonymization, hive etc) b. Testing infrastructure c. Integration i. Windows and MacOS support ii. Support for instabase/dat/other services 2. Ideas for new tools to reduce cost and complexity of data science
  21. 21. Speaker Dr. Venkata Pingali Founder, Scribble Data Former-VP Analytics, FourthLion IIT(B) PhD (USC) http://linkedin.com/in/pingali

×