Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Proposed Talk Outline for Pycon2017

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 14 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Proposed Talk Outline for Pycon2017 (20)

Publicité

Plus récents (20)

Proposed Talk Outline for Pycon2017

  1. 1. NAVIGATING THE PYTHON ECOSYSTEM FOR DATA SCIENCE Ananth Krishnamoorthy, Ph.D. Outline Slides for Talk at PyCon2017
  2. 2. Summary • In their day-to-day jobs, data science teams and data scientists face challenges in many overlapping yet distinct areas such as Reporting, Data Processing & Storage, Scientific Computing, ML Modelling, Application Development. To succeed, Data science teams, especially small ones, need a deep appreciation of these dependencies on their success. • Python ecosystem for data science has a number of tools and libraries for various aspects of data science, including Machine Learning, Cluster Computing, Scientific Computing, etc. • The idea of this talk is to understand what the Python data science ecosystem offers (so that you don't reinvent it), what are some common gaps (so that you don't go blue looking for answers). • In this talk, we describe how different tools/libraries fit in the machine learning model development and deployment workflow . This talk is about how these different tools work (and don’t work) together with each other. It is intended as a landscape survey of the python data science ecosystem, along with a mention of some common gaps that practitioners may notice as they put together a stack and/or an application for their company.
  3. 3. The most important trait of the Analytics 3.0 era is that not only online firms, but virtually any type of firm in any industry, can participate in the data economy. Banks, industrial manufacturers, health care providers, retailers—any company in any industry that is willing to exploit the possibilities—can all develop data-based offerings for customers, as well as support internal decisions with big data. Analytics 1.0 Analytics 2.0 Analytics 3.0 Data  Enterprise Data  Structured transactional data  Bring in web and social data  Complex, large, semistructured data sources  GPS, Mobile Device, Clickstream, Sensor data  Unstructured, real time, streaming Tools  Spreadsheets  BI, OLAP  ETL  On-premise servers  Visualization  NoSQL  Hadoop  Machine Learning , Artificial Intelligence  On-Demand Everything  Analytical Apps  Integrated, Embedded models Activity  Majority of analytical activity was descriptive analytics, or reporting  Creating analytical models was a time-consuming “batch” process  Visual analytics dominates predictive and prescriptive techniques  Develop products, not PowerPoints or reports  Analytics integral to running the business, strategic asset  Rapid and agile insight delivery  Analytical tools available at point of decision Source: THE RISE OF ANALYTICS 3.0, By Thomas H. Davenport, IIA, 2013 Evolving Role of Data Science Teams
  4. 4. Machine Learning vs Real World Data Science Machine Learning Deployment Application Development Big Data Processing Data Storage ETL
  5. 5. Challenges faced by Data Science Teams • Requires many more competencies than can be reasonably expected from one person • Challenges are greater for smaller teams and smaller companies, e.g. startups • Challenges create dependencies on other teams e.g. Development • Dependencies slow down execution and benefits realization
  6. 6. Plethora of Choices Reporting Data Processing & Storage Scientific Computing ML Modelling Application Development SQL NoSQL Graphdb OLAP ETL Cluster Computing Stream Processing SQL Charting Statistics Cloud Front End Microservices Back End ML Deep Learning Dim. Reduction Signal Processing Optimization Time Series Analysis Simulation MapReduce
  7. 7. Data Science Workflow ETL Process ModelStore Deploy DATA SCIENTIST SKILLS Infrastructure and Provisioning ???
  8. 8. Python Ecosystem ETL Process ModelStore Deploy Odo Blaze Pandas Dask Spark Sklearn_Pandas Scikit-learn Keras Spark MLlib Bokeh Jupyter
  9. 9. Review of Key Tools (50% of talk time spent here, more slides to be added) • Jupyter • Pandas • Scikit-Learn • Keras / TensorFlow / Theano • Matplotlib/Bokeh • Blaze • Odo • Dask • pySpark We shall see some code snippets here, to illustrate a few ideas The idea is to know enough to pick the right components for the job at hand
  10. 10. Use Case 1: Small Data This use case will illustrate case of Small Data i.e. Desktop / In-memory processing
  11. 11. Use Case 2: ‘Medium’ Data This use case will illustrate case of Medium Data with Out-of-core processing
  12. 12. Use Case 3: Big Data This use case will illustrate case of Big Data i.e cluster computing
  13. 13. What Works • Sklearn’s Consistent API, wide variety of ML algorithms • Sklearn Pipelines • Scikit-Keras Integration • Pandas for Data Analysis • …. • ….
  14. 14. Gaps – A Data Scientist’s Perspective • Uniform API Across Activities • Separation of Data, Processing, and Instructions • Single Data Structure Paradigm • Support for in-memory, out-of-core, and distributed computing in same paradigm e.g. SFrame • ETL • Push heavy lifting to backend systems • Monitoring workflows • UI development • Bokeh • Deployment • Application • Web Services

Notes de l'éditeur

  • Slide needs improvement 

×