Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
1© Cloudera, Inc. All rights reserved.
Models in Production: A Look
From Beginning to End
Sean Owen – Director of Data Sci...
2© Cloudera, Inc. All rights reserved.
Data
Preparation
Data
Modeling
Model
Deployment
(maybe)
What does a Data Scientist ...
3© Cloudera, Inc. All rights reserved.
• Team: Data scientists and analysts
• Goal: Understand data, develop and improve m...
4© Cloudera, Inc. All rights reserved.
Typical data science workflow
Data Engineering Data Science (Exploratory) Productio...
5© Cloudera, Inc. All rights reserved.
Common Limitations
Access
Many times secured clusters are hard
for data science pro...
6© Cloudera, Inc. All rights reserved.
Introducing Cloudera Data Science Workbench
Self-service data science for the enter...
7© Cloudera, Inc. All rights reserved.
Solving Data Science is a Full-Stack Problem
• Leverage Big Data
• Enable real-time...
© Cloudera, Inc. All rights reserved. 8
ACME Occupancy Detection
Predicting-room-occupancy-
from-environmental-sensors-
As...
© Cloudera, Inc. All rights reserved. 9
© Cloudera, Inc. All rights reserved. 10
Three Key Roles
Ingest sensor data at scale. Store
and secure data. Clean and
tra...
© Cloudera, Inc. All rights reserved. 11
• Manages ingest of
raw CSV data to
HDFS
• Writes Scala Spark
code to ETL the dat...
© Cloudera, Inc. All rights reserved. 12
"date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy"
"1","20...
© Cloudera, Inc. All rights reserved. 13
• Builds, evaluates and
tunes predictive
models
• Builds visualizations
• Writes ...
© Cloudera, Inc. All rights reserved. 14
Temperature Humidity Light CO2 Humidity
Ratio
Occupancy
23.18 27.272 426 721.25 0...
© Cloudera, Inc. All rights reserved. 15
(Demo)
© Cloudera, Inc. All rights reserved. 16
© Cloudera, Inc. All rights reserved. 17
© Cloudera, Inc. All rights reserved. 18
© Cloudera, Inc. All rights reserved. 19
• Validates PMML
model and deploys
to production
• Uses continuous
integration li...
© Cloudera, Inc. All rights reserved. 20
Temperature Humidity Light CO2 Humidity
Ratio
Occupancy
23.18 27.272 426 721.25 0...
© Cloudera, Inc. All rights reserved. 21
(Demo)
© Cloudera, Inc. All rights reserved. 22
© Cloudera, Inc. All rights reserved. 23
© Cloudera, Inc. All rights reserved. 24
© Cloudera, Inc. All rights reserved. 25
© Cloudera, Inc. All rights reserved. 26
github.com/srowen/
cdsw-simple-serving
© Cloudera, Inc. All rights reserved.
2
7
A conference for and by practicing data scientists
Save the Date: July 20th at t...
© Cloudera, Inc. All rights reserved. 28
Thank you
Prochain SlideShare
Chargement dans…5
×

Part 3: Models in Production: A Look From Beginning to End

1 346 vues

Publié le


3 Things to Learn About:

-How to uplevel your existing analytics stack with a collaborative environment that supports the latest open source languages and libraries.
-How to get better use of your core data management investments while opening up new supported tools for data science.
-How to expand data science outside of silo’d environments and enable self-service data science access.

Publié dans : Logiciels
  • Data science is more than just modeling. The complete data science lifecycle also includes data engineering and model deployment. This project offers a simplified yet credible example of all three elements, as implemented using Apache Spark, the Cloudera Data Science Workbench, and JPMML / OpenScoring.
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Part 3: Models in Production: A Look From Beginning to End

  1. 1. 1© Cloudera, Inc. All rights reserved. Models in Production: A Look From Beginning to End Sean Owen – Director of Data Science, Cloudera Sean Anderson – Product Marketing, Cloudera
  2. 2. 2© Cloudera, Inc. All rights reserved. Data Preparation Data Modeling Model Deployment (maybe) What does a Data Scientist Do?
  3. 3. 3© Cloudera, Inc. All rights reserved. • Team: Data scientists and analysts • Goal: Understand data, develop and improve models, share insights • Data: New and changing; often sampled • Environment: Local machine, sandbox cluster • Tools: R, Python, SAS/SPSS, SQL; notebooks; data wrangling/discovery tools, … • End State: Reports, dashboards, PDF, MS Office • Team: Data engineers, developers, SREs • Goal: Build and maintain applications, improve model performance, manage models in production • Data: Known data; full scale • Environment: Production clusters • Tools: Java/Scala, C++; IDEs; continuous integration, source control, … • End State: Online/production applications Types of data science Exploratory (discover and quantify opportunities) Operational (deploy production systems)
  4. 4. 4© Cloudera, Inc. All rights reserved. Typical data science workflow Data Engineering Data Science (Exploratory) Production (Operational) Data Wrangling Visualization and Analysis Model Training & Testing Production Data Pipelines Batch Scoring Online Scoring Serving Data GovernanceGovernance Processing Acquisition
  5. 5. 5© Cloudera, Inc. All rights reserved. Common Limitations Access Many times secured clusters are hard for data science professionals to connect either because they don’t have the right permissions or resources are to scarce to afford them access. In addition popular frameworks and libraries don’t read Hadoop data formats out-of-the-box. Scale Notebook environments seldom have large enough data storage for medium, let alone big data. Data scientists are often relegated to sample data and constrained when working on distributed systems. Popular frameworks and libraries don’t easily parallelize across the cluster. Developer Experience Popular notebooks don’t work well with access engines like Spark and package deployment and dependency management across multiple software versions is often hard to manage. Then once a model is built there is no easy path from model development to production
  6. 6. 6© Cloudera, Inc. All rights reserved. Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: • Secure self-service environments for data scientists to work against Cloudera clusters • Support for Python, R, and Scala, plus project dependency isolation for multiple library versions • Workflow automation, version control, collaboration and sharing
  7. 7. 7© Cloudera, Inc. All rights reserved. Solving Data Science is a Full-Stack Problem • Leverage Big Data • Enable real-time use cases • Provide sufficient toolset for the Data Analysts • Provide sufficient toolset for the Data Scientists + Data Engineers • Provide standard data governance capabilities • Provide standard security across the stack • Provide flexible deployment options • Integrate with partner tools • Provide management tools that make it easy for IT to deploy/maintain ✓Hadoop ✓Kafka, Spark Streaming ✓Spark, Hive, Hue ✓Data Science Workbench ✓Navigator + Partners ✓Kerberos, Sentry, Record Service, KMS/KTS ✓Cloudera Director ✓Rich Ecosystem ✓Cloudera Manager/Director
  8. 8. © Cloudera, Inc. All rights reserved. 8 ACME Occupancy Detection Predicting-room-occupancy- from-environmental-sensors- As A Service github.com/srowen/cdsw-simple-serving
  9. 9. © Cloudera, Inc. All rights reserved. 9
  10. 10. © Cloudera, Inc. All rights reserved. 10 Three Key Roles Ingest sensor data at scale. Store and secure data. Clean and transform data for analysis. Explore data and build predictive model, offline. Evaluate and tune the model. Develop modeling pipeline and deliver models Verify and approve model for deployment. Create and maintain model APIs. Update models in production. Data Engineering Data Science Model Deployment
  11. 11. © Cloudera, Inc. All rights reserved. 11 • Manages ingest of raw CSV data to HDFS • Writes Scala Spark code to ETL the data • Uses an IDE • Checks code into git • Adds code to Maven project
  12. 12. © Cloudera, Inc. All rights reserved. 12 "date","Temperature","Humidity","Light","CO2","HumidityRatio","Occupancy" "1","2015-02-04 17:51:00",23.18,27.272,426,721.25,0.00479298817650529,1 "2","2015-02-04 17:51:59",23.15,27.2675,429.5,714,0.00478344094931065,1 "3","2015-02-04 17:53:00",23.15,27.245,426,713.5,0.00477946352442199,1 "4","2015-02-04 17:54:00",23.15,27.2,426,708.25,0.00477150882608175,1 spark.read.textFile(rawInput). map { line => if (line.startsWith(""date"")) { line } else { line.substring(line.indexOf(',') + 1) } }. repartition(1). write.text(csvInput) spark.read. option("inferSchema", true). option("header", true). csv(csvInput). drop("date") Temperature Humidity Light CO2 Humidity Ratio Occupancy 23.18 27.272 426 721.25 0.00479 1 23.15 27.2675 429.5 714 0.00478 1 23.15 27.245 426 713.5 0.00477 1 23.15 27.2 426 708.25 0.00477 1
  13. 13. © Cloudera, Inc. All rights reserved. 13 • Builds, evaluates and tunes predictive models • Builds visualizations • Writes Scala, Python or R Spark code to model using MLlib, etc • Uses Cloudera Data Science Workbench or similar • Checks code, PMML model into git
  14. 14. © Cloudera, Inc. All rights reserved. 14 Temperature Humidity Light CO2 Humidity Ratio Occupancy 23.18 27.272 426 721.25 0.00479 1 23.15 27.2675 429.5 714 0.00478 1 23.15 27.245 426 713.5 0.00477 1 23.15 27.2 426 708.25 0.00477 1 val assembler = new VectorAssembler(). setInputCols(training.columns.filter(_ != "Occupancy")). setOutputCol("featureVec") val lr = new LogisticRegression(). setFeaturesCol("featureVec"). setLabelCol("Occupancy"). setRawPredictionCol("rawPrediction") val pipeline = new Pipeline().setStages(Array(assembler, lr)) LogisticRegression [regParam=0.01]
  15. 15. © Cloudera, Inc. All rights reserved. 15 (Demo)
  16. 16. © Cloudera, Inc. All rights reserved. 16
  17. 17. © Cloudera, Inc. All rights reserved. 17
  18. 18. © Cloudera, Inc. All rights reserved. 18
  19. 19. © Cloudera, Inc. All rights reserved. 19 • Validates PMML model and deploys to production • Uses continuous integration like Travis CI • Maintains REST API via OpenScoring • Uses an IDE • Checks code into git
  20. 20. © Cloudera, Inc. All rights reserved. 20 Temperature Humidity Light CO2 Humidity Ratio Occupancy 23.18 27.272 426 721.25 0.00479 1 23.15 27.2675 429.5 714 0.00478 1 23.15 27.245 426 713.5 0.00477 1 23.15 27.2 426 708.25 0.00477 1 <PMML version="4.3" xmlns="http://www.dmg.org/PMML-4_3"> … <RegressionModel functionName="classification" normalizationMethod="softmax"> … <RegressionTable intercept="16.121752149952" targetCategory="1"> <NumericPredictor name="Temperature" coefficient="-1.239411520229105"/> <NumericPredictor name="Humidity" coefficient="0.040079547154413746"/> <NumericPredictor name="Light" coefficient="0.020182888698828436"/> <NumericPredictor name="CO2" coefficient="0.0060762157896669"/> <NumericPredictor name="HumidityRatio" coefficient="-500.42306896474247"/> </RegressionTable> … </RegressionModel> </PMML> POST /model/occupancy
  21. 21. © Cloudera, Inc. All rights reserved. 21 (Demo)
  22. 22. © Cloudera, Inc. All rights reserved. 22
  23. 23. © Cloudera, Inc. All rights reserved. 23
  24. 24. © Cloudera, Inc. All rights reserved. 24
  25. 25. © Cloudera, Inc. All rights reserved. 25
  26. 26. © Cloudera, Inc. All rights reserved. 26 github.com/srowen/ cdsw-simple-serving
  27. 27. © Cloudera, Inc. All rights reserved. 2 7 A conference for and by practicing data scientists Save the Date: July 20th at the Chapel Wrangle is a one-day, single track community event that hosts the best and brightest in the Bay Area talking about the principles, practice, and application of Data Science, across multiple data-rich industries. Join Cloudera to discuss future trends, how they can can be predicted, and most importantly—how can they be anticipated. wrangleconf.com
  28. 28. © Cloudera, Inc. All rights reserved. 28 Thank you

×