Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

From Insight to Action: Using Data Science to Transform Your Organization

Plus De Contenu Connexe

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

From Insight to Action: Using Data Science to Transform Your Organization

  1. 1. 1© Cloudera, Inc. All rights reserved. From Insight to Action - Using Data Science to Transform Your Organization Rob Morrow, Chief Technologist US Government
  2. 2. 2© Cloudera, Inc. All rights reserved. Deploy on any cloud infrastructure Cloudera Director: Management for IaaS-related and CDH cluster operations Easy Administration • Dynamic cluster lifecycle management • ICD-503 Support • Single pane of glass: multi-cluster view • Consumption based billing and metering Enterprise-grade • Integration across Cloudera Enterprise • Management of CDH deployments at scale Flexible Deployments • No cloud vendor lock-in: open plugin framework for IaaS platforms • Scaling of provisioned clusters • Spot instance provisioning Cloudera Director
  3. 3. 3© Cloudera, Inc. All rights reserved. Enterprise Data Science Topics It took Todd Lipcon 3 years to create Kudu;10 years of work before that learning and gaining trust among OS Community as a committer. Government of the future: value created through interesting methods. If your organization is already good at the 5,000 Open Source Algorithms (Regression etc), you now need a Data Science Cadre. Open Source: Help Wanted. Methods, not raw DataMost problems are not really Data Science “Challenges”
  4. 4. 4© Cloudera, Inc. All rights reserved. Data Engineering and Data Science Workloads Data Ingestion (Kafka, Navigator, Search) Cloudera enables users to build real-time, end- to-end data pipelines in order to power their business. Leadership in Apache Spark and Kafka have made Cloudera a trusted resource for users who want to capture real-time, streaming, and time series data without being presented with gaps in security. Data Processing (Spark, Hive) Cloudera is helping users accelerate their data pipelines with leadership in technologies like Apache Spark. Data processing in Cloudera Enterprise can help take processing windows from hours to minutes and enables faster access to data for a variety of users and skillsets. Data Science (Spark MLlib) Cloudera is bringing the most popular data science languages/libraries to our platform for easier collaboration, self-service exploration, and implementation at scale. Cloudera is advancing the state of distributed machine learning at scale. Cloudera enables exploratory data science and the ability to deliver robust data products.
  5. 5. 5© Cloudera, Inc. All rights reserved. Closing Gaps in Critical Skills Areas in the Govt Data Science High Value, Low Frequency • Only a small set of problems require direct Data Science expertise (~5%) • Domain-general, algorithm-specific • Very high expertise Characterized by • Spark/Python Expertise • Advanced Algorithms • Hypothesis-testing Automation/Workload • Per-task/Algorithm automation Data Analysis High Frequency, Self-Service • The “other” 95% of Problems • More domain-specific Characterized by • Tools with UI’s (Data Robot) • “Exploratory” data investigation Automation/Workload • Easily automated Data Science “Unicorns” are even more valuable in the Govt. So how to you scale them out?
  6. 6. 6© Cloudera, Inc. All rights reserved. Two Data Science Use Cases Improving decisions vs. improving products Decision Science (improving business decisions) Data Products (improving products for customers) • User: Data scientists and analysts • Data: New and changing; often sampled • Environment: Local machine, sandbox cluster • Tools: R, Python, SAS/SPSS, SQL; notebooks; data wrangling/discovery tools, … • Goal: Understand data, develop and improve models, share results • Production: Hosted/scheduled reports or dashboards • User: Data engineers, developers, SREs • Data: Known data; full scale • Environment: Production clusters • Tools: Java/Scala, C++; IDEs; continuous integration, source control, … • Goal: Build and maintain applications, improve model performance, manage models in production • Production: Online applications
  7. 7. 7© Cloudera, Inc. All rights reserved. Ingest The Foundation of Hadoop’s Potential Data can come from a variety of “siloed” sources ▪ Existing databases ▪ Sensor data ▪ Server logs ▪ Chat transcripts Value of data is multiplied when combined and correlated with other data ▪ “40% value improvement from combining data from multiple IoT sources” McKinsey Global Institute
  8. 8. 8© Cloudera, Inc. All rights reserved. Data Processing Leverage the right processing for your job Data may require unique processing characteristics ▪ Batch ▪ Streaming ▪ Real-time Hadoop arose to address one and now the ecosystem has evolved to answer the rest. ▪ “We’re doubling down on Spark. We invested earliest, and we’ve invested most, in making Hadoop enterprise-grade” Mike Olson
  9. 9. 9© Cloudera, Inc. All rights reserved. Data Science A Unified Platform to Accelerate Data Science from Exploration to Production. Data Scientists need to use data to… ▪ Explore ▪ Model ▪ Test The field of data science blends math and statistics knowledge with advanced computer knowledge. ▪ “Data Scientist: Person who is better at statistics than any software engineer and better at software engineering than any statistician” Josh Wills
  10. 10. 10© Cloudera, Inc. All rights reserved. MLlib Collection of mainstream machine learning algorithms built on Spark Including: •Classifiers: logistic regression, boosted trees, random forests, etc •Clustering: k-means, Latent Dirichlet Allocation (LDA) •Recommender Systems: Alternating Least Squares •Dimensionality Reduction: Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) •Feature Engineering & Selection: TF-IDF, Word2Vec, Normalizer, etc •Statistical Functions: Chi-Squared Test, Pearson Correlation, etc
  11. 11. 11© Cloudera, Inc. All rights reserved. Data Science Track Info Data Science Location: Severn Matrix Decomposition at Scale Juliet Hougland, Data Scientist, Cloudera Large-scale Agent-Based Modeling and Simulation on High-Performance Computers Dr. Robert Axtell, George Mason University Random Decision Forests at Scale Todd Boetticher, Solutions Consultant, Cloudera
  12. 12. 12© Cloudera, Inc. All rights reserved. 1 Recommended Training for Data Engineering Learn how to identify which tool is the right one to use in a given situation, and gain hands-on experience using those tools Cloudera University’s three-day course helps participants understand what data scientists do, the problems they solve, and the tools and techniques they use Learn how to increase the ROI from big data investments, by delivering faster time to insight for your organization. Apache Spark and Hadoop Data Science on Hadoop Cloudera Search
  13. 13. 13© Cloudera, Inc. All rights reserved. Thank you