Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Data Science as a Service: Intersection of Cloud Computing and Data Science

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
NoSQL (Not Only SQL)
NoSQL (Not Only SQL)
Chargement dans…3
×

Consultez-les par la suite

1 sur 31 Publicité

Data Science as a Service: Intersection of Cloud Computing and Data Science

Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.

Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Publicité

Similaire à Data Science as a Service: Intersection of Cloud Computing and Data Science (20)

Publicité

Data Science as a Service: Intersection of Cloud Computing and Data Science

  1. 1. Data Science as a Service Dr. Pouria Amirian (Pouria.Amirian@ndm.ox.ac.uk) Big Data Project Coordinator, The Global Health Network, University of Oxford Intersection Of Cloud Computing And Data Science
  2. 2. outline  Data Science  What data science is  Steps in a Data Science project  Experiments  Using AzureML  Big Data issues  In a data science project  Methods in analysis 2
  3. 3. What is Data Science?  Practice of obtaining useful insights from data  3 Vs of Big Data:  Volume  Variety  Velocity  + other Vs  It applies to large volume data (volume)  It applies to semi-structured and unstructured data (variety)  It sometimes applies to real-time or fast changing data (velocity)  It applies to small and traditional static data 3
  4. 4. Data Science as a team sport 4 Math Statistical Learning Linguistics Machine Learning Signal Processing Programming Storage/Data StructureOperations Research Distributed and High Performance Computing
  5. 5. Data Science from analytics point of view  Analytics Spectrum: 5 Descriptive Diagnostic Predictive Prescriptive What happened? Why did it happen? What will happen? What should I do?
  6. 6. Data Science Vs. Business Intelligence  Analytics Spectrum: 6 What happened? Why did it happen? What will happen? What should I do? Traditional BI Descriptive Diagnostic Predictive Prescriptive
  7. 7. Why is it so popular? why it matters?  A) More Available and Usable Data  McKensey: Organizations that use data science to make decisions are more productive and deliver higher ROI  Gartner: Organizations that invest in modern data infrastructure will outperform their peers by up to 20% 7
  8. 8. Why is it so popular? why it matters?  B) Increased Awareness of Machine Learning Techniques  A subset of machine learning algorithms are now more widely understood since they have been tried and tested by early adopters such as Netflix and Amazon (Recommendation engines).  while many people may not know details of the algorithms used, they now increasingly understand their research/business value. 8
  9. 9. Why is it so popular? why it matters?  C) More Accuracte Analysis  The large volumes of data being collected also enables you to build more accurate predictive models.  The larger sample size, the smaller the margin of error. This in turn increases the accuracy of predictions from your model. 9
  10. 10. Why is it so popular? why it matters?  D) Faster and Cheaper Computation  Today, a smartphone’s processor is up to five times more powerful than that of a desktop computer 20 years ago.  Price of computation is decreased  Capacity of computation is increased  dramatic gains in technology, productivity, innovations etc. 10
  11. 11. The Data Science Workflow Problem Definition Data Collection and Preparation Model Development Model Deployment Performance Improvement 11 Critical Very Important Time Consuming Fun :D Iterative Cumbersome :( Critical
  12. 12. The Data Science Workflow Problem Definition Data Collection and Preparation Model Development Model Deployment Performance Improvement 12 • Domain Knowledge • Separation of Concerns • Prioritize each problem • Selection or right data • Data Transformation • Missing Values • Exploratory analysis • Right algorithm • Test accuracy • Test other algorithms • Validate • Turning data scientist model to developer code (R to C#) • Monitor the performance of deployed model • Re-Training model • Re-Deploying model • Re-monitoring
  13. 13. The Data Science Workflow Big Data Issues (I) Problem Definition Data Collection and Preparation Model Development Model Deployment Performance Improvement 13
  14. 14. Solutions to overcome the big data issues 14  1- Use advanced research computing (http://www.arc.ox.ac.uk/)
  15. 15. Solutions to overcome the big data issues 15  2-Create and use a Hadoop Cluster  Open source (Apache)  It is based on two components HDFS MapReduce
  16. 16. MapReduce 16
  17. 17. HortonWorks 17
  18. 18. Cloudera 18
  19. 19. MapR 19
  20. 20. open source won't prevent vendor lock-in!!! 20
  21. 21. Third Solution  Microsoft’s Cloud Computing 21
  22. 22. AzureML (Azure Machine Learning)  Azure ML provides an easy-to-use and powerful set of cloud- based data transformation and machine learning tools.  AzureML Studio (or Studio for short)  It has many modules for data transformation, analysis, visualization,…  It supports R and Python  It is under heavy development  www.studio.azureml.net 22
  23. 23. AzureML Workflow 23 Data Input Data Transformation (Project) Split Data(training and test)Learning Algorithm Train the Learning Algorithm Validate the Algorithm(Score) Evaluate Model Performance
  24. 24. First Experiment: Predicting Price of Car AutomobileFullModuleModel02-03-2015 24
  25. 25. Second Experiment: Using R in ML Studio AutomobileRTransformation02-03-2015 25
  26. 26. Third experiment: comparing two models AutomobileFullModuleTwoModel02-03-2015 26
  27. 27. Fourth experiment: Creating Web service  Very easy just some clicks!!!!  Make: bmw  Engine-size: 164  Horse-power: 121  highway—mpg: 25  Its actual price is 24,565 27
  28. 28. Tips  Data input can come from a variety of data interfaces, including HTTP connections (any filesharing service like dropbox, googleDrive, oneDrive), SQLAzure, and Hive Query.  You can use functionality in all supported R modules (410)  You can write your utility functions and upload it as another module  It is under heavy development  Two weeks ago the process for web service publication changed  Two months ago there was no support for Python  Two months ago around 400 R packages were supported  … 28
  29. 29. Big Data Issues (II)  High dimensional data or wide data  Using various methods needs knowledge of those methods  Traditional methods are not efficient enough (unstable)  Least Squares for example 29
  30. 30. Advantages of AzureML  Solutions can be quickly deployed as web services.  Models run in a highly scalable cloud environment.  using the R and Python language for solution-specific functionality.  It creates minimum code for consuming the web service in R and Python (and C#)  It can be run from anywhere 30
  31. 31. “ ” Big Data is not about Data. The value in big data is in Analytics. GARY KING Thanks for your attention Time for Q/A

×