Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Introduction to Data Science

Prochain SlideShare
Introduction on Data Science
Introduction on Data Science
Chargement dans…3

Consultez-les par la suite

1 sur 84 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)


Similaire à Introduction to Data Science (20)

Plus par Edureka! (20)


Plus récents (20)

Introduction to Data Science

  1. 1. Data Science Introduction to Data Science
  2. 2. LIVE On-line Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work on Large Data Base Verifiable Certificate How it Works? Slide 2 www.edureka.in/data-science
  3. 3. Topics for the Day Slide 3 www.edureka.in/data-science  Big Data  Big Data Scenarios  Big Data Challenges  Introduction to Data Science  Data Science: Components  Types of DataScientists  Data Science: Core Components  Use-Cases  Introduction to Hadoop and R  R and Hadoop Integration  Machine Learning with Mahout  References
  4. 4. Objectives At the end of this module, you will be able to  Understand Big Data and its challenges  Implement Big Data in real time scenarios  List and explain the components and prospects of Data Science  Learn the implementation of Hadoop on Big data  Analyze some real world use-cases with the help of R programming Language  Understand machine learning concepts
  5. 5. Data Science Slide 5 www.edureka.in/data-science
  6. 6. Big Data Slide 6 www.edureka.in/data-science
  7. 7. What is Big Data? Lots of Data (Terabytes or Petabytes) Systems/Enterprises generate huge amount of data from Terabytes to and even Petabytes of information Slide 8 www.edureka.in/data-sciencehttp://www.today.mccombs.utexas.edu/2012/04/the-big-data-machine
  8. 8. Big Data Scenarios Slide 9 www.edureka.in/data-sciencehttp://www.clker.com/clipart-13967.html
  9. 9. Big Data Scenarios: Sports Slide 9 www.edureka.in/data-sciencehttp://www.espncricinfo.com/
  10. 10. Big Data Scenarios: Sports Sports teams are using data for tracking ticket sales and even for tracking team strategies. Advertising and marketing agencies are tracking social media to understand responsiveness to campaigns, promotions, and other advertising mediums Slide 10 www.edureka.in/data-sciencehttp://www.espncricinfo.com/
  11. 11. Big Data Scenarios : Hospital Care Slide 12 www.edureka.in/data-sciencehttp://www.majorprojects.vic.gov.au/our-projects/our-past-projects/austin-hospital
  12. 12. Big Data Scenarios : Hospital Care Hospitals are analyzing medical data and patient records to predict those patients that are likely to seek readmission within a few months of discharge. The hospital can then intervene in hopes of preventing another costly hospital stay. Medical diagnostics company analyzes millions of lines of data to develop first non-intrusive test for predicting coronary artery disease. To do so, researchers at the company analyzed over 100 million gene samples to ultimately identify the 23 primary predictive genes for coronary artery disease Slide 13 www.edureka.in/data-science
  13. 13. Big Data Scenarios : Amazon.com Slide 13 www.edureka.in/data-sciencehttp://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
  14. 14. Amazon has an unrivalled bank of data on online consumer purchasing behaviour that it can mine from its 152 million customer accounts. Amazon also uses Big Data to monitor, track and secure its 1.5 billion items in its retail store that are laying around it 200 fulfilment centres around the world. Amazon stores the product catalogue data in S3. S3 can write, read and delete objects up to 5 TB of data each. The catalogue stored in S3 receives more than 50 million updates a week and every 30 minutes all data received is crunched and reported back to the different warehouses and the website. Big Data Scenarios : Amazon.com Slide 14 www.edureka.in/data-sciencehttp://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png
  15. 15. Big Data Scenarios: NetFlix Slide 15 www.edureka.in/data-sciencehttp://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png
  16. 16. Netflix uses 1 petabyte to store the videos for streaming. BitTorrent Sync has transferred over 30 petabytes of data since its pre-alpha release in January 2013. The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects. One petabyte of average MP3-encoded songs (for mobile, roughly one megabyte per minute), would require 2000 years to play. Big Data Scenarios: NetFlix Slide 16 www.edureka.in/data-sciencehttp://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png
  17. 17. Big Data Scenarios: The Large Hadron Collider Slide 18 www.edureka.in/data-sciencehttp://www.crowdsourcing.org/article/-nasa-tries-to-free-creativity-with-big-data-challenge/19984
  18. 18. The experiments in the Large Hadron Collider produce about 15 petabytes of data per year, which are distributed over the Worldwide LHC Computing Grid. One petabyte is enough to store the DNA of the entire population of the USA - with cloning it twice. Big Data Scenarios: The Large Hadron Collider Slide 19 www.edureka.in/data-sciencehttp://en.wikipedia.org/wiki/Large_Hadron_Collider
  19. 19. IBM’s Definition IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Web logs Audios Images Videos Sensor Data VOLUME VELOCITY VARIETY Slide 19 www.edureka.in/data-science
  20. 20. IBM’s Definition  Structured  Unstructured  Semi structured  All the above Variety 3 Vs of Big data  Batch  Near Time  Real Time  Streams Velocity  Terabytes  Records  Transactions  Tables, files Volume IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Slide 20 www.edureka.in/data-science
  21. 21. Slide 22 www.edureka.in/data-sciencehttp://whatsthebigdata.files.wordpress.com/2013/11/batman-on-big-data.jpg What about ‘Veracity’?
  22. 22. Hello There!! My name is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Slide 22 www.edureka.in/data-science Annie’s Introduction
  23. 23. Map the following to corresponding type: Structured/ Unstructured/ Semi- structured. - XML Files - Word Docs, PDF files, Text files - E-Mail body - Data from Enterprise systems (ERP, CRM etc.) Slide 23 www.edureka.in/data-science Annie’s Question
  24. 24. XML Files -> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data E-Mail body -> Unstructured Data Data from Enterprise systems (ERP, CRM etc.) -> Structured Data Slide 24 www.edureka.in/data-science Annie’s Answer
  25. 25. Big Data: Challenges Slide 26 www.edureka.in/data-sciencehttp://spinnakr.com/blog/wp-content/uploads/2013/08/Using-Big-Data-.jpg
  26. 26. Big Data Challenges Data security and Privacy High variety of Information High veracity of Data Data Acquisition High velocity of processed Data Information search and Analytics High volume of Data Information storage and Analytics Slide 27 www.edureka.in/data-science Big Data: Challenges
  27. 27. Slide 28 www.edureka.in/data-sciencehttp://thesocietypages.org/sociologylens/files/2013/09/BIgDataDilbert_Cartoon.jpg
  28. 28. Data Science Slide 29 www.edureka.in/data-sciencehttp://escience.washington.edu/blog/uw-berkeley-nyu-collaborate-378m-data-science-initiative
  29. 29. Data Science “More data usually beats better algorithms,” Such as: Recommending movies or music based on past preferences. Slide 29 www.edureka.in/data-science
  30. 30. No matter how extremely unpleasant your algorithm is, they can often be beaten simply by having more data (and a less sophisticated algorithm). Big Data is here Bad News We are struggling to store and analyze it. Good News Data Science Slide 30 www.edureka.in/data-science
  31. 31. Data Science: Components Slide 32 www.edureka.in/data-sciencehttp://abstrusegoose.com/55
  32. 32. Data Science Visualization Advanced Computing Domain Expertise Statistics Data Engineering Data Science: Components Slide 32 www.edureka.in/data-science
  33. 33. Data Science: Prospects Slide 33 www.edureka.in/data-science
  34. 34. Types of Data Scientists Based on clustering the ways that data is handled by Data Scientists, the following 4 categories can be created:  Data Businesspeople are the product and profit-focused data scientists. They’re leaders, managers, and entrepreneurs, but with a technical bent. A common educational path is an engineering degree paired with an MBA.  Data Creatives are eclectic jacks-of-all-trades, able to work with a broad range of data and tools. They may think of themselves as artists or hackers, and excel at visualization and open source technologies.  Data Developers are focused on writing software to do analytic, statistical, and machine learning tasks, often in production environments. They often have computer science degrees, and often work with so-called “big data”.  Data Researchers apply their scientific training, and the tools and techniques they learned in academia, to organizational data. They may have PhDs, and their creative applications of mathematical tools yields valuable insights and products. Slide 35 www.edureka.in/data-sciencehttp://datacommunitydc.org/blog/2013/06/there-is-more-than-one-kind-of-data-scientist/
  35. 35. Relationships - Four Categories and the Five Skill Groups Slide 36 www.edureka.in/data-sciencehttp://datacommunitydc.org/blog/wp-content/uploads/2012/08/SkillsSelfIDMosaic-edit-500px.png
  36. 36. Data Science: Core Components Data Science Data Architecture Tool: Hadoop Machine Learning Tool: Mahout Analytics Tool: R Slide 36 www.edureka.in/data-science
  37. 37. Use-Cases Slide 37 www.edureka.in/data-science
  38. 38. No one Knows How to Use it Slide 38 www.edureka.in/data-science
  39. 39. Use-Case Implementation: Techniques Used A Problem Dataset Analysis Results Slide 39 www.edureka.in/data-science
  40. 40. Understanding the Machine Learning algorithm to be used Implementing Machine Learning in Hadoop on Big Data Visualisation of the analysis Understanding the problem statement and defining the solution Exploring ways to integrate R with Hadoop Implementing Machine Learning algorithm in R on the smaller dataset Use-Case Implementation:Process Flow Diagram Slide 40 www.edureka.in/data-science
  41. 41. Domain of the Dataset: Communications and Media. However, the application of the algorithm is not limited to only Communications and Media. The technique is useful for any domain which requires organizing documents to improve retrieval and support browsing. Problem Statement: A top media company wants to browse through the popular news from a collection that appeared on the Reuters newswire in 1987. Clustering / Grouping documents based on their contents will make the analysis easier. Media Use-Case The Reuters-21578 data set composition Slide 41 www.edureka.in/data-science
  42. 42. Media Use-Case: K-means Clustering First we will understand the implementation of the technique in R on a smaller dataset Then we will understand how to achieve document clustering on Big Data using Mahout libraries on Hadoop K-Means Clustering can be implemented on this dataset Communications and Media Dataset to be Clustered based on their contents R Implementation Hadoop Implementation Machine Learning Implementation Content-wise Clustered/Grouped documents Slide 42 www.edureka.in/data-science
  43. 43. Domain of the Dataset: Products and Retail. However, the application of the algorithm is not limited to only Products and Retail. The technique can be applied wherever we want to discover the co-occurrence relationship amongst various activities. Problem Statement: Market Basket Analysis. A retail outlet wants understand the purchase behavior of a buyer. This information will enable the retailer to understand the buyer's needs. The analysis might tell a retailer that customers often purchase shampoo and conditioner together, so putting both items on promotion at the same time would create a significant increase in profit, while a promotion involving just one of the items would likely drive sales of the other. Market Basket Use-Case Market Basket Analysis 98% of people who purchased items A and B also purchased item C Slide 43 www.edureka.in/data-science
  44. 44. Market Basket Use-Case: Association Rule Mining Product and Retail Dataset Understand the implementation of the technique on a smaller dataset Understand how to achieve the same on Big Data using Mahout libraries on Hadoop The technique used is Affinity Analysis or Association Rule Mining R Implementation Hadoop Implementation Machine Learning Implementation Market Basket Analysis Slide 44 www.edureka.in/data-science
  45. 45. Slide 46 www.edureka.in/data-science Domain of the Dataset: Life Science and Health Care. However, the application of the algorithm is not limited to only Life Science and Health Care . The technique can be applied wherever we want to forecast the occurrence of a event on the basis of certain conditions. Problem Statement: A health care organization wants to forecast the onset of diabetes mellitus in Indians using certain set of attributes of patients as input such as:  Plasma glucoseconcentration  Diastolic bloodpressure  Triceps skin fold thickness etc. Health Care Use-Case http://www.thenewstribe.com/2013/11/15/diabetes-is-killing-one-patient-every-six-seconds/
  46. 46. Slide 47 www.edureka.in/data-science Understand how to achieve the same on Big Data using Mahout libraries on Hadoop The technique used is Affinity Analysis or Association Rule Mining. R Implementation Understand the basic implementation of the technique on a smaller dataset using R Achieve parallel processing on the same algorithm using a parallel processing library provided by Revolution R. Hadoop Implementation Machine Learning Implementation Forecast the onset of diabetes mellitus in Indians Life Science and Health Care Dataset with some attributes of patients as input. Health Care Use-Case: Parallel Processing
  47. 47. Slide 48 www.edureka.in/data-science Domain of the Dataset: Social Media. However, the application of the algorithm is not limited to only Social Media. The technique can be applied wherever we want to put documents into category without going through the contents of all the documents. Problem Statement: A Social Media research firm wants to know the trends of topics discussed on Twitter. For easy analysis it wants to classify them in the following categories:  apparel (clothes, shoes, watches, …)  art (Book, DVD, Music, …)  camera  event (travel, concert, …)  health (beauty, spa, …)  home (kitchen, furniture, garden, …)  tech (computer, laptop, tablet, …) http://www.mobigyaan.com/images/stories/Miscellaneous/mobigyaan-twitter-chat.jpg Social Media Use-Case
  48. 48. Social Media Use-Case: Naïve Bayes Classifier Understand the basic implementation of the technique on a smaller dataset using R. Understand how to achieve the same on Big Data using Mahout libraries on Hadoop. The technique used is Naïve Bayes Classifier. Social Media dataset R Implementation Hadoop Implementation Machine Learning Implementation Categorical classification of the tweets Slide 48 www.edureka.in/data-science
  49. 49. Going forward with the class, we will throw some light on the concepts of Hadoop, R and Machine Learning respectively. These topics will be vividly covered in their respective modules during the course. Data Science: Core Components Slide 49 www.edureka.in/data-science
  50. 50. Introduction to Hadoop Slide 50 www.edureka.in/data-science
  51. 51.  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing.  In 2004, Google published a paper on a process called MapReduce. parallel processing model process huge amount of  MapReduce framework provides a and associated implementation to data.  Therefore, an implementation of MapReduce framework was adopted by an Apache open source project named Hadoop. Introduction to Hadoop Slide 51 www.edureka.in/data-science
  52. 52. Hadoop Key Characteristics Scalable Reliable Economical Flexible Robust Ecosystem Hadoop Key Characteristics Slide 52 www.edureka.in/data-science
  53. 53. Hadoop Core Components Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node Slide 53 www.edureka.in/data-science
  54. 54. Hadoop is a framework that allows for the distributed processing of: - Small Data Sets - Large Data Sets Slide 54 www.edureka.in/data-science Annie’s Question
  55. 55. Large Data Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in Tb’s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes. Slide 55 www.edureka.in/data-science Annie’s Answer
  56. 56. For setting-up Hadoop on your system you can follow the “Hadoop Installation Guide” present in the LMS. Slide 56 www.edureka.in/data-science
  57. 57. Analytics with R Slide 57 www.edureka.in/data-science
  58. 58. Analytics with R Slide 59 www.edureka.in/data-sciencehttp://www.r-project.org/
  59. 59. R : Characteristics Slide 59 www.edureka.in/data-science  R is open source and free.  R has lots of packages and multiple ways of doing the same thing.  By default stores memory in RAM.  R has the most advanced graphics. You need much better programming skills.  R has GUI to help make learning easier.  Customization needs commandline.  R can connect to many database and data types.
  60. 60. Comparing R and others http://r4stats.com/articles/popularity/ Comparing R Slide 60 www.edureka.in/data-science
  61. 61. Comparing R with Base SAS* /SAS Stat* R Base SAS* /SAS Stat* R is open source and free Base SAS* , SAS/Stat*, SAS/ET*, SAS/OR*, SAS/Graph* are expensive relatively because of annual licenses Open source R has support from email lists, twitter, stack overflow SAS Institute* products have dedicated support and extensive documentation R is slower on the desktop than base SAS for datasets ~4-5 gb By default R stores memory in RAM, so we can use the cloud R has much better graphics You need much better programming skills You can create custom functions in R easily Customization needs command line R has multiple GUI that are free SAS GUI are more expensive Slide 62 www.edureka.in/data-science*Copyright © 2012 SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513, USA. All rights reserved.
  62. 62. Annie’s Question R Provides support in terms of: 1. Dedicated Support and Documentation 2. Email-lists, twitter, etc. Slide 62 www.edureka.in/data-science
  63. 63. Annie’s Answer Answer: 2. Email-lists, twitter, etc. Slide 63 www.edureka.in/data-science
  64. 64. Annie’s Question Custom functions can be easily created in : 1. SAS 2. R Slide 64 www.edureka.in/data-science
  65. 65. Annie’s Answer Answer: 1. R Slide 65 www.edureka.in/data-science
  66. 66. Annie’s Question Most of the functions in R are written in : - Java - R - C - Fortran Slide 66 www.edureka.in/data-science
  67. 67. Annie’s Answer Most of the user-visible functions in R are written in R. It is possible for the user to interface to procedures written in the C, C++, or FORTRAN languages for efficiency. Slide 67 www.edureka.in/data-science
  68. 68. Introduction to R Programming language www.r-project.org/about.html  History  Evolution  Current State Slide 68 www.edureka.in/data-science  Open Source  Free  Widely Recognized  Official Website  R Core  Creators  R Journal
  69. 69. R and Hadoop Integration  R and Hadoop are a natural match in Big Data Analytics and visualization.  One of the most well-known R packages to support Hadoop functionalities is : RHadoop  Rhadoop was developed by Revolution Analytics.  RHadoop is a collection of three R packages: rmr, rhdfs and rhbase. file rmr package provides Hadoop MapReduce functionality in R, rhdfs provides HDFS management in R and rhbase provides HBase database management from within R. + Slide 69 www.edureka.in/data-science
  70. 70. For setting-up R on your system you can follow the “R Installation Guide” present in the LMS under module 1. Slide 70 www.edureka.in/data-science
  71. 71. Machine Learning Slide 71 www.edureka.in/data-science
  72. 72. Slide 73 www.edureka.in/data-science Machine Learning: Mahout  Machine Learning is a class of algorithms which is data-driven, i.e. unlike "normal" algorithms it is the data that "tells" what the "good answer" is. Example: An hypothetical non-machine learning algorithm for face recognition in images would try to define what a face is (round skin-like-colored disk, with dark area where you expect the eyes etc). A machine learning algorithm would not have such coded definition, but will "learn-by-examples": you'll show several images of faces and not-faces and a good algorithm will eventually learn and be able to predict whether or not an unseen image is a face. http://endthelie.com/2012/08/24/fbi-sharing-facial-recognition-software-with-police-departments-across-america/
  73. 73. Mahout Overview Mahout is about scalable Machine Learning Mahout has functionality for many of today’s common machine learning tasks Machine Learning is all over the web today MapReduce magic in action Slide 73 www.edureka.in/data-science
  74. 74. Hadoop and MapReduce magic in action https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout Write intelligent applications using Apache Mahout LinkedIn Recommendations Machine Learning: LinkedIn Recommendations Slide 74 www.edureka.in/data-science
  75. 75. Annie’s Question Mahout Algorithms for clustering, classification and collaborative filtering are implemented on top of Apache Hadoop using : - Flume - MapReduce - Sqoop - Hive Slide 75 www.edureka.in/data-science
  76. 76. Annie’s Answer Mahout Algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm. Slide 76 www.edureka.in/data-science
  77. 77. 1. Install R with the help of “R Installation Steps” guide in the LMS. This is a step wise guide which will help you in installing and setting up R on your system Slide 77 www.edureka.in/data-science Assignment
  78. 78. Agenda for Next Class Slide 78 www.edureka.in/data-science In the next class you will be able to  Understand what is R  Describe why R is used?  Implement R Programming Concepts  Learn Data Import Techniques  Analyze the Processing of Data
  79. 79. Pre-work Go through the “R Essentials for Data Science” section in the LMS. Watch the recordings present in the section to gain an understanding of the R environment. Slide 79 www.edureka.in/data-science
  80. 80. What’s Within the LMS? Slide 80 www.edureka.in/data-science
  81. 81. What’s Within the LMS? Recording of the Class Presentation Quiz Slide 81 www.edureka.in/data-science
  82. 82. What’s Within the LMS? Assignment Installation Guide Pre-work Slide 82 www.edureka.in/data-science
  83. 83. References Slide 83 www.edureka.in/data-science http://www.today.mccombs.utexas.edu/2012/04/the-big-data-machine http://www.espncricinfo.com/ http://www.majorprojects.vic.gov.au/our-projects/our-past-projects/austin-hospital http://wp.streetwise.co/wp-content/uploads/2012/08/Amazon-Recommendations.png http://smhttp.23575.nexcesscdn.net/80ABE1/sbmedia/blog/wp-content/uploads/2013/03/netflix-in-asia.png http://www.crowdsourcing.org/article/-nasa-tries-to-free-creativity-with-big-data-challenge/19984 http://whatsthebigdata.files.wordpress.com/2013/11/batman-on-big-data.jpg http://spinnakr.com/blog/wp-content/uploads/2013/08/Using-Big-Data-.jpg http://thesocietypages.org/sociologylens/files/2013/09/BIgDataDilbert_Cartoon.jpg http://abstrusegoose.com/55 http://www.thenewstribe.com/2013/11/15/diabetes-is-killing-one-patient-every-six-seconds/ http://www.mobigyaan.com/images/stories/Miscellaneous/mobigyaan-twitter-chat.jpg http://www.r-project.org/ http://endthelie.com/2012/08/24/fbi-sharing-facial-recognition-software-with-police-departments-across-america/