Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 150 Publicité

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

Télécharger pour lire hors ligne

Huge amount of data is being collected everywhere - when we browse the web, go to the doctor's clinic, visit the supermarket, tweet or watch a movie. This plethora of data is dealt under a new realm called Data Science. Data Science is now recognized as a highly-critical growing area with impact across many sectors including science, government, finance, health care, social networks, manufacturing, advertising, retail,
and others. This colloquium will try to provide an overview as well as clarify bits and bats about this emerging field.

Huge amount of data is being collected everywhere - when we browse the web, go to the doctor's clinic, visit the supermarket, tweet or watch a movie. This plethora of data is dealt under a new realm called Data Science. Data Science is now recognized as a highly-critical growing area with impact across many sectors including science, government, finance, health care, social networks, manufacturing, advertising, retail,
and others. This colloquium will try to provide an overview as well as clarify bits and bats about this emerging field.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg (20)

Plus récents (20)

Publicité

Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg

  1. 1. DATA SCIENCE Colloquium (7) MS(LIS) 2013-2015 Indian Statistical Institute Documentation Research and Training Centre
  2. 2. ● Data Science is a newly emerging field dedicated to analyzing and manipulating data to derive insights and build data products. It combines skill-sets ranging from computer science, to mathematics, to art. (www.kaggle.com)
  3. 3. ● Data science imply a focus involving data and, by extension, statistics, or the systematic study of the organization, properties, and analysis of data and its role in inference, including our confidence in the inference. (D.J.Patil) ● In simple word we can say that it is process which extract information/knowledge from huge data.
  4. 4. Evolution • 1900 - Statistics • 1960 - “Data Mining” • 2006 - Google Analytics appears • 2007 - Business/Data/Predictive Analytics • 2012 - Big Data surge • 2013 - Data Science • 2015 - ??
  5. 5. ● Data is growing at very high pace(exponentially). ● According to IBM, 2.5 exabytes - that's 2.5 billion gigabytes (GB) - of data was generated every day in 2012. About 75% of data is unstructured, coming from sources such as text, voice and video.
  6. 6. ● In 2012 it reached 2.8 zettabytes and IDC forecasts that we will generate 40 zettabytes (ZB) by 2020 which is the equivalent of 5,200 GB of data for every man, woman and child on Earth. ● 90% of all the data in the world today has been created in the past few years.
  7. 7. S.No. Sub-Topic Speaker 1. What is Data Science Sandip Das 2. Data Scientist Anwesha Bhattacharya 3. Applications of Data Science Manasa Rath 4. Workflow of Data Science Dibakar Sen 5. Challenges in Workflow of Data Science Jayanta Kr. Nayek 6. Tools and Technology Tanmay & Manash 7. Machine Learning in Data Science Samhati Soor 8. Conclusion Shiv Shakti Ghosh
  8. 8. References ● http://bit.ly/1gyRYcM ● http://bit.ly/SdJ2OU ● http://bit.ly/RzrZ9k ● http://bit.ly/1pwlEY4 ● http://bit.ly/1pwlUq6
  9. 9. What is Data Science Sandip Das
  10. 10. DATA SCIENCE DATA SCIENCE
  11. 11. Data What kind of data might you collect?
  12. 12. Data  How many Lily pads  Measures the inches of the Lily pads  How many small, medium or large Lily pads  How many frogs
  13. 13. What is Data?  It is something you want to know.  A collection of fact.  Facts and statistics collected together for reference or analysis.  Data as the plural form of datum; as pieces of information; and as a collection of object-units that are distinct from one another.  Data is undifferentiated observation of facts in terms of words, numbers, symbols, etc.
  14. 14. What is Data? Computer data is information processed or stored by a computer. This information may be in the form of text documents, images, audio clips, software programs, or other types of data. Computer data may be processed by the computer's CPU and is stored in files and folders on the computer's hard disk.
  15. 15. Science  The systematic observation of natural events and conditions in order to discover facts about them and to formulate laws and priciples based on these facts.  Science involves more than the gaining of knowledge.It is about gaining a deeper and often useful understanding of the world.
  16. 16. The Science is an art of  Discovering what we don't know from data  Obtaining predictive,actionable insight from data  Creating Data products that have business impact  Building confidence in decisions that drive business value
  17. 17. Data science  According to Computer scientist Peter Nauer “The science of dealing with Data, once they have been established”  Data Science is the scientific study of the creation, validation and transformation of data to create meaning.  Data science is the study of the generalizable extraction of knowledge from data.
  18. 18. Multidisciplinary Approach
  19. 19. Domain Expertise  Domain expertise is proficiency, with special knowledge or skills, in a particular area or topic.  Domain expertise includes knowing what problems are important to solve and knowing what sufficient answers look like. Domain experts understand what the customers of their knowledge want to know.
  20. 20. Data Engineering It is the data part of data science. It involves Acquiring Ingesting Transforming Storing Retrieving data
  21. 21. Scientific Method It is the process for acquiring new knowledge by applying the principles of reasoning on empirical evidence derived from testing hypotheses through repeatable experiments.
  22. 22. Statistics & Mathematics Statistics (along with mathematics) is the cerebral part of Data Science. They collect, Organize, analyse and interpret data.
  23. 23. Advanced Computing Advanced computing is the heavy lifting of data science. It consists software design and programming language.
  24. 24. Visualization  It is the pretty face of data science.  A good visualization is the result of a creative process that composes an abstraction of the data in an informative and aesthetically interesting form.
  25. 25. Hacker mindset  Hacking is modifying one's own computer system, icluding building, rebuilding, modifying and creating software, electronic hardware or peripherals, in order to make it better, make it faster, give it added features.  Data science hacking involves inventing new models, exploring.
  26. 26. References ● http://bit.ly/1jZR0WA ● http:// bit.ly/1pwmV1m ● http://bit.ly/1tkKyKG ● http://bit.ly/1ntd13L ● http://bit.ly/1wi9t5Z
  27. 27. Data Scientist Anwesha Bhattacharya (& I am not a data scientist)
  28. 28. Who is a data scientist? ● A practitioner of data science is called a data scientist.(~Wikipedia) ● Data scientists use technology and skills to increase awareness, clarity and direction for those working with data. (http://www.datascientists.net)
  29. 29. Why do we need data scientists? ● Firstly, there is more data than we can consume. We require a data scientist who can look at the data and say, “This is important. Check out this one.” ● They are the people who can understand and provide meaning to the piles and piles of data that are collected. “Big data” is the buzzword that represents those piles. ● Minimise the disruption that are encountered while dealing with data. ● Present data with an awareness of the consequences of presenting that data.
  30. 30. Data Scientist aims
  31. 31. Types of Data Scientists Data scientists can be broadly classified into two categories: Product-focused data scientists. Business Intelligence style of data scientists. There are roughly 4 to 5 groups in each category.
  32. 32. Product-focused Data Scientists  Data Researcher The professionals in this category come from the academic world and have in-depth backgrounds in statistics or the physical or social sciences. This type of data scientist often holds a PhD but is weakly skilled in Machine learning, Programming or Business. Data Developer These guys tend to concentrate on technical issues that come with handling data. They are strong in programming and machine learning but weak in business and statistics skills.  Data Creatives These are the guys who make something innovative out of mountains of data. They are strongly skilled in machine learning, Big Data, programming and other skills to handle massive data.  Data Business people They represent the business side and are responsible for making vital business decisions through data analytics techniques. They are a blend of business and technical proficiency.
  33. 33. Business Intelligence based Data Scientists ● Quantitative, exploratory Data Scientists Quantitative, exploratory data scientists are inclined to have PhDs and use theory to comprehend behaviour. By combining theory and exploratory research, these data scientists improve products. ● Operational Data Scientists Operational data scientists frequently work in finance, sales or operations teams in an organization. His role is to analyse performance, responses and behavior of a process, to improve organization’s strategy and efficiency. ● Product Data Scientists Product data scientists fit in to product management or engineering. Their job is to understand the way users make use of a product and make use of that knowledge to fine tune the product. ● Marketing Data Scientists Marketing data scientists focuses on the user base, evaluate performance and work on improving efficiency, pretty much like the standard marketing guy. ● Research Data Scientists Research data scientists create insights from a data set.
  34. 34. Profile of Data Scientist ● They love data ● Have investigative mind set ● Goal of work: finding patterns in data and data driven products ● Are practitioners, not theorists ● Have “hands on” skills ● Have domain expertise ● Team players ● Technically focused ● Versatile communication and collaboration skills ● Curiosity for exploring and experimenting with data. ● Sceptical people, likely to ask a lot of questions around the viability of a given solution and whether it will really work.
  35. 35. Required skills ● Data mining - Computational process of discovering patterns in large data sets. The analysis step of the "Knowledge Discovery in Databases". ● Programming - The act of instructing computers to perform tasks. ● Algorithms - Step-by-step procedure for calculations used for analysis of data. ● Statistics – The collection, organization, analysis, interpretation and presentation of data. ● NLP - Interactions between computers and human languages. ● Machine learning - The science of getting computers to act without being explicitly programmed. ● Distributed systems – The components located on networked computers communicate and coordinate their actions by passing messages. ● Visualization - The creation and study of the visual representation of data, communicate both abstract and concrete ideas. ● .........
  36. 36. What Does a Data Scientist Do? 10 Things [most] Data Scientists Do: 1) Ask Good Questions. What is What? We don’t know! We’d like to know? 2) Explore data & generate hypothesis. Run experiments 3) Scoop, Scrap & Sample Data 4) Tame Data 5) Discover the unknowns. 6) Model Data. Model Algorithms. 7) Understand Data Relationships 8) Tell the Machine How to Learn from Data 9) Create Data Products that Deliver Actionable Insight 10) Communicate the results using visualization, presentations
  37. 37. DIKUW I K U WD Raw What How to Why When Numbers Description Extract Cause & Effect Prediction Letters Context Test Proved What's best Symbols Relationships Instruction Known Unknowns Unknown Unknowns Data Information Knowledge Understanding Wisdom Data Engineer Data Analyst Data Miner Data Scientist PAST FUTURE
  38. 38. Data Scientist Data Analyst Familiarity with database systems e.g MySQL Familiarity with data warehousing and business intelligence concepts Better to be familiar with Java, Python In-depth exposure of SQL and analytics Should have clear understanding of various analytical functions - median, rank etc. and how to use them on data sets Strong understanding of Hadoop based analytics Perfection in mathemetics, statistics, correlation, data mining etc. Perfection regarding the tools and components of data architecture Deep statistical insights and machine learning Proficiency in decision making ● Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries. ● Data science, on the other hand, aims to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on. Data Scientist vs Data Analyst
  39. 39. Challenges of data scientist ● Red tape No access allowed ● Unknown need What's the organization's goal? ● Terminology What's a wonkulator? ● Real world data Messy, noisy, missing ● Analysis distrust ...but I dont like that result
  40. 40. References ● Zhukov, Leonid. Data Scientists. Higher School of Economics. National Research University. ● http://bit.ly/1kduMvA ● http://bit.ly/1orF9DL ● http://bit.ly/1tMBBvQ ● http://bit.ly/1kJ9gU8 ● http://bit.ly/TS9H5e ● http://bit.ly/1jZR0WA
  41. 41. APPLICATIONS of DATA SCIENCE by Manasa Rath
  42. 42. Reaching to Data Science
  43. 43. APPLICATIONS agriculture pharmacy energy retail tourism realestate import-export finance business services
  44. 44. Applications in Education sector -Survey done by Pearson group to improve the learing softwares, course materials better quality and efficacy in learning -Tools used is Python, R, Google Big Query
  45. 45. Data Science in Healthcare Industry -where a group has been diagnosed with Type2 Diabetes & some subset of this group has developed complications -would like to know whether there is any pattern to complications and whether the probability of complication can be predicted and therefore acted upon Healthcare Use Database Snippet
  46. 46. Extracting Interesting Patterns of Health outcomes from Healthcare System Care Whether the pattern is robust and predictive ?? OBSERVATIONS What is incidence of complications of Type 2 diabetes for peple over 37 who are on more than six medications?
  47. 47. Remarks -Predictive accuracy becomes a primary objective, the computer tends to play a significant role in model building and decision making Shows an integrated skill set spanning mathematics,statistics, AI,databases, optimization along with deep understanding of the craft problem formulation to engineer effective problems
  48. 48. Applications in Social Networking sites
  49. 49. Key Points --ability to interpret unstructured data and integrate it with numbers further increases our ability to extract useful knowledge in real- time and act on it
  50. 50. References 1.Data Science and Prediction by Vasant Dhar http://bit.ly/1tiRvMr
  51. 51. Workflow of Data Science Dibakar Sen
  52. 52. Work flow of Data Science ● The work flow process consist of three major activities- -Organising -Packaging -Delivering
  53. 53. Work flow Phases Understanding of data / Evaluation
  54. 54. Understanding of Data - set objectives or goal - set data fields - data collection procedure
  55. 55. Preparation Phase Understanding of data / Evaluation
  56. 56. Preparation Phase ● Acquire data The obvious first step in any data science workflow is to acquire the data to analyze. Data can be acquired from a variety of sources. e.g.,: -Existing Data can be used (e.g., U.S. Census data sets). -Data can be automatically generated by computer software. -Data can be manually entered into a spreadsheet or text file by a human through survey.
  57. 57. Preparation Phase ● Reform and clean data -Before analysis begins, we need to verify that the data are accurate and that the variables are well named and properly labeled. -We have to store the data in desired format, - Verify the sample and variables - Do the variables have the correct values? - Are missing data coded appropriately? -Are the data internally consistent? - Is the sample size correct? etc. -Programmers reformat and clean data either by writing scripts or by manually editing data, say, a spreadsheet.
  58. 58. Analysis Phase Understanding of data / Evaluation
  59. 59. Analysis Phase ● Data Analysis - The core activity of data science is the analysis phase: writing, executing, and refining computer programs to analyze and obtain insights from data. - Different "scripting" languages such as Python, Perl, R, and MATLAB are used to analysis the data. However, they also use compiled languages such as C, C++, and Fortran when appropriate.
  60. 60. ● In the analysis phase, the programmer engages in a repeated iteration cycle of editing scripts, executing to produce output files, inspecting the output files to gain insights and discover mistakes, debugging, and re-editing.
  61. 61. Reflection/Evaluation Phase Understanding of data / Evaluation
  62. 62. Reflection / Evaluation Phase The analysis phase involves programming, the reflection phase involves thinking and communicating about the outputs of analyses. After inspecting a set of output files, a data scientist might perform the following types of reflection: -Take notes - Hold meetings - Make comparisons and explore alternatives
  63. 63. Dissemination Phase Understanding of data / Evaluation
  64. 64. Dissemination Phase The final phase of data science is disseminating results. Prepare reports in order to communicate findings to the appropriate audience. Results are most commonly in the form of written reports such as internal memos, slideshow presentation, business / policy white paper, or academic research publications. ● Beyond presenting results in written form, some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems.
  65. 65. References ● http://bit.ly/1jZcx2I ● http://bit.ly/1jZeTyN ● http://bit.ly/1hbQuWx
  66. 66. Challenges in Workflow of Data Science Jayanta Kr. Nayek
  67. 67. Preparation phase Acquire data: -Keeping track of provenance : -Where each piece of data comes from and whether it is still up-to-date. -Data management : -Programmers must assign names to data files that they create or download and then organize those files into directories. -When they create or download new versions of those files, they must make sure to assign proper filenames to all versions and keep track of their differences. -Storage : -Sometimes there is so much data that it cannot fit on a single hard drive, so it must be stored on remote servers.
  68. 68. Preparation Phase Reformat and clean data : -A related problem is that raw data often contains semantic errors(an error in logic or arithmetic that must be detected at run time), missing entries, or inconsistent formatting, so it needs to be "cleaned" prior to analysis. -Data integration : -Data integration involves combining data residing in different sources and providing users with a unified view of these data. -Heterogeneous Data: -data integration involves synchronizing huge quantities of variable, heterogeneous data resulting from internal legacy systems (an old method, technology, computer system, or application program,"of, relating to, or being a previous or outdated computer system) that vary in data format. Legacy systems may have been created around flat file, network, or hierarchical databases.
  69. 69. Preparation Phase ● Data Integration Problems: -Unanticipated Costs: -Labor costs for initial planning, evaluation, programming and additional data acquisition -Software and hardware purchases -Unanticipated technology changes/advances -Both labor and the direct costs of data storage and maintenance -Lack of Data Management Expertise: -support required to engage and convey to everyone in the agency the need for and benefits of data integration is unlikely to flow from leaders who lack awareness of or commitment to the benefits of data integration.
  70. 70. Preparation Phase Data transmission: -It is the physical transfer of data over a point-to-point or point-to-multipoint communication channel. -Cloud data storage is popularly used as the development of cloud technologies. -We know that the network bandwidth capacity is the bottleneck in cloud and distributed systems, especially when the volume of communication is large. -On the other side, cloud storage also lead to data security problems as the requirements of data integrity checking.
  71. 71. Analysis Phase -Data inconsistence and incompleteness: -A number of data preprocessing techniques, including data cleaning, data integration, data transformation and date reduction, can be applied to remove noise and correct inconsistencies. -Scalability: -The biggest and most important challenge is scalability when we deal with the Big Data analysis. -In the last few decades, researchers paid more attentions to accelerate analysis algorithms to cope with increasing volumes of data and speed up processors following the Moore’s Law. -Data Curation: -Data curation is aimed at data discovery and retrieval, data quality assurance, value addition, reuse and preservation over time. -The existing database management tools are unable to process Big Data that grow so large and complex.
  72. 72. Analysis Phase-Timeliness: -Real-time Big Data applications, like navigation, social networks, finance, biomedicine, astronomy, intelligent transport systems, and internet of thing, timeliness is at the top priority. How can we guarantee the timeliness of response when the volume of data will be processed is very large? -File and metadata management: -Repeatedly editing and executing scripts while iterating on experiments causes the production of numerous output files, such as intermediate data, textual reports, tables, and graphical visualizations. -However, doing so leads to data management problems due to the abundance of files and the fact that programmers often later forget their own ad-hoc naming conventions. -Data security: -Firstly, the size of Big Data is extremely large, channelling the protection approaches. -Secondly, it also leads to much heavier workload of the security.
  73. 73. Analysis Phase -Absolute running times: Scripts might take a long time to terminate, either due to large amounts of data being processed or the algorithms being slow. -Incremental running times: Scripts might take a long time to terminate after minor incremental code edits done while iterating on analyses, which wastes time re- computing almost the same results as previous runs. -Crashes from errors: Scripts might crash prematurely due to errors in either the code or inconsistencies in data sets. Programmers often need to endure several rounds of debugging before their scripts can terminate with useful results.
  74. 74. Reflection Phase ● Take notes: Since notes are a form of data, the usual data management problems arise in notetaking, most notably how to organize notes and link them with the context in which they were originally written. ● Make comparisons and explore alternatives: Data scientists must organize, manage, and compare these graphs to gain insights and ideas for what alternative hypotheses to explore.
  75. 75. Dissemination Phase -Functionalities: -To convey information easily by providing knowledge hidden in the complex and large-scale data sets, both aesthetic form and functionality are necessary. -Current tools mostly have poor performances in functionalities and response time. -Scalability : -It is particularly difficult to conduct data visualization (the main objective of data visualization is to represent knowledge more intuitively and effectively by using different graphs) because of the large size and high dimension of Big Data.
  76. 76. Dissemination Phase ● Difficult to distribute research code: Some data scientists also want to distribute their software so that colleagues can reproduce their experiments or play with their prototype systems. It is difficult to distribute research code in a form that other people can easily execute on their own computers. ● Difficult to reproduce the results: It is even difficult to reproduce the results of one's own experiments a few months or years in the future, since one's own operating system and software inevitably get upgraded in some incompatible manner such that the original code no longer runs.
  77. 77. Reference ● Chen,Philip C.L. And Zhang,Chun-Yang.(2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data.Information Sciences.ELSEVIER.Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Macau, China. ● http://bit.ly/1jZcx2I ● http://1.usa.gov/SNspKm
  78. 78. TECHNOLOGY and Tools for DATA SCIENCE TANMAY MONDAL & MANASH KUMAR
  79. 79. We need ● Organise Data ● Analyse Data ● Package and Deliver Data
  80. 80. Data Science Tools  Language − Java, R, Python, ...  Databases/Data Warehouses − Apache Cassandra, Apache HBase, MongoDB, ....  Data Mining − RapidMiner/RapidAnalytics, Orange, Weka, ....  File Systems − Gluster, Hadoop Distributed File System, ...
  81. 81. Data Science Tools  Big Data Search − Lucene, Solr, ...  Data Aggregation and Transfer − Sqoop, Flume, ....  Miscellaneous Big Data Tools – Hadoop, Avro, Zookeeper, ...  ......................
  82. 82. What is Hadoop? ● The Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. N o d e s Hadoop cluster
  83. 83. Why Hadoop? • Handles enormous data volumes. • Cost-effective. • Scalable. • Fault tolerant.
  84. 84. Origin of Hadoop • Google introduced two key technology for handling Big data, Google File System (a distributed file system technology) in 2003 and MapReduce ( framework for distributed compute model) in 2004 to the world. • Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. • In February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. • First release of Apache Hadoop in September 2007
  85. 85. When should we go for Hadoop ?  Data is too huge  Unstructured data  Parallelism  Processes are independent  Need better scalability
  86. 86. The Hadoop Ecosystem● HDFS - Hadoop Distributed File System. ● MapReduce - A distributed framework for executing work in parallel. • Hive - Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. ● Pig – Pig is a high-level platform for creating MapReduce programs used with Hadoop. ● HBase – A non-rational, distributed database system. ● ..........
  87. 87. The Major Component of Hadoop  Hadoop use its own distributed file system,HDFS, which makes data available to multiple computing nodes.  Hadoop uses MapReduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
  88. 88. HDFS  Hierarchical UNIX-like file system for data storage sort of Splitting of large files into blocks.  Stores files in blocks across many nodes in a cluster.  Distribution and replication of blocks to different nodes.  Have master slave architecture.
  89. 89. HDFS Architecture
  90. 90. HDFS ... NameNode  Runs on a single node as a master process  Holds file metadata (which blocks are where)  Directs client access to files in HDFS SecondaryNameNode  Maintains a copy of the NameNode metadata Data Node ● Stores data in the local file system ● Periodically sends a report of all existing blocks to the NameNode
  91. 91. WHAT IS MAP REDUCE? MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster
  92. 92. Map Reduce Paradigm Data processing system with two key phase Map Perform a map function on input key/value pairs to generate intermediate key/value pairs Reduce Perform a reduce function on intermediate key/value groups to generate output key/value pairs
  93. 93. Map Reduce Daemons •JobTracker (Master) -Monitors job and task progress - Manages MapReduce jobs -Giving tasks to different nodes •TaskTracker (Slave) - Creates individual map and reduce tasks - Reports task status to JobTracker -Runs on same node as DataNode service
  94. 94. Hadoop Map Reduce Components Reduce Phase Shuffle Sort Reducer Output Format Map Phase Input Format Record Reader Mapper Combiner
  95. 95. 105 How does Map Reduce work? ➢ The run time partitions the input and provides it to different Map instances ➢ Map (key, value)  (key’, value’) ➢ The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. ➢ Map and Reduce are user written functions in java
  96. 96. WORD COUNT IN MAP REDUCE
  97. 97. Validation of data extract and load into EDW(Enterprise Data Warehouse) Once map-reduce process is completed and data output files are generated, then data is moved to enterprise data warehouse or any other transactional systems depending on the requirement.
  98. 98. USERS OF HADOOP Yahoo! -  More than 100,000 CPUs in 40,000 computers running Hadoop  Produces data that was used in every Yahoo! Web search query Facebook -  In 2010 Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage.  On June 13, 2012 they announced the data had grown to 100 PB.  Each (commodity) node has 8 cores and 12 TB of storage
  99. 99. USERS OF HADOOP Adobe - Adobe uses Apache Hadoop and Apache HBase in several areas from social services to structured data storage and processing for internal use. Currently have about 30 nodes running HDFS Ebay - 532 nodes cluster (8 * 532 cores, 5.3PB) Heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase Using it for Search optimization and Research.
  100. 100. Twitter  We use Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. GBIF (Global Biodiversity Information Facility)  Nonprofit organization that focuses on making scientific data on biodiversity available via the Internet  18 nodes running a mix of Apache Hadoop and Apache HBase
  101. 101. University of Glasgow  30 nodes cluster (Xeon Quad Core 2.4GHz, 4GB RAM, 1TB/node storage). To facilitate information retrieval research & experimentation, particularly for TREC Greece.com  Using Apache Hadoop for analyzing data for millions of images, log analysis, data mining
  102. 102. References http://bit.ly/1km1e46  http://bit.ly/Rzuzfz  http://yhoo.it/1pheFVK  Big data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja.
  103. 103. Machine Learning Samhati Soor
  104. 104. What is it? Learning is a process of knowledge acquisition with specific purpose. Machine learning is the study of how to use computers to simulate human learning activities. Training Set Learning Algorithm hypothesis Predicted OutputInput Feedback
  105. 105. Why Machine Learning is Possible? Mass Storage More data available Higher Performance of Computer Larger memory in handling the data Greater computational power for calculating and even online learning Machine Learning Basics: 1. General Introduction
  106. 106. Basic Structure of the Machine Learning System External environment Corpus study Knowledge Representation Execution Machine Learning Model
  107. 107. The Goal of Machine Learning is... to create a predictive model that is indistinguishable from a correct model. Without Logic With Logic
  108. 108. Two Phases Machine learning methods are broken into two phases: Training Application
  109. 109. Types of Machine Learning Other types: 1. Semi-supervised learning 2. Time-series forecasting 3. Anomaly detection 4. Active learning Main types: 1. Supervised Learning 2. Unsupervised learning 3. Reinforcement learning
  110. 110. The Main Research Work on Machine Learning Field Task-oriented research Cognitive simulation Theoretical analysis
  111. 111. Data Science and Machine Learning  If we are giving the computer rules and/or algorithms to automatically search through your data to “learn” how recognize patterns and make complex decisions (such as identifying spam emails), we are implementing machine learning. In Data science, Data scientists use both statistical techniques and machine learning algorithms for identifying patterns and structure in data.
  112. 112. . Role of Machine Learning in Data Science https://doubleclix.wordpress.com/category/data-science/
  113. 113. A Simple Implementation Let, we have a model consisted of the likelihood of the coin landing heads (prior over θ), while the data consisted of the results of N coin flips. We are observing some data. Our goal is to determine the model from the data i.e. we will find the probability of getting desired model using the given data or p(model|data).
  114. 114. Using Conditional Probability, p(data|model) =p(data and model) * p(model) --(1) p(model|data) =p(data and model) * p(data) --(2) From (1) and (2) we get, p(data|model) / p(model) = p(model|data) / p(data) That implies : p(model|data) = (p(model|data) * p(data)) / p(model) posterior likelihood prior evidence
  115. 115. The likelihood distribution describes the likelihood of data given model — it reflects our assumptions about how the data c was generated. The prior distribution describes our assumptions about model before observing the data. The posterior distribution describes our knowledge of model, incorporating both the data and the prior. The evidence is useful in model selection.
  116. 116. Working Method of a Predictive Modeler and a Data Scientist A predictive modeler may use machine learning approach to predict a value or likelihood of an outcome, given a number of input variables. A data scientist applies these same approaches on large data sets, writing code and using software adapted to work on big data.
  117. 117. The available library of statistical and machine learning algorithms for evaluating and learning from big data is growing, but is not yet as comprehensive as the algorithms available for the non-distributed world. The algorithms vary by product, so it is important to understand what is and is not available. Even not all algorithms familiar to the statistician and data miner are easily converted to the distributed computing environment. The bottom line is that, while fitting models on big data has the potential benefit of greater predictive power, some of the costs are loss of flexibility in algorithm choices and/or extensive programming time. Prospective
  118. 118. References Machine Learning and Data Mining Lecture Notes CSC 411/D11 Computer Science Department University of Toronto Version: February 6, 2012 The Discipline of Machine Learning Tom M. Mitchell July 2006 CMU-ML-06-108 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Statistical Machine Learning-Nic Schraudolph http://bit.ly/1oFt1ws http://bit.ly/1oFtNty
  119. 119. Conclusion Shiv Shakti Ghosh
  120. 120. Research Areas Cloud computing Databases and Database Management Systems Natural language processing Signal Processing Computer vision
  121. 121. Cloud computing Cloud computing involves distributed computing over a network, where a program or application may run on many connected computers at the same time. It specifically refers to a server connected through a communication network such as the Internet, an intranet, a local area network (LAN) or wide area network (WAN).
  122. 122. Issues Privacy -The increased use of cloud computing services such as Gmail and Google Docs has pressed the issue of privacy concerns. The greater use of cloud computing services has given access to a plethora of data which has the immense risk of data being disclosed either accidentally or deliberately.
  123. 123. Contd.. Legal-certain legal issues arise with cloud computing, including trademark infringement, security concerns and sharing of proprietary data resources. Vendor lock-in-cloud computing is still relatively new, standards are still being developed. Many cloud platforms and services are built on the specific standards, tools and protocols developed by a particular vendor for its particular cloud offering. This is a major challenge in interoperability.
  124. 124. Research areas open interoperation across cloud solutions at IaaS, PaaS and SaaS levels managing multi tenancy at large scale and in heterogeneous environments dynamic and seamless elasticity from private clouds to public clouds for unusual and/or infrequent requirements data management in a cloud environment, taking the technical and legal constraints into consideration
  125. 125. Databases &DBMS A database is an organized collection of data. The data are typically organized to in a way that supports processes requiring this information. Database management systems (DBMSs) are specially designed software applications that interact with the user, other applications, and the database itself to capture and analyze data.
  126. 126. Issues Data definition – Defining new data structures for a database, removing data structures from the database, modifying the structure of existing data. Update – Inserting, modifying, and deleting data. Retrieval – Obtaining information either for end-user queries and reports or for processing by applications. Administration – Registering and monitoring users, enforcing data security, monitoring performance, maintaining data integrity, dealing with concurrency control, and recovering information if the system fails.
  127. 127. Research areas Research activity includes theory and development of prototypes and models. Notable research topics include, the atomic transaction concept and related concurrency control techniques, query languages and query optimization methods, RAID, and more.
  128. 128. NLP Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction.
  129. 129. Human-level natural language processing is an AI problem, that is equivalent to making computers as intelligent as people. NLP's future is therefore tied closely to the development of AI in general. As natural language understanding improves, computers will be able to learn from the information online and apply what they learned in the real world. In the future, humans may not need to code programs, but will dictate to a computer in a human natural language, and the computer will understand and act upon the instructions.
  130. 130. Signal Processing Signal processing is an area of Systems Engineering, Electrical Engineering and applied mathematics that deals with operations on or analysis of analog as well as digitized signals, representing time-varying or spatially varying physical quantities. Signals of interest can include sound, electromagnetic radiation, images, and sensor readings, for example biological measurements such as electrocardiograms, control system signals, telecommunication transmission signals, and many others.
  131. 131. Computer vision Computer vision is a field that includes methods for acquiring, processing, analyzing, and understanding images and, in general, high-dimensional data from the real world in order to produce numerical or symbolic information. A theme in the development of this field has been to duplicate the abilities of human vision by electronically perceiving and understanding an image.
  132. 132. Data Science Higher Education programmes 2014 Programs in 2014Institute / Organization Course Indiana University, Indiana, US * Online Certificate in Data Science(January 2014 ). University of California, Berkeley Master of Information and Data Science program. Saint Peters University, US ** Master of Science in Data Science program. Worcester Polytechnic Institute, Worcester, Massachusetts, US Master of Science in Data Science program. University of Virginia , US *** Master of Science in Data Science * The program consists of 12 credits, including cloud computing, data management and data analysis. ** The program’s curriculum will include topics such as decision analysis and optimization, predictive modeling, data mining and visualization. *** A professional program to prepare students for the use of data analysis in major industries such as health care, business, and science.
  133. 133. Conferences on Data Science 2014 International Conference on Data Science and Engineering, (26-28 August 2014) Hosted By : School of Computer Science Studies Cochin University of Science & Technology, Co-Sponsored by IEEE Kerala. DataEDGE Conference : A new vision for data science, (May 8–9, 2014 Berkeley, CA ) Discussions will be on the way organizations are using data to address business and social issues, about the challenges of working with data at scale, and about the most pressing questions and debates facing data scientists today.
  134. 134. O’REILLY Strata is organising three conferences: New York(October 15-17, 2014 ) Discussions will be on complex issues and opportunities brought to business by big data, data science, and pervasive computing. Barcelona, Spain (November 19–21,2014) Discussions will be on big data analytics. San Jose, CA (February 18–20, 2015)
  135. 135. ASE(Academy of Science and Engineering) is organising three conferences: Stanford University, CA, USA, (May 27 - May 31, 2014) Tsinghua University, Beijing, China, (August 4-7, 2014) Harvard University, Cambridge, MA, US (December 15- 19, 2014). IEEE International Conference on Big Data Science and Engineering (Tsinghua University, Beijing, China, 24-26 Sept. 2014). The 2014 International Conference on Data Science and Advanced Analytics(October 30 - November 1, 2014, Shanghai, China).
  136. 136. Journals of Data Science Journal of Data Science-an international journal devoted to applications of statistical methods at large. Online version is free. Hard copy version- 300 USD/ year CODATA Data Science Journal Published by Codata. EPJ Data Science: a Springer Open Journal International Journal of Data Science : Inder Science Publishers.
  137. 137. References http://bit.ly/1omFc3B http://bit.ly/1jZbP5F http://bit.ly/1mCBzqv http://oreil.ly/1jZc4O0 http://bit.ly/1mnyJRe http://bit.ly/1tMzzvx http://bit.ly/1pwnZlN http://bit.ly/1iq0y9a https://bitly.com/

Notes de l'éditeur

  • Data never sleeps
  • This classification shows that any bunch of people can be put in any one of the category. The right type of data scientist can be chosen based on the organization’s requirement
    Before choosing the type of data scientist you want to become, consider the skills required or the skills you already posses to proceed in the appropriate direction.
    So who are you gonna be?? A programmer, a statistician, a marketer, a business lead or a jack of all trades??
  • Department of Statistics, Columbia University, New York + Department of Statistics and Information Science, Catholic Fu-jen University, Taipei + Data Mining Center, Renmin University of China, Beijing

×