Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Introduction to Data Science

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Prochain SlideShare
Introduction to data science
Introduction to data science
Chargement dans…3
×

Consultez-les par la suite

1 sur 48 Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Introduction to Data Science (20)

Plus récents (20)

Publicité

Introduction to Data Science

  1. 1. An Introduction to Data Science Anoop V.S Ph.D Research Scholar Data Engineering Lab Indian Institute of Information Technology and Management - Kerala (IIITM-K) Thiruvananthapuram, India anoop.res15@iiitmk.ac.in March 10, 2017 Anoop V.S Introduction to Data Science March 10, 2017 1 / 48
  2. 2. Anoop V.S Introduction to Data Science March 10, 2017 2 / 48
  3. 3. Why you should attend this talk ? Companies have recognized the immense business value which can be delivered using data. This has caused a huge demand of skilled professional in data related jobs around the world. Job profiles such as Data Scientist, Data Analyst, Big Data Engineer, Statistician are being largely hunted by companies. Not only they are being handsomely paid, but a career in analytics has much more to promise. After the U.S., India has the largest demand of analytics / big data / data science professionals. Amidst such demand, people find themselves confused to select an appropriate job profile for the best future. Anoop V.S Introduction to Data Science March 10, 2017 3 / 48
  4. 4. How much a Data Science Professional can earn ? Anoop V.S Introduction to Data Science March 10, 2017 4 / 48
  5. 5. Which cities are offering high salary ? Anoop V.S Introduction to Data Science March 10, 2017 5 / 48
  6. 6. Data Scientist - the SEXIEST JOB OF 21st CENTURY ! Requires a mixture of multidisciplinary skills ranging from an intersection of mathematics, statistics, computer science, communication and business. Finding a Data Scientist is hard ! Finding people who understand who a Data Scientist is, is equally hard !! The trend is expected to accelerate in the coming years as data from mobile sensors, sophisticated instruments, the web, and more, grows It is predicted that in 2020, the world will generate 50 times the amount of data than in 2011 Anoop V.S Introduction to Data Science March 10, 2017 6 / 48
  7. 7. What skills are needed ? Anoop V.S Introduction to Data Science March 10, 2017 7 / 48
  8. 8. So, what really is Data Science ? Asking questions (formulating hypothesis), answers to which solve known problems or unearth unknown solutions that in turn drive business value Defining the data needed or working with an existing data set and employing tools (computer science based) to collect, store and explore such data generally in huge volume & variety Identifying the type of analysis to be done to get to the answers and performing such analysis by implementing various algorithms/tools, often in a distributed and parallel architecture Communicating the insights gathered from the analysis in the form of simple stories/visualizations/dashboards that a non-data scientist can understand and build conversation out of it Building a higher level abstraction that does steps 2-3-4 in an autonomous way, analyzing & taking actions on new data as they are fed to the system Anoop V.S Introduction to Data Science March 10, 2017 8 / 48
  9. 9. Summing up in an image Anoop V.S Introduction to Data Science March 10, 2017 9 / 48
  10. 10. Leading by an example Two of the most famous companies in the world use analytics and Big Data to shape their product, services and delivery - Amazon and Facebook. Amazon uses analytics to curate products on their customers homepages based on their previous purchases and browsing habits. Facebook uses analytics to fill your news feed with updates from people you interact with the most; content from sites you frequent and products you have checked out on other sites. Anoop V.S Introduction to Data Science March 10, 2017 10 / 48
  11. 11. Type of analytics Descriptive Analytics, which use data aggregation and data mining to provide insight into the past and answer: ”What has happened?” Predictive Analytics, which use statistical models and forecasts techniques to understand the future and answer: ”What could happen?” Prescriptive Analytics, which use optimization and simulation algorithms to advice on possible outcomes and answer: ”What should we do?” Anoop V.S Introduction to Data Science March 10, 2017 11 / 48
  12. 12. Descriptive Analytics: Insight into the past Descriptive analysis or statistics does exactly what the name implies they Describe, or summarize raw data and make it something that is interpretable by humans They are analytics that describe the past. The past refers to any point of time that an event has occurred, whether it is one minute ago, or one year ago Descriptive analytics are useful because they allow us to learn from past behaviors, and understand how they might influence future outcomes. Common examples of descriptive analytics are reports that provide historical insights regarding the companys production, financials, operations, sales, finance, inventory and customers Anoop V.S Introduction to Data Science March 10, 2017 12 / 48
  13. 13. Predictive Analytics: Understanding the future Predictive analytics has its roots in the ability to ”Predict” what might happen Predictive analytics provides companies with actionable insights based on data. It is important to remember that no statistical algorithm can predict the future with 100% certainty. Companies use these statistics to forecast what might happen in the future. This is because the foundation of predictive analytics is based on probabilities Predictive analytics can be used throughout the organization, from forecasting customer behavior and purchasing patterns to identifying trends in sales activities Anoop V.S Introduction to Data Science March 10, 2017 13 / 48
  14. 14. Prescriptive Analytics: Advise on possible outcomes The relatively new field of prescriptive analytics allows users to prescribe a number of different possible actions to and guide them towards a solution At their best, prescriptive analytics predicts not only what will happen, but also why it will happen providing recommendations regarding actions that will take advantage of the predictions. Prescriptive analytics use a combination of techniques and tools such as business rules, algorithms, machine learning and computational modelling procedures. These techniques are applied against input from many different data sets including historical and transactional data, real-time data feeds, and big data Anoop V.S Introduction to Data Science March 10, 2017 14 / 48
  15. 15. Now into some basics - What is Data / Information / Knowledge ? Data is unprocessed facts and figures without any added interpretation or analysis. ”The price of crude oil is $80 per barrel.” Information is data that has been interpreted so that it has meaning for the user. ”The price of crude oil has risen from $70 to $80 per barrel” gives meaning to the data and so is said to be information to someone who tracks oil prices. Knowledge is a combination of information, experience and insight that may benefit the individual or the organisation. ”When crude oil prices go up by $10 per barrel, it’s likely that petrol prices will rise by Rs. 20 per litre” is knowledge. Anoop V.S Introduction to Data Science March 10, 2017 15 / 48
  16. 16. Relationship of Data, Information and Intelligence Anoop V.S Introduction to Data Science March 10, 2017 16 / 48
  17. 17. Categories of Data - A quick view Structured Data concerns all data which can be stored in database SQL in table with rows and columns. They have relationnal key and can be easily mapped into pre-designed fields. Today, those data are the most processed in development and the simpliest way to manage informations. Semistructured Data doesnt reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database. Unstructured Data represent around 80% of data. It often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data Anoop V.S Introduction to Data Science March 10, 2017 17 / 48
  18. 18. Big Data - in recent News Anoop V.S Introduction to Data Science March 10, 2017 18 / 48
  19. 19. Big Data - in recent News Anoop V.S Introduction to Data Science March 10, 2017 19 / 48
  20. 20. Big Data - in recent News Anoop V.S Introduction to Data Science March 10, 2017 20 / 48
  21. 21. Big Data - in recent News Anoop V.S Introduction to Data Science March 10, 2017 21 / 48
  22. 22. Do you know ”90% of the worlds data was generated in the last few years.” !!! Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques Big data is not merely a data, rather it has become a complete subject, which involves various tools, techniques and frameworks. What comes under Big Data ? Black Box Data Social Media Data Stock Exchange Data Power Grid Data Transport Data Search Engine Data etc. Anoop V.S Introduction to Data Science March 10, 2017 22 / 48
  23. 23. 3Vs of Big Data Volume Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it wouldve been a problem but new technologies (such as Hadoop) have eased the burden. Velocity Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety Data comes in all types of formats from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. Anoop V.S Introduction to Data Science March 10, 2017 23 / 48
  24. 24. Who uses Big Data ? Banking - its important to understand customers and boost their satisfaction, its equally important to minimize risk and fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step ahead of the game with advanced analytics Education - Educators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluation and support Government - When government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime. Anoop V.S Introduction to Data Science March 10, 2017 24 / 48
  25. 25. Who uses Big Data ? Health care - Patient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly, accurately and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care. Manufacturing - More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business decisions. Retail - Retailers need to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business Anoop V.S Introduction to Data Science March 10, 2017 25 / 48
  26. 26. Operational Vs. Analytical Big Data Operational Big Data provide operational features to run real-time, interactive workloads that ingest and store data. MongoDB is a top technology for operational Big Data applications with over 10 million downloads of its open source software. Analytical Big Data Analytical Big Data technologies, on the other hand, are useful for retrospective, sophisticated analytics of your data. Hadoop is the most popular example of an Analytical Big Data technology. But picking an operational vs analytical Big Data solution isnt the right way to think about the challenge. They are complementary technologies and you likely need both to develop a complete Big Data solution. Anoop V.S Introduction to Data Science March 10, 2017 26 / 48
  27. 27. Traditional Vs. Google’s solution In Traditional approach will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose. Limitations will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be written to interact with the database, process the required data and present it to the users for analysis purpose. Anoop V.S Introduction to Data Science March 10, 2017 27 / 48
  28. 28. Google’s solution Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset. Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an Open Source Project called HADOOP in 2005. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop applications capable of running on clusters of computers and they could perform complete statistical analysis for a huge amounts of data. Anoop V.S Introduction to Data Science March 10, 2017 28 / 48
  29. 29. How MapReduce works ? Anoop V.S Introduction to Data Science March 10, 2017 29 / 48
  30. 30. Machine Learning - Learning from DATA ! Machine learning is a method of data analysis that automates analytical model building. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being explicitly programmed where to look. The iterative aspect of machine learning is important because as models are exposed to new data, they are able to independently adapt. They learn from previous computations to produce reliable, repeatable decisions and results While many machine learning algorithms have been around for a long time, the ability to automatically apply complex mathematical calculations to big data over and over, faster and faster is a recent development. Anoop V.S Introduction to Data Science March 10, 2017 30 / 48
  31. 31. Here are a few widely publicized examples of machine learning applications you may be familiar with The heavily hyped, self-driving Google car? The essence of machine learning. Online recommendation offers such as those from Amazon and Netflix? Machine learning applications for everyday life. Knowing what customers are saying about you on Twitter? Machine learning combined with linguistic rule creation. Fraud detection? One of the more obvious, important uses in our world today. Anoop V.S Introduction to Data Science March 10, 2017 31 / 48
  32. 32. How to learn from DATA ? 1 Supervised Learning 1 we have training data with correct answers 2 use training data to prepare the algorithm 3 then apply it to a data without correct answer 2 Unsupervised Learning 1 no training data 2 throw data into the algorithm 3 hope it makes some kind of sense out of the data Anoop V.S Introduction to Data Science March 10, 2017 32 / 48
  33. 33. Some types of learning algorithms Prediction Predicting a variable from data Classification Assigning records to predefined groups Clustering Splitting records into groups based on similarity Association Learning Seeing what often appears together with what Issues with learning - Data is usually noisy in some way, Inductive bias - the shape of the algorithm we choose may not fit the data at all, may induce induce under-fitting or over-fitting. Anoop V.S Introduction to Data Science March 10, 2017 33 / 48
  34. 34. Testing our model and treating missing values When using for real problems, testing the model is crucial. Testing means splitting your dataset - training data (used as input to algorithm) and test data (used for evaluation only) Need to compute some measure of performance - precision / recall, root mean square error Usually there are missing values in the dataset and this cause problems for many Machine Learning algorithms. These can be solved by, Remove all records with NULL values Use a default value Estimate a replacement value etc. Anoop V.S Introduction to Data Science March 10, 2017 34 / 48
  35. 35. Top 10 Machine Learning Algorithms Machine Learning algorithms are expected to replace 25% of the jobs across the world in the next 10 years !!! Nave Bayes Classifier Algorithm K Means Clustering Algorithm Support Vector Machine Algorithm Apriori Algorithm Linear Regression Logistic Regression Artificial Neural Networks Random Forests Decision Trees Nearest Neighbours Anoop V.S Introduction to Data Science March 10, 2017 35 / 48
  36. 36. Nave Bayes Classifier Algorithm When to use Nave Bayes Classifier Algorithm ? If you have a moderate or large training data set. If the instances have several attributes. Given the classification parameter, attributes which describe the instances should be conditionally independent. Applications of Nave Bayes Classifier Algorithm Sentiment Analysis - It is used at Facebook to analyse status updates expressing positive or negative emotions. Document Categorization - Google uses document classification to index documents and find relevancy scores i.e. the PageRank Google Mail uses Nave Bayes algorithm to classify your emails as Spam or Not Spam Anoop V.S Introduction to Data Science March 10, 2017 36 / 48
  37. 37. K Means Clustering Algorithm K-means is a popularly used unsupervised machine learning algorithm for cluster analysis The algorithm operates on a given data set through pre-defined number of clusters, k. The output of K Means algorithm is k clusters with input data partitioned among the clusters. Applications of K Means Clustering Algorithm K Means Clustering algorithm is used by most of the search engines like Yahoo, Google to cluster web pages by similarity and identify the relevance rate of search results This helps search engines reduce the computational time for the users. Anoop V.S Introduction to Data Science March 10, 2017 37 / 48
  38. 38. Support Vector Machine Learning Algorithm Support Vector Machine is a supervised machine learning algorithm for classification or regression problems Dataset teaches SVM about the classes so that SVM can classify any new data It works by classifying the data into different classes by finding a line (hyperplane) which separates the training data set into classes SVM offers best classification performance (accuracy) on the training data. Applications of Support Vector Machine Learning Algorithm SVM is commonly used for stock market forecasting by various financial institutions. It can be used to compare the relative performance of the stocks when compared to performance of other stocks in the same sector The relative comparison of stocks helps manage investment making decisions based on the classifications made by the SVM learning algorithm. Anoop V.S Introduction to Data Science March 10, 2017 38 / 48
  39. 39. Apriori Machine Learning Algorithm Apriori algorithm is an unsupervised machine learning algorithm that generates association rules from a given data set Association rule implies that if an item A occurs, then item B also occurs with a certain probability Most of the association rules generated are in the IF THEN format. For example, IF people buy an iPad THEN they also buy an iPad Case to protect it It is easy to implement and can be parallelized easily. Applications of Apriori Machine Learning Algorithm Detecting Adverse Drug Reactions Market Basket Analysis Auto-Complete Applications Anoop V.S Introduction to Data Science March 10, 2017 39 / 48
  40. 40. Linear Regression Machine Learning Algorithm Linear Regression algorithm shows the relationship between 2 variables and how the change in one variable impacts the other The algorithm shows the impact on the dependent variable on changing the independent variable It is one of the most interpretable machine learning algorithms, making it easy to explain to others. It is the mostly widely used machine learning technique that runs fast. Applications of Linear Regression Machine Learning Algorithm Estimating Sales - Linear Regression finds great use in business, for sales forecasting based on the trends Risk Assessment - Linear Regression helps assess risk involved in insurance or financial domain. A health insurance company can do a linear regression analysis on the number of claims per customer against age Anoop V.S Introduction to Data Science March 10, 2017 40 / 48
  41. 41. Decision Tree Machine Learning Algorithm A decision tree is a graphical representation that makes use of branching methodology to exemplify all possible outcomes of a decision, based on certain conditions In a decision tree, the internal node represents a test on the attribute, each branch of the tree represents the outcome of the test and the leaf node represents a particular class label The classification rules are represented through the path from root to the leaf node. Applications of Decision Tree Machine Learning Algorithm Decision trees are among the popular machine learning algorithms that find great use in finance for option pricing. Decision tree algorithms are used by banks to classify loan applicants by their probability of defaulting payments. Anoop V.S Introduction to Data Science March 10, 2017 41 / 48
  42. 42. The Best Machine Learning Libraries in Python Python is one of the best languages you can use to learn (and implement) machine learning techniques for a few reasons: It’s simple - Python is now becoming the language of choice among new programmers thanks to its simple syntax and huge community It’s powerful - Just because something is simple doesn’t mean it isn’t capable. Python is also one of the most popular languages among data scientists and web programmers. Its community has created libraries to do just about anything you want, including machine learning Lots of ML libraries There are tons of machine learning libraries already written for Python. You can choose one of the hundreds of libraries based on your use-case, skill, and need for customization. Anoop V.S Introduction to Data Science March 10, 2017 42 / 48
  43. 43. The Best Machine Learning Libraries in Python - contd.. Tensorflow - a high-level neural network library that helps you program your network architectures while avoiding the low-level details scikit-learn - The scikit-learn library is definitely one of, if not the most, popular ML libraries out there among all languages. It has a huge number of features for data mining and data analysis, making it a top choice for researches and developers alike. Theano - is a machine learning library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays, which can be a point of frustration for some developers in other libraries Anoop V.S Introduction to Data Science March 10, 2017 43 / 48
  44. 44. The Best Machine Learning Libraries in Python - contd.. Pylearn2 - Most of Pylearn2’s functionality is actually built on top of Theano, so it has a pretty solid base. Pyevolve - Pyevolve provides a great framework to build and execute genetic algorithms and neural networks. Pattern - This is more of a ’full suite’ library as it provides not only some ML algorithms but also tools to help you collect and analyze data. The data mining portion helps you collect data from web services like Google, Twitter, and Wikipedia. The nice thing about including these tools is how easy it makes it to both collect and train on data in the same program. Anoop V.S Introduction to Data Science March 10, 2017 44 / 48
  45. 45. Machine Learning & Big Data Analytics - The perfect marriage TWO Orthogonal Aspects !! Big Data - Handling massive data volumes ! Analytics / Machine Learning - Learning insights from data ! Can be combined so that it gives accurate, effective analysis !!! Anoop V.S Introduction to Data Science March 10, 2017 45 / 48
  46. 46. Books I recommend for Machine Learning Anoop V.S Introduction to Data Science March 10, 2017 46 / 48
  47. 47. Books I recommend for Big Data, Machine Learning Anoop V.S Introduction to Data Science March 10, 2017 47 / 48
  48. 48. Thank you for not yawning ! Questions ? Anoop V.S Introduction to Data Science March 10, 2017 48 / 48

×