Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Data science and business analytics

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Chargement dans…3
×

Consultez-les par la suite

1 sur 63 Publicité

Data science and business analytics

Télécharger pour lire hors ligne

About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.

About
Evolution of Data, Data Science , Business Analytics, Applications, AI, ML, DL, Data science – Relationship, Tools for Data Science, Life cycle of data science with case study,
Algorithms for Data Science, Data Science Research Areas,
Future of Data Science.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Data science and business analytics (20)

Publicité

Plus récents (20)

Data science and business analytics

  1. 1. Data science andbusiness analytics Dr.M.Inbavalli Vice Principal & Head Research Department of Computer Science Marudhar Kesari Jain College for Women Vaniyambadi-635751
  2. 2. Overview • Evolution of Data • Data Science • Business Analytics • Applications • AI, ML, DL, Data science – Relationship • Tools for Data Science • Life cycle of data science with case study • Algorithms for Data Science • Data Science Research Areas • Future of Data Science
  3. 3. Data All Around • Data has become the most abundant thing today • Explosion of data, in pretty much every domain • Lots of data is being collected and warehoused • Web data, e-commerce • Financial transactions, bank/credit transactions • Online trading and purchasing • Social Network
  4. 4. •Data All Around • Sensing devices and sensor networks that can monitor everything 24/7 from temperature to pollution to vital signs • Increasingly sophisticated smart phones • Internet, social networks makes it easy to publish data • Scientific experiments and simulations produce astronomical volumes of data • Internet of Things(IOT) • Dataification: taking all aspects of life and turning them into data (e.g., what you like/enjoy has been turned into a stream of your "likes")
  5. 5. • Data Science – Why all the excitement?
  6. 6. • How Much Data Do We have? • Data volumes expected to get much worse • Over 2.5 quintillion bytes of data are created every single day.
  7. 7. How Much Data Do We have? What can you do with the Traffic Prediction data? 9 Crowdsourcing + physical modeling + sensing + data assimilation From Institute for Transportation Studies
  8. 8. • How to handle that data? • Data is just like crude oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so data must be broken down, analyzed for it to have value. • How to extract interesting actionable insights and scientific knowledge?
  9. 9. •Data Science why excitement? • Data Science is the science which uses computer science, statistics and machine learning, visualization and human-computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products. • Turn data into data products.
  10. 10. • Data Science why excitement? Theories and techniques from many fields and disciplines are used to investigate and analyze a large amount of data to help decision makers in many industries such as science, engineering, economics, politics, finance, and education Computer Science Pattern recognition, visualization, data warehousing, High performance computing, Databases, AI Mathematics Mathematical Modeling Statistics Statistical and Stochastic modeling, Probability. Data science (DS) is a multidisciplinary field of study with goal to address the challenges in big data
  11. 11. • Data Science why excitement?(cont) • Data Science blend of tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. • focus on statistical modeling, machine learning, management and analysis of data sets, and data acquisition. • Data Science makes use of several statistical procedures • These procedures range from data transformations, data modeling, statistical operations (descriptive and inferential statistics) and machine learning modeling. • In order to gain predictive responses from the models, it is an essential requirement to understand the underlying patterns of the data model. Furthermore, optimization techniques can be utilized to meet the business requirements of the user.
  12. 12. •Data Science why excitement?(cont) • Using various statistical tools, a Data Scientist has to develop models. With the help of these models, they help their clients in the decision-making process. Furthermore, these models support demand generation initiatives. Data Science also covers: • Data Integration. • Distributed Architecture. • Automating Machine learning. • Data Visualization. • Dashboards and BI. • Data Engineering. • Deployment in production mode • Automated, data-driven decisions.
  13. 13. Example Search • Google revenue around $50 bn/year from marketing, 97% of the companies revenue. • Sponsored search uses an action – a pure competition for marketers trying to win access to consumers. • In other words, a competition for models of consumers – their likelihood of responding to the ad – and of determining the right bid for the item. • There are around 30 billion search requests a month. Perhaps a trillion events of history between search providers. • Google Adwords and Adsense
  14. 14. Data Science Applications • Transaction Databases  Recommender systems (NetFlix), Fraud Detection (Security and Privacy) • Wireless Sensor Data  Smart Home, Real-time Monitoring, Internet of Things • Text Data, Social Media Data  Product Review and Consumer Satisfaction (Facebook, Twitter, LinkedIn), E-discovery • Software Log Data  Automatic Trouble Shooting (Splunk) • Genotype and Phenotype Data  Epic, 23andme, Patient-Centered Care, Personalized Medicine
  15. 15. • Other Applications • Bank -make smarter decisions through fraud detection, management of customer data, risk modeling, real-time predictive analytics, customer segmentation, etc. • In case of fraud detection -- a credit card, insurance, and accounting. • able to analyze investment patterns and cycles of customers and suggest you several offers that suit you accordingly. • ability to risk modeling through data science through which they can assess their overall performance. • In real-time and predictive analytics, banks use machine learning algorithms to improve their analytics strategy
  16. 16. Other Applications • customer sentiment analysis techniques can boost the social media interaction, boost their feedback and analyze customer reviews. Manufacturing-IOT enabled the companies to predict potential problems, monitor systems and analyze the continuous stream of data. Uber is using data science for price optimization and providing better experiences to their customers. Using powerful predictive tools, they accurately predict the price based on parameters like a weather pattern, availability of transport, customers, etc.
  17. 17. Data • Measureable units of information gathered or captured from activity of people, places and things. • data is generated from different sources like financial logs, text files, multimedia forms, sensors, and instruments. • need to understand • which data to use • how to organize the data, and so on. • prepare the structured, and the unstructured data to be used by the Analytics team for model building purpose. • Types of Data • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data • Social Network, Semantic Web (RDF), … • Streaming Data
  18. 18. What do we do with the Data ? • Aggregation and Statistics • Data warehousing and OLAP • Indexing, Searching, and Querying • Keyword based search • Pattern matching (XML/RDF) • Knowledge discovery • Data Mining • Statistical Modeling • Example –Data Science • Companies learn your secrets, shopping patterns, and preferences • Eg. can we know if a child likes animation games , even if they doesn’t want us to know? • Building, and maintain a Data warehouse is a key skill which a Data Engineer must have.
  19. 19. • They build pipelines which extract data from multiple sources and then manipulates it to make it usable. • Business analytics (BA) is the practice of iterative, methodical exploration of an organization's data, with an emphasis on statistical analysis. Business analytics is used by companies committed to data-driven decision-making. • BA activities must be anchored to a strategically relevant business question to be answered by using data analysis.
  20. 20. • Data Science and Business Analytics • Data science or analytics is the process of deriving insights from data in order to make optimal decisions. • data science and analytics techniques such as basic statistics, regressions, simulation and optimization modeling, data mining and machine learning, text analytics, artificial intelligence and visualizations. • Data science focuses on data modelling and data warehousing to track the ever- growing data set. The information extracted through data science applications are used to guide business processes and reach organisational goals.
  21. 21. Databases Data Science Data Volume Modest Massive Examples Bank records, Personnel records, Census, Medical records Online clicks, GPS logs, Tweets, Building sensor readings Priorities Consistency, Error recovery, Auditability Speed, Availability, Query richness Structured Strongly (Schema) Weakly or none (Text) Properties Transactions, ACID* CAP* theorem (2/3), eventual consistency Realizations SQL NoSQL: MongoDB, CouchDB, Hbase, Cassandra, Riak, Memcached, Apache River, …
  22. 22. Features Business Intelligence (BI) Data Science Data Sources Structured (Usually SQL, often Data Warehouse) Both Structured and Unstructured ( logs, cloud data, SQL, NoSQL, text) Approach Statistics and Visualization Statistics, Machine Learning, Graph Analysis, Neuro- linguistic Programming (NLP) Focus Past and Present Present and Future Tools Pentaho, Microsoft BI, QlikView, R Rapid Miner, BigML, Weka, R
  23. 23. Data Science ML AI Tools -1. SAS2. Tableau3. Apache Spark4. MATLAB, SQL, 1. Amazon Lex2. IBM Watson Studio3. Microsoft Azure ML Studio 1.TensorFlow2. Scikit Learn 3. Keras, Amazon lex, Google cloud platform, Data robot. Data Science deals with structured and unstructured data. Machine Learning uses statistical models. Artificial Intelligence uses logic and decision trees. Fraud Detection and Healthcare analysis are popular examples of Data Science. Recommendation Systems such as Spotify, and Facial Recognition are popular examples. Chatbots, and Voice assistants are popular applications of AI. The main applications of Data Science are credit card fraud, ATM theft, disease prediction, pattern identification etc. The main applications of machine learning are Online recommender system, Google search algorithms, Facebook auto friend tagging suggestions, etc. The main applications of AI are Siri, customer support using catboats, Expert System, Online game playing, intelligent humanoid robot, etc.
  24. 24. • Relationship between Data Science, Artificial Intelligence and Machine Learning • Machine Learning for Predictive Reporting • to study transactional data to make valuable predictions . • Also known as supervised learning • implemented to suggest the most effective courses of action for any company. Machine Learning for Pattern Discovery • set parameters in various data reports • unsupervised learning where there are no pre-decided parameters. Artificial Intelligence represents an action planned feedback of perception. Perception > Planning > Action > Feedback of Perception Data Science uses different parts of this pattern or loop to solve specific problems
  25. 25. • For instance, in the first step, i.e. Perception, • data scientists try to identify patterns with the help of the data. • planning, there are two aspects: • Finding all possible solutions • Finding the best solution among all solutions • machine learning by taking it as a standalone subject- understood in the context of its environment. AI is the tool that helps data science get results and the solutions for specific problems. However, machine learning is what helps in achieving that goal Example : Google’s search engine is a product of data science It uses predictive analysis, a system used by artificial intelligence, to deliver intelligent results to the users
  26. 26. • Tools for Data Science • Reporting and Business Intelligence • Predictive Modelling and Machine Learning • Artificial Intelligence • Data Science Tools for Big Data(Volume) • Data 1GB to 10 GB - Traditional DB Excel, Access, SQl etc. • >10 GB – Haddop, Hive • Tools for Handling Variety
  27. 27. • Voluminous • customer feedback may vary in length, sentiments, and other factors. • Example for SQL are Oracle, MySQL, SQLite, whereas NoSQL consists of popular databases like MongoDB, Cassandra, etc. • These NoSQL databases are seeing huge adoption numbers because of their ability to scale and handle dynamic data. .
  28. 28. • Tools for Handling Velocity • speed at which the data is captured. • includes both real-time and non-real-time data. • Example for realtime data • sensor data collected by self-driving cars- automatic actions • CCTV • Stock trading • Fraud detection for credit card transaction • Network data – social media (Facebook, Twitter, etc.) Tools -Apache Kafka- real-time data pipelines. Apache Storm- process up to 1 Million tuples per second and it is highly scalable Amazon Kinesis-Licensed and powerful Apache Flink- high performance, fault tolerance, and efficient memory management.
  29. 29. Reporting and BI Tools Predictive Analytics and Machine Learning Tools Frameworks for Deep Learning AI Tools Excel, QlikView, Tableau , Microstrategy, powerBI, Google Analytics,Dundas,SISENSE etc Python , R, Apache spark, Julia, Jupyter Notebooks TensorFlow, Pytroch, Keras and Caffe AutoKeras, Google Cloud AutoML, IBM Watson, DataRobot, H20’s Driverless AI, and Amazon’s Lex SAS, SPSS,MATLAB- Licensed
  30. 30. Lifecycle of Data Science
  31. 31. • Role of Data Scientist • Identifying the data-analytics problems that offer the greatest opportunities to the organization • Determining the correct data sets and variables • Collecting large sets of structured and unstructured data from disparate sources • Cleaning and validating the data to ensure accuracy, completeness, and uniformity • Devising and applying models and algorithms to mine the stores of big data • Analyzing the data to identify patterns and trends • Interpreting the data to discover solutions and opportunities • Communicating findings to stakeholders using visualization and other means
  32. 32. • Phase 1—Discovery • various specifications, requirements, priorities and required budget. • the ability to ask the right questions. • need to frame the business problem and formulate initial hypotheses (IH) to test. • Phase 2—Data preparation • data cleaning, transformation, and visualization. This will help you to spot the outliers and establish a relationship between the variables.----R • Phase 3—Model planning • methods and techniques to draw the relationships between variables • These relationships will set the base for the algorithms in next phase • apply Exploratory Data Analytics (EDA) using various statistical formulas and visualization tools.
  33. 33. • R has a complete set of modeling capabilities and provides a good environment for building interpretive models. • SQL Analysis services can perform in-database analytics using common data mining functions and basic predictive models. • SAS/ACCESS can be used to access data from Hadoop and is used for creating repeatable and reusable model flow diagrams.
  34. 34. • Phase 4—Model building • develop datasets for training and testing purposes • various learning techniques like classification, association and clustering to build the model. Example : 1. Classification (decision trees) 2. Clustering (K-means, Fuzzy C-means, Hierarchical Clustering, DBSCAN) 3. Association rules 4. Advanced supervised machine learning algorithms (Naive Bayes, k-NN, SVM) 5. Intro to ensemble learning algorithms (Random Forest, Gradient Boosting)
  35. 35. • Phase 5—Operationalize • Analyzing the data to identify patterns and trends • Interpreting results • deliver final reports, briefings, code and technical documents • pilot project • Phase 6—Communicate results • identify all the key findings, communicate to the stakeholders and determine if the results of the project are a success or a failure
  36. 36. • Basic statistics • 1. Random variables, sampling • 2. Distributions and statistical measures • 3. Hypothesis testing Overview of linear algebra 1. Linear algebra and matrix computations 2. Functions, derivatives, convexity Modeling techniques regression 1. Mathematical modeling process 2. Linear regression 3. Logistic regression • Data visualization and visual analytics • 1. Visual analytics 2. Visualizations in Python and visual analytics in IBM Watson Analytics
  37. 37. • Data visualization and visual analytics • 1. Visual analytics 2. Visualizations in Python and visual analytics in IBM Watson Analytics • Data mining and machine learning • 1. Classification (decision trees) 2. Clustering (K-means, Fuzzy C-means, Hierarchical Clustering, DBSCAN) 3. Association rules 4. Advanced supervised machine learning algorithms (Naive Bayes, k-NN, SVM) 5. Intro to ensemble learning algorithms (Random Forest, Gradient Boosting) • Simulation modeling 1. Random number generation 2. Monte Carlo simulations 3. Simulation in Ipython
  38. 38. • Real time example • Case Study: Diabetes Prevention • What if we could predict the occurrence of diabetes and take appropriate measures beforehand to prevent it? • 1. You can refer to the sample data below. • Step 1: Discovery • Attributes: • npreg – Number of times pregnant • glucose – Plasma glucose concentration • bp – Blood pressure • skin – Triceps skinfold thickness • bmi – Body mass index • ped – Diabetes pedigree function • age – Age • income – Income
  39. 39. • Step 2 Data Preparation • once we have the data, we need to clean and prepare the data for data analysis. • data has a lot of inconsistencies like missing values, blank columns, abrupt values and incorrect data format which need to be cleaned. • we have organized the data into a single table under different attributes – making it look more structured.
  40. 40. • Step 2(Cont) • This data has a lot of inconsistencies. • In the column npreg, “one” is written in words, whereas it should be in the numeric form like 1. • In column bp one of the values is 6600 which is impossible (at least for humans) as bp cannot go up to such huge value. • Income column is blank and also makes no sense in predicting diabetes. • Therefore, it is redundant to have it here and should be removed from the table. • clean and preprocess this data by removing the outliers, filling up the null values and normalizing the data type. -data preprocessing. • Finally, we get the clean data which can be used for analysis.
  41. 41. • Step 3 Model Planning • load the data into the analytical sandbox and apply various statistical functions • R has functions like describe which gives us the number of missing values and unique values. • We can also use the summary function which will give us statistical information like mean, median, range, min and max values. • Then, we use visualization techniques like histograms, line graphs, box plots to get a fair idea of the distribution of data.
  42. 42. • Step 4 Model Building • supervised learning technique to build a model here.
  43. 43. • Step 5 Deliver the Model • Check with sample data. Data :Data tables and data types ○ Operations on tables ○ Basic plotting ○ Tidy data / the ER model ○ Relational Operations ○ SQL wrangling ○ Data acquisition (load and scrape) ○ EDA Vis / grammar of graphics ○ Data cleaning (text, dates) ○ EDA: Summary statistics ○ Data analysis with optimization (derivatives) ○ Data transformations ○ Missing data
  44. 44. • Modeling ○ Univariate probability and statistics ○ Hypothesis testing ○ Multivariate probablity and statistics (joint and conditional probability, Bayes thm) ○ Data Analysis with geometry (vectors, inner products, gradients and matrices) ○ Linear regression ○ Logistic regression ○ Gradient descent (batch and stochastic) ○ Trees and random forests ○ K-NN ○ Naïve Bayes ○ Clustering ○ PCA
  45. 45. • Sample Algorithms for Data Science analytics Regression • The most popular technique for this algorithm is least of squares. This method calculates the best-fitting line. • Based on historical data Example : • Weather forecasting • Assessing risk Tools • TensorFlow and PyTorch
  46. 46. • Logistic Regression • Logistic regression is similar to linear regression, but it is used when the output is binary (i.e. when outcome can have only two possible values). The prediction for this final output will be a non-linear S-shaped function called the logistic function, g(). • Graph of a logistic regression curve showing probability of passing an exam versus hours studying
  47. 47. • Decision Trees • Decision Trees can be used for both regression and classification tasks. • Categorical Variable Decision Tree-predict whether a customer will pay his renewal premium with an insurance company (yes/ no). • Continuous Variable Decision Tree.-predict customer income based on occupation, product, and various other variables. • Example C4.5, CART • Naive Bayes • classification technique • It measures the probability of each class, and the conditional probability for each class give values of x. This algorithm is used for classification problems to reach a binary yes/no outcome.
  48. 48. Example: Text classification/ Spam Filtering/ Sentiment Analysis Recommendation System Types Gaussian Naive Bayes Multinomial Naive Bayes Bernoulli SVM KNN Kmeans Dimensionality Reduction
  49. 49. • ANN • Feed forward -multilayer perceptrons • convolution neural networks-classification, object detection, or even image segmentation, • hierarchical object extractors.
  50. 50. What do Data Scientists do? • National Security • Cyber Security • Business Analytics • Engineering • Healthcare • And more ….
  51. 51. Data Scientist must posses • Mathematics and Applied Mathematics • Applied Statistics/Data Analysis • Solid Programming Skills (R, Python, Julia, SQL) • Data Mining • Data Base Storage and Management • Machine Learning and discovery
  52. 52. • Data Science Research Areas • machine learning. • artificial intelligence. • Deep learning • databases. • statistics. • optimization. • natural language processing. • computer vision. • speech processing. • Privacy • Ethics • Energy consumption • Cloud computing • IOT • Cloud • Social Media • Block Chain etc.
  53. 53. • Future of Data Science and Analytics
  54. 54. Thank You ?

×