Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Introduction to data science

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 14 Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Similaire à Introduction to data science (20)

Publicité
Publicité

Introduction to data science

  1. 1. Introduction to Data Science PREPARED BY MAHIR MAHTAB HAQUE
  2. 2. What is Data Science?  It is a set of methodologies for taking in thousands of forms of data that are available to us today and using them to draw meaningful conclusions.  Purpose of Data Science: - Describe the current state of an organization or process - Detect anomalous events - Diagnose the causes of events and behaviors - Predict future events  Data Science Workflow: - Collect data from various sources – surveys, web traffic results, geo-tagged social media posts, financial transactions, etc. Once data have been collected, we store that data in a safe and accessible way. - Prepare the raw data, also known as ‘cleaning the data’, which involves finding missing or duplicate values and converting data into a more organized format. - Explore and visualize the cleaned data by building dashboards to track how data changes over time or performing comparisons between two sets of data. - Run experiments and predictions on the data, for example building a system that forecasts temperature changes or performing a test to find which web page acquires more customers.
  3. 3. 3 exciting areas of Data Science  Machine Learning: - Starts with a well-defined question (What is the probability that this transaction is fraudulent?) - Gather some data to analyze (Old transactions labeled as fraudulent/valid) - Bring in new additional data to make predictions (New credit card transactions)  Internet of Things (IoT): - Refers to gadgets that are not standard computers but still have the ability to transmit data. - Includes smart watches, internet-connected home security systems, electronic toll collection systems, building energy management systems, etc. - IoT is a great source for data science projects.  Deep Learning: - A sub-field of machine learning, where multiple layers of algorithms called ‘Neurons’ work together to draw complex conclusions. - Deep learning takes much more ‘Training Data’, which are records of data used to build an algorithm, than a traditional machine learning model and is also able to learn relationships that traditional models cannot. - Deep learning is used to solve data-intensive problems such as image classification or language understanding.
  4. 4. Data Science Roles and Tools Roles Data Engineer Data Analyst Data Scientist Machine Learning Scientist Responsibilities They control the flow of data by building custom data pipelines and storage systems. They design infrastructure so that data is not collected but it is easily obtained and processed. They describe the data through exploring the data and creating visualizations and dashboards. To do these, they need to first clean the data. They find new insights from data and use traditional machine learning for prediction and forecasting. Very similar to Data Scientists. They what’s likely to be true from what we already know – these scientists use Training Data to classify larger, unrulier data whether it’s to classify images that contain a car or create a chatbot . Focus area Data collection and storage Data preparation & Exploration and Visualization Data preparation, Exploration and Visualization & Experimentation and Prediction Data preparation, Exploration and & Experimentation and Prediction Tools • SQL for storing and data. • Either Java, Scala or Python processing data. • Shell is used on the command line to automate and run • SQL for querying data – use existing databases to retrieve and aggregate relevant data. • Spreadsheets to perform simple analyses on small data quantities. • Tableau, Power BI or Looker to create dashboards and share analyses. • Python/R can also be used for cleaning and analyzing data. • SQL, Python or R proficiency. • Data science libraries for using reusable codes for common data science tasks. • Python/R to create predictive models. • Popular machine learning libraries (TensorFlow) to run powerful deep learning algorithms.
  5. 5. Step 1: Data collection & storage  Vast amounts of data are being generated daily from surfing the internet to paying by card in a shop. The companies behind these services that we use, collect these data internally and use it to make data-driven decisions. There are also many free, open data sources available. This means data can be freely used, shared and built-on by anyone.  Company data sources: - Web events - Customer data - Survey data - Logistics data - Financial transactions  Open data sources: - Public data APIs (Application programming interface) – Twitter, Wikipedia, Yahoo! Finance, Google Maps - Public records (international organizations such as World Bank, UN, WTO; national statistical offices; government agencies)
  6. 6. Types of data Quantitative data: Data that can be counted, measured and expressed using numbers. Qualitative data: Data that is descriptive and conceptual – something that can be observed not measured. Image data: An image is made up of pixels. These pixels contain information about color and intensity. Typically, the pixels are stored in computer memory. Text data: Emails, documents, reviews, social media posts, etc – these data can be stored and analyzed to find relevant insights. Geospatial data: Data with location especially useful for navigation apps like Google Maps/Waze. Network data: Data consisting of people or things in a network and the relationships between them.
  7. 7. Data storage and retrieval  When storing data, there are 3 important things to consider: - Determining where to store the data - Knowing what kind of data we are storing - How we can retrieve the data from storage  Location: - On-premises cluster, i.e., data stored across many different computers - Cloud storage (MS Azure, Amazon Web Services, Google Cloud), which can also carry out data analytics, machine learning and deep learning.  Types of data storage: - Unstructured data (email, text, video & audio, web pages, social media messages) are stored in a Document Database - Tabular data is stored in Relational Database  Data retrieval (each type of database has its own query language): - Document Database mainly use NoSQL (Not only SQL) - Relational Database use SQL (Structured Query Language)
  8. 8. Data Pipelines  These move data into defined stages, i.e., from data ingestion through an API to loading data into a database.  A key feature is that pipelines automates this movement. - Data engineer, rather than manually running programs to collect and store data, schedules tasks whether it’s hourly, daily or tasks that can be triggered by an event. - Due to this automation, data pipelines need to be monitored. Alerts can be generated automatically if 95% of storage capacity has been reached or if an API is responding with an error. - Data pipelines are important when working with lots of data from different sources.  There is no set way to make a pipeline – pipelines are highly customized depending on your data, storage options and ultimate usage of the data.  ETL (extract, transform and load) is a popular framework for data pipelines.
  9. 9. Step 2 & 3: Data preparation, Exploratory Data Analysis & Visualization  Data preparation: - Skipping this step may lead to errors down the way, such as incorrect results which may throw off your algorithms. - Tidy Data is a way of presenting a matrix of data, with observations on rows and variables as columns.  Exploratory Data Analysis (EDA): - It is a process that consists in exploring the data and formulating hypotheses about it and assessing its main characteristics with a strong emphasis on visualization. This takes place after data preparation, but they can get mixed.  Visualization: - Dashboards are used to group all relevant information in one place to make it easier to gather insights and act on them. - Business Intelligence tools let you clean, explore, visualize data and build dashboards without requiring any programming knowledge. Examples: Tableau, Looker, Power BI - Note: Make your visualizations interactive and use filters
  10. 10. Step 4: Running experiments and predictions  A/B Testing (aka Champion/Challenger Testing)  It is used to make a choice between two options. These experiments help drive decisions and draw conclusions. Generally, they begin with a question and a hypothesis, then data collection followed by a statistical test and its interpretation.  A/B Testing steps: - Selecting a metric to track - Calculating the sample size - Running the experiment - Checking for significance (result is likely not due to chance given the statistical assumptions made)  Case study: Which is the better title for the blog post - Form a question: Does the title in blog post A or blog post B result in more clicks? - Form a hypothesis: Title in blog post A and B result in the same number of clicks. - Collect data:  50% users will see blog title A  50% users will see blog title B  Track click-through rate until sample size has been reached - Test the hypothesis with a statistical test (t-test, z-test, ANOVA, Chi-square test): Is the difference in titles’ click-through rates significant? - Interpret results: Choose a title or ask more questions and design another experiment.
  11. 11. Time-series forecasting  What is a statistical model? - Represents a real-world process with statistics - Mathematical relationships between variables, including random variables - Based on statistical assumptions and historical data  Predictive modeling: A subcategory of modeling used for prediction. - Process:  New input: Enter future date in a model of unemployment  Predictive model: Model of unemployment  Output: Get a prediction of what unemployment rate will be next month - Predictive models can be as simple as a linear equation with an x & y variable to a very complicated deep learning algorithm.  Time-series data: A series of data points sequenced by time. Example: daily stock, gas prices over the years - Often it is in the form of rates, such as monthly unemployment rates or patient’s heart rate during surgery. - Time-series data is usually plotted as a line graph. - Seasonality occurs when there are repeating patterns related to time such as months or weeks. - Time-series data is used in predictive modeling to predict metrics at future dates, which is known as forecasting. We can build predictive models using time-series data from past years or decades to generate predictions. This uses a combination of statistical and machine learning methods. - Confidence Intervals says that the model is ‘X%’ sure that the time value will fall in this area.
  12. 12. Supervised machine learning  Machine learning: A set of methods for making predictions based on existing data.  Supervised machine learning: A sub-set of machine learning where the existing data has a specific structure, i.e., it has labels and features. - Labels are what we want to predict. - Features are data that might predict the label.  Abilities of supervised machine learning: - Recommendation systems - Diagnosing biomedical images - Recognizing hand-written digits - Predicting customer churn  Case study: Customer churn prediction - Customer: Will either stay subscribed or is likely to cancel subscription (churn). - Gather training data to build the model, i.e., historical customer data where some will have maintained subscriptions while others will have churned. We eventually want to be able to predict the label for each customer (churned/subscribed), hence we will need features about each customer that might affect our label (age, gender, date of last purchase, household income). Machine learning can analyze many features simultaneously. - We use these labels and features to train our model to make predictions on new data. - It’s always good practice to not allocate all your historical data for your training model. Withheld data is called a test set and it can be used to evaluate the efficacy of the model.
  13. 13. Unsupervised learning  Clustering: A set of machine learning algorithms that divide data into categories called clusters. - Clusters help us see patterns in messy datasets. - Machine learning scientists use clustering to divide customers into segments, images into categories or behaviors into typical and anomalous. - Clustering is a broader category within machine learning called ‘Unsupervised learning.’ Unsupervised learning, unlike Supervised learning which uses data with features and labels, use data with only features. These features are basically measurements. - Some clustering algorithms need us to define how many clusters we want to create. The number of clusters we ask for greatly affects how the algorithm will segment our data, based on hypothesis.
  14. 14. THANK YOU!

×