Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

So your boss says you need to learn data science

Interested in Data science but trying to get a handle on all the terms getting you confused? Not sure where to start? This presentation breaks down the concepts and the terminology

  • Identifiez-vous pour voir les commentaires

So your boss says you need to learn data science

  1. 1. So your boss wants you to learn data science Susan Ibach susan@aigaming.com @HockeyGeekGirl
  2. 2. Data Science has become a buzzword I THINK WE NEED TO DO DATA SCIENCE YOUR DATA SCIENCE
  3. 3. When your boss walks up to you and says we need to do data science, where do you start? PLATFORM TO USE DATA SCIENCE BIG DATA AI ML
  4. 4. What is a data scientist? Advanced math skills Subject Matter Expertise Data Engineering skills
  5. 5. Follow the 7 Steps to data science success 1 2 3 4 5 6 7
  6. 6. Step 1: Identify your problem and data to define the problem 1
  7. 7. What insights might help solve/define the problem? An airline wants to prevent flight delays
  8. 8. Different insights require different tools SELECT COUNT(*) FROM FLIGHTS WHERE ACTUAL_ARR_TIME > SCHED_ARR_TIME SELECT COUNT(*) FROM FLIGHTS WHERE ACTUAL_ARR_TIME > SCHED_ARR_TIME BETWEEN 1997 and 2017
  9. 9. Data science tools include Data Mining gain insights from data • Those who bought this also bought • Keyword extraction Machine Learning make predictions • Who will need hospitalization from the flu? • How many copies of this book will I sell? Deep Learning For complex data processed in layers • Is there a bird in this photo? • Will this person get cancer?
  10. 10. Do we need Artificial Intelligence? •AI is when a computer completes a task that normally requires human intelligence • Answering questions from a customer • Recognizing the content of a photo • Understanding human speech •We use data science to analyze and recognize patterns and responses so we can do AI
  11. 11. Step 2: Collect data 1 2
  12. 12. Which flights are most likely to be delayed next week? What data would help you determine:
  13. 13. Relational databases BLOB storage NoSQL databases Data warehouses Flat Files Open source data Sensors Where do I get all that data?
  14. 14. BIG DATA When does data become “big data”? High Volume High Velocity High Variety
  15. 15. Step 3: Prepare data 1 2 3
  16. 16. Your data will need clean-up/prep Flight # Dep Date Sched Dep Time Dep Airport Dep Delay 041 15-dec-2016 09:20 YYZ 253:26 386 15-dec-2016 15:20 YYZ 415 15-dec-2016 19:15 YYZ 0:02 415 15-dec-2016 19:15 YYZ 0:02 Date Airport Wind Precipitation Precipitation Type 15/12/2016 Pearson NNE 5 MPH 150 mm Snow 15/12/2016 Dulles SW 18 MPH 7 mm Rain 15/12/2016 Reagan SW 18 MPH 7 mm Rain Missing Values Duplicate rows Different data formats Decomposition Outliers Scaling
  17. 17. Start with what you already know • Excel, SQL Write your own Code • Python Pandas library, R Third party products Experian, Paxata, Alteryx, SAP Lumira, Teradata Data Lab, Knowledge Works, Datameer What tools might you use for data prep?
  18. 18. If you have Big Data •Preparing and pulling together your data will require a LOT of storage and processing power
  19. 19. Step 4: Identify the data that influences outcomes 1 2 3 4
  20. 20. Which fields “features” might helps us predict if a flight will be late “label”? Flight # Dep Date Sched Dep Time Dep Airport Dep Delay 041 15-dec-2016 09:20 YYZ 253:26 386 15-dec-2016 15:20 YYZ 415 15-dec-2016 19:15 YYZ 0:02 415 15-dec-2016 19:15 YYZ 0:02 Date Airport Wind Precipitation Precipitation Type 15/12/2016 Pearson NNE 5 MPH 15 cm Snow 15/12/2016 Dulles SW 18 MPH 7 mm Rain 15/12/2016 Reagan SW 18 MPH 7 mm Rain Are there any fields we can decompose to get more information?
  21. 21. Which fields “features” help us predict if a picture contains a dog or cat “label”? • Pixel1Color, Pixel2Color, Pixel3Color,….Pixel9036Color
  22. 22. Break out the deep learning GPUs Storage Pixel Edge Shape Cat
  23. 23. Step 5: Pick the right algorithm 1 2 3 4 5
  24. 24. What are you trying to predict? Prediction Algorithm Example Predict continuous values Regression Predict what time a flight will land Predict what category something falls into Classification Predict if a flight will be late or on time Detect unusual data points Anomaly detection Predict if a credit card transaction is fraudulent Predict if a runner cheated on a marathon
  25. 25. Supervised vs Unsupervised Type Definiton Example Supervised You have existing data with known inputs and known outputs to help make predictions When I try to predict if a flight next week will be late, I know what flights have been late in the past Unsupervised You have input data but no known outcomes in your data When I try to predict if a runner cheated on a marathon, I don’t have a history of runners who cheated in the past.
  26. 26. Step 6: Train your model 1 2 3 4 5 6
  27. 27. Once you have data and your algorithm you can train and create your predictive model
  28. 28. Python R scikit-learn (based on NumPy, SciPy, and matplotlib) Azure Machine Learning Service Cognitive Toolkit/Tensorflow (deep learning) There are lots of tools to choose from
  29. 29. Step 7: Test your model 1 2 3 4 5 6 7
  30. 30. You need to know the accuracy of your model! Predictive/Trained Model Flt #406 Air Canada April 1, 2016 3:15 PM YYZ-YVR Late: No Flt #351 West Jet April 12, 2016 8:01 AM YOW-YYZ Late: No Flt #141 Delta Sep 25, 2016 1:45 PM HND-SEA Late: Yes Flt #406, Air Canada, April 1, 2016, 3:15 PM, YYZ-YVR Flt #351, West Jet, April 12, 2016, 8:01 AM, YOW-YYZ Flt #141, Delta, Sep 25, 2016, 1:45 PM HND-SEA Flt #406, Air Canada, April 1, 2016, 3:15 PM, YYZ-YVR, Late: Yes Flt #351, West Jet, April 12, 2016, 8:01 AM, YOW-YYZ, Late: No Flt #141, Delta, Sep 25, 2016, 1:45 PM HND-SEA, Late: Yes 66.6% accuracy
  31. 31. What do I do if my accuracy is lousy? Go back to step 1
  32. 32. For additional information •Appendix A What is Hadoop anyway? •Appendix B What cloud tools exist to help with data science? •Appendix C Lexicon
  33. 33. THANK YOU QUESTIONS? Susan Ibach susan@aigaming.com @HockeyGeekGirl
  34. 34. Appendix A – What is Hadoop anyway? It’s a tool for analyzing Big Data
  35. 35. Hadoop is an OS framework •Based on java •Distributed processing of large datasets across clusters of computers •Distributed storage and computation across clusters of computers •Scales from single server to thousands of machines
  36. 36. Hadoop components • Hadoop Common – java libraries used by Hadoop to abstract the filesystem and OS • Hadoop YARN – framework for job scheduling and managing cluster resources • HDFS – distributed File system for access to application data (distributed storage) • Based on Google File System (GFS) • Hadoop can run on any distributed file system (FS, HFTP, FS, S3, FS) but usually HDFS • File in HDFS is split into blocks which are stored in DataNodes. Name nodes map blocks to datanodes • MapReduce – the query language for parallel processing of large data sets (distributed computation) • Map data into key/value pairs (tuples) • Reduce data tuples into smaller pairs of tuples • Input/output stored in file system • Job tracker and TaskTracker schedule, monitor tasks and re-execute failed tasks
  37. 37. Hadoop components • Hive – similar to SQL, hides complexity of Map Reduce programming, generates a MapReduce job • Pig - (Pig latin) – High level data flow language for parallel computation & ETL • Hbase - Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of columns) • Spark - compute engine for Hadoop data used for ETL, machine learning, stream /real-time processing and graph computation (gradually replacing MapReduce because it is faster for iterative algorithms)
  38. 38. How does Hadoop work • User submits a job to Hadoop • Location of input and output files • Java classes containing map and reduce functions • Job configuration parameters •Hadoop submits the job to JobTracker which distributes the job to the slaves, schedules tasks and monitors them •Task trackers execute the task and output is stored in output files on the file system
  39. 39. Why is it popular? • Allows user to quickly write and test distributed systems • It automatically distributes data and work across the machines and utilizes the parallelism of CPU cores • Does not rely on hardware for fault tolerance and high availability • Servers can be added or removed dynamically • It’s Open source and compatible on many platforms since it is Java based
  40. 40. Appendix B – What cloud tools exist to help with data science?
  41. 41. Cortana, Bot Framework Interact with it Type messages, talk, send images, or video and get answers Power BI See it Visualize data with heat maps, graphs and charts Stream Analytics Stream it Monitor data as it arrives and act on it in real time Azure Machine Learning, Microsoft R Server Learn it Analyze past data to learn by finding patterns you can use to predict outcomes for new data SQL Data Warehouse, SQL DB, Document DB, Blob storage Relate it Store related data together using the best data store for the job Data Lake Store it A data store that can handle data of any size, shape or speed Event Hubs Collect it Collect data from sources such as IoT sensors that send large amounts of data over small amounts of time Data Factory Move it Move data from one place to another, transform it as you move it Data Catalog Document it Document all your data sources Cognitive Services Use it Pre-trained models available for use HD Insight & Azure Data Bricks Scale it Create clusters for Hadoop or Spark (DataBricks for Spark) Microsoft Azure
  42. 42. Cloud Dataprep Prepare it Prepare your data for analysis BiqQuery ML, BigQuery GIS Train it Train machine learning models Big Query, GCP Data Lake Store it Data warehouse Cloud Dataproc Scale it Spin up clusters for Hadoop and Spark Cloud Pub/Sub, Cloud Dataflow Stream it Ingest events in real time Cloud DataFlow Store it A data store that can handle data of any size, shape or speed Prepackaged AI solutions Use it Pre-trained models available for use Google Cloud Platform
  43. 43. Analytics Engine Scale it Build and deploy clusters for Hadoop and Spark InfoSphere Information Server on cloud Access it Extract, transform & load data + data standardization Streaming Analytics Stream it Monitor data as it arrives and act on it in real time IBM Watson Train it or Use it Train your own models or leverage pre-trained models for features such as speech to text, natural language processing, and image analysis Watson IoT Platform Collect it Connect devices and analyze the associated data Deep Learning Analyze it Design and deploy deep learning modules using neural networks IBM Data Refinery Prepare it Data preparation tool IBM
  44. 44. Data Lakes, Redshift Store it Store your data Lake formation Move it Get data into your data lake Streaming Analytics Stream it Monitor data as it arrives and act on it in real time Amazon Kinesis, IoT Core Collect it Collect, process and analyze real time data including data from IoT devices Glue Document it Create a catalog of your data that is searchable and queryable by users Athena Analyze it Analyze your data EMR, Deep Learning AMIs Scale it Scale using Hadoop and Spark QuickSight See it Visualizations and dashboards Application Services Use it Pre-trained models ready for use Deep Learning AMIs, SageMaker Train it Tools to help you build and train models AWS
  45. 45. Appendix C Lexicon Buzzwords and Tools
  46. 46. Amazon Redshift – Data warehouse infrastructure Ambari web based tool for managing Apache Hadoop clusters – provision, manage and monitor your Hadoop clusters Avro – a data serialization system (like XML or JSON) Apache Hadoop distributed storage and processing of big data. Splits files into large blocks and distributes them across nodes in a cluster. It then transfers the packaged code into nodes to process the data in parallel, for faster processing Apache Flink - open source stream processing framework to help you move data from your sensros and applications to your data stores and applications Apache Storm - Open source realtime computation system . Storm does for realtime processing what Hadoop does for batch processing Azure DataBricks – platform for managing and deploying Spark at scale Azure Data Lake Analytics – allows you to write queries against data in a wide variety of data stores Azure notebooks – Basically Jupyter notebooks on Azures supporting Python, F# and R Azure SQL Data Warehouse – Data warehouse infrastructure Caffe – Deep learning framework Cassandra –NoSQL Database Cognitive Toolkit (CNTK) – Microsoft’s Deep learning toolkit (competes with Google Tensorflow) for training machine learning models. Provides APIs you call with Python CouchDB – NoSQL Database Chukwa – Data collection system for managing large distributed systems H2O – Open source deep learning platform (competes with Tensorflow and Cognitive Toolkit) Hadoop Distributed File System (HDFS) is the distributed file system used by Hadoop great for horizontal scalability (does not support insert, update & delete) Hadoop Map Reduce – programming model used to process data, provides horizontal scalability Hadoop YARN Platform for managing resources and scheduling in Hadoop clusters HD Insight –Microsoft Azure service used to spin up Hadoop clusters to help analyze big data with Hadoop, Spark, Hbase, R-Server, Storm, etc..
  47. 47. Hive – Data warehouse infrastructure Hbase – Scalable distributed non-relational database that supports structured data storage for large tables (billions of rows X millions of columns) Jupyter Notebooks – web applications that allow you to create shareable interactive documents containing text, equations, code, and data visualizations. Very useful for data scientists to explore and manipulate data sets and to share results. You can use them for data cleaning and transformation, machine learning, and data visualization supports Python, R, Julia, and Scala. You can use Jupyter notebooks on a Spark Cluster. Kafka – distributed publisher subscriber messaging system. Used in the extraction step of ETL for high volume high velocity data flow MapReduce – a two stage algorithm for processing large datasets. Data is split across a Hadoop cluster, the map function breaks data in key value pairs (e.g. individual words in a text file), the Reduce function combines the mapped data (e.g. total counts of each word). MapReduce functions can be written in Java, Python, C# or Pig MATLAB – tools for machine learning – build models MongoDB – NoSQL Database MySQL – NoSQL Database Scikit-learn – tools for data mining and data analysis built on Python (NumPy, SciPy and matplotlib) Spark – compute engine for Hadoop data used for ETL, machine learning, stream processing and graph computation (starting to replace MapReduce because Spark is faster) Sqoop – Used for transferring data between structured databases and Hadoop Tensorflow – Google’s deep learning toolkit. An open source software library for training machine learning models, allows you to deploy computation across one or more CPUs or GPUS with a single API. Tensorflow provides APIs you call from Python Torch – computing framework for Machine learning algorithms that puts GPUs first (good for deep learning) TPU – Tensor processing unit. Custom built ASIC designed for high performance for running models rather than training them, Google Compute Engine – second generation of Google TPUs Tez – data flow programming framework built on YARN, runs projects like Hive and Pig, starting to replace MapReduce as the execution engine on Hadoop because it can process data in a single job instead of multiple jobs ZooKeeper – high performance coordination service for distributed applications
  48. 48. Scala – libraries and tools for performing data analysis Python – Pandas (for exploring data, data preparation: e.g. missing values, joins, string manipulation) NumPy – fundamental package for scientific computing with Python SciPy – numerical routines for numerical integration and optimization matplotlib (for graphing, charting and visualizing data sets or query results) Keras – deep learning for building your own neural networks R - language for statistical (linear and nonlinear modelling, classification, clustering) and graphics Julia – numerical computing language supports parallel execution based on C Mahout – Scalable Machine Learning and data mining library Pig (Pig latin) – High level data flow language for parallel computation & ETL HiveQL - similar to SQL, hides complexity of Map Reduce programming, generates a MapReduce job USQL – data language used by Azure Data Lake to query across data sources Programming languages and libraries

×