SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
CS639:
Data Management for
Data Science
Lecture 1: Intro to Data Science and Course Overview
Theodoros Rekatsinas
1
2
3
Big science is data driven.
Big science is data driven.
4
Increasingly many companies see
themselves as data driven.
Increasingly many companies see
themselves as data driven.
5
Even more “traditional” companies…
The world is increasingly
driven by data…
6
This class teaches the basics of
how to use & manage data to
obtain useful insights.
Today’s Lecture
1. Motivation, Admin, & Setup
2. Introduction to Data Science
7
1. Motivation, Admin & Setup
8
Section 1
What you will learn about in this section
1. Motivation for studying Data Science
2. Administrative structure
3. Course logistics
9
Section 1
10
Section 1
Data Analysis has always been around
R.A. Fisher “Correlation does not imply causation”
Peter Luhn
A pioneer in hash coding and full text processing
Coined the term “business intelligence”
11
Section 1
Data Analysis has always been around
Introduced the box-plot
Also introduced the term “bit” (later used by Shannon)
The time of data-driven AI
John Tukey
Tom Mitchell
12
Section 1
Data Analysis has always been around
Essays on Scientific Discovery based on data-intensive science
Father of ACID
(requirements for reliable transaction processing)
Known for AI programming
(at Google)
By Jim Gray
Peter Norvig
13
Section 1
Data Analysis is popular
14
Section 1
Why should you study data science?
• Mercenary-make more $$$:
• Startups need data science talent right away = low employee #
• Massive industry…
• Intellectual:
• Science: data poor to data rich
• No idea how to handle the data!
• Fundamental ideas to/from all of CS:
• Systems, theory, AI, logic, stats, analysis….
15
Section 1
What this course is (and is not)
• Discuss the fundamentals data management for data science
workflows
• How to represent and store data
• How to extract and prepare data for analysis
• How to analyze data and how to visualize and communicate insights
• You will learn how to design data science pipelines
• This is not a databases, systems, or machine learning class. We will
touch many topics covered in these classes but we will not go into
details.
Who we are…
Instructor (me) Theo Rekatsinas
• Faculty in the Computer Sciences and part of the UW-Database Group
• Research: data integration and cleaning, statistical analytics, and machine
learning.
• thodrek@cs.wisc.edu
• Office hours: MWF after class @CS 4361
16
Section 1
17
Section 1
Frank Zou
Huawei
Wang
Communication w/ Course Staff
• Piazza https://piazza.com/wisc/spring2019/cs639
• Class mailing list: compsci639-4-s19@lists.wisc.edu
• Office hours: Listed on the website
• Also By appointment!
18
Section 1
The goal is to get you
to answer each
other’s questions so
you can benefit and
learn from each other.
The goal is to get you
to answer each
other’s questions so
you can benefit and
learn from each other.
Course Website:
https://thodrek.github.io/cs639_spring19/
19
Section 1
Course Github:
https://github.com/thodrek/cs639_spring19
20
Lectures
• Lecture slides cover essential material
• This is your best reference.
• We will have pointers as needed
• Recommended textbooks listed on website
• Try to cover same thing in many ways: Lecture, lecture slides,
homework, exams (no shock)
• Attendance makes your life easier…
Section 1
Graded Elements
• Six Programming assignments (45%)
• Focus on different aspects of data science
• Midterm (20%)
• Final exam (35%)
21
Dates are posted on
the website!!!
Dates are posted on
the website!!!
Section 1
What is expected from you
• Attend lectures
• If you don’t, it’s at your own peril
• Be active and think critically
• Ask questions, post comments on forums
• Do programming projects
• Start early and be honest
• Study for tests and exams
22
Section 1
Programming Assignments
• Six programming assignments
• Python plus Jupiter notebooks
• These are individual assignments
• Submission via Canvas
• ~1 week per programming assignments
• Ask questions, post comments on forums
• Start early!
• You have late days. The policy is described on the website
23
Section 1
Programming setup for class
24
1. For all assignments we will use the provided Virtual Machine.
• Ubuntu + necessary python libraries
• Link provided on the website
• We will not provide support for any other platform (you can still use your own machine)
2. To deploy and run the provided VM:
1. Download and Install Virtual Box: https://www.virtualbox.org/wiki/Downloads
2. Download the Class VM from:
https://www.dropbox.com/s/xjvj3jlaurzjfas/cs639_vm.ova.zip?dl=0
3. Import the VM by following the instructions here:
https://blogs.oracle.com/oswald/importing-a-vdi-in-virtualbox
4. Run the VM and login with the following credentials:
1. Username: CS639_DS_USER
2. Password: cs639_ds_user
3. Come to office hours if you need help with installation!
Please help out your
peers by posting issues
/ solutions on Piazza!
Please help out your
peers by posting issues
/ solutions on Piazza!
Section 1
2. Introduction to Data Science
25
Section 2
What you will learn about in this section
1. What is Data Science?
2. Data Science workflows
3. What should a Data Scientist know?
4. Overview of lecture coverage
26
Section 2
Data Science is an emerging field
• https://www.oreilly.com/ideas/what-is-data-science
27
Section 2
Data Science products
28
Section 2
One definition of data science
29
Section 2
Data science is a broad field that refers to the collective
processes, theories, concepts, tools and technologies that
enable the review, analysis and extraction of valuable
knowledge and information from raw data.
Source: Techopedia
Data science is not databases
30
Section 2
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records,
Personnel records,
Census,
Medical records
Online clicks,
GPS logs,
Tweets,
Building sensor readings
Priorities Consistency,
Error recovery,
Auditability
Speed,
Availability,
Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
Riak, Memcached,
MongoDB, CouchDB,
Hbase, Cassandra,…
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
Data science is not databases
31
Section 2
Databases Data Science
Querying the past Querying the future
Business intelligence (BI) is the transformation of raw data into meaningful and
useful information for business analysis purposes. BI can handle enormous
amounts of unstructured data to help identify, develop and otherwise create new
strategic business opportunities - Wikipedia
Data science workflow
32
Section 2
https://cacm.acm.org/blogs/blog-cacm/169199-data-science-
workflow-overview-and-challenges/fulltext
Data science workflow
33
Section 2
Data science workflow
34
Section 2
Digging Around
in Data
Hypothesize
Model
Large Scale
Exploitation
Evaluate
Interpret
Clean,
prep
What is hard about Data Science
35
Section 2
• Overcoming assumptions
• Making ad-hoc explanations of data patterns
• Overgeneralizing
• Communication
• Not checking enough (validate models, data pipeline
integrity, etc.)
• Using statistical tests correctly
• Prototype  Production transitions
• Data pipeline complexity (who do you ask?)
What is hard about Data Science
36
Section 2
What is hard about Data Science
37
Section 2
What are Data Scientists really doing?
38
Section 2
https://visit.figure-eight.com/rs/416-ZBE-
142/images/CrowdFlower_DataScienceReport_2016.pdf
Lectures: 1st part – Data Storage
1. Overview: What is data science
• Lectures 1-3
2. Foundations: Relational data models & SQL
• Lectures 4-7
• How to manipulate data with SQL, a declarative language
• reduced expressive power but the system can do more for you
• Query optimization
3. MapReduce and NoSQL systems: MapReduce, KeyValue Stores,
Graph DBs
• Lectures 8-14
• Dealing with massive amounts of data and non-relational data
39
Section 2
Lectures: 2nd part – Predictive analytics
4. Statistical Reasoning: Inference, Sampling Bayesian Methods
• Lectures 15-17
• How to reason about patterns in data
5. Machine Learning: Decision Trees, Evaluation of ML models, Ensembles
• Lectures 18-22
• Overview of different ML paradigms
6. Optimization: How to train ML models (efficiently)?
• Lecture 23
• Loss Functions, Optimization via Gradient Descent
• Stochastic Gradient Descent (SGD), Parallel SGD
40
Section 2
Lectures: 3rd part – Data Integration
7. Information Extraction: Named entity recognition and relation extraction
• Lecture 24
• How to identify entities of interest in unstructured data?
• How to find relationships between them?
8. Data Integration: Combine information from different data sources
• Lecture 25
• How to find if a real-world entity is mentioned in different sources?
• How to align data from different sources?
9. Data Cleaning: Remove errors and noise from data
• Lecture 26
• How can we detect errors in data to be used for analytics?
• How can we fix these errors automatically?
41
Section 2
Lectures: 4th part – Communicating Insights
10. Data Visualization: Creating data charts that convey interesting
findings
• Lectures 27-29
• How to convey insights most effectively?
• How to explore raw data?
11. Data Privacy: Sharing sensitive information
• Lecture 30-32
• How can we share sensitive data?
• How can we perform analytics on sensitive data?
42
Section 2

Contenu connexe

Similaire à Lecture_1_Intro.pdf

Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)SayyedYusufali
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)SayyedYusufali
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)SayyedYusufali
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdfAyele40
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?DIGITALSAI1
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabadVamsiNihal
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabadsaitejavella
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training HyderabadNithinsunil1
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabadVamsiNihal
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)SayyedYusufali
 

Similaire à Lecture_1_Intro.pdf (20)

Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Data Scientists
 Data Scientists Data Scientists
Data Scientists
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)Data science training in hyd ppt converted (1)
Data science training in hyd ppt converted (1)
 
Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)Data science training in hyd pdf converted (1)
Data science training in hyd pdf converted (1)
 
Data science training in hydpdf converted (1)
Data science training in hydpdf  converted (1)Data science training in hydpdf  converted (1)
Data science training in hydpdf converted (1)
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)Concepts, use cases and principles to build big data systems (1)
Concepts, use cases and principles to build big data systems (1)
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf1. Overview_of_data_analytics (1).pdf
1. Overview_of_data_analytics (1).pdf
 
Which institute is best for data science?
Which institute is best for data science?Which institute is best for data science?
Which institute is best for data science?
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 
Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
Data science training in Hyderabad
Data science  training in HyderabadData science  training in Hyderabad
Data science training in Hyderabad
 
Data science training Hyderabad
Data science training HyderabadData science training Hyderabad
Data science training Hyderabad
 
Data science online training in hyderabad
Data science online training in hyderabadData science online training in hyderabad
Data science online training in hyderabad
 
Data science training in hyd ppt (1)
Data science training in hyd ppt (1)Data science training in hyd ppt (1)
Data science training in hyd ppt (1)
 

Plus de paijitk

IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdfIDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdfpaijitk
 
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdfลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdfpaijitk
 
Chapter_01wht.pdf
Chapter_01wht.pdfChapter_01wht.pdf
Chapter_01wht.pdfpaijitk
 
functions2-200924082810.pdf
functions2-200924082810.pdffunctions2-200924082810.pdf
functions2-200924082810.pdfpaijitk
 
stringsinpython-181122100212.pdf
stringsinpython-181122100212.pdfstringsinpython-181122100212.pdf
stringsinpython-181122100212.pdfpaijitk
 
Python-review1.pdf
Python-review1.pdfPython-review1.pdf
Python-review1.pdfpaijitk
 
Functions_21_22.pdf
Functions_21_22.pdfFunctions_21_22.pdf
Functions_21_22.pdfpaijitk
 
Functions_19_20.pdf
Functions_19_20.pdfFunctions_19_20.pdf
Functions_19_20.pdfpaijitk
 
03-Data-Exploration.en.th.pptx
03-Data-Exploration.en.th.pptx03-Data-Exploration.en.th.pptx
03-Data-Exploration.en.th.pptxpaijitk
 
02-Lifecycle.en.th.pptx
02-Lifecycle.en.th.pptx02-Lifecycle.en.th.pptx
02-Lifecycle.en.th.pptxpaijitk
 
01. Introduction.en.th.pptx
01. Introduction.en.th.pptx01. Introduction.en.th.pptx
01. Introduction.en.th.pptxpaijitk
 
Lecture_2_Stats.pdf
Lecture_2_Stats.pdfLecture_2_Stats.pdf
Lecture_2_Stats.pdfpaijitk
 

Plus de paijitk (12)

IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdfIDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
IDIO2020_B3_DataViz_EdoraFNguyenJ_acc.pdf
 
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdfลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
ลำดับการโพสต์วันปฐมนิเทศ ญาณสาสมาธิ (ออนไ.pdf
 
Chapter_01wht.pdf
Chapter_01wht.pdfChapter_01wht.pdf
Chapter_01wht.pdf
 
functions2-200924082810.pdf
functions2-200924082810.pdffunctions2-200924082810.pdf
functions2-200924082810.pdf
 
stringsinpython-181122100212.pdf
stringsinpython-181122100212.pdfstringsinpython-181122100212.pdf
stringsinpython-181122100212.pdf
 
Python-review1.pdf
Python-review1.pdfPython-review1.pdf
Python-review1.pdf
 
Functions_21_22.pdf
Functions_21_22.pdfFunctions_21_22.pdf
Functions_21_22.pdf
 
Functions_19_20.pdf
Functions_19_20.pdfFunctions_19_20.pdf
Functions_19_20.pdf
 
03-Data-Exploration.en.th.pptx
03-Data-Exploration.en.th.pptx03-Data-Exploration.en.th.pptx
03-Data-Exploration.en.th.pptx
 
02-Lifecycle.en.th.pptx
02-Lifecycle.en.th.pptx02-Lifecycle.en.th.pptx
02-Lifecycle.en.th.pptx
 
01. Introduction.en.th.pptx
01. Introduction.en.th.pptx01. Introduction.en.th.pptx
01. Introduction.en.th.pptx
 
Lecture_2_Stats.pdf
Lecture_2_Stats.pdfLecture_2_Stats.pdf
Lecture_2_Stats.pdf
 

Dernier

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 

Dernier (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 

Lecture_1_Intro.pdf

  • 1. CS639: Data Management for Data Science Lecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1
  • 2. 2
  • 3. 3 Big science is data driven. Big science is data driven.
  • 4. 4 Increasingly many companies see themselves as data driven. Increasingly many companies see themselves as data driven.
  • 6. The world is increasingly driven by data… 6 This class teaches the basics of how to use & manage data to obtain useful insights.
  • 7. Today’s Lecture 1. Motivation, Admin, & Setup 2. Introduction to Data Science 7
  • 8. 1. Motivation, Admin & Setup 8 Section 1
  • 9. What you will learn about in this section 1. Motivation for studying Data Science 2. Administrative structure 3. Course logistics 9 Section 1
  • 10. 10 Section 1 Data Analysis has always been around R.A. Fisher “Correlation does not imply causation” Peter Luhn A pioneer in hash coding and full text processing Coined the term “business intelligence”
  • 11. 11 Section 1 Data Analysis has always been around Introduced the box-plot Also introduced the term “bit” (later used by Shannon) The time of data-driven AI John Tukey Tom Mitchell
  • 12. 12 Section 1 Data Analysis has always been around Essays on Scientific Discovery based on data-intensive science Father of ACID (requirements for reliable transaction processing) Known for AI programming (at Google) By Jim Gray Peter Norvig
  • 14. 14 Section 1 Why should you study data science? • Mercenary-make more $$$: • Startups need data science talent right away = low employee # • Massive industry… • Intellectual: • Science: data poor to data rich • No idea how to handle the data! • Fundamental ideas to/from all of CS: • Systems, theory, AI, logic, stats, analysis….
  • 15. 15 Section 1 What this course is (and is not) • Discuss the fundamentals data management for data science workflows • How to represent and store data • How to extract and prepare data for analysis • How to analyze data and how to visualize and communicate insights • You will learn how to design data science pipelines • This is not a databases, systems, or machine learning class. We will touch many topics covered in these classes but we will not go into details.
  • 16. Who we are… Instructor (me) Theo Rekatsinas • Faculty in the Computer Sciences and part of the UW-Database Group • Research: data integration and cleaning, statistical analytics, and machine learning. • thodrek@cs.wisc.edu • Office hours: MWF after class @CS 4361 16 Section 1
  • 18. Communication w/ Course Staff • Piazza https://piazza.com/wisc/spring2019/cs639 • Class mailing list: compsci639-4-s19@lists.wisc.edu • Office hours: Listed on the website • Also By appointment! 18 Section 1 The goal is to get you to answer each other’s questions so you can benefit and learn from each other. The goal is to get you to answer each other’s questions so you can benefit and learn from each other.
  • 19. Course Website: https://thodrek.github.io/cs639_spring19/ 19 Section 1 Course Github: https://github.com/thodrek/cs639_spring19
  • 20. 20 Lectures • Lecture slides cover essential material • This is your best reference. • We will have pointers as needed • Recommended textbooks listed on website • Try to cover same thing in many ways: Lecture, lecture slides, homework, exams (no shock) • Attendance makes your life easier… Section 1
  • 21. Graded Elements • Six Programming assignments (45%) • Focus on different aspects of data science • Midterm (20%) • Final exam (35%) 21 Dates are posted on the website!!! Dates are posted on the website!!! Section 1
  • 22. What is expected from you • Attend lectures • If you don’t, it’s at your own peril • Be active and think critically • Ask questions, post comments on forums • Do programming projects • Start early and be honest • Study for tests and exams 22 Section 1
  • 23. Programming Assignments • Six programming assignments • Python plus Jupiter notebooks • These are individual assignments • Submission via Canvas • ~1 week per programming assignments • Ask questions, post comments on forums • Start early! • You have late days. The policy is described on the website 23 Section 1
  • 24. Programming setup for class 24 1. For all assignments we will use the provided Virtual Machine. • Ubuntu + necessary python libraries • Link provided on the website • We will not provide support for any other platform (you can still use your own machine) 2. To deploy and run the provided VM: 1. Download and Install Virtual Box: https://www.virtualbox.org/wiki/Downloads 2. Download the Class VM from: https://www.dropbox.com/s/xjvj3jlaurzjfas/cs639_vm.ova.zip?dl=0 3. Import the VM by following the instructions here: https://blogs.oracle.com/oswald/importing-a-vdi-in-virtualbox 4. Run the VM and login with the following credentials: 1. Username: CS639_DS_USER 2. Password: cs639_ds_user 3. Come to office hours if you need help with installation! Please help out your peers by posting issues / solutions on Piazza! Please help out your peers by posting issues / solutions on Piazza! Section 1
  • 25. 2. Introduction to Data Science 25 Section 2
  • 26. What you will learn about in this section 1. What is Data Science? 2. Data Science workflows 3. What should a Data Scientist know? 4. Overview of lecture coverage 26 Section 2
  • 27. Data Science is an emerging field • https://www.oreilly.com/ideas/what-is-data-science 27 Section 2
  • 29. One definition of data science 29 Section 2 Data science is a broad field that refers to the collective processes, theories, concepts, tools and technologies that enable the review, analysis and extraction of valuable knowledge and information from raw data. Source: Techopedia
  • 30. Data science is not databases 30 Section 2 Databases Data Science Data Value “Precious” “Cheap” Data Volume Modest Massive Examples Bank records, Personnel records, Census, Medical records Online clicks, GPS logs, Tweets, Building sensor readings Priorities Consistency, Error recovery, Auditability Speed, Availability, Query richness Structured Strongly (Schema) Weakly or none (Text) Properties Transactions, ACID* CAP* theorem (2/3), eventual consistency Realizations SQL NoSQL: Riak, Memcached, MongoDB, CouchDB, Hbase, Cassandra,… ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
  • 31. Data science is not databases 31 Section 2 Databases Data Science Querying the past Querying the future Business intelligence (BI) is the transformation of raw data into meaningful and useful information for business analysis purposes. BI can handle enormous amounts of unstructured data to help identify, develop and otherwise create new strategic business opportunities - Wikipedia
  • 32. Data science workflow 32 Section 2 https://cacm.acm.org/blogs/blog-cacm/169199-data-science- workflow-overview-and-challenges/fulltext
  • 34. Data science workflow 34 Section 2 Digging Around in Data Hypothesize Model Large Scale Exploitation Evaluate Interpret Clean, prep
  • 35. What is hard about Data Science 35 Section 2 • Overcoming assumptions • Making ad-hoc explanations of data patterns • Overgeneralizing • Communication • Not checking enough (validate models, data pipeline integrity, etc.) • Using statistical tests correctly • Prototype  Production transitions • Data pipeline complexity (who do you ask?)
  • 36. What is hard about Data Science 36 Section 2
  • 37. What is hard about Data Science 37 Section 2
  • 38. What are Data Scientists really doing? 38 Section 2 https://visit.figure-eight.com/rs/416-ZBE- 142/images/CrowdFlower_DataScienceReport_2016.pdf
  • 39. Lectures: 1st part – Data Storage 1. Overview: What is data science • Lectures 1-3 2. Foundations: Relational data models & SQL • Lectures 4-7 • How to manipulate data with SQL, a declarative language • reduced expressive power but the system can do more for you • Query optimization 3. MapReduce and NoSQL systems: MapReduce, KeyValue Stores, Graph DBs • Lectures 8-14 • Dealing with massive amounts of data and non-relational data 39 Section 2
  • 40. Lectures: 2nd part – Predictive analytics 4. Statistical Reasoning: Inference, Sampling Bayesian Methods • Lectures 15-17 • How to reason about patterns in data 5. Machine Learning: Decision Trees, Evaluation of ML models, Ensembles • Lectures 18-22 • Overview of different ML paradigms 6. Optimization: How to train ML models (efficiently)? • Lecture 23 • Loss Functions, Optimization via Gradient Descent • Stochastic Gradient Descent (SGD), Parallel SGD 40 Section 2
  • 41. Lectures: 3rd part – Data Integration 7. Information Extraction: Named entity recognition and relation extraction • Lecture 24 • How to identify entities of interest in unstructured data? • How to find relationships between them? 8. Data Integration: Combine information from different data sources • Lecture 25 • How to find if a real-world entity is mentioned in different sources? • How to align data from different sources? 9. Data Cleaning: Remove errors and noise from data • Lecture 26 • How can we detect errors in data to be used for analytics? • How can we fix these errors automatically? 41 Section 2
  • 42. Lectures: 4th part – Communicating Insights 10. Data Visualization: Creating data charts that convey interesting findings • Lectures 27-29 • How to convey insights most effectively? • How to explore raw data? 11. Data Privacy: Sharing sensitive information • Lecture 30-32 • How can we share sensitive data? • How can we perform analytics on sensitive data? 42 Section 2