SlideShare une entreprise Scribd logo
1  sur  16
Welcome to CMPSC-310!
Introduction to Data Science
What Is Data Science?
Extraction of knowledge from data (also known as
knowledge discovery and data mining, KDD).
Data science :=
Computer science (for data structures,
algorithms, visualization, big data support, general
programming) +
Statistics (for regressions and inference) +
Domain knowledge (for asking questions and
interpreting results). 2
Data, Information, Knowledge, etc.
3
(by David Somerville @smrvl)
Data Science and Other Disciplines: BI
Business Intelligence engineers traditionally make tools for others to analyze
data with. BI engineers do not analyze the data. Data scientists will both make
and analyze using what they made. If you are a software engineer you need to
learn statistical modeling and how to communicate results. You will need to use
these datasets and work with them to make decisions.
4
Data Science and Other Disciplines: STATS
Statisticians are traditionally content with the assumption (condition) that all their
data will fit in main memory at the same time. Statisticians traditionally used
math or created new math to squeeze as much information as possible from small
numbers of observations or features. Data scientists recognize the need to use
and create math to handle analyses in data-poor environments but will use and
create new software engineering tools to handle very large datasets, and they
recognize that some the models are the same in both cases. You need to learn to
deal with data that does not fit in memory to be a data scientist because it’s no
longer safe to assume.
5
Data Science and Other Disciplines: DB
Database programmers and administrators bring useful skills to data science
but they are traditionally focused on one data model: relational. Handling
graphs’ nodes and edges (e.g., pagerank), images, video, text, as well as SQL
when appropriate, are more like data science. You need to deal with unstructured
data to be a data scientist.
6
Data Science and Other Disciplines: Visualization
Visualization experts and business analysts bring skills but are traditionally not
concerned with massive scale like hundreds or thousands of machines. If you
are a business analyst then you need to learn about algorithms and tradeoffs at
large scale. With cloud computing and with algorithms, you may get an answer but
it may cost more or less than it did 5 years ago. It is no longer safe to throw your
trust over the wall to some algorithm or to your staff to run some algorithm. You
will need to internalize the tradeoffs of choosing one model or another yourself.
7
Data Science and Other Disciplines: ML
Machine learning is similar to data science but it’s a small fraction of it. The
getting of data, cleaning, exploring, and making interactive visualizations and data
products for yourself and for others to use (e.g. data driven language translators,
spellcheckers) as well as doing ML, these are more like data science.
8
Topics
● Numeric data analysis
● Signal processing
● Text data analysis (information/document/text retrieval, natural language
processing)
● Statistical inference
● Databases (information integration)
● Complex network analysis
● Data visualization 9
Define the Question of Study
● Descriptive: Describe a set of data.
● Exploratory: Find new relationships.
● Inferential: Use a small data sample to describe a bigger population. Based
on statistics.
● Predictive: Use data on some objects to predict values for another object.
● Causal: Does one variable affect another variable? Based on statistics.
Correlation != Causation.
● Mechanistic: Exactly how does one variable affect another variable? Based
on deep domain knowledge. 10
Get and Clean Data
1. Define the ideal data set
Determine what data you can access
2. Obtain the data
Raw data vs processed data. Always use raw data, but process it once; record all
processing steps
3. Clean the data
11
Explore Data
● Exploratory data analysis
● Model data and predict
● Interpret results
● Challenge results
● Present results to the data sponsor
12
Create Reproducible Code
● Don't do things by hand–teach the computer! All things done by hand must be
precisely documents
● Don't use interactive GUI tools (no history!)
● Use version control software (Git/GitHub)
● Avoid intermediate files, unless they are hard to build (in which case cache
them)
13
Report Structure
● Project report
○ Abstract: A brief description of the project.
○ Introduction.
○ Methods.
○ Results.
○ Conclusion.
● Code
○ Well-commented scripts that can be executed without any command line parameters or
interaction. 14
Suggested Directory Structure
● data – for the input data, if needed
● cache – for the previously downloaded data
● results – for numerical results
● code – for the Python script(s)
● doc – for the report and figures
15
Data Acquisition Pipeline
16

Contenu connexe

Tendances

Tendances (19)

2005)
2005)2005)
2005)
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Data science life cycle
Data science life cycleData science life cycle
Data science life cycle
 
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Data science
Data scienceData science
Data science
 
Data science
Data scienceData science
Data science
 
Data science
Data scienceData science
Data science
 
Data science
Data science Data science
Data science
 
Data science Big Data
Data science Big DataData science Big Data
Data science Big Data
 
Data Science
Data ScienceData Science
Data Science
 
data science
data sciencedata science
data science
 
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
Data Scientist Roles and Responsibilities | Data Scientist Career | Data Scie...
 
Paper presentation
Paper presentationPaper presentation
Paper presentation
 
50 Years of Data Science
50 Years of Data Science50 Years of Data Science
50 Years of Data Science
 

Similaire à Welcome to CS310!

Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Rohit Dubey
 

Similaire à Welcome to CS310! (20)

Ch1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptxCh1IntroductiontoDataScience.pptx
Ch1IntroductiontoDataScience.pptx
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGargColloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
Colloquium(7)_DataScience:ShivShaktiGhosh&MohitGarg
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Introduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdfIntroduction to Data Analysis Course Notes.pdf
Introduction to Data Analysis Course Notes.pdf
 
Introduction to Data Science.pdf
Introduction to Data Science.pdfIntroduction to Data Science.pdf
Introduction to Data Science.pdf
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Data science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptxData science Nagarajan and madhav.pptx
Data science Nagarajan and madhav.pptx
 
Introduction to data science.pdf
Introduction to data science.pdfIntroduction to data science.pdf
Introduction to data science.pdf
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 

Plus de Dmitry Zinoviev

Network analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweetsNetwork analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweets
Dmitry Zinoviev
 

Plus de Dmitry Zinoviev (20)

Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)
 
WHat is star discourse in post-Soviet film journals?
WHat is star discourse in post-Soviet film journals?WHat is star discourse in post-Soviet film journals?
WHat is star discourse in post-Soviet film journals?
 
The “Musk” Effect at Twitter
The “Musk” Effect at TwitterThe “Musk” Effect at Twitter
The “Musk” Effect at Twitter
 
Are Twitter Networks of Regional Entrepreneurs Gendered?
Are Twitter Networks of Regional Entrepreneurs Gendered?Are Twitter Networks of Regional Entrepreneurs Gendered?
Are Twitter Networks of Regional Entrepreneurs Gendered?
 
Using Complex Network Analysis for Periodization
Using Complex Network Analysis for PeriodizationUsing Complex Network Analysis for Periodization
Using Complex Network Analysis for Periodization
 
Algorithms
AlgorithmsAlgorithms
Algorithms
 
Text analysis of The Book Club Play
Text analysis of The Book Club PlayText analysis of The Book Club Play
Text analysis of The Book Club Play
 
Exploring the History of Mental Stigma
Exploring the History of Mental StigmaExploring the History of Mental Stigma
Exploring the History of Mental Stigma
 
Roles and Words in a massive NSSI-Related Interaction Network
Roles and Words in a massive NSSI-Related Interaction NetworkRoles and Words in a massive NSSI-Related Interaction Network
Roles and Words in a massive NSSI-Related Interaction Network
 
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
“A Quaint and Curious Volume of Forgotten Lore,” or an Exercise in Digital Hu...
 
Network analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweetsNetwork analysis of the 2016 USA presidential campaign tweets
Network analysis of the 2016 USA presidential campaign tweets
 
Network Analysis of The Shining
Network Analysis of The ShiningNetwork Analysis of The Shining
Network Analysis of The Shining
 
The Lord of the Ring. A Network Analysis
The Lord of the Ring. A Network AnalysisThe Lord of the Ring. A Network Analysis
The Lord of the Ring. A Network Analysis
 
Pickling and CSV
Pickling and CSVPickling and CSV
Pickling and CSV
 
Python overview
Python overviewPython overview
Python overview
 
Programming languages
Programming languagesProgramming languages
Programming languages
 
The P4 of Networkacy
The P4 of NetworkacyThe P4 of Networkacy
The P4 of Networkacy
 
DaVinci Code. Network Analysis
DaVinci Code. Network AnalysisDaVinci Code. Network Analysis
DaVinci Code. Network Analysis
 
Soviet Popular Music Landscape: Community Structure and Success Predictors
Soviet Popular Music Landscape: Community Structure and Success PredictorsSoviet Popular Music Landscape: Community Structure and Success Predictors
Soviet Popular Music Landscape: Community Structure and Success Predictors
 
C for Java programmers (part 2)
C for Java programmers (part 2)C for Java programmers (part 2)
C for Java programmers (part 2)
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

Welcome to CS310!

  • 2. What Is Data Science? Extraction of knowledge from data (also known as knowledge discovery and data mining, KDD). Data science := Computer science (for data structures, algorithms, visualization, big data support, general programming) + Statistics (for regressions and inference) + Domain knowledge (for asking questions and interpreting results). 2
  • 3. Data, Information, Knowledge, etc. 3 (by David Somerville @smrvl)
  • 4. Data Science and Other Disciplines: BI Business Intelligence engineers traditionally make tools for others to analyze data with. BI engineers do not analyze the data. Data scientists will both make and analyze using what they made. If you are a software engineer you need to learn statistical modeling and how to communicate results. You will need to use these datasets and work with them to make decisions. 4
  • 5. Data Science and Other Disciplines: STATS Statisticians are traditionally content with the assumption (condition) that all their data will fit in main memory at the same time. Statisticians traditionally used math or created new math to squeeze as much information as possible from small numbers of observations or features. Data scientists recognize the need to use and create math to handle analyses in data-poor environments but will use and create new software engineering tools to handle very large datasets, and they recognize that some the models are the same in both cases. You need to learn to deal with data that does not fit in memory to be a data scientist because it’s no longer safe to assume. 5
  • 6. Data Science and Other Disciplines: DB Database programmers and administrators bring useful skills to data science but they are traditionally focused on one data model: relational. Handling graphs’ nodes and edges (e.g., pagerank), images, video, text, as well as SQL when appropriate, are more like data science. You need to deal with unstructured data to be a data scientist. 6
  • 7. Data Science and Other Disciplines: Visualization Visualization experts and business analysts bring skills but are traditionally not concerned with massive scale like hundreds or thousands of machines. If you are a business analyst then you need to learn about algorithms and tradeoffs at large scale. With cloud computing and with algorithms, you may get an answer but it may cost more or less than it did 5 years ago. It is no longer safe to throw your trust over the wall to some algorithm or to your staff to run some algorithm. You will need to internalize the tradeoffs of choosing one model or another yourself. 7
  • 8. Data Science and Other Disciplines: ML Machine learning is similar to data science but it’s a small fraction of it. The getting of data, cleaning, exploring, and making interactive visualizations and data products for yourself and for others to use (e.g. data driven language translators, spellcheckers) as well as doing ML, these are more like data science. 8
  • 9. Topics ● Numeric data analysis ● Signal processing ● Text data analysis (information/document/text retrieval, natural language processing) ● Statistical inference ● Databases (information integration) ● Complex network analysis ● Data visualization 9
  • 10. Define the Question of Study ● Descriptive: Describe a set of data. ● Exploratory: Find new relationships. ● Inferential: Use a small data sample to describe a bigger population. Based on statistics. ● Predictive: Use data on some objects to predict values for another object. ● Causal: Does one variable affect another variable? Based on statistics. Correlation != Causation. ● Mechanistic: Exactly how does one variable affect another variable? Based on deep domain knowledge. 10
  • 11. Get and Clean Data 1. Define the ideal data set Determine what data you can access 2. Obtain the data Raw data vs processed data. Always use raw data, but process it once; record all processing steps 3. Clean the data 11
  • 12. Explore Data ● Exploratory data analysis ● Model data and predict ● Interpret results ● Challenge results ● Present results to the data sponsor 12
  • 13. Create Reproducible Code ● Don't do things by hand–teach the computer! All things done by hand must be precisely documents ● Don't use interactive GUI tools (no history!) ● Use version control software (Git/GitHub) ● Avoid intermediate files, unless they are hard to build (in which case cache them) 13
  • 14. Report Structure ● Project report ○ Abstract: A brief description of the project. ○ Introduction. ○ Methods. ○ Results. ○ Conclusion. ● Code ○ Well-commented scripts that can be executed without any command line parameters or interaction. 14
  • 15. Suggested Directory Structure ● data – for the input data, if needed ● cache – for the previously downloaded data ● results – for numerical results ● code – for the Python script(s) ● doc – for the report and figures 15