SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
Data Science With Python
Mosky
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
2
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
3
We will introduce.
➤ It's kind of outdated, but still contains lot of keywords.
➤ MrMimic/data-scientist-roadmap – GitHub
➤ Becoming a Data Scientist – Curriculum via Metromap
5
➤ Machine learning = statistics - checking of assumptions 😆
➤ But does resolve more problems. "
➤ Statistics constructs more solid inferences.
➤ Machine learning constructs more interesting predictions.
Statistics vs. Machine Learning
6
Probability, Descriptive Statistics, and Inferential Statistics
7
Population
Sample
Probability
Descriptive

Statistics
Inferential
Statistics
➤ Deep learning is the most renowned part of machine learning.
➤ A.k.a. the “AI”.
➤ Deep learning uses artificial neural networks (NNs).
➤ Which are especially good at:
➤ Computer vision (CV) 👀
➤ Natural language processing (NLP) 📖
➤ Machine translation
➤ Speech recognition
➤ Too costly to simple problems.
Machine Learning vs. Deep Learning
8
Big Data
➤ The “size” is constantly moving.
➤ As of 2012, ranges from 10n TB to n PB, which is 100x.
➤ Has high-3Vs:
➤ Volume, amount of data.
➤ Velocity, speed of data in and out.
➤ Variety, range of data types and sources.
➤ A practical definition:
➤ A single computer can't process in a reasonable time.
➤ Distributed computing is a big deal.
9
Today,
➤ “Models” are the math models.
➤ “Statistical models”, emphasize inferences.
➤ “Machine learning models”, emphasize predictions.
➤ “Deep learning” and “big data” are gigantic subfields.
➤ We won't introduce.
➤ But the learning resources are listed at the end.
10
Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at
➤ PyCons in 

TW, MY, KR, JP, SG, HK,

COSCUPs, and TEDx, etc.
➤ Countless hours 

on teaching Python.
➤ Own the Python packages:
➤ ZIPCodeTW, 

MoSQL, Clime, etc.
➤ http://mosky.tw/
11
The Outline
➤ “Data”
➤ The Analysis Steps
➤ Visualization
➤ Preprocessing
➤ Dimensionality Reduction
➤ Statistical Models
➤ Machine Learning Models
➤ Keep Learning
12
The Packages
➤ $ pip3 install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
➤ Or
➤ > conda install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
13
Common Jupyter Notebook Shortcuts
14
Esc Edit mode → command mode.
Ctrl-Enter Run the cell.
B Insert cell below.
D, D Delete the current cell.
M To Markdown cell.
Cmd-/ Comment the code.
H Show keyboard shortcuts.
P Open the command palette.
Checkpoint: The Packages
➤ Open 00_preface_the_packages.ipynb up.
➤ Run it.
➤ The notebooks are available on https://github.com/moskytw/
data-science-with-python.
15
”Data”
“Data”
➤ = Variables
➤ = Dimensions
➤ = Labels + Features
17
Data in Different Types
18
Discrete
Nominal {male, female}
Ordinal

Ranked
↑ & can be ordered. {great > good > fair}
Continuous
Interval ↑ & distance is meaningful. temperatures
Ratio ↑ & 0 is meaningful. weights
Data in the X-Y Form
19
y x
dependent variable independent variable
response variable explanatory variable
regressand regressor
endogenous variable | endog exogenous variable | exog
outcome design
label feature
➤ Confounding variables:
➤ May affect y, but not x.
➤ May lead erroneous conclusions, “garbage in, garbage out”.
➤ Controlling, e.g., fix the environment.
➤ Randomizing, e.g, choose by computer.
➤ Matching, e.g., order by gender and then assign group.
➤ Statistical control, e.g., BMI to remove height effect.
➤ Double-blind, even triple-blind trials.
20
Get the Data
➤ Logs
➤ Existent datasets
➤ The Datasets Package – StatsModels
➤ Kaggle
➤ Experiments
21
The Analysis Steps
The Three Steps
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
23
1. Define Assumption
➤ Specify a feasible objective.
➤ “Use AI to get the moon!”
➤ Write an formal assumption.
➤ “The users will buy 1% items from our recommendation.”

rather than “The users will love our recommendation!”
➤ Note the dangerous gaps.
➤ “All the items from recommendation are free!”
➤ “Correlation does not imply causation.”
➤ Consider the next actions.
➤ “Release to 100% of users.” rather than “So great!”
24
2. Validate Assumption
➤ Collect potential data.
➤ List possible methods.
➤ A plotting, median, or even mean may be good enough.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Evaluate the metrics of methods with data.
25
3. Validated Assumption?
➤ Yes → Congrats! Report fully and take the actions! 🎉
➤ No → Check:
➤ The hypotheses of methods.
➤ The confounding variables in data.
➤ The formality of assumption.
➤ The feasibility of objective.

26
Iterate Fast While Industry Changes Rapidly
➤ Resolve the small problems first.
➤ Resolve the high impact/effort problems first.
➤ One week to get a quick result and improve

rather than one year to get the may-be-the-best result.
➤ Fail fast!
27
Checkpoint: Pick up a Method
➤ Think of an interesting problem.
➤ E.g., revenue is higher, but is it random?
➤ Pick one method from the cheatsheets.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Remember the three analysis steps.
28
Visualization
Visualization
➤ Make Data Colorful – Plotting
➤ 01_1_visualization_plotting.ipynb
➤ In a Statistical Way – Descriptive Statistics
➤ 01_2_visualization_descriptive_statistics.ipynb
30
➤ Star98
➤ star98_df = sm.datasets.star98.load_pandas().data
➤ Fair
➤ fair_df = sm.datasets.fair.load_pandas().data
➤ Howell1
➤ howell1_df = pd.read_csv(

'dataset_howell1.csv', sep=';')
➤ Or your own datasets.
➤ Plot the variables that interest you.
Checkpoint: Plot the Variables
31
Preprocessing
Feed the Data That Models Like
33
➤ Preprocess data for:
➤ Hard requirements, e.g.,
➤ corpus → vectors
➤ “What kind of news will be voted down on PTT?”
➤ Soft requirements (hypotheses), e.g.,
➤ t-test: better when samples are normally distributed.
➤ SVM: better when features range from -1 to 1.
➤ More representative features, e.g., total price / units.
➤ Note that different models have different tastes.
Preprocessing
➤ The Dishes – Containers
➤ 02_1_preprocessing_containers.ipynb
➤ A Cooking Method – Standardization
➤ 02_2_preprocessing_standardization.ipynb
➤ Watch Out for Poisonous Data Points – Removing Outliers
➤ 02_3_preprocessing_removing_outliers.ipynb
34
➤ Try to standardize and compare.
➤ Try to trim the outliners.
Checkpoint: Preprocess the Variables
35
Dimensionality
Reduction
The Model Sicks Up!
➤ Let's reduce the variables.
➤ Feed a subset → feature selection.
➤ Feature selection using SelectFromModel – Scikit-Learn
➤ Feed a transformation → feature extraction.
➤ PCA, FA, etc.
➤ Another definition: non-numbers → numbers.
37
➤ Principal Component Analysis
➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb
➤ Factor Analysis
➤ 03_2_dimensionality_reduction_factor_analysis.ipynb
Dimensionality Reduction
38
➤ Try to PCA(all variables) → the better components, or FA.
➤ And then plot n-dimensional data onto 2-dimensional plane.
Checkpoint: Reduce the Variables
39
Statistical Models
Statistical Models
➤ Identify Boring or Interesting – Hypothesis Testings
➤ 04_1_statistical_models_hypothesis_testings.ipynb
➤ “Hypothesis Testing With Python”
➤ Identify X-Y Relationships – Regression
➤ 04_2_statistical_models_regression_anova.ipynb
41
More Regression Models
➤ If y is not linear,
➤ Logit or Poisson Regression | Generalized Linear Models, GLMs
➤ If y is correlated,
➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE
➤ If x has multicollinearity,
➤ Lasso or Ridge Regression
➤ If error term is heteroscedastic,
➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS
➤ If x is time series – predict x0 from x-1, not predict y from x,
➤ Autoregressive Integrated Moving Average, ARIMA
42
➤ Try to apply the analysis steps with a statistical method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Statistical Method
43
Machine Learning
Models
➤ Apple or Orange? – Classification
➤ 05_1_machine_learning_models_classification.ipynb
➤ Without Labels – Clustering
➤ 05_2_machine_learning_models_clustering.ipynb
➤ Predict the Values – Regression
➤ Who Are the Best? – Model Selection
➤ sklearn.model_selection.GridSearchCV
Machine Learning Models
45
Confusion matrix, where A = 002 = C[0, 0]
46
predicted -
AC
predicted +
BD
actual -
AB
true -
A
false +
B
actual +
CD
false -
C
true +
D
➤ precision = D / BD
➤ recall = D / CD
➤ sensitivity = D / CD = recall = observed power
➤ specificity = A / AB = observed confidence level
➤ false positive rate = B / AB = observed α
➤ false negative rate= C / CD = observed β
Common “rates” in confusion matrix
47
Ensemble Models
➤ Bagging
➤ N independent models and average their output.
➤ e.g., the random forest models.
➤ Boosting
➤ N sequential models, the n model learns from n-1's error.
➤ e.g., gradient tree boosting.
48
➤ Try to apply the analysis steps with a ML method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Machine Learning Method
49
Keep Learning
Keep Learning
➤ Statistics
➤ Seeing Theory
➤ Biological Statistics
➤ scipy.stats + StatsModels
➤ Research Methods
➤ Machine Learning
➤ Scikit-learn Tutorials
➤ Standford CS229
➤ Hsuan-Tien Lin

➤ Deep Learning
➤ TensorFlow | PyTorch
➤ Standford CS231n
➤ Standford CS224n
➤ Big Data
➤ Dask
➤ Hive
➤ Spark
➤ HBase
➤ AWS
51
The Facts
➤ ∵
➤ You can't learn all things in the data science!
➤ ∴
➤ “Let's learn to do” ❌
➤ “Let's do to learn” ✅
52
The Learning Flow
1. Ask a question.
➤ “How to tell the differences confidently?”
2. Explore the references.
➤ “T-test, ANOVA, ...”
3. Digest into an answer.
➤ Explore by the breadth-first way.
➤ Write the code.
➤ Make it work, make it right, finally make it fast.
53
Recap
➤ Let's do to learn, not learn to do.
➤ What is your objective?
➤ For the objective, what is your assumption?
➤ For the assumption, what method may validate it?
➤ For the method, how will you evaluate it with data?
➤ Q & A
54

Contenu connexe

Tendances

Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPyDevashish Kumar
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using PythonChariza Pladin
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Edureka!
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...
Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...
Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...Simplilearn
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Edureka!
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro9xdot
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...Edureka!
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 

Tendances (20)

Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 
Data science
Data scienceData science
Data science
 
Pandas
PandasPandas
Pandas
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Python PPT
Python PPTPython PPT
Python PPT
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...
Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...
Top 5 Python Libraries For Data Science | Python Libraries Explained | Python...
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
Data Science Training | Data Science Tutorial for Beginners | Data Science wi...
 
Scikit Learn intro
Scikit Learn introScikit Learn intro
Scikit Learn intro
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
Python Matplotlib Tutorial | Matplotlib Tutorial | Python Tutorial | Python T...
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 

Similaire à Data Science With Python

How to apply deep learning to 3 d objects
How to apply deep learning to 3 d objectsHow to apply deep learning to 3 d objects
How to apply deep learning to 3 d objectsOgushi Masaya
 
How to track and organize your experimentation process
How to track and organize your experimentation processHow to track and organize your experimentation process
How to track and organize your experimentation processneptune.ml
 
Market and Social Research Part 9
Market and Social Research Part 9Market and Social Research Part 9
Market and Social Research Part 9bestsliders
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
 
Hypothesis Testing With Python
Hypothesis Testing With PythonHypothesis Testing With Python
Hypothesis Testing With PythonMosky Liu
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learningAkshay Kanchan
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack BigDataExpo
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning ResearchBrodmann17
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 

Similaire à Data Science With Python (20)

How to apply deep learning to 3 d objects
How to apply deep learning to 3 d objectsHow to apply deep learning to 3 d objects
How to apply deep learning to 3 d objects
 
How to track and organize your experimentation process
How to track and organize your experimentation processHow to track and organize your experimentation process
How to track and organize your experimentation process
 
0-introduction.pdf
0-introduction.pdf0-introduction.pdf
0-introduction.pdf
 
Market and Social Research Part 9
Market and Social Research Part 9Market and Social Research Part 9
Market and Social Research Part 9
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
Hypothesis Testing With Python
Hypothesis Testing With PythonHypothesis Testing With Python
Hypothesis Testing With Python
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 

Plus de Mosky Liu

Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With PythonMosky Liu
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3Mosky Liu
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrencyMosky Liu
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost MaintainabilityMosky Liu
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style GuidesMosky Liu
 
Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Mosky Liu
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in PythonMosky Liu
 
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyMosky Liu
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in PracticeMosky Liu
 
Minimal MVC in JavaScript
Minimal MVC in JavaScriptMinimal MVC in JavaScript
Minimal MVC in JavaScriptMosky Liu
 
Learning Git with Workflows
Learning Git with WorkflowsLearning Git with Workflows
Learning Git with WorkflowsMosky Liu
 
Dive into Pinkoi 2013
Dive into Pinkoi 2013Dive into Pinkoi 2013
Dive into Pinkoi 2013Mosky Liu
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013Mosky Liu
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from DataMosky Liu
 
MoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMosky Liu
 
Introduction to Clime
Introduction to ClimeIntroduction to Clime
Introduction to ClimeMosky Liu
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.Mosky Liu
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - BasicMosky Liu
 

Plus de Mosky Liu (18)

Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With Python
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrency
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost Maintainability
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in Practice
 
Minimal MVC in JavaScript
Minimal MVC in JavaScriptMinimal MVC in JavaScript
Minimal MVC in JavaScript
 
Learning Git with Workflows
Learning Git with WorkflowsLearning Git with Workflows
Learning Git with Workflows
 
Dive into Pinkoi 2013
Dive into Pinkoi 2013Dive into Pinkoi 2013
Dive into Pinkoi 2013
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
MoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORM
 
Introduction to Clime
Introduction to ClimeIntroduction to Clime
Introduction to Clime
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - Basic
 

Dernier

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 

Dernier (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 

Data Science With Python

  • 1. Data Science With Python Mosky
  • 2. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 2
  • 3. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 3 We will introduce.
  • 4.
  • 5. ➤ It's kind of outdated, but still contains lot of keywords. ➤ MrMimic/data-scientist-roadmap – GitHub ➤ Becoming a Data Scientist – Curriculum via Metromap 5
  • 6. ➤ Machine learning = statistics - checking of assumptions 😆 ➤ But does resolve more problems. " ➤ Statistics constructs more solid inferences. ➤ Machine learning constructs more interesting predictions. Statistics vs. Machine Learning 6
  • 7. Probability, Descriptive Statistics, and Inferential Statistics 7 Population Sample Probability Descriptive
 Statistics Inferential Statistics
  • 8. ➤ Deep learning is the most renowned part of machine learning. ➤ A.k.a. the “AI”. ➤ Deep learning uses artificial neural networks (NNs). ➤ Which are especially good at: ➤ Computer vision (CV) 👀 ➤ Natural language processing (NLP) 📖 ➤ Machine translation ➤ Speech recognition ➤ Too costly to simple problems. Machine Learning vs. Deep Learning 8
  • 9. Big Data ➤ The “size” is constantly moving. ➤ As of 2012, ranges from 10n TB to n PB, which is 100x. ➤ Has high-3Vs: ➤ Volume, amount of data. ➤ Velocity, speed of data in and out. ➤ Variety, range of data types and sources. ➤ A practical definition: ➤ A single computer can't process in a reasonable time. ➤ Distributed computing is a big deal. 9
  • 10. Today, ➤ “Models” are the math models. ➤ “Statistical models”, emphasize inferences. ➤ “Machine learning models”, emphasize predictions. ➤ “Deep learning” and “big data” are gigantic subfields. ➤ We won't introduce. ➤ But the learning resources are listed at the end. 10
  • 11. Mosky ➤ Python Charmer at Pinkoi. ➤ Has spoken at ➤ PyCons in 
 TW, MY, KR, JP, SG, HK,
 COSCUPs, and TEDx, etc. ➤ Countless hours 
 on teaching Python. ➤ Own the Python packages: ➤ ZIPCodeTW, 
 MoSQL, Clime, etc. ➤ http://mosky.tw/ 11
  • 12. The Outline ➤ “Data” ➤ The Analysis Steps ➤ Visualization ➤ Preprocessing ➤ Dimensionality Reduction ➤ Statistical Models ➤ Machine Learning Models ➤ Keep Learning 12
  • 13. The Packages ➤ $ pip3 install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn ➤ Or ➤ > conda install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn 13
  • 14. Common Jupyter Notebook Shortcuts 14 Esc Edit mode → command mode. Ctrl-Enter Run the cell. B Insert cell below. D, D Delete the current cell. M To Markdown cell. Cmd-/ Comment the code. H Show keyboard shortcuts. P Open the command palette.
  • 15. Checkpoint: The Packages ➤ Open 00_preface_the_packages.ipynb up. ➤ Run it. ➤ The notebooks are available on https://github.com/moskytw/ data-science-with-python. 15
  • 17. “Data” ➤ = Variables ➤ = Dimensions ➤ = Labels + Features 17
  • 18. Data in Different Types 18 Discrete Nominal {male, female} Ordinal
 Ranked ↑ & can be ordered. {great > good > fair} Continuous Interval ↑ & distance is meaningful. temperatures Ratio ↑ & 0 is meaningful. weights
  • 19. Data in the X-Y Form 19 y x dependent variable independent variable response variable explanatory variable regressand regressor endogenous variable | endog exogenous variable | exog outcome design label feature
  • 20. ➤ Confounding variables: ➤ May affect y, but not x. ➤ May lead erroneous conclusions, “garbage in, garbage out”. ➤ Controlling, e.g., fix the environment. ➤ Randomizing, e.g, choose by computer. ➤ Matching, e.g., order by gender and then assign group. ➤ Statistical control, e.g., BMI to remove height effect. ➤ Double-blind, even triple-blind trials. 20
  • 21. Get the Data ➤ Logs ➤ Existent datasets ➤ The Datasets Package – StatsModels ➤ Kaggle ➤ Experiments 21
  • 23. The Three Steps 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? 23
  • 24. 1. Define Assumption ➤ Specify a feasible objective. ➤ “Use AI to get the moon!” ➤ Write an formal assumption. ➤ “The users will buy 1% items from our recommendation.”
 rather than “The users will love our recommendation!” ➤ Note the dangerous gaps. ➤ “All the items from recommendation are free!” ➤ “Correlation does not imply causation.” ➤ Consider the next actions. ➤ “Release to 100% of users.” rather than “So great!” 24
  • 25. 2. Validate Assumption ➤ Collect potential data. ➤ List possible methods. ➤ A plotting, median, or even mean may be good enough. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Evaluate the metrics of methods with data. 25
  • 26. 3. Validated Assumption? ➤ Yes → Congrats! Report fully and take the actions! 🎉 ➤ No → Check: ➤ The hypotheses of methods. ➤ The confounding variables in data. ➤ The formality of assumption. ➤ The feasibility of objective.
 26
  • 27. Iterate Fast While Industry Changes Rapidly ➤ Resolve the small problems first. ➤ Resolve the high impact/effort problems first. ➤ One week to get a quick result and improve
 rather than one year to get the may-be-the-best result. ➤ Fail fast! 27
  • 28. Checkpoint: Pick up a Method ➤ Think of an interesting problem. ➤ E.g., revenue is higher, but is it random? ➤ Pick one method from the cheatsheets. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Remember the three analysis steps. 28
  • 30. Visualization ➤ Make Data Colorful – Plotting ➤ 01_1_visualization_plotting.ipynb ➤ In a Statistical Way – Descriptive Statistics ➤ 01_2_visualization_descriptive_statistics.ipynb 30
  • 31. ➤ Star98 ➤ star98_df = sm.datasets.star98.load_pandas().data ➤ Fair ➤ fair_df = sm.datasets.fair.load_pandas().data ➤ Howell1 ➤ howell1_df = pd.read_csv(
 'dataset_howell1.csv', sep=';') ➤ Or your own datasets. ➤ Plot the variables that interest you. Checkpoint: Plot the Variables 31
  • 33. Feed the Data That Models Like 33 ➤ Preprocess data for: ➤ Hard requirements, e.g., ➤ corpus → vectors ➤ “What kind of news will be voted down on PTT?” ➤ Soft requirements (hypotheses), e.g., ➤ t-test: better when samples are normally distributed. ➤ SVM: better when features range from -1 to 1. ➤ More representative features, e.g., total price / units. ➤ Note that different models have different tastes.
  • 34. Preprocessing ➤ The Dishes – Containers ➤ 02_1_preprocessing_containers.ipynb ➤ A Cooking Method – Standardization ➤ 02_2_preprocessing_standardization.ipynb ➤ Watch Out for Poisonous Data Points – Removing Outliers ➤ 02_3_preprocessing_removing_outliers.ipynb 34
  • 35. ➤ Try to standardize and compare. ➤ Try to trim the outliners. Checkpoint: Preprocess the Variables 35
  • 37. The Model Sicks Up! ➤ Let's reduce the variables. ➤ Feed a subset → feature selection. ➤ Feature selection using SelectFromModel – Scikit-Learn ➤ Feed a transformation → feature extraction. ➤ PCA, FA, etc. ➤ Another definition: non-numbers → numbers. 37
  • 38. ➤ Principal Component Analysis ➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb ➤ Factor Analysis ➤ 03_2_dimensionality_reduction_factor_analysis.ipynb Dimensionality Reduction 38
  • 39. ➤ Try to PCA(all variables) → the better components, or FA. ➤ And then plot n-dimensional data onto 2-dimensional plane. Checkpoint: Reduce the Variables 39
  • 41. Statistical Models ➤ Identify Boring or Interesting – Hypothesis Testings ➤ 04_1_statistical_models_hypothesis_testings.ipynb ➤ “Hypothesis Testing With Python” ➤ Identify X-Y Relationships – Regression ➤ 04_2_statistical_models_regression_anova.ipynb 41
  • 42. More Regression Models ➤ If y is not linear, ➤ Logit or Poisson Regression | Generalized Linear Models, GLMs ➤ If y is correlated, ➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE ➤ If x has multicollinearity, ➤ Lasso or Ridge Regression ➤ If error term is heteroscedastic, ➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS ➤ If x is time series – predict x0 from x-1, not predict y from x, ➤ Autoregressive Integrated Moving Average, ARIMA 42
  • 43. ➤ Try to apply the analysis steps with a statistical method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Statistical Method 43
  • 45. ➤ Apple or Orange? – Classification ➤ 05_1_machine_learning_models_classification.ipynb ➤ Without Labels – Clustering ➤ 05_2_machine_learning_models_clustering.ipynb ➤ Predict the Values – Regression ➤ Who Are the Best? – Model Selection ➤ sklearn.model_selection.GridSearchCV Machine Learning Models 45
  • 46. Confusion matrix, where A = 002 = C[0, 0] 46 predicted - AC predicted + BD actual - AB true - A false + B actual + CD false - C true + D
  • 47. ➤ precision = D / BD ➤ recall = D / CD ➤ sensitivity = D / CD = recall = observed power ➤ specificity = A / AB = observed confidence level ➤ false positive rate = B / AB = observed α ➤ false negative rate= C / CD = observed β Common “rates” in confusion matrix 47
  • 48. Ensemble Models ➤ Bagging ➤ N independent models and average their output. ➤ e.g., the random forest models. ➤ Boosting ➤ N sequential models, the n model learns from n-1's error. ➤ e.g., gradient tree boosting. 48
  • 49. ➤ Try to apply the analysis steps with a ML method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Machine Learning Method 49
  • 51. Keep Learning ➤ Statistics ➤ Seeing Theory ➤ Biological Statistics ➤ scipy.stats + StatsModels ➤ Research Methods ➤ Machine Learning ➤ Scikit-learn Tutorials ➤ Standford CS229 ➤ Hsuan-Tien Lin
 ➤ Deep Learning ➤ TensorFlow | PyTorch ➤ Standford CS231n ➤ Standford CS224n ➤ Big Data ➤ Dask ➤ Hive ➤ Spark ➤ HBase ➤ AWS 51
  • 52. The Facts ➤ ∵ ➤ You can't learn all things in the data science! ➤ ∴ ➤ “Let's learn to do” ❌ ➤ “Let's do to learn” ✅ 52
  • 53. The Learning Flow 1. Ask a question. ➤ “How to tell the differences confidently?” 2. Explore the references. ➤ “T-test, ANOVA, ...” 3. Digest into an answer. ➤ Explore by the breadth-first way. ➤ Write the code. ➤ Make it work, make it right, finally make it fast. 53
  • 54. Recap ➤ Let's do to learn, not learn to do. ➤ What is your objective? ➤ For the objective, what is your assumption? ➤ For the assumption, what method may validate it? ➤ For the method, how will you evaluate it with data? ➤ Q & A 54