SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
1
Beginner's Guide to Data Science
Data Science is: Popular
Lots of Data => Lots of Analysis => Lots of Jobs
Universities: Starting new multidisciplinary programs
Industry: Cottage industry evolving for online and training courses
Goal of this Talk:
● Hear if from people who do it and what they do
● Use it for further learning and specialization
2
Data is: Big!
● 2.5 quintillion (1018)
bytes of data are generated every day!
● Everything around you collects/generates data
● Social media sites
● Business transactions
● Location-based data
● Sensors
● Digital photos, videos
● Consumer behaviour (online and store transactions)
● More data is publicly available
● Database technology is advancing
● Cloud based & mobile applications are widespread
Source: IBM http://www-01.ibm.com/software/data/bigdata/
Lots of Data => Lots of Analysis => Lots of Jobs
3
If I have data, I will know :)
Everyone wants better predictability, forecasting, customer satisfaction, market
differentiation, prevention, great user experience, ...
● How can I price a particular product?
● What can I recommend online customers to buy after buying X, Y or Z?
● How can we discover market segments? group customers into market segments?
● What customer will buy in the upcoming holiday season? (what to stock?)
● What is the price point for customer retention for subscriptions?
4
Data Science is: making sense of Data
Lots of Data => Lots of Analysis => Lots of Jobs
● Multidisciplinary study of data collections for analysis, prediction, learning and
prevention.
● Utilized in a wide variety of industries.
● Involves both structured or unstructured data sources.
5
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
6
Start Data
Acquisition
What is the
question?
What type of
data is
needed?
Data
Quality
Analysis
Reformating
& Imputing
Data
Plan Clean Data
Scripts
Scripts
Feature
Selection
Model
Selection
Results
Evaluation
Modeling Deployment and
optimization
Data Analysis
Explore the
Data
Scripts
Feature
Engineering
Maintenance
Deployment
Optimization
7
Start Data
Acquisition
What is the
question?
What type of
data is
needed?
Data
Quality
Analysis
Reformating
& Imputing
Data
Plan Clean Data
Scripts
Scripts
Feature
Selection
Model
Selection
Results
Evaluation
Modeling Deployment and
optimization
Data Analysis
Explore the
Data
Scripts
Feature
Engineering
Maintenance
Deployment
Optimization
8
Data Acquisition Stage
● As soon as the data scientist identified the problem she is trying to solve, she
must assess:
● What type of data is available
● What might be required and currently is not collected
● Is it available from other units of the company?
● Does she need to crawl/buy data from third parties?
● How much data is needed? (Data volume)
● How to access the data?
● Is the data private?
● Is it legally OK to use the data?
9
Data Acquisition Stage
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
○ Pieces of data may be in different sources
○ Formats may not match/may be incompatible
○ Unstructured data may need to be accounted for
10
Data Acquisition Stage -- Example
Example: Online customer experience may require collecting lots of data such
as
● clicks
● conversions
● add-to-cart rate
● dwell time
● average order value
● foot traffic
● bounce rate
● exits and time to purchase
11
Data Acquisition: Type and Source of Data
● Time spent on a page, browsing and/or
search history
○ Website Logs
● User and Inventory Data
○ Transaction databases
● Social Engagement
○ Social Networks (Yelp, Twitter,...)
● Customer Support
○ Call Logs, Emails
● Gas prices, competitors, news, Stock
Prices, etc..
○ RSS Feeds, News Sites, Wikipedia,...
● Training Data?
○ CrowdFlower, Mechanical Turk
12
Data Acquisition : Storage and Access
● Where the data resides
○ Cloud or Computing Clusters
● Storage System
○ SQL, NoSQL, File System
○ SQL: MySQL, Oracle, MS Server,...
○ NoSQL: MongoDB, Cassandra,
Couchbase, Hbase, Hive, ...
○ Text Indexing: Solr, ElasticSearch,...
● Data Processing Frameworks:
○ Hadoop, Spark, Storm etc...
13
Data Acquisition: Data Integration
Data integration involves combining data residing
in different sources and providing users with a
unified view of these data. (Wikipedia)
● Schema Mapping
● Record Matching
● Data Cleaning
Data
Source 1
Data
Source 2
Data
Source 3
Data
Source 4
ETL
Data
Warehouse
14
Data Cleaning
● Data are often incomplete, incorrect.
○ Typo : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected
for some of the examples
○ Impossible Data combinations: e.g., gender=
MALE, pregnant = TRUE
○ Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization
Figure ref: https://thedailyomnivore.net/2015/12/02/
15
Start Data
Acquisition
What is the
question?
What type of
data is
needed?
Data
Quality
Analysis
Reformating
& Imputing
Data
Plan Clean Data
Scripts
Scripts
Feature
Selection
Model
Selection
Results
Evaluation
Modeling Deployment and
optimization
Data Analysis
Explore the
Data
Scripts
Feature
Engineering
Maintenance
Deploy Models
Optimization
16
Analysis - Data Preparation
● Univariate Analysis: Analyze/explore variables one by one
● Bivariate Analysis: Explore relationship between variables
● Coverage, missing values: treating unknown values
● Outliers: detect and treat values that are distant from other observations
● Feature Engineering: Variable transformations and creation of new better
variables from raw features
Commonly used tools:
● SQL
● R: plyr, reshape, ggplot2, data.table,
● Python: NumPy, Pandas, SciPy, matplotlib
17
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one
- Continuous variable: explore central tendency and spread of the values
- Summary statistics
- mean, median, min, max
- IQR, standard deviation, variance, quartile
- Visualize Histograms, Boxplots
18
Analysis - Exploratory Analysis
Walmart Store Sales Forecasting Data, Kaggle
Summary statistics for “Temperature”:
Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev.
-7.29 45.90 60.71 59.36 73.88 102.00 18.68
19
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one
- Categorical Variable: frequency tables
- Count and count %
- Visualize Bar charts
20
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Continuous to continuous variables: Correlation measures the strength and
direction of a linear relationship
- Visualize Scatterplots -> relationship may not be linear
21
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Categorical to categorical variables -> crosstab table
- Visualize Stacked bar charts
- Continuous to categorical variables ->
- Visualize Boxplots, Histograms for each level(category)
22
Analysis - Correlation vs Causation
Correlation ⇏ causation!
23
Analysis - Correlation vs Causation
Correlation ⇏ causation!
To prove causation:
● Randomized controlled experiments
● Hypothesis testing, A/B testing
24
Analysis - Feature Engineering
Create new features from existing raw features: discretize, bin
Transform Variables
Create new categorical variables: too many levels, levels that rarely occur, one
level almost always occur
Extremely skewed data - outliers
Imputation: Filling in missing data
25
Analysis - Missing Values
Missing values are unknown values of a feature.
Important as they may lead to biased models or incorrect estimations and conclusions.
Some ML algorithms accept missing values: for example some tree based models treat
missing values as a separate branch while many other algorithms require complete
dataset. Therefore, we can
● omit: remove missing values and use available data
● impute: replace missing values estimating by mean/median/mode value of the
existing data, by most similar data points (KNN) or more complex algorithms like
Random Forest
26
Analysis - Outliers
Outliers are values distant from other observations like values that are > ~three
standard deviation away from the mean or values between top and bottom 5
percentiles or values outside of 1.5 IQR.
Visualization methods like Boxplots, Histograms and Scatterplots help
27
Analysis - Outliers
Some algorithms like regression are sensitive to outliers and can cause high error
variance and bias in the estimated values.
Delete, cap, transform or impute like missing values.
28
Start Data
Acquisition
What is the
question?
What type of
data is
needed?
Data
Quality
Analysis
Reformating
& Imputing
Data
Plan Clean Data
Scripts
Scripts
Feature
Selection
Model
Selection
Results
Evaluation
Modeling Deployment and
optimization
Data Analysis
Explore the
Data
Scripts
Feature
Engineering
Maintenance
Deployment
Optimization
29
Predictive data modeling
Prediction, that is the end goal of many data science adventures!
Data on consumer behaviour is collected:
● to predict future consumer behaviour and to take action accordingly
Examples:
● Recommendation systems (netflix, pandora, amazon, etc.)
● Online user behaviour is used to predict best targeted ads
● Customer purchase histories are used to determine how to price,stock,
market and display future products.
30
Machine learning
● Machine Learning is the study of algorithms that improve their performance at
some task with example data or past experience
○ Foundation to many ML algorithms lie in statistics and optimization theory
○ Role of Computer science: Efficient algorithms to
■ Solve the optimization problem
■ Represent and evaluate data models for inference
● Wide variety of off-the-shelf algorithms are available today. Just pick a library
and go! (is it really that easy?)
○ Short answer: no. Long answer: model selection and tuning requires deeper understanding.
31
Machine learning - basics
Machine learning systems are made up of
3 major parts, which are:
● Model: the system that makes
predictions.
● Parameters: the signals or factors
used by the model to form its
decisions.
● Learner: the system that adjusts
the parameters — and in turn the
model — by looking at differences
in predictions versus actual
outcome. Ref: http://marketingland.com/how-machine-learning-works-150366 32
Machine learning application examples
● Association Analysis
○ Basket analysis: Find the probability that somebody
who buys X also buys Y
● Supervised Learning
○ Classification: Spam filter, language prediction,
customer/visit type prediction
○ Regression: Pricing
○ Recommendation
● Unsupervised Learning
○ Given a database of customer data, automatically
discover market segments and group customers into
different market segments
33
Model selection and generalization
● Learning is an ill-posed problem; data is
not sufficient to find a unique solution
● There is a trade-off between three
factors:
○ Model complexity
○ Training set size
○ Generalization error (expected error
on new data)
● Overfitting and underfitting problems
Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
Generalization error and cross-validation
● Measuring the generalization error is a major
challenge in data mining and machine
learning
● To estimate generalization error, we need
data unseen during training. We could split
the data as
○ Training set (50%)
○ Validation set (25%) (optional, for selecting ML
algorithm parameters)
○ Test (publication) set (25%)
● How to avoid selection bias: k-fold cross-
validation
Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative
35
Deep Learning
● Neural networks(NN) has been around for decades but they just weren’t “deep” enough. NNs with
several hidden layers are called deep neural networks (DNN).
● Different than many ML approaches, deep learning attempts to model high-level abstractions in data.
● Deep learning is suited best when input space is locally structured – spatial or temporal – vs. arbitrary
input features
36
Start Data
Acquisition
What is the
question?
What type of
data is
needed?
Data
Quality
Analysis
Reformating
& Imputing
Data
Plan Clean Data
Scripts
Scripts
Feature
Selection
Model
Selection
Results
Evaluation
Modeling Deployment and
optimization
Maintenance
Deployment
Optimization
Data Analysis
Explore the
Data
Scripts
Feature
Engineering
37
Deployment, maintenance and optimization
● Deployed solutions might include:
○ A trained data model (model + parameters)
○ Routines for inputting and prediction
○ (Optional) Routines for model improvement (through feedback, deployed system can improve
itself)
○ (Optional) Routines for training
● Once the model has been deployed in production, it is time for regular
maintenance and operations.
● The optimization phase could be triggered by failing performance, need to
add new data sources and retraining the model, or even to deploy improved
versions of the model based on better algorithms.
Ref: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A234092 38
Recap - Software Toolbox of Data Scientists:
● Database
○ SQL
○ NoSQL languages for target databases
● Programming Languages and Libraries
○ Python (due to availability of libraries for data management) scikit-learn, pyML, pandas
○ R
○ General programming languages such as Java for gluing different systems
○ C/C++] mlpack, dlib
● Tools: Orange, Weka, Matlab
● Vendor Specific Platforms for data analytics
(such as Adobe Marketing Cloud, etc.)
● Hive
● Spark
39
Conclusion: It takes a team
Must haves:
- Programming and Scripting skills
- Statistics and data analysis skills
- Machine learning skills
Necessary but not sufficient:
- Database management skills
- Distributed computing skills
Domain knowledge may make or break a system: If you do not realize a type of
data is essential, the results will not be very useful
40
41
THANK YOU

Contenu connexe

Tendances

1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalitiesRajendran
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data MiningScottperrone
 
Data mining presentation by Wasiful Alam Fahim
Data mining presentation by Wasiful Alam FahimData mining presentation by Wasiful Alam Fahim
Data mining presentation by Wasiful Alam FahimWasiful Alam Fahim
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoTShivam Singh
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]ssuser23e4f31
 

Tendances (20)

Image Analytics for Retail
Image Analytics for RetailImage Analytics for Retail
Image Analytics for Retail
 
AlogoAnalytics Company Presentation
AlogoAnalytics Company PresentationAlogoAnalytics Company Presentation
AlogoAnalytics Company Presentation
 
1.2 steps and functionalities
1.2 steps and functionalities1.2 steps and functionalities
1.2 steps and functionalities
 
Image Analytics In Healthcare
Image Analytics In HealthcareImage Analytics In Healthcare
Image Analytics In Healthcare
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Importance of Data Mining
Importance of Data MiningImportance of Data Mining
Importance of Data Mining
 
Data mining presentation by Wasiful Alam Fahim
Data mining presentation by Wasiful Alam FahimData mining presentation by Wasiful Alam Fahim
Data mining presentation by Wasiful Alam Fahim
 
Machine Learning For Stock Broking
Machine Learning For Stock BrokingMachine Learning For Stock Broking
Machine Learning For Stock Broking
 
Chatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine LearningChatbots: Automated Conversational Model using Machine Learning
Chatbots: Automated Conversational Model using Machine Learning
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Data Mining
Data MiningData Mining
Data Mining
 
Ghhh
GhhhGhhh
Ghhh
 
Machine Learning in ICU mortality prediction
Machine Learning in ICU mortality predictionMachine Learning in ICU mortality prediction
Machine Learning in ICU mortality prediction
 
Data analytics
Data analyticsData analytics
Data analytics
 
Apply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business ApplicationApply (Big) Data Analytics & Predictive Analytics to Business Application
Apply (Big) Data Analytics & Predictive Analytics to Business Application
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
Data Analytics Life Cycle [EMC² - Data Science and Big data analytics]
 
Data Science
Data ScienceData Science
Data Science
 

Similaire à Data science guide

Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxCloudBusiness2
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career pathRubikal
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxsumitkumar600840
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in MalaysiaAhmed Elmalla
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh hasmeerana605
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Rohit Dubey
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analyticsInbavalli Valli
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptxXanGwaps
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data ScientistRohit Dubey
 

Similaire à Data science guide (20)

Kp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptxKp-Data Analytics-ts.pptx
Kp-Data Analytics-ts.pptx
 
Data analytics career path
Data analytics career pathData analytics career path
Data analytics career path
 
Data Analytics Career Paths
Data Analytics Career PathsData Analytics Career Paths
Data Analytics Career Paths
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Data Science in Python.pptx
Data Science in Python.pptxData Science in Python.pptx
Data Science in Python.pptx
 
Data Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptxData Science Introduction: Concepts, lifecycle, applications.pptx
Data Science Introduction: Concepts, lifecycle, applications.pptx
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
data mining
data miningdata mining
data mining
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Data Science Training in Chandigarh h
Data Science Training in Chandigarh    hData Science Training in Chandigarh    h
Data Science Training in Chandigarh h
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
Data Science Job ready #DataScienceInterview Question and Answers 2022 | #Dat...
 
Data science and business analytics
Data  science and business analyticsData  science and business analytics
Data science and business analytics
 
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx351315535-Module-1-Intro-to-Data-Science-pptx.pptx
351315535-Module-1-Intro-to-Data-Science-pptx.pptx
 
Big data Analytics
Big data AnalyticsBig data Analytics
Big data Analytics
 
Guide for a Data Scientist
Guide for a Data ScientistGuide for a Data Scientist
Guide for a Data Scientist
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 

Plus de gokulprasath06

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codegokulprasath06
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processinggokulprasath06
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guidegokulprasath06
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN)Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN)gokulprasath06
 

Plus de gokulprasath06 (7)

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
K-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source codeK-means Clustering Algorithm with Matlab Source code
K-means Clustering Algorithm with Matlab Source code
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's GuideIntroduction to Data Mining - A Beginner's Guide
Introduction to Data Mining - A Beginner's Guide
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN)Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN)
 
OLAP
OLAPOLAP
OLAP
 

Dernier

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一z xss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证nhjeo1gg
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxviniciusperissetr
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 

Dernier (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
在线办理WLU毕业证罗瑞尔大学毕业证成绩单留信学历认证
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 

Data science guide

  • 1. 1 Beginner's Guide to Data Science
  • 2. Data Science is: Popular Lots of Data => Lots of Analysis => Lots of Jobs Universities: Starting new multidisciplinary programs Industry: Cottage industry evolving for online and training courses Goal of this Talk: ● Hear if from people who do it and what they do ● Use it for further learning and specialization 2
  • 3. Data is: Big! ● 2.5 quintillion (1018) bytes of data are generated every day! ● Everything around you collects/generates data ● Social media sites ● Business transactions ● Location-based data ● Sensors ● Digital photos, videos ● Consumer behaviour (online and store transactions) ● More data is publicly available ● Database technology is advancing ● Cloud based & mobile applications are widespread Source: IBM http://www-01.ibm.com/software/data/bigdata/ Lots of Data => Lots of Analysis => Lots of Jobs 3
  • 4. If I have data, I will know :) Everyone wants better predictability, forecasting, customer satisfaction, market differentiation, prevention, great user experience, ... ● How can I price a particular product? ● What can I recommend online customers to buy after buying X, Y or Z? ● How can we discover market segments? group customers into market segments? ● What customer will buy in the upcoming holiday season? (what to stock?) ● What is the price point for customer retention for subscriptions? 4
  • 5. Data Science is: making sense of Data Lots of Data => Lots of Analysis => Lots of Jobs ● Multidisciplinary study of data collections for analysis, prediction, learning and prevention. ● Utilized in a wide variety of industries. ● Involves both structured or unstructured data sources. 5
  • 6. Data Science is: multidisciplinary ● Statisticians ● Mathematicians ● Computer Scientists in ○ Data mining ○ Artificial Intelligence & Machine Learning ○ Systems Development and Integration ○ Database development ○ Analytics ● Domain Experts ○ Medical experts ○ Geneticists ○ Finance, Business, Economy experts ○ etc. 6
  • 7. Start Data Acquisition What is the question? What type of data is needed? Data Quality Analysis Reformating & Imputing Data Plan Clean Data Scripts Scripts Feature Selection Model Selection Results Evaluation Modeling Deployment and optimization Data Analysis Explore the Data Scripts Feature Engineering Maintenance Deployment Optimization 7
  • 8. Start Data Acquisition What is the question? What type of data is needed? Data Quality Analysis Reformating & Imputing Data Plan Clean Data Scripts Scripts Feature Selection Model Selection Results Evaluation Modeling Deployment and optimization Data Analysis Explore the Data Scripts Feature Engineering Maintenance Deployment Optimization 8
  • 9. Data Acquisition Stage ● As soon as the data scientist identified the problem she is trying to solve, she must assess: ● What type of data is available ● What might be required and currently is not collected ● Is it available from other units of the company? ● Does she need to crawl/buy data from third parties? ● How much data is needed? (Data volume) ● How to access the data? ● Is the data private? ● Is it legally OK to use the data? 9
  • 10. Data Acquisition Stage ● Data may not exist ● Sources of data may be public or private ● Not all sources of data may be suitable for processing ● Data are often incomplete and dirty ● Data consolidation and cleanup are essential ○ Pieces of data may be in different sources ○ Formats may not match/may be incompatible ○ Unstructured data may need to be accounted for 10
  • 11. Data Acquisition Stage -- Example Example: Online customer experience may require collecting lots of data such as ● clicks ● conversions ● add-to-cart rate ● dwell time ● average order value ● foot traffic ● bounce rate ● exits and time to purchase 11
  • 12. Data Acquisition: Type and Source of Data ● Time spent on a page, browsing and/or search history ○ Website Logs ● User and Inventory Data ○ Transaction databases ● Social Engagement ○ Social Networks (Yelp, Twitter,...) ● Customer Support ○ Call Logs, Emails ● Gas prices, competitors, news, Stock Prices, etc.. ○ RSS Feeds, News Sites, Wikipedia,... ● Training Data? ○ CrowdFlower, Mechanical Turk 12
  • 13. Data Acquisition : Storage and Access ● Where the data resides ○ Cloud or Computing Clusters ● Storage System ○ SQL, NoSQL, File System ○ SQL: MySQL, Oracle, MS Server,... ○ NoSQL: MongoDB, Cassandra, Couchbase, Hbase, Hive, ... ○ Text Indexing: Solr, ElasticSearch,... ● Data Processing Frameworks: ○ Hadoop, Spark, Storm etc... 13
  • 14. Data Acquisition: Data Integration Data integration involves combining data residing in different sources and providing users with a unified view of these data. (Wikipedia) ● Schema Mapping ● Record Matching ● Data Cleaning Data Source 1 Data Source 2 Data Source 3 Data Source 4 ETL Data Warehouse 14
  • 15. Data Cleaning ● Data are often incomplete, incorrect. ○ Typo : e.g., text data in numeric fields ○ Missing Values : some fields may not be collected for some of the examples ○ Impossible Data combinations: e.g., gender= MALE, pregnant = TRUE ○ Out-of-Range Values: e.g., age=1000 ● Garbage In Garbage Out ● Scripting, Visualization Figure ref: https://thedailyomnivore.net/2015/12/02/ 15
  • 16. Start Data Acquisition What is the question? What type of data is needed? Data Quality Analysis Reformating & Imputing Data Plan Clean Data Scripts Scripts Feature Selection Model Selection Results Evaluation Modeling Deployment and optimization Data Analysis Explore the Data Scripts Feature Engineering Maintenance Deploy Models Optimization 16
  • 17. Analysis - Data Preparation ● Univariate Analysis: Analyze/explore variables one by one ● Bivariate Analysis: Explore relationship between variables ● Coverage, missing values: treating unknown values ● Outliers: detect and treat values that are distant from other observations ● Feature Engineering: Variable transformations and creation of new better variables from raw features Commonly used tools: ● SQL ● R: plyr, reshape, ggplot2, data.table, ● Python: NumPy, Pandas, SciPy, matplotlib 17
  • 18. Analysis - Exploratory Analysis Univariate Analysis: Analyze/explore variables one by one - Continuous variable: explore central tendency and spread of the values - Summary statistics - mean, median, min, max - IQR, standard deviation, variance, quartile - Visualize Histograms, Boxplots 18
  • 19. Analysis - Exploratory Analysis Walmart Store Sales Forecasting Data, Kaggle Summary statistics for “Temperature”: Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev. -7.29 45.90 60.71 59.36 73.88 102.00 18.68 19
  • 20. Analysis - Exploratory Analysis Univariate Analysis: Analyze/explore variables one by one - Categorical Variable: frequency tables - Count and count % - Visualize Bar charts 20
  • 21. Analysis - Exploratory Analysis Bivariate Analysis: Explore relationship between variables - Continuous to continuous variables: Correlation measures the strength and direction of a linear relationship - Visualize Scatterplots -> relationship may not be linear 21
  • 22. Analysis - Exploratory Analysis Bivariate Analysis: Explore relationship between variables - Categorical to categorical variables -> crosstab table - Visualize Stacked bar charts - Continuous to categorical variables -> - Visualize Boxplots, Histograms for each level(category) 22
  • 23. Analysis - Correlation vs Causation Correlation ⇏ causation! 23
  • 24. Analysis - Correlation vs Causation Correlation ⇏ causation! To prove causation: ● Randomized controlled experiments ● Hypothesis testing, A/B testing 24
  • 25. Analysis - Feature Engineering Create new features from existing raw features: discretize, bin Transform Variables Create new categorical variables: too many levels, levels that rarely occur, one level almost always occur Extremely skewed data - outliers Imputation: Filling in missing data 25
  • 26. Analysis - Missing Values Missing values are unknown values of a feature. Important as they may lead to biased models or incorrect estimations and conclusions. Some ML algorithms accept missing values: for example some tree based models treat missing values as a separate branch while many other algorithms require complete dataset. Therefore, we can ● omit: remove missing values and use available data ● impute: replace missing values estimating by mean/median/mode value of the existing data, by most similar data points (KNN) or more complex algorithms like Random Forest 26
  • 27. Analysis - Outliers Outliers are values distant from other observations like values that are > ~three standard deviation away from the mean or values between top and bottom 5 percentiles or values outside of 1.5 IQR. Visualization methods like Boxplots, Histograms and Scatterplots help 27
  • 28. Analysis - Outliers Some algorithms like regression are sensitive to outliers and can cause high error variance and bias in the estimated values. Delete, cap, transform or impute like missing values. 28
  • 29. Start Data Acquisition What is the question? What type of data is needed? Data Quality Analysis Reformating & Imputing Data Plan Clean Data Scripts Scripts Feature Selection Model Selection Results Evaluation Modeling Deployment and optimization Data Analysis Explore the Data Scripts Feature Engineering Maintenance Deployment Optimization 29
  • 30. Predictive data modeling Prediction, that is the end goal of many data science adventures! Data on consumer behaviour is collected: ● to predict future consumer behaviour and to take action accordingly Examples: ● Recommendation systems (netflix, pandora, amazon, etc.) ● Online user behaviour is used to predict best targeted ads ● Customer purchase histories are used to determine how to price,stock, market and display future products. 30
  • 31. Machine learning ● Machine Learning is the study of algorithms that improve their performance at some task with example data or past experience ○ Foundation to many ML algorithms lie in statistics and optimization theory ○ Role of Computer science: Efficient algorithms to ■ Solve the optimization problem ■ Represent and evaluate data models for inference ● Wide variety of off-the-shelf algorithms are available today. Just pick a library and go! (is it really that easy?) ○ Short answer: no. Long answer: model selection and tuning requires deeper understanding. 31
  • 32. Machine learning - basics Machine learning systems are made up of 3 major parts, which are: ● Model: the system that makes predictions. ● Parameters: the signals or factors used by the model to form its decisions. ● Learner: the system that adjusts the parameters — and in turn the model — by looking at differences in predictions versus actual outcome. Ref: http://marketingland.com/how-machine-learning-works-150366 32
  • 33. Machine learning application examples ● Association Analysis ○ Basket analysis: Find the probability that somebody who buys X also buys Y ● Supervised Learning ○ Classification: Spam filter, language prediction, customer/visit type prediction ○ Regression: Pricing ○ Recommendation ● Unsupervised Learning ○ Given a database of customer data, automatically discover market segments and group customers into different market segments 33
  • 34. Model selection and generalization ● Learning is an ill-posed problem; data is not sufficient to find a unique solution ● There is a trade-off between three factors: ○ Model complexity ○ Training set size ○ Generalization error (expected error on new data) ● Overfitting and underfitting problems Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
  • 35. Generalization error and cross-validation ● Measuring the generalization error is a major challenge in data mining and machine learning ● To estimate generalization error, we need data unseen during training. We could split the data as ○ Training set (50%) ○ Validation set (25%) (optional, for selecting ML algorithm parameters) ○ Test (publication) set (25%) ● How to avoid selection bias: k-fold cross- validation Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative 35
  • 36. Deep Learning ● Neural networks(NN) has been around for decades but they just weren’t “deep” enough. NNs with several hidden layers are called deep neural networks (DNN). ● Different than many ML approaches, deep learning attempts to model high-level abstractions in data. ● Deep learning is suited best when input space is locally structured – spatial or temporal – vs. arbitrary input features 36
  • 37. Start Data Acquisition What is the question? What type of data is needed? Data Quality Analysis Reformating & Imputing Data Plan Clean Data Scripts Scripts Feature Selection Model Selection Results Evaluation Modeling Deployment and optimization Maintenance Deployment Optimization Data Analysis Explore the Data Scripts Feature Engineering 37
  • 38. Deployment, maintenance and optimization ● Deployed solutions might include: ○ A trained data model (model + parameters) ○ Routines for inputting and prediction ○ (Optional) Routines for model improvement (through feedback, deployed system can improve itself) ○ (Optional) Routines for training ● Once the model has been deployed in production, it is time for regular maintenance and operations. ● The optimization phase could be triggered by failing performance, need to add new data sources and retraining the model, or even to deploy improved versions of the model based on better algorithms. Ref: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A234092 38
  • 39. Recap - Software Toolbox of Data Scientists: ● Database ○ SQL ○ NoSQL languages for target databases ● Programming Languages and Libraries ○ Python (due to availability of libraries for data management) scikit-learn, pyML, pandas ○ R ○ General programming languages such as Java for gluing different systems ○ C/C++] mlpack, dlib ● Tools: Orange, Weka, Matlab ● Vendor Specific Platforms for data analytics (such as Adobe Marketing Cloud, etc.) ● Hive ● Spark 39
  • 40. Conclusion: It takes a team Must haves: - Programming and Scripting skills - Statistics and data analysis skills - Machine learning skills Necessary but not sufficient: - Database management skills - Distributed computing skills Domain knowledge may make or break a system: If you do not realize a type of data is essential, the results will not be very useful 40