Building machine learning muscle in your team & transitioning to make them do machine learning at scale. We also discuss about Spark & other relevant technologies.
2. Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com
7. What is not Machine Learning ?
● Rule Based Approach
● Legacy Systems
8. Learning Algorithm
What is Machine Learning ?
● Solve prediction problem
Input Data
● Logic is learned from examples & not by rules
Training Data
Prediction Function
or
Trained Model
9. Types of Machine Learning
Machine Learning
ReinforcementUnsupervisedSupervised
Task Driven Data Driven Environment Driven
10. Spam Mail Detection
● Input - Mail
● Output - Spam or Ham
● Supervised Machine Learning,
● Binary Classification Problem
11. ● Input - Sensor Data
● Output - Failure time
● Supervised Machine Learning,
● Regression Problem
Predicting Lift Failure
19. Module 2
Machine Learning
Pipeline
● Understanding Machine Learning Pipeline
● User Story - Automating customer support
● Implementation
● User Story - Fast Query Chatbots
● Implementation
21. Machine Learning Pipeline - Business Understanding
● Business understanding includes clarity what you are trying to achieve.
● Machine learning is not possible with small data size.
● Consolidating data pipeline to channelize continues flow of data.
● Web scraping, data lakes access, REST etc.
22. Machine Learning Pipeline - Data Wrangling
● Production data is never clean.
● It needs a major effort ( around 70% of total effort ) to make it ready for next stage.
● Transforming & mapping data from raw format to another format ready for next stage.
23. Machine Learning Pipeline - Data Visualization
● Visualization makes it easy to grasp difficult concepts
● Find useful pattern in the data
● Interactively drill down into charts for deeper details
24. Vectors - Fixed length array of numbers
● Text documents
● Image files
● CSV
● Audio
● Video
● Time Series data
● Many more ...
Machine Learning Pipeline - Data Preprocessing
Feature Extraction
25. Machine Learning Pipeline - Model Training
Learning Algorithm
Regression/Trees/SVM/Naiv
e Bayes/Neural Networks/
Prediction Function
or
Trained Model
26. ● Linear Regression
● Logistic Regression
● Naive Bayes
● Nearest Neighbors
● Decision Trees
● Ensemble Methods
● Clustering
● Support Vector Machines
● Neural Networks
● CNN
● RNN
● GAN
Machine Learning Pipeline - Learning Algorithms
28. Machine Learning Pipeline - Model Validation
● Training different learning method will give you different trained model.
● Also, each model have huge possibilities of configuration (hyper-parameters).
● Finding the best model among all possibilities & best configuration for it is done as a part
of Model Validation.
● If results are not satisfactory, one has to go back in the chain & fix a few things.
31. 1. Reduce manual
effort of classifying
reviews.
2.Channelizing data
from Web server to
Analytics Engine.
1. Getting
data ready for
visualization.
2. Historical
data shows
past trends.
Visualization
of trend
Text needs to
be tokenized
& vectorized
Different
models were
trained.
Naive Bayes,
SGD Classifier
Choose the
best model
with best
hyper-
parameter
Naive Bayes
(MultinomialNB)
was chosen & put
in deployment
1. Implementation : Customer Service Industry
33. 2. Implementation : Fast Query Chatbots
1. Reduce manual effort
understanding the text
query
2. Waiting for BI has a
long turnaround time
3. We are trying to do this
using chatbot
1. Getting data
ready for
visualization.
2. Historical
data shows
past trends
Visualization
of trend of
text & sql
Text cannot
be used for
ML
Needs to be
tokenized &
vectorized
Deep learning
models with
different layer
configuration
Choosing the
best model
with best
hyper-
parameter
Model with best
config was chosen
& put in
deployment
35. Module 3
Data Challenges
● Optimal data size
● Identify data sources
● Identify what is useful in data
● Cleaning data to extract useful information
● Tools & Libraries to clean & extract useful information
36. Optimal Data size for AI product
● Expectation from a predictor -
Moderate Bias & Moderate
Variance.
● Predictor validation is important.
● The more the data better the
model becomes to a limit.
37. Identify Data Sources
● No specific order in identifying problem statement & data sources.
● Innovation in this space can happen in both ways - Top-Down & Bottom’s-
Up.
● Data can be historical batch data stored in RDBMS & NoSQL DBs.
● Live streamed data using Kafka.
40. Tools vs Libraries
● Data cleaning tools available in market.
● Why they don’t work in long run?
● Data cleaning libraries available.
● Why are more and more enterprises are embracing libraries?
42. Spark vs Other technologies
● Big Data Compute Framework
● Do data cleaning at scale with unbounded performance
● Talk to different data sources
43. Module 4
Machine
Learning Pipeline
at Scale
● Machine Learning Pipeline using Spark
● Spark - A very social technology
● Spark for Big Data Cleaning & Wrangling
● Spark for building ML models at Scale
● Validation & monitoring of models
● Deployment using REST interface using Apache Livy
47. Preprocessing Data at Scale
● Scaling
● CountVectorizer
● Binning
● … many things can be done at scale using Spark
48. Training Models using Spark
● Distributed Model Training using Spark
● Regression
● Classification
● Clustering
● Recommendation Engine
49. Building Data Pipeline in Spark
● Spark provides in-built Transformers & Estimators.
● Pipeline can be built to connect transformers & estimators.
● Machine Learning Pipeline can be automated.
51. Module 5
Knowing
the
Unknowns
● Implementing Transformers & Estimators on Spark
● Deep Learning using Spark
● Are model retrainable?
● The skilling journey
● Introducing Apache Beam
53. What is Deep Learning ?
● Specialized Learning Technique.
● Rather than we choosing features for learning, this technique finds
important feature derivatives.
● Objective is to learn best derived features for prediction.
● It mimics the way our brain learns.
● Very useful for natural language, computer vision, audio, video etc.
54. Do you always need Deep Learning ?
● More data is required for Deep Learning
● More Compute Power
● Models less interpretable
“Don’t kill a mosquito with a cannon ball”
Don’t use Deep Learning if you don’t need to
55. Deep Learning using Spark
● Which one to choose - Distributed TensorFlow & DL using Spark.
● Libraries like - spark-dl & elephas
56. Are models re-trainable ?
● Online learning models in scikit - SGDClassifier, Multinomial Naive Bayes
● Spark ML models are not online learning models
58. Apache Beam - Probably our next webinar
● Apache Beam is an evolution of the Dataflow model created by Google to
process massive amounts of data.
● The name Beam (Batch + strEAM) comes from the idea of having a unified
model for both batch and stream data processing.
● Programs written using Beam can be executed in different processing
frameworks (via runners) using a set of different IOs (Spark, Flink etc.).
60. Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com
64. Imp : Advice to executives about AI
● Everybody should embrace modern capability of AI, on other they should
also think about business specific problems. Not every single tool that AI
community can develop can suit them correctly.
● Biggest challenge is people change not technology change, biggest gap
now is people who can map technology to business problem.
● Insourcing vs outsourcing. Building Team vs using enterprise solutions.
● AI will change everything in next few decades. Be a part of it.
65. Challenges - Data & Security
● Volume of data - Machine learning
on smaller data is infeasible.
● Accessibility of data - Important
data is not accessible & may be in
encrypted format.
info@zekeLabs.com | www.zekeLabs.com | +91
66. Compute, Storage & Network Power
● AI products needs data gathering from sensors, servers etc.
● Once gathered, data needs to be stored for further processing.
● Learning algorithms & data processing activities need lot of compute
power.
67. Infrastructure for development
● Finding the best model is an iterative
process.
● More experiments leads better model.
● Hyper-parameter Tuning
● Scaled infrastructure for developer is
important.
info@zekeLabs.com | www.zekeLabs.com | +91
68. Infrastructure for deployment
● Speedy Deployment.
● Easy deployment
● Fluctuating Demand.
● Need of Elastic infrastructure.
● Cost optimization.
info@zekeLabs.com | www.zekeLabs.com | +91
70. Cost optimization:
● Use Open Source alternatives
● Infrastructure optimization
● Don’t reinvent the wheel
info@zekeLabs.com | www.zekeLabs.com | +91
71. Module 3
Impact of AI
● Will AI benefit human ?
● AI in human computer interaction
● Impact of AI on business
● Impact on workplace
● Impact on society
info@zekeLabs.com | www.zekeLabs.com | +91
8095465880
72. AI benefit human - social, environmental
● Predicting diseases
● 60% People would prefer AI assistance over humans as financial advisors
or tax preparers
● 71% people believe that AI will help humans solve complex problems and
help live more enriched lives
79. Impact of artificial intelligence on society
● People are averse to the idea of availing annual health check-
ups at home with a robotic smart kit (77%) or having chatbot
assistant teachers in universities/ colleges that lower the cost
of overall tuition (61%).
● Responsible AI ensures that its workings are aligned to ethical
standards and social norms pertinent within its scope of
operations.
● Explainable AI is responsible for building AI models with
accountability and the ability to describe or depict why a certain
decision was made by the algorithm.
80. Module 4
Identify right tools
● Programming Language
● Open source libraries
● Infrastructure Optimizations
● Other alternatives
info@zekeLabs.com | www.zekeLabs.com | +91
8095465880
82. Why Python makes life easy ?
● Easy to learn for ETL developers
● Integrates very well with other technologies
● Full-stack development -
○ Dashboard using bokeh,
○ Web application using django,
○ Machine learning models using scikit,
○ Scaling using PySpark
info@zekeLabs.com | www.zekeLabs.com | +91
87. Monolithic Infrastructure - Preallocated Infra
Model Training
● Developers request access
whenever required
● Might incur delay in peak
working hours.
● Idle in non-working hours
Model Interfacing
● Idle in non-peak hours.
● May fall short in spikes.
● Pay even if infra is not used
info@zekeLabs.com | www.zekeLabs.com | +91
88. Serverless Infrastructure - Elastic Allocation
Model Training
● No-preallocation
● Pay only for what you use
● Absolute no idle time for infra
● No wait time for developers
Model Interfacing
● Allocate infra only when required
● Scales down during non-peak
hours
● Improved customer experience
even in peak hours
info@zekeLabs.com | www.zekeLabs.com | +91
89. Serverless Infrastructure Solutions
● Open Function as a Service (OpenFaas)
● AWS Lambda
● Google Cloud Function
● Azure Function
info@zekeLabs.com | www.zekeLabs.com | +91
90. Distributed Machine Learning using Spark
● Apache Spark is a distributed data
processing framework.
● Many machine learning algorithms are
implemented in Spark.
● Most of the API’s are same that of scikit-
learn
● Scaled ETL & Machine Learning can be done
using Spark
info@zekeLabs.com | www.zekeLabs.com | +91
92. Module 5
Build AI Team
● Adoption of AI
● Skills
● Hiring or upskilling
● Upskilling workforce
info@zekeLabs.com | www.zekeLabs.com | +91
8095465880
99. Visit : www.zekeLabs.com for more details
THANK YOU
Let us know how can we help your organization to Upskill the
employees to stay updated in the ever-evolving IT Industry.
Get in touch:
www.zekeLabs.com | +91-8095465880 | info@zekeLabs.com