The document is about Edureka's Data Science Certification Training course. It covers the following key topics:
- An introduction to machine learning and how it works. Common machine learning techniques like supervised and unsupervised learning are discussed.
- Cluster analysis and k-means clustering are explained in detail as important unsupervised learning algorithms. K-means clustering partitions observations into k clusters where each observation belongs to the cluster with the nearest mean.
- A demo of k-means clustering is shown on a Netflix movie dataset to group movies based on characteristics and increase business. Testimonials from past learners praise the quality of Edureka's data science training.
2. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What Will You Learn Today?
Cluster analysisIntroduction to
Machine Learning
Types of clustering
Introduction to k-
means clustering
How k-means
clustering work?
Demo in R: Netflix
use-case
1 2 3
4 65
3. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Machine learning?
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without
being explicitly programmed.
Training Data Learn
Algorithm
Build Model Perform
Feedback
4. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
ML Use Case – Google self driving car
Google self driving car is a smart, driverless car.
It collects data from environment through
sensors.
Takes decisions like when to speed up, when to
speed down, when to overtake and when to
turn.
5. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types of Machine Learning
Supervised
learning
Unsupervised
learning
Feed the classifier with training data set and predefined labels.
It will learn to categorize particular data under a specific label.
When and where
should I buy a
house?
House features
Area crime rate
Bedrooms
Distance to HQ
Area (in sq.ft)
Locality
6. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types of Machine Learning
Supervised
learning
Unsupervised
learning
An image of fruits is first fed into the system.
The system identifies different fruits using features like color, size and it categorizes them.
When a new fruit is shown, it analyses its features and puts it into the category having
similar featured items.
8. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
What is Clustering?
Clustering means grouping of objects based on the information found in the data describing the objects or their
relationship.
The goal is that objects in one group should be similar to each other but different from objects in another group.
It deals with finding a structure in a collection of unlabeled data.
Some Examples of clustering methods are:
K-means Clustering
Fuzzy/ C-means Clustering
Hierarchical Clustering
9. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Clustering Use Cases
Marketing
Seismic studiesLand use
Insurance
Marketing
Discovering distinct groups in customer databases,
such as customers who make lot of long-distance
calls.
Insurance
Identifying groups of crop insurance policy holders
with a high average claim rate. Farmers crash crops,
when it is “profitable”.
Land use
Identification of areas of similar land use in a GIS
database.
Seismic studies
Identifying probable areas for oil/gas exploration
based on seismic data
Use-cases
11. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Types of Clustering
Exclusive Clustering
• An item belongs exclusively to
one cluster, not several.
• K-means does this sort of
exclusive clustering.
• An item can belong to multiple
clusters
• Its degree of association with each
cluster is known
• Fuzzy/ C-means does this sort of
exclusive clustering.
Overlapping Clustering Hierarchial Clustering
• When two cluster have a parent-
child relationship or a tree-like
structure then it is Hierarchical
clustering
Cluster 1
Cluster 2
Cluster 0
Cluster 2
Cluster 1
13. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
K-means clustering
k-means
clustering
k-means clustering is one of the
simplest algorithms which uses
unsupervised learning method to
solve known clustering issues.
Divides entire dataset into k clusters.
k-means clustering require following
two inputs.
1. K = number of clusters
2. Training set(m) = {x1, x2, x3,......, xm}
Total population
Group 2 Group 3Group 1 Group 4
14. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Example - Google News
Various news URLs related to Trump and Modi are grouped under one section.
K-means clustering automatically clusters new stories about the same topic into pre-defined clusters.
15. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Example
I need to find specific
locations to build
schools in this area so
that the students
doesn’t have to travel
much
The plot of students in an area is as given below,
19. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How k-means work?
Choose number of clusters
Initialization
Cluster assignment
Move centroid
Optimization
Convergence
The WSS is defined as the sum of the squared distance between each member of the
cluster and its centroid.
Mathematically:
where, p(i)= data point
q(i)= closest centroid to data point
The idea of the elbow method is to choose the k after which the WSS decrease
is almost constant.
20. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How k-means work?
Choose number of clusters
Initialization
Cluster assignment
Move centroid
Optimization
Convergence
Cluster
centroid
X-axis
Y-axis
Randomly initialize k points called the cluster centroids.
Here, k = 2
Value of k(number of clusters) can be determined by the elbow curve.
21. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How k-means work?
Choose number of clusters
Initialization
Cluster assignment
Move centroid
Optimization
Convergence
Compute the distance between the data points and the cluster
centroid initialized.
Depending upon the minimum distance, data points are divided into two
groups.
22. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How k-means work?
Choose number of clusters
Initialization
Cluster assignment
Move centroid
Optimization
Convergence
Compute the mean of blue dots.
Reposition blue cluster centroid to this mean.
Compute the mean of orange dots.
Reposition orange cluster centroid to this mean.
23. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How k-means work?
Choose number of clusters
Initialization
Cluster assignment
Move centroid
Optimization
Convergence
Repeat previous two steps iteratively till the cluster centroids stop changing their
positions.
24. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
How k-means work?
Choose number of clusters
Initialization
Cluster assignment
Move centroid
Optimization
Convergence
Finally, k-means clustering algorithm converges.
Divides the data points into two clusters clearly visible in orange and blue.
25. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Problem Statement
Challenge: Netflix wanted to increase its business by showing most popular movies on its website.
Solution: So, Netflix decided to group the movies based on budget, gross and facebook likes
Approach: For this, Netflix took imdb dataset of 5000 values and applied k-means clustering to group it.
But how would I
know which movie
set to show and
which to not ?
28. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Output
We got three clusters based on budget and gross.
Lets see how good are these clusters.
Using command cl gives following output.
Within cluster sum of squares by cluster:
(between_SS / total _ SS = 72.4 %)
Higher the %age value, better is the model.
29. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Further, lets relate cluster assignment to individual characteristics like director facebook likes(column 5) and movie
facebook likes(column 28). Cluster 2 has maximum movie likes as well as director likes.
Output
32. www.edureka.co/data-scienceEdureka’s Data Science Certification Training
Course Details
Go to www.edureka.co/data-science
Get Edureka Certified in Data Science Today!
What our learners have to say about us!
Shravan Reddy says- “I would like to recommend any one who
wants to be a Data Scientist just one place: Edureka. Explanations
are clean, clear, easy to understand. Their support team works
very well.. I took the Data Science course and I'm going to take
Machine Learning with Mahout and then Big Data and Hadoop”.
Gnana Sekhar says - “Edureka Data science course provided me a very
good mixture of theoretical and practical training. LMS pre recorded
sessions and assignments were very good as there is a lot of
information in them that will help me in my job. Edureka is my
teaching GURU now...Thanks EDUREKA.”
Balu Samaga says - “It was a great experience to undergo and get
certified in the Data Science course from Edureka. Quality of the
training materials, assignments, project, support and other
infrastructures are a top notch.”