2. Agenda
● What is Data Science?
● Domain’s - Need of Data Science?
● Data Life Cycle
● Data Science Sub-Domains
● Why Python for Data Science?
● Python - Modules in Data science
○ Introduction to Pandas
○ Introduction to Numpy
○ Introduction to Matplotlib
○ Introduction to Seaborn
● What is Machine Learning ?
3.
4. What is Data Science
Data Science is the field of study that combines Domain expertise,
Programming skills, Knowledge of Math and Statistics to extract
meaningful insights from DATA.
In turn these systems generate insights that analysts and
business users translate into tangible business values.
7. Domains - Need of Data Science
● Ecommerce
○ Recommendation System, Customer sentiment analysis,
Inventory management, improve customer service.
● HealthCare
○ Castlight - Helps customers / Client to take an appropriate plan
● Financials
○ Chatbots, call-center automation , paper work automation
● And ETC……….
8. Why Python for Data Science
● It is easy to Learn
○ Now the language of choice for 8 of 10 US computer science
programs
● Full Featured
○ Not just a statistics language , but has full capabilities for data
acquisition, cleaning, databases, high performance computing and
more
● Strong Data Science Libraries
○ Pandas, Numpy, Matplotlib, Scipy, Seaborn, NLTK, Scikitlearn and
etc….
10. What is Anaconda?
● Essentially a Large ( ~ 400 MB ) Python Installation
● But contains everything you need for Data Engineering, Analytics and
Machine Learning
● Unless you have a special reason not to , you should just install and use
this.
11. Introduction to Pandas
What is Pandas ?
Pandas is a Python library for data analysis and data manipulation. A
python version of the R data.frame library.
Key Features of Pandas
● It has API’s for loading data from different file formats into memory.
● ( exel, tsv, csv, db and etc).
● Data is structured in the form of Rows and Columns.
● Retrieval of data is similar as SQL, can perform all the operations such
as Groupby, Joins, Views and etc..
● Merging of data from multiple datasets.
● Does support much of DataTime series functionality, Timezone,
Business Days, Holidays and etc..
● Boolean Indexing
● Fancy Indexing
12. Core DataStrucures of Pandas
● DataFrames
● Series
Core Operations
Create Select Insert Map
Join Sort Clean ApplyMap
View Update Filter Append
Group Summarise Confirm Rotate
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38. Introduction to Numpy
● Numpy is extremely used in scientific computing
● 3 Main benefits of using numpy array over a list
○ Less memory
○ Fast
○ Convenient
● Broadcasting allows universal functions to deal in a meaningful way with
numpy arrays.
39.
40.
41.
42.
43. Introduction to Matplotlib
A picture is worth than thousands of words. Matplotlib is a 2-D plotting library
that helps in visualizing figures. Matplotlib emulates Matlab like graphs and
visualizations.
Matplotlib is a python library used to create 2D graphs and plots
by using python scripts. It has a module named pyplot which makes things
easy for plotting by providing feature to control line styles, font properties,
formatting axes etc. It supports a very wide variety of graphs and plots
namely - histogram, bar charts, power spectra, error charts etc. It is used
along with NumPy to provide an environment that is an effective open source
alternative for MatLab.
44.
45.
46.
47.
48.
49.
50.
51. Introduction to Seaborn
Seaborn is a Python data visualization library based on matplotlib . it
provides a high level interface for drawing attractive and informative
statistical graphics
Important features of seaborn
● Built in themes for styling matplotlib graphics
● Fitting in and visualizing linear regression models
● Plotting statistical time series data
● Seaborn works well with NumPy and Pandas data structures
● It comes with built in themes for styling Matplotlib graphics
60. Machine Learning
● What is Machine Learning
● Types of Machine Learning
● Supervised and Unsupervised Learning.
● Use Cases
○ Linear Regression ( Supervised)
○ K-Means ( Unsupervised)
○ Sentiment Analysis
61. What is Machine Learning
Machine Learning is a subset of Artificial Intelligence ( AI ) which
provides the machines the ability to learn automatically & improve
from experience without being explicitly programmed.
63. Linear Regression (Supervised)
Linear Regression is a machine learning algorithm based on supervised
learning. It performs a regression task. Regression models a target prediction
value based on independent variables. It is mostly used for finding out the
relationship between variables and forecasting.
64. K - Means ( Unsupervised)
K-means clustering is a type of unsupervised learning, which is used when
you have unlabeled data (i.e., data without defined categories or groups).
The goal of this algorithm is to find groups in the data, with the number of
groups represented by the variable K. The algorithm works iteratively to
assign each data point to one of K groups based on the features that are
provided. Data points are clustered based on feature similarity. The results of
the K-means clustering algorithm are:
● The centroids of the K clusters, which can be used to label new data
● Labels for the training data (each data point is assigned to a single cluster)