1
A MAJOR PROJECT REPORT ON
“TELECOM CHURN BASED ON ML”
In partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Submitted by
Hemanth Pasula D Vedanth G.Yashwanth venkat Samrat
16911A1246 16911A1210 16911A1217
Under the Esteemed Guidance of
M.Suresh Babu
Assistant. Professor
DEPARTMENT OF INFORMATION TECHNOLOGY
VIDYA JYOTHI INSTITUTE OF TECHNOLOGY
(Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad)
Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad – 500075
2019 - 2020

2
Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad - 500075
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that the Project Report on “TELECOM CHURN BASED ON ML” is a
bonafide work by D Vedanth(16911A1210), Hemanth Pasula(16911A1246),
G.Yashwanth venkat Samrat (16911A1217) in partial fulfillment of the requirement for
the award of the degree of Bachelor of Technology in “INFORMATION
TECHNOLOGY” VJIT Hyderabad during the year 2019 - 2020.
Project Guide Head of the department
M.Suresh Babu, Mr. B. Srinivasulu,
M.Tech, M.E.,
Assistant Professor. Professor.
External Examiner

3
Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad - 500075
2019 - 2020
DECLARATION
We, D Vedanth(16911A1210), Hemanth Pasula(16911A1246), G.Yashwanth
venkat Samrat (16911A1217) hereby declare that Project Report entitled “TELECOM
CHURN BASED ON ML”, is submitted in the partial fulfillment of the requirement for
the award of Bachelor of Technology in Information Technology to Vidya Jyothi
Institute of Technology, Autonomous VJIT - Hyderabad, is an authentic work and has not
been submitted to any other university or institute for the degree.
D Vedanth (16911A1210)
Hemanth Pasula (16911A1246)
G.Yashwanth venkat Samrat(16911A1217)

4
ACKNOWLEDGEMENT
It is a great pleasure to express our deepest sense of gratitude and indebtedness to
our internal guide M.Suresh Babu, Assistant Professor, Department of IT, VJIT, for having
been a source of constant inspiration, precious guidance and generous assistance during the
project work. We deem it as a privilege to have worked under her able guidance. Without her
close monitoring and valuable suggestions this work wouldn’t have taken this shape. We feel
that this help is un-substitutable and unforgettable.
We wish to express our sincere thanks to Dr. P. Venugopal Reddy, Director VJIT, for
providing the college facilities for the completion of the project. We are profoundly thankful to
Mr.B.Srinivasulu, Professor and Head of Department of IT, for his cooperation and
encouragement. Finally, we thank all the faculty members, supporting staff of IT Department and
friends for their kind co-operation and valuable help for completing the project.
TEAM
MEMBERS
D Vedanth (16911A1210)
Hemanth Pasula (16911A1246)
G.Yashwanth venkat Samrat (16911A1217)

5
CONTENTS
1. Introduction
1.1 Motivation 9
1.2 Problem definition 9
1.3 Objective of Project 9
1.3.1 General Objective 9
1.3.2 Specific Objective 10
1.4 Limitations of Project 10
2. LITERATURE SURVEY
2.1 Introduction 11
3. ANALYSIS
3.1 Introduction 12
3.2 Software Requirement Specification 12
3.2.1 Functional requirements 12
3.2.2 Non-Functional requirements 14
3.2.3 Software requirement 14
3.2.4 Hardware requirement 23
3.3 Content diagram of Project 25
4. DESIGN
4.1 Introduction 26
4.2 UML Diagram 42
4.2.1 Class diagram 42
4.2.2 Use case diagram 43
4.2.3 Activity diagram 44
4.2.4 Sequence diagram 45
4.2.5 Collaboration diagram 46
4.2.6 Component diagram 47
4.2.7 Deployment diagram 48
5. IMPLEMENTATION METHODS & RESULTS 49
5.1 Introduction 49
5.2 Explanation of Key functions 50
5.3 Screenshots or output screens 71

6
6. TESTING & VALIDATION
6.1 Introduction 78
6.2 Design of test cases and scenarios 88
6.3 Validation 88
6.4 Conclusion 90
7. CONCLUSION AND FUTURE-ENHANCEMENTS :
First Paragraph - Project Conclusion 90
Second Paragraph - Future enhancement
90
REFERENCES: 91
1. Author Name, Title of Paper/ Book, Publisher’s Name, Year of
publication
2. Full URL Address

7
LIST OF FIGURES PAGE NO
Fig-1 UML Views 28
Fig-2 Merging data sets 50
Fig-3 Inputting missing value 52
Fig-4 Diving into training set 52
Fig-5 Correlation Matrix 53
Fig-6 Histogram 55
Fig-7 Unit testing life cycle 79
Fig-8 White box testing 80
Fig-9 Black box testing 82
Fig-10 Grey box testing 83
Fig-11 Grey box testing feature 84
Fig-12 Software verification 89

8
LIST OF OUTPUT SCREEN SHOTS PAGE NO
Fig-1 Structure of data 71
Fig-2 Histogram 72
Fig-3 Box plot 72
Fig-4 Graph plot 73
Fig-5 Conversion of data types 73
Fig-6 Missing value 74
Fig-7 Converting the data sets 1 and 0 74
Fig-8 Naive Bayesian 75
Fig-9 Optimal cut-off point 75
Fig-10 Correlation matrix 76
Fig-11 ROC curve 76
Fig-12 Customer who stay 77
Fig-13 Customer who leave 77

9
Chapter-1
Introduction
1.1 Motivation
In the present project we are trying to help the service providers to not to lose their
customers. With the help of this we are better able to understand the customers and let the
service providers know their customers.
1.2 Problem definition
Telecommunication is the exchange of signs, signals, messages, words, writings, images and
sounds or information of any nature by wire, radio, optical or
other electromagnetic systems. Telecommunication occurs when the exchange
of information between communication participants includes the use of technology. It is transmitted
through a transmission medium, such as over physical media, for example, over electrical cable, or
via electromagnetic radiation through space such as radio or light. Such transmission paths are often
divided into communication channels which afford the advantages of multiplexing. Since
the Latin term communication is considered the social process of information exchange, the
term telecommunications is often used in its plural form because it involves many different
technologies.
The churn rate, also known as the rate of attrition or customer churn, is the rate at
which customers stop doing business with an entity. It is most commonly expressed as the
percentage of service subscribers who discontinue their subscriptions within a given time
period. It is also the rate at which employees leave their jobs within a certain period. For a
company to expand its clientele, its growth rate (measured by the number of new customers)
must exceed its churn rate.
Machine learning (ML) is the study of computer algorithms that improve automatically
through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build
a mathematical model based on sample data, known as "training data", in order to make predictions
or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a
wide variety of applications, such as email filtering and computer vision, where it is difficult or
infeasible to develop conventional algorithms to perform the needed tasks.
1.3 Objective of Project
1.3.1 General Objective
Customer attrition, also known as customer churn, customer turnover, or customer
defection, is the loss of clients or customers.
Telephone service companies, Internet service providers, pay TV companies, insurance
firms, and alarm monitoring services, often use customer attrition analysis and customer
attrition rates as one of their key business metrics because the cost of retaining an existing
customer is far less than acquiring a new one. Companies from these sectors often have
customer service branches which attempt to win back defecting clients, because recovered
long-term customers can be worth much more to a company than newly recruited clients.
1.3.2 Specific Objective

10
Companies usually make a distinction between voluntary churn and involuntary
churn. Voluntary churn occurs due to a decision by the customer to switch to another
company or service provider, involuntary churn occurs due to circumstances such as a
customer's relocation to a long-term care facility, death, or the relocation to a distant
location. In most applications, involuntary reasons for churn are excluded from the analytical
models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to
factors of the company-customer relationship which companies control, such as how billing
interactions are handled or how after-sales help is provided.
predictive analytics use churn prediction models that predict customer churn by assessing
their propensity of risk to churn. Since these models generate a small prioritized list of
potential defectors, they are effective at focusing customer retention marketing programs on
the subset of the customer base who are most vulnerable to churn.
1.4 Limitations of Project
Acquiring the data of a customer is difficult. In the present days it is very difficult to
know what a person is thinking. So, we are only able to judge with the help of data which is
in front of us. Which can be only 80-90% reliable. So, we should make shore that our
predictions are good.

11
Chapter-2
LITERATURE SURVEY
2.1 Introduction
LOGISTIC REGRESSION
Problem Statement :
"You have a telecom firm which has collected data of all its customers"
The main types of attributes are :
1.Demographics (age, gender etc.)
2.Services availed (internet packs purchased, special offers etc)
3.Expenses (amount of recharge done per month etc.)
Based on all this past information, you want to build a model which will predict whether a
particular customer will churn or not.
So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether
or not a particular customer has churned. It is a binary variable 1 means that the customer
has churned and 0 means the customer has not churned.
With 21 predictor variables we need to predict whether a particular customer will switch to
another telecom provider or not.
The data sets were taken from the Kaggle website.
DATA PROCEDURE
Import the required libraries
1. Importing all datasets
2. Merging all datasets based on condition ("customer_id ")
3. Data Cleaning - checking the null values
4. Check for the missing values and replace them
5. Model building
• Binary encoding
• One hot encoding
• Creating dummy variables and removing the extra columns
6. Feature selection using RFE - Recursive Feature Elimination
7. Getting the predicted values on train set
8. Creating a new column predicted with 1 if churn > 0.5 else 0
9. Create a confusion matrix on train set and test
10. Check the overall accuracy

12
Chapter-3
ANALYSIS
3.1 Introduction
The field of chemistry uses analysis in at least three ways: to identify the components
of a particular chemical compound (qualitative analysis), to identify the proportions of
components in a mixture (quantitative analysis), and to break down chemical processes and
examine chemical reactions between elements of matter. For an example of its use, analysis
of the concentration of elements is important in managing a nuclear reactor, so nuclear
scientists will analyse neutron activation to develop discrete measurements within vast
samples. A matrix can have a considerable effect on the way a chemical analysis is
conducted and the quality of its results. Analysis can be done manually or with a device.
Chemical analysis is an important element of national security among the major world
powers with materials.
3.2 Software Requirement Specification
3.2.1 Functional requirements
Functional requirements may involve calculations, technical details, data
manipulation and processing, and other specific functionality that define what a system is
supposed to accomplish. Behavioural requirements describe all the cases where the system
uses the functional requirements, these are captured in use cases. Functional requirements are
supported by non-functional requirements (also known as "quality requirements"), which
impose constraints on the design or implementation (such as performance requirements,
security, or reliability). Generally, functional requirements are expressed in the form "system
must do requirement," while non-functional requirements take the form "system shall be
requirement." The plan for implementing functional requirements is detailed in the system
design, whereas non-functional requirements are detailed in the system architecture.
Customer data It contains all data related to customer’s services and contract
information. In addition to all offers, packages, and services subscribed to by the customer.
Furthermore, it also contains information generated from CRM system like (all customer
GSMs, Type of subscription, birthday, gender, the location of living and more ).
Network logs data Contains the internal sessions related to internet, calls, and SMS
for each transaction in Telecom operator, like the time needed to open a session for the
internet and call ending status. It could indicate if the session dropped due to an error in the
internal network.
Call details records “CDRs” Contain all charging information about calls, SMS,
MMS, and internet transaction made by customers. This data source is generated as text files.
This data has a large size and there is a lot of detailed information about it. We spent a lot of
time to understand it and to know its sources and storing format. In addition to these records,
the data must be linked to the detailed data stored in relational databases that contain detailed
information about the customer.

13
The Quality of the System is maintained in such a way so that it can be very user
friendly to all the users.
The software quality attributes are assumed as under:
1.Accurate and hence reliable.
In simpler terms, given a set of data points from repeated measurements of the same
quantity, the set can be said to be accurate if their average is close to the true value of the
quantity being measured, while the set can be said to be precise if the values are close to
each other. In the first, more common definition of "accuracy" above, the two concepts are
independent of each other, so a particular set of data can be said to be either accurate, or
precise, or both, or neither.
In everyday language, we use the word reliable to mean that something is dependable
and that it will give behave predictably every time.
2.Secured.
Security is freedom from, or resilience against, potential harm caused by others.
Beneficiaries ... the English language in the 16th century. It is derived from Latin
secures, meaning freedom from anxiety: se (without) + cura (care, anxiety).
3.Fast speed.
In everyday use and in kinematics, the speed of an object is the magnitude of the
change of its position; it is thus a scalar quantity. The average speed of an object in an
interval of time is the distance travelled by the object divided by the duration of the
interval; the instantaneous speed is the limit of the average speed as the duration of the time
interval approaches zero.
4.Compatibility.
a state in which two things are able to exist or occur together without problems or
conflict.
3.2.2 Non Functional requirements

14
In systems engineering and requirements engineering, a non-functional
requirement (NFR) is a requirement that specifies criteria that can be used to judge the
operation of a system, rather than specific behaviours. They are contrasted with functional
requirements that define specific behaviour or functions. The plan for
implementing functional requirements is detailed in the system design. The plan for
implementing non-functional requirements is detailed in the system architecture, because
they are usually architecturally significant requirements.
3.2.3 Software requirement
1.Python
Introduction
Python is an interpreted, high-level, general-purpose programming language. Created
by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace. Its language
constructs and object-oriented approach aim to help programmers write clear, logical code
for small and large-scale projects.
Python is dynamically typed and garbage-collected. It supports multiple programming
paradigms, including structured (particularly, procedural), object-oriented, and functional
programming. Python is often described as a "batteries included" language due to its
comprehensive standard library.
Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0,
released in 2000, introduced features like list comprehensions and a garbage collection
system capable of collecting reference cycles. Python 3.0, released in 2008, was a major
revision of the language that is not completely backward-compatible, and much Python 2
code does not run unmodified on Python 3.
The Python 2 language was officially discontinued in 2020 (first planned for 2015), and
"Python 2.7.18 is the last Python 2.7 release and therefore the last Python 2 release." No
more security patches or other improvements will be released for it.[31][32]
With Python
2's end-of-life, only Python 3.5.x and later are supported.
Python interpreters are available for many operating systems. A global community of
programmers develops and maintains CPython, an open source reference implementation.
A non-profit organization, the Python Software Foundation, manages and directs resources
for Python and CPython development.
History of Python
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired
by SETL), capable of exception handling and interfacing with the Amoeba operating
system. Its implementation began in December 1989. Van Rossum shouldered sole
responsibility for the project, as the lead developer, until 12 July 2018, when he announced
his "permanent vacation" from his responsibilities as Python's Benevolent Dictator For Life,
a title the Python community bestowed upon him to reflect his long-term commitment as the

15
project's chief decision-maker. He now shares his leadership as a member of a five-person
steering council. In January 2019, active Python core developers elected Brett Cannon, Nick
Coghlan, Barry Warsaw, Carol Willing and Van Rossum to a five-member "Steering
Council" to lead the project.
Python 2.0 was released on 16 October 2000 with many major new features, including
a cycle-detecting garbage collector and support for Unicode.
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is
not completely backward-compatible. Many of its major features were back ported to Python
2.6.x[
and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which
automates (at least partially) the translation of Python 2 code to Python 3.
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern
that a large body of existing code could not easily be forward-ported to Python 3.
2.Anaconda 3
Introduction
Anaconda is a free and open-source distribution of the Python and R programming
languages for scientific computing (data science, machine learning applications, large-scale
data processing, predictive analytics, etc.), that aims to simplify package management and
deployment. The distribution includes data-science packages suitable for Windows, Linux,
and macOS. It is developed and maintained by Anaconda, Inc., which was founded by Peter
Wang and Travis Oliphant in 2012. As an Anaconda, Inc. product, it is also known
as Anaconda Distribution or Anaconda Individual Edition, while other products from the
company are Anaconda Team Edition and Anaconda Enterprise Edition, which are both not
free.
Package versions in Anaconda are managed by the package management system conda. This
package manager was spun out as a separate open-source package as it ended up being useful
on its own and for other things than Python. There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend
on, and a small number of other packages.
History of Anaconda 3
Anaconda distribution comes with 1,500 packages selected from PyPI as well as
the conda package and virtual environment manager. It also includes a GUI, Anaconda
Navigator[
, as a graphical alternative to the command line interface (CLI).
The big difference between conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science and the
reason conda exists.
When pip installs a package, it automatically installs any dependent Python packages
without checking if these conflict with previously installed packages.
It will install a package
and any of its dependencies regardless of the state of the existing installation.
Because of this,
a user with a working installation of, for example, Google Tensorflow, can find that it stops
working having used pip to install a different package that requires a different version of the

16
dependent numpy library than the one used by Tensorflow. In some cases, the package may
appear to work but produce different results in detail.
In contrast, conda analyses the current environment including everything currently installed,
and, together with any version limitations specified (e.g. the user may wish to have
Tensorflow version 2,0 or higher), works out how to install a compatible set of
dependencies, and shows a warning if this cannot be done.
Open source packages can be individually installed from the Anaconda repository, Anaconda
Cloud (anaconda.org), or the user's own private repository or mirror, using the command.
Anaconda, Inc. compiles and builds the packages available in the Anaconda repository itself,
and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. Anything
available on PyPI may be installed into a conda environment using pip, and conda will keep
track of what it has installed itself and what pip has installed.
Custom packages can be made using the command, and can be shared with others by
uploading them to Anaconda Cloud, PyPI or other repositories.
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python
3.7. However, it is possible to create new environments that include any version of Python
packaged with conda
Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a
graphical alternative to the command line interface (CLI).
Here we define conda as analyzes your current environment, everything you have installed,
any version limitations you specify (e.g. you only want tensor flow >= 2.0) and figures out
how to install compatible dependencies. Or it will tell you that what you want can't be done.
pip, by contrast, will just install the package you specify and any dependencies, even if that
breaks other packages.
Conda allows users to easily install different versions of binary software packages and any
required libraries appropriate for their computing platform. Also, it allows users to switch
between package versions and download and install updates from a software repository.
Conda is written in the Python programming language, but can manage projects containing
code written in any language (e.g., R), including multi-language projects. Conda can
install Python, while similar Python-based cross-platform package managers (such
as wheel or pip) cannot.
A popular conda channel for bioinformatics software is Bioconda, which provides multiple
software distributions for computational biology. In fact, the conda package and environment
manager is included in all versions of Anaconda, Miniconda and Anaconda Repository.
without checking if these conflict with previously installed packages. It will install a package
and any of its dependencies regardless of the state of the existing installation because of this,
a user with a working installation of, for example, Google Tensorflow, can find that it stops
dependent numpy library than the one used by Tensorflow. In some cases, the package may

17
In contrast, conda analyses the current environment including everything currently installed,
and, together with any version limitations specified (e.g. the user may wish to have
Tensorflow version 2,0 or higher), works out how to install a compatible set of
dependencies, and shows a warning if this cannot be done.
Open source packages can be individually installed from the Anaconda repository, Anaconda
Cloud (anaconda.org), or your own private repository or mirror, using the cond
install command. Anaconda Inc compiles and builds the packages available in the Anaconda
repository itself, and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-
bit. Anything available on PyPI may be installed into a conda environment using pip, and
conda will keep track of what it has installed itself and what pip has installed.
Custom packages can be made using the conda build command, and can be shared with
others by uploading them to Anaconda Cloud,PyPI
or other repositories.
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python
3.7. However, it is possible to create new environments that include any version of Python
packaged with conda.
Anaconda Navigator
Anaconda Navigator is a desktop graphical user interview included in Anaconda distribution
that allows users to launch applications and manage conda packages, environments and
channels without using. Navigator can search for packages on Anaconda Cloud or in a local
Anaconda Repository, install them in an environment, run the packages and update them. It
is available for window, micrOS and LINUX.
The following applications are available by default in Navigator.
Conda is an open source language-agnostic and environment management system that
installs, runs, and updates packages and their dependencies. It was created for Python
programs, but it can package and distribute software for any language (e.g.,R) .where R is
a pograming language and free software environment for statistical computing and graphics
supported by the R Foundation for Statistical Computing.The R language is widely used
among statisticians and data miners for developing statistical software and data analysis
.Polls, data mining surveys, and studies of scholarly literature databases show substantial
increases in popularity as of February 2020, R ranks 13th in the TIOBE index, a measure of
popularity of programming languages.
A GNU package, the official R software environment is written primarily in C, FORTRAN,
and R itself (thus, it is partially self-hosting) and is freely available under the GNU General
Public License. Pre-compiled executables are provided for various operating systems.
Although R has a command line interface. There are several third-party graphical user
interfaces, such as RStudio, an integrated development environment, and Jupyter, a notebook
interface, including multi-language projects. The conda package and environment manager is
included in all versions of Anaconda, Miniconda, and Anaconda Repository.
Anaconda cloud:
Anaconda Cloud is a package management service by Anaconda where you can find, access,
store and share public and private notebooks, environments, and conda and PyPI
packages. Cloud hosts useful Python packages, notebooks and environments for a wide

18
variety of applications. You do not need to log in or to have a Cloud account, to search for
public packages, download and install them.
Project jyupter:
Project Jupyter is a nonprofit organization created to "develop open-source software, open-
standards, and services for interactive computing across dozens of programming languages".
Spun-off from IPython in 2014 by Fernando Pérez, Project Jupyter supports execution
environments in several dozen languages. Project Jupyter's name is a reference to the three
core programming languages supported by Jupyter, which are Julia, Python and R, and also a
homage to Galileo’s notebooks recording the discovery of the moons of Jupiter Project
Jupyter has developed and supported the interactive computing products Jupyter Notebook,
JupyterHub, and JupyterLab, the next-generation version of Jupyter Notebook.
In 2014, Fernando Pérez announced a spin-off project from IPython called Project
Jupyter IPython continued to exist as a Python shell and a kernel for Jupyter, while
the notebook and other language-agnostic parts of IPython moved under the Jupyter
name Jupyter is language agnostic and it supports execution environments (aka kernels) in
several dozen languages among which are Julia, R, Haskell, Ruby, and of course Python (via
the IPython kernel)
In 2015, GitHub and the Jupyter Project announced native rendering of Jupyter notebooks
file format (.ipynb files) on the GitHub platform.
Jupyter Notebook
Jupyter Notebook(formerly IPython Notebooks) is a web based computational environment
for creating Jupyter notebook documents. The "notebook" term can colloquially make
reference to many different entities, mainly the Jupyter web application, Jupyter Python web
server, or Jupyter document format depending on context. A Jupyter Notebook document is a
JSON document, following a versioned schema, and containing an ordered list of
input/output cells which can contain code, text, mathematics, plots and rich media, usually
ending with the ".ipynb" extension.
A Jupyter Notebook can be converted to a number of open source output formats
(HTML,PRESENTATION SKILLS,PDF ,markdown,python) through "Download As" in the
web interface, via the nbconvert library or "jupyter nbconvert" command line interface in a
shell. To simplify visualisation of Jupyter notebook documents on the web, the library is
provided as a service through NbViewer which can take a URL to any publicly available
notebook document, convert it to HTML on the fly and display it to the user.
Jupyter Notebook interface
Jupyter Notebook can connect to many kernels to allow programming in many languages.
By default Jupyter Notebook ships with the IPython kernel. As of the 2.3 release(October
2014), there are currently 49 Jupyter-compatible kernels for many programming languages,
including Python, R, Julia and Haskell
The Notebook interface was added to IPython in the 0.12 release (December 2011), renamed
to Jupyter notebook in 2015 (IPython 4.0 – Jupyter 1.0). Jupyter Notebook is similar to the
notebook interface of other programs such as Maple, Mathematica, and SageMath, a
computational interface style that originated with Mathematica in the 1980s.
According
to The Atlantic, Jupyter interest overtook the popularity of the Mathematica notebook
interface in early 2018.

19
Jupyter kernels
A Jupyter kernel is a program responsible for handling various types of requests (code
execution, code completions, inspection), and providing a reply. Kernels talk to the other
components of Jupyter using ZeroMQ over the network, and thus can be on the same or
remote machines. Unlike many other Notebook-like interfaces, in Jupyter, kernels are not
aware that they are attached to a specific document, and can be connected to many clients at
once. Usually kernels allow execution of only a single language, but there are a couple of
exceptions.
By default Jupyter ships with IPython as a default kernel and a reference implementation via
the ipykernel wrapper. Kernels for many languages having varying quality and features
are available.
JupyterHub
JupyterHub is a multi-user server for Jupyter Notebooks. It is designed to support many users
by spawning, managing, and proxying many singular Jupyter Notebook servers. While
JupyterHub requires managing servers, third-party services like Jupyo provide an alternative
to JupyterHub by hosting and managing multi-user Jupyter notebooks in the cloud.
JupyterLab
JupyterLab is the next-generation user interface for Project Jupyter. It offers all the familiar
building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser,
rich outputs, etc.) in a flexible and powerful user interface. The first stable release was
announced on February 20, 2018.
Web application
It has also added user- as well as system-based web applications enhancement to add support
for deployment across the variety of environments. It also tries to manage sessions as well as
applications across the network.
Tomcat is building additional components. A number of additional components may be used
with Apache Tomcat. These components may be built by users should they need them or
they can be downloaded from one of the mirrors.
Features:
Tomcat 7.x implements the Servlet 3.0 and JSP 2.2 specifications. It requires Java version
1.6, although previous versions have run on Java 1.1 through 1.5. Versions 5 through 6 saw
improvements in garbage collection, JSP parsing, performance and scalability. Native
wrappers, known as "Tomcat Native", are available for Microsoft Windows and Unix for
platform integration, Tomcat 8.x implements the Servlet 3.1 and JSP 2.3 Specifications.
Apache Tomcat 8.5.x is intended to replace 8.0.x and includes new features pulled forward
from Tomcat 9.0.x. The minimum Java version and implemented specification versions
remain unchanged.
WINDOWS 10

20
Windows 10 is a series of personal computer operating systems produced by Microsoft as
part of its Windows NT family of operating systems. It is the successor to Windows 8.1, and
was released to manufacturing on July 15, 2015, and broadly released for retail sale on July
29, 2015. Windows 10 receives new builds on an ongoing basis, which are available at no
additional cost to users, in addition to additional test builds of Windows 10 which are
available to Windows Insiders. Devices in enterprise environments can receive these updates
at a slower pace, or use long-term support milestones that only receive critical updates, such
as security patches, over their ten-year lifespan of extended support.
One of Windows 10's most notable features is support for universal apps, an expansion of
the Metro-style apps first introduced in Windows 8. Universal apps can be designed to run
across multiple Microsoft product families with nearly identical code—
including PCs, tablets, smartphones, embedded systems, Xbox One, Surface Hub and Mixed
Reality. The Windows user interface was revised to handle transitions between a mouse-
oriented interface and a touchscreen-optimized interface based on available input devices—
particularly on 2-in-1 PCs, both interfaces include an updated Start menu which incorporates
elements of Windows 7's traditional Start menu with the tiles of Windows 8. Windows 10
also introduced the Microsoft Edge web browser, a virtual desktop system, a window and
desktop management feature called Task View, support for fingerprint and face
recognition login, new security features for enterprise environments, and DirectX 12.
Windows 10 received mostly positive reviews upon its original release in July 2015. Critics
praised Microsoft's decision to provide a desktop-oriented interface in line with previous
versions of Windows, contrasting the tablet-oriented approach of 8, although Windows 10's
touch-oriented user interface mode was criticized for containing regressions upon the touch-
oriented interface of Windows 8. Critics also praised the improvements to Windows 10's
bundled software over Windows 8.1, Xbox Live integration, as well as the functionality and
capabilities of the Cortana personal assistant and the replacement of Internet Explorer with
Edge. However, media outlets have been critical of changes to operating system behaviours,
including mandatory update installation, privacy concerns over data collection performed by
the OS for Microsoft and its partners and the adware-like tactics used to promote the
operating system on its release.
Although Microsoft's goal to have Windows 10 installed on over a billion devices within
three years of its release had failed, it still had an estimated usage share of 60% of all the
Windows versions on traditional PCs, and thus 47% of traditional PCs were running
Windows 10 by September 2019. Across all platforms (PC, mobile, tablet and console), 35%
of devices run some kind of Windows, Windows 10 or older.
Microsoft Excel
MicrosoftExcel isa spreadsheet developedby Microsoft for Windows, macOS, Androi
d and iOS. It features calculation, graphing tools, pivot tables, and a macro programming
language called Visual Basic for Applications. It has been a very widely applied spreadsheet

21
for these platforms, especially since version 5 in 1993, and it has replaced Lotus 1-2-3 as the
industry standard for spreadsheets. Excel forms part of the Microsoft Office suite of
software.
Number of rows and columns Versions of Excel up to 7.0 had a limitation in the size of their
data sets of 16K (214
= 16384) rows. Versions 8.0 through 11.0 could handle 64K
(216
= 65536) rows and 256 columns (28
as label 'IV'). Version 12.0 onwards, including the
current Version 16.x, can handle over 1M (220
= 1048576) rows, and 16384 (214
as label
'XFD') columns.
Microsoft Excel up until 2007 version used a proprietary binary file format called Excel
Binary File Format (.XLS) as its primary format. Excel 2007 uses Office Open XML as its
primary file format, an XML-based format that followed after a previous XML-based format
called "XML Spreadsheet" ("XMLSS"), first introduced in Excel 2002.
Although supporting and encouraging the use of new XML-based formats as replacements,
Excel 2007 remained backwards-compatible with the traditional, binary formats. In addition,
most versions of Microsoft Excel can read CSV, DBF, SYLK, DIF, and other legacy
formats. Support for some older file formats was removed in Excel 2007. The file formats
were mainly from DOS-based programs.
Microsoft originally marketed a spreadsheet program called Multiplan in 1982. Multiplan
became very popular on CP/M systems, but on MS-DOS systems it lost popularity to Lotus
1-2-3. Microsoft released the first version of Excel for the Macintosh on September 30,
1985, and the first Windows version was 2.05 (to synchronize with the Macintosh version
2.2) in November 1987. Lotus was slow to bring 1-2-3 to Windows and by the early 1990s,
Excel had started to outsell 1-2-3 and helped Microsoft achieve its position as a leading PC
software developer. This accomplishment solidified Microsoft as a valid competitor and
showed its future of developing GUI software. Microsoft maintained its advantage with
regular new releases, every two years or so.
Excel 2.0 is the first version of Excel for the Intel platform. Versions prior to 2.0 were only
available on the Apple Macintosh.
Excel 2.0 (1987) The first Windows version was labeled "2" to correspond to the Mac
version. This included a run-time version of Windows.
BYTE in 1989 listed Excel for Windows as among the "Distinction" winners of the BYTE
Awards. The magazine stated that the port of the "extraordinary" Macintosh version
"shines", with a user interface as good as or better than the original.
Excel 3.0 (1990) Included toolbars, drawing capabilities, outlining, add-in support, 3D
charts, and many more new features.[
Also, an easter egg in Excel 4.0 reveals a hidden animation of a dancing set of numbers 1
through 3, representing Lotus 1-2-3, which was then crushed by an Excel logo.[81]

22
Excel 5.0 (1993) With version 5.0, Excel has included Visual Basic for Applications (VBA),
a programming language based on Visual Basic which adds the ability to automate tasks in
Excel and to provide user-defined functions (UDF) for use in worksheets. VBA is a powerful
addition to the application and includes a fully featured integrated development
environment (IDE). Macro recording can produce VBA code replicating user actions, thus
allowing simple automation of regular tasks. VBA allows the creation of forms and
in-worksheet controls to communicate with the user. The language supports use (but not
creation) of ActiveX (COM) DLL's; later versions add support for class modules allowing
the use of basic object-oriented programming techniques.
The automation functionality provided by VBA made Excel a target for macro viruses. This
caused serious problems until antivirus products began to detect these
viruses. Microsoft belatedly took steps to prevent the misuse by adding the ability to disable
macros completely, to enable macros when opening a workbook or to trust all macros signed
using a trusted certificate.
Versions 5.0 to 9.0 of Excel contain various Easter eggs, including a "Hall of Tortured
Souls", although since version 10 Microsoft has taken measures to eliminate such
undocumented features from their products.
5.0 was released in a 16-bit x86 version for Windows 3.1 and later in a 32-bit version for NT
3.51 (x86/Alpha/PowerPC)
Included in Office 2007. This release was a major upgrade from the previous version.
Similar to other updated Office products, Excel in 2007 used the new Ribbon menu system.
This was different from what users were used to, and was met with mixed reactions. One
study reported fairly good acceptance by users except highly experienced users and users of
word processing applications with a classical WIMP interface, but was less convinced in
terms of efficiency and organization. However, an online survey reported that a majority of
respondents had a negative opinion of the change, with advanced users being "somewhat
more negative" than intermediate users, and users reporting a self-estimated reduction in
productivity.
Added functionality included the SmartArt set of editable business diagrams. Also added
was an improved management of named variables through the Name Manager, and much-
improved flexibility in formatting graphs, which allow (x, y) coordinate labeling and lines of
arbitrary weight. Several improvements to pivot tables were introduced.
Also like other office products, the Office Open XML file formats were introduced,
including .xlsm for a workbook with macros and .xlsx for a workbook without macros.
Specifically, many of the size limitations of previous versions were greatly increased. To
illustrate, the number of rows was now 1,048,576 (220
) and columns was 16,384 (214
; the far-
right column is XFD). This changes what is a valid A1 reference versus a named range. This
version made more extensive use of multiple cores for the calculation of spreadsheets;
however, VBA macros are not handled in parallel and XLL add-ins were only executed in
parallel if they were thread-safe and this was indicated at registration.
3.2.4 Hardware requirement

23
Server-side hardware
1.Hardware recommended by all software needed
2.Communication hardware to serve client requests
Client-side hardware
1.Hardware recommended by respective client’s operating system and web browser.
2.Communication hardware to communicate the server,
Others
An INTEL Core 2 Duo CPU, a RAM of 4 GB (minimum), a Hard Disk of any GB, a 105
keys keyboard, an Optical Mouse and Color Display are required.
Operating System
As our system will be platforms independent; so, there is no constraints in case of operating
system in using this software. Windows and Linux operating systems are considered to be
extremely stable platforms. One of the advantages of using the Linux operating system is the
speed at which performance enhancements and security patches are made available since it is
an open source product.
Web Server
The available web server’s platforms are IIS and Apache. Since Apache is open source,
software security patches tend to be released faster. One of the downsides to IIS is its
vulnerability to virus attacks, something that Apache rarely has problems with. Apache
offers PHP; a language considered to be the C of the next part is the end user environment.
Before we have explained that we have assume that the end users are not so much aware
about the computer and they are not so much interested to divert their activities to the
automated system from the manual system.
Web Browser
Let's play word association, just like when a psychologist asks you what comes to mind
when you hear certain words: What do you think when you hear the words "Opera. Safari.
Chrome. Firefox."
If you think of the Broadway play version of "The Lion King," maybe it is time to see a
psychologist. However, if you said, "Internet browsers," you're spot on. That's because the
leading Internet Browsers are:
 Google Chrome
 Mozilla Firefox
 Apple Safari
 Microsoft Internet Explorer

24
 Microsoft Edge
 Opera
 Maxton
And that order pretty much lines up with how they're ranked in terms of market share and
popularity...today. Browsers come and go. Ten years ago, Netscape Navigator was a well-
known browser: Netscape is long gone today. Another, called Mosaic, is considered the first
modern browser—it was discontinued in 1997.
So, what exactly is a browser?
Definition:
A browser, short for web browser, is the software application (a program) that you're using
right now to search for, reach and explore websites. Whereas Excel is a program for
spreadsheets and Word® a program for writing documents, a browser is a program for
Internet exploring (which is where that name came from).
Browsers don't get talked about much. A lot of people simply click on the "icon" on our
computers that take us to the Internet—and that's as far as it goes. And in a way, that's
enough. Most of us simply get in a car and turn the key...we don't know what kind of engine
we have or what features it has...it takes us where we want to go. That's why when it comes
to computers:
 There are some computer users that can't name more than one or two browsers
 Many of them don't know they can switch to another browser for free
 There are some who go to Google's webpage to "google" a topic and think that
Google is their browser.

25
3.3 Content diagram of Project

26
Chapter-4
DESIGN
4.1 Introduction
The UML may be used to visualize, specify, construct and document the artifacts of a
software-intensive system.
The UML is only a language and so is just one part of a software development method. The
UML is process independent, although optimally it should be used in a process that is use
case driven, architecture-centric, iterative, and incremental.
What is UML:
The Unified Modeling Language (UML) is a robust notation that we can use to build OOAD
models. It is called so, since it was the unification, of the ideas from different methodologies
from three amigos – Botch, Rumbaugh and Jacobson.
The UML is the standard language for visualizing, specifying, constructing, and
documenting the artifacts of a software-intensive system. It can be used with all processes,
throughout the development life cycle, and across different implementation technologies.
 The UML combines the best from:
o Data Modeling concepts
o Business Modeling (work flow)
o Object Modeling
o Component Modeling
 The UML may be used to:
o Display the boundary of a system and its major functions using use cases and
actors
o Illustrate use case realizations with interaction diagrams
o Represent a static structure of a system using class diagrams
o Model the behavior of objects with state transition diagrams
o Reveal the physical implementation architecture with component &
deployment diagrams
o Extend the functionality of the system with stereotypes
Evolution of UML

27
One of the methods was the Object Modeling Technique (OMT), devised by James
Rumbaugh and others at General Electric. It consists of a series of models - use case, object,
dynamic, and functional that combines to give a full view of a system.
The Botch method was devised by Grady Booch and developed the practice of analyzing a
system as a series of views. It emphasizes analyzing the system from both a macro
development view and micro development view and it was accompanied by a very detailed
notation.
The Object-Oriented Software Engineering (OOSE) method was devised by Ivar Jacobson
and focused on the analysis of system behavior. It advocated that at each stage of the process
there should be a check to see that the requirements of the user were being met.
The UML is composed of three different parts:
1. Model elements 2. Diagrams 3. Views
 Model Elements:
o The model elements represent basic object-oriented concepts such as classes, objects,
and relationships.
o Each model element has a corresponding graphical symbol to represent it in the
diagrams.
 Diagrams:
o Diagrams portray different combinations of model elements.
o For example, the class diagram represents a group of classes and the relationships,
such as association and inheritance, between them.
o The UML provides nine types of diagram - use case, class, object, state chart,
sequence, collaboration, activity, component, and deployment
Views:
o Views provide the highest level of abstraction for analyzing the system.
o Each view is an aspect of the system that is abstracted to a number of related UML
diagrams.
o Taken together, the views of a system provide a picture of the system in its entirety.
o In the UML, the five main views of the system are
o User Structural Behavioral Implementation
Environment

28
Fig-1 UML Views
In addition to model elements, diagrams, and views, the UML provides mechanisms for
adding comments, information, or semantics to diagrams. And it provides mechanisms to
adapt or extend itself to a particular method, software system, or organization.
Extensions of UML:
o Stereotypes can be used to extend the UML notational elements
o Stereotypes may be used to classify and extend associations, inheritance
relationships, classes, and components
Examples:
 Class stereotypes : boundary, control, entity
 Inheritance stereotypes : extend
 Dependency stereotypes : uses
 Component stereotypes : subsystem
Use Case Diagrams:
The use case diagram presents an outside view of the system
The use-case model consists of use-case diagrams.
o The use-case diagrams illustrate the actors, the use cases, and their relationships.
o Use cases also require a textual description (use case specification), as the visual
diagrams can't contain all of the information that is necessary.
o The customers, the end-users, the domain experts, and the developers all have an
input into the development of the use-case model.
o Creating a use-case model involves the following steps:

29
1. defining the system
2. identifying the actors and the use cases
3. describing the use cases
4. defining the relationships between use cases and actors.
5. defining the relationships between the use cases
Use Case:
Definition: is a sequence of actions a system performs that yields an observable result of
value to a particular actor.
Use Case Naming:
Use Case should always be named in business terms, picking the words from the vocabulary
of the particular domain for which we are modeling the system. It should be meaningful to
the user because use case analysis is always done from user’s perspective. It will be usually
verbs or short verb phrases.
Use Case Specification shall document the following:
o Brief Description
o Precondition
o Main Flow
o Alternate Flow
o Exceptional flows
o Post Condition
o Special Requirements
Notation of use case
Actor:
Definition: someone or something outside the system that interacts with the
system
o An actor is external - it is not actually part of what we are building, but an interface
needed to support or use it.
o It represents anything that interacts with the system.
Notation for Actor

30
Relation: Two important types of relation used in Use Case Diagram
 Include: An Include relationship shows behavior that is common to
one or more use cases (Mandatory).
Include relation results when we extract the common sub flows and
make it a use case
Extend: An extend relationship shows optional behavior (Optional)
Extend relation results usually when we add a bit more specialized
feature to the already existing one, we say the use case B extends its
functionality to use case A.
A system boundary rectangle separates the system from the external actors.
An extend relationship indicates that one use case is a variation of another. Extend notation
is a dotted line, labeled <<extend>>, and with an arrow toward the base case. The extension
points, which determines when the extended case is appropriate, is written inside the base
case.
Class Diagrams:
Class diagram shows the existence of classes and their relationships in the structural view of
a system.
UML modeling elements in class diagram:
o Classes and their structure and behavior
o Relationships
 Association
 Aggregation
 Composition
 Dependency
 Generalization / Specialization (inheritance relationships)
 Multiplicity and navigation indicators
 Role names
A class describes properties and behavior of a type of object.
o Classes are found by examining the objects in sequence and collaboration diagram
o A class is drawn as a rectangle with three compartments
o Classes should be named using the vocabulary of the domain Naming standards
should be created e.g., all classes are singular nouns starting with a capital letter

31
o The behavior of a class is represented by its operations. Operations may be found by
examining interaction diagrams
o The structure of a class is represented by its attributes. Attributes may be found by
examining class definitions, the problem requirements, and by applying domain
knowledge
Notation:
Class information: visibility and scope
The class notation is a 3-piece rectangle with the class name, attributes, and operations.
Attributes and operations can be labeled according to access and scope.
Relationship:
Association:
o Association represents the physical or conceptual connection between two or more
objects
o An association is a bi-directional connection between classes
o An association is shown as a line connecting the related classes
Aggregation:
o An aggregation is a stronger form of relationship where the relationship is between a
whole and its parts
o It is entirely conceptual and does nothing more than distinguishing a ‘whole’ from
the ‘part’.
o It doesn’t link the lifetime of the whole and its parts

32
o An aggregation is shown as a line connecting the related classes with a diamond next
to the class representing the whole
Composition:
Composition is a form of aggregation with strong ownership and coincident
lifetime as the part of the whole.
Multiplicity
o Multiplicity defines how many objects participate in relationships
o Multiplicity is the number of instances of one class related to ONE instance of the
other class
This table gives the most common multiplicities
For each association and aggregation, there are two multiplicity decisions to make:
one for each end of the relationship
Although associations and aggregations are bi-directional by default, it is often
desirable to restrict navigation to one direction
If navigation is restricted, an arrowhead is added to indicate the direction of the
navigation

33
Dependency:
o A dependency relationship is a weaker form of relationship showing a relationship
between a client and a supplier where the client does not have semantic knowledge of
the supplier.
o A dependency is shown as a dashed line pointing from the client to the supplier.
Generalization / Specialization (inheritance relationships):
o It is a relationship between a general thing (called the super class or the
parent) and a more specific kind of that thing (called the subclass(es) or the
child).
o An association class is an association that also has class properties (or a class
has association properties)
o A constraint is a semantic relationship among model elements that specifies
conditions and propositions that must be maintained as true: otherwise, the
system described by the model is invalid.

34
o An interface is a specifier for the externally visible operations of a class
without specification of internal structure. An interface is formally equivalent
to abstract class with no attributes and methods, only abstract operations.
o A qualifier is an attribute or set of attributes whose values serve to partition
the set of instances associated with an instance across an association.
Interaction Diagrams:
o Interaction diagrams is used to model the dynamic behavior of the system
o Interaction diagram helps us to identify the classes and its methods
o Interaction diagrams describe how use cases are realized as interactions
among objects
o Show classes, objects, actors and messages between them to achieve the
functionality of a Use Case
There are two types of interaction diagrams:
1. Sequence Diagram
2. Collaboration Diagram

35
1. Sequence Diagram:
A sequence diagram simply depicts interaction between objects in a sequential order i.e. the
order in which these interactions take place. We can also use the terms event diagrams or
event scenarios to refer to a sequence diagram. Sequence diagrams describe how and in what
order the objects in a system function. These diagrams are widely used by businessmen and
software developers to document and understand requirements for new and existing systems.
Sequence Diagram Notations:
1. Actors – An actor in a UML diagram represents a type of role where it interacts with
the system and its objects. It is important to note here that an actor is always outside the
scope of the system we aim to model using the UML diagram.
We use actors to depict various roles including human users and other external
subjects. We represent an actor in a UML diagram using a stick person notation. We
can have multiple actors in a sequence diagram.
For example – Here the user in seat reservation system is shown as an actor where it
exists outside the system and is not a part of the system.
2. Lifelines – A lifeline is a named element which depicts an individual participant in a
sequence diagram. So basically, each instance in a sequence diagram is represented by
a lifeline. Lifeline elements are located at the top in a sequence diagram. The standard
in UML for naming a lifeline follows the following format – Instance Name: Class

36
Name
3. Messages – Communication between objects is depicted using messages. The messages
appear in a sequential order on the lifeline. We represent messages using arrows.
Lifelines and messages form the core of a sequence diagram.
Messages can be broadly classified into the following
 Synchronous messages – A synchronous message waits for a reply before the
interaction can move forward. The sender waits until the receiver has completed
the processing of the message. The caller continues only when it knows that the

37
receiver has processed the previous message i.e. it receives a reply message. A
large number of calls in object-oriented programming are synchronous. We use a
solid arrow head to represent a synchronous message.
 Asynchronous Messages – An asynchronous message does not wait for a reply
from the receiver. The interaction moves forward irrespective of the receiver
processing the previous message or not. We use a lined arrow head to represent
an asynchronous message.
 Create message – We use a Create message to instantiate a new object in the
sequence diagram. There are situations when a particular message call requires
the creation of an object. It is represented with a dotted arrow and create word
labelled on it to specify that it is the create Message symbol.
For example – The creation of a new order on a e-commerce website would
require a new object of Order class to be created.
 Delete Message – We use a Delete Message to delete an object. When an object
is deallocated memory or is destroyed within the system, we use the Delete
Message symbol. It destroys the occurrence of the object in the system. It is
represented by an arrow terminating with a x.
For example – In the scenario below when the order is received by the user, the
object of order class can be destroyed.
 . Self-Message – Certain scenarios might arise where the object needs to send a
message to itself. Such messages are called Self Messages and are
represented with a U-shaped arrow
 Reply Message – Reply messages are used to show the message being sent from
the receiver to the sender. We represent a return/reply message using an open

38
arrowhead with a dotted line. The interaction moves forward only when a reply
message is sent by the receiver.
2. Collaboration diagram:
It shows the interaction of the objects and also the group of all messages sent or received
by an object. This allows us to see the complete set of services that an object must
provide.
The following collaboration diagram realizes a scenario of reserving a copy of book in a
library
 Difference between the Sequence Diagram and Collaboration Diagram
o Sequence diagrams emphasize the temporal aspect of a scenario - they
focus on time.
o Collaboration diagrams emphasize the spatial aspect of a scenario - they focus
on how objects are linked.
Activity Diagram:
An activity diagram is essentially a fancy flowchart. Activity diagrams. An activity diagram
focuses on the flow of activities involved in a single process. The activity diagram shows
how those activities depend on one another.
1. initial State – The starting state before an activity takes place is depicted using the
initial state.
2. Action or Activity State – An activity represents execution of an action on objects or
by objects. We represent an activity using a rectangle with rounded corners. Basically
any action or event that takes place is represented using an activity.
3. Action Flow or Control flows – Action flows or Control flows are also referred to as
paths and edges. They are used to show the transition from one activity state to another.

39
4. Decision node and Branching – When we need to make a decision before deciding the
flow of control, we use the decision node.
5. Guards – A Guard refers to a statement written next to a decision node on an arrow
sometimes within square brackets.
6.Fork – Fork nodes are used to support concurrent activities.

40
Join – Join nodes are used to support concurrent activities converging into one. For join
notations we have two or more incoming edges and one outgoing edge.
6. Merge or Merge Event – Scenarios arise when activities which are not being executed
concurrently have to be merged. We use the merge notation for such scenarios. We can
merge two or more activities into one if the control proceeds onto the next activity
irrespective of the path chosen.
7. Swim lanes – We use swim lanes for grouping related activities in one column. Swim
lanes group related activities into one column or one row. Swim lanes can be vertical
and horizontal. Swim lanes are used to add modularity to the activity diagram. It is not
mandatory to use swim lanes. They usually give more clarity to the activity diagram.
It’s similar to creating a function in a program. It’s not mandatory to do so, but it is a
recommended practice.

41
Component Diagram:
Describe organization and dependency between the software implementation components.
Components are distributable physical units - e.g. source code, object code.
Deployment Diagram:
Describe the configuration of processing resource elements and the mapping of software
implementation components onto them.
Contain components - e.g. object code, source code and nodes (e.g. printer, database, client
machine).
4.2 UML Diagram

42
4.2.1 Class diagram
Diagram:1 class diagram for user and admin

43
4.2.2 Use case diagram
Diagram:2 use case diagram for admin

44
4.2.3 Activity diagram
Diagram:3 Activity diagram for admin

45
4.2.4 Sequence diagram
Diagram:4 sequence diagram for admin

46
4.2.5 Collaboration diagram
Diagram:5 collaboration diagram for admin
4.2.6 Component diagram

47
Diagram:6 component diagram for admin

48
4.2.7 Deployment diagram
Diagram:7 deployment daigram for local server

49
Chapter-5
IMPLEMENTATION METHODS & RESULTS
5.1 Introduction :
We implement our project through Anaconda 3 and jyupter when comes
to Anaconda Anaconda is a free and open source and distribution
the Python and R programming languages for scientific computing (data science, machine
learning applications, large-scale data processing, predictive analytics, etc.),that aims to
simplify package management and deployment. Package versions are managed by
the package management system conda .Anaconda distribution includes data-science
packages suitable for Windows, Linux, and mac OS.
Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as
a graphical alternative to the command line interface (CLI).
Here we define conda as analyzes your current environment, everything you have installed,
any version limitations you specify (e.g. you only want tensor flow >= 2.0) and figures out
how to install compatible dependencies. Or it will tell you that what you want can't be done.
pip, by contrast, will just install the package you specify and any dependencies, even if that
breaks other packages.
Conda allows users to easily install different versions of binary software packages and any
required libraries appropriate for their computing platform. Also, it allows users to switch
between package versions and download and install updates from a software repository.
Conda is written in the Python programming language, but can manage projects containing
code written in any language (e.g., R), including mufti-language projects. Conda can
install Python, while similar Python-based cross-platform package managers (such
as wheel or pip) cannot.
A popular conda channel for bio informatics software is Bioconda, which provides multiple
software distributions for computational biology. In fact, the conda package and environment
manager is included in all versions of Anaconda, Miniconda and Anaconda Repository.
without checking if these conflict with previously installed packages. It will install a package
and any of its dependencies regardless of the state of the existing installation because of this,
a user with a working installation of, for example, Google Tensor flow, can find that it stops
dependent numpy library than the one used by Tensor flow. In some cases, the package may

50
5.2 Explanation of Key functions
1.Importing and Merging Data
Python code in one module gains access to the code in another module by the process
of importing it. The import statement is the most common way of invoking the import
machinery, but it is not the only way. Functions such as importlib.import_module() and
built-in can also be used to invoke the import machinery.
The import statement combines two operations; it searches for the named module, then it
binds the results of that search to a name in the local scope. The search operation of
the import statement is defined as a call to the function, with the appropriate arguments. The
return value of is used to perform the name binding operation of the import statement. See
the import statement for the exact details of that name binding operation.
Contrary to when you merge new cases, merging in new variables requires the IDs
for each case in the two files to be the same, but the variable names should be different. In
this scenario, which is sometimes referred to as augmenting your data (or in SQL, “joins”) or
merging data by columns (i.e. you’re adding new columns of data to each row), you’re
adding in new variables with information for each existing case in your data file. As with
merging new cases where not all variables are present, the same thing applies if you merge in
new variables where some cases are missing – these should simply be given blank values.
Fig-2 Merging data sets
2.Data volume
Since we don’t know the features that could be useful to predict the churn, we had to work
on all the data that reflect the customer behavior in general. We used data sets related to

51
calls, SMS, MMS, and the internet with all related information like complaints, network
data, IMEI, charging, and other. The data contained transactions for all customers during
nine months before the prediction baseline. The size of this data was more than 70 Terabyte,
and we couldn’t perform the needed feature engineering phase using traditional databases.
3.Data variety
The data used in this research is collected from multiple systems and databases. Each source
generates the data in a different type of files as structured, semi-structured (XML-JSON) or
unstructured (CSV-Text). Dealing with these kinds of data types is very hard without big
data platform since we can work on all the previous data types without making any
modification or transformation. By using the big data platform, we no longer have any
problem with the size of these data or the format in which the data are represented.
4.Unbalanced data set
The generated data set was unbalanced since it is a special case of the classification problem
where the distribution of a class is not usually homogeneous with other classes. The
dominant class is called the basic class, and the other is called the secondary class. The data
set is unbalanced if one of its categories is 10% or less compared to the other one [18].
Although machine learning algorithms are usually designed to improve accuracy by reducing
error, not all of them take into account the class balance, and that may give bad results [18].
In general, classes are considered to be balanced in order to be given the same importance in
training.
We found that Syria Tel data set was unbalanced since the percentage of the secondary class
that represents churn customers is about 5% of the whole data set.
5.Extensive features
The collected data was full of columns, since there is a column for each service, product, and
offer related to calls, SMS, MMS, and internet, in addition to columns related to personnel
and demographic information. If we need to use all these data sources the number of
columns for each customer before the data being processed will exceed ten thousand
columns.
6.Missing values
There is a representation of each service and product for each customer. Missing values may
occur because not all customers have the same subscription. Some of them may have a
number of services and others may have something different. In addition, there are some
columns related to system configurations and these columns have only null value for all
customers.

52
Fig-3 Inputting Missing values
7.Checking the Churn Rate
The most basic way you can calculate churn for a given month is to check how many
customers you have at the start of the month and how many of those customers leave by the
end of the month. Once you have both numbers, divide the number of customers that left by
the number of customers at the start of the month.
8.Splitting Data into Training and Test Sets
As I said before, the data we use is usually split into training data and test data. The training
set contains a known output and the model learns on this data in order to be generalized to
other data later on. We have the test data set (or subset) in order to test our model’s prediction
on this subset.
Fig-4 Diving into training and testing sets
9.Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables.
Each cell in the table shows the correlation between two variables. A correlation matrix is
used to summarize data, as an input into a more advanced analysis, and as a diagnostic for
advanced analyses.
Typically, a correlation matrix is “square”, with the same variables shown in the rows and
columns. I've shown an example below. This shows correlations between the stated
importance of various things to people. The line of 1.00s going from the top left to the
bottom right is the main diagonal, which shows that each variable always perfectly

53
correlates with itself. This matrix is symmetrical, with the same correlation is shown above
the main diagonal being a mirror image of those below the main diagonal.
Fig-5 Correlation Matrix
10.Feature Selection Using RFE
When U.S. Citizenship and Immigration Services (USCIS) needs more information
in order to proceed any further on your application, it can issue you a Request for Evidence
(RFE). You will need to respond to the RFE within the time frame indicated (usually 30 to
90 days) so that the immigration official adjudicating your case will have enough evidence to
make a favorable decision.
11.Making Predictions
Making predictions is a strategy in which readers use information from a text
(including titles, headings, pictures, and diagrams) and their own personal experiences to
anticipate what they are about to read (or what comes next). A reader involved in making
predictions is focused on the text at hand, constantly thinking ahead and also refining,
revising, and verifying his or her predictions. This strategy also helps students make
connections between their prior knowledge and the text.
Students may initially be more comfortable making predictions about fiction than nonfiction
or informational text. This may be due to the fact that fiction is more commonly used in
early reading instruction. Students also tend to be more comfortable with the structure of
narrative text than they are with the features and structures used in informational text.
However, the strategy is important for all types of text. Teachers should make sure to include
time for instruction, modeling, and practice as students read informational text. They can
also help students successfully make predictions about informational text by ensuring that
students have sufficient background knowledge before beginning to read the text.

54
12.Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to
find the best model that represents our data and how well the chosen model will work in the
future. To avoid outfitting, both methods use a test set (not seen by the model) to evaluate
model performance.
13.ROC Curve
A receiver operating characteristic curve, or ROC curve, is a graphical plot that
illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is
varied.
The ROC curve is created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings. The true-positive rate is also known
as sensitivity, recall or probability of detection in machine learning. The false-positive rate is
also known as probability of false alarm and can be calculated as (1 − specificity). It can also
be thought of as a plot of the power as a function of the Type I Error of the decision rule
(when the performance is calculated from just a sample of the population, it can be thought
of as estimators of these quantities). The ROC curve is thus the sensitivity or recall as a
function of fall-out. In general, if the probability distributions for both detection and false
alarm are known, the ROC curve can be generated by plotting the cumulative distribution
function (area under the probability distribution from to the discrimination threshold) of the
detection probability in the y-axis versus the cumulative distribution function of the false-
alarm probability on the x-axis.
Histogram:
A histogram is a display of statistical information that uses rectangles to show the frequency
of data items in successive numerical intervals of equal size. In the most common form of
histogram, the independent variable is plotted along the horizontal axis and the dependent
variable is plotted along the vertical axis. The data appears as colored or shaded rectangles of
variable area.
The illustration, below, is a histogram showing the results of a final exam given to a
hypothetical class of students. Each score range is denoted by a bar of a certain color. If this
histogram were compared with those of classes from other years that received the same test
from the same professor, conclusions might be drawn about intelligence changes among
students over the years. Conclusions might also be drawn concerning the improvement or
decline of the professor's teaching ability with the passage of time. If this histogram were
compared with those of other classes in the same semester who had received the same final
exam but who had taken the course from different professors, one might draw conclusions
about the relative competence of the professors.

55
Fig-6 Histogram
Some histograms are presented with the independent variable along the vertical axis and the
dependent variable along the horizontal axis. That format is less common than the one shown
here.
5.3 Screenshots or output screens
1.Code
# In[1]:
import pandas as pd
import numpy as np
# In[2]:
churn_data = pd.read_csv("OneDrive/Desktop/project datasets/churn_data.csv")
customer_data = pd.read_csv("OneDrive/Desktop/project datasets/customer_data.csv")
internet_data = pd.read_csv("OneDrive/Desktop/project datasets/internet_data.csv")
# In[3]:

56
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')
# In[4]:
telecom = pd.merge(df_1, internet_data, how='inner', on='customerID')
# In[5]:
telecom.head()
# In[6]:
telecom.describe()
# In[7]:
telecom.info()
# In[8]:
telecom['PhoneService'] = telecom['PhoneService'].map({'Yes': 1, 'No': 0})
telecom['PaperlessBilling'] = telecom['PaperlessBilling'].map({'Yes': 1, 'No': 0})
telecom['Churn'] = telecom['Churn'].map({'Yes': 1, 'No': 0})
telecom['Partner'] = telecom['Partner'].map({'Yes': 1, 'No': 0})
telecom['Dependents'] = telecom['Dependents'].map({'Yes': 1, 'No': 0})
# In[9]:
# Creating a dummy variable for the variable 'Contract' and dropping the first one.
cont = pd.get_dummies(telecom['Contract'],prefix='Contract',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,cont],axis=1)
# Creating a dummy variable for the variable 'PaymentMethod' and dropping the first one.
pm = pd.get_dummies(telecom['PaymentMethod'],prefix='PaymentMethod',drop_first=True)
telecom = pd.concat([telecom,pm],axis=1)

57
# Creating a dummy variable for the variable 'gender' and dropping the first one.
gen = pd.get_dummies(telecom['gender'],prefix='gender',drop_first=True)
telecom = pd.concat([telecom,gen],axis=1)
# Creating a dummy variable for the variable 'MultipleLines' and dropping the first one.
ml = pd.get_dummies(telecom['MultipleLines'],prefix='MultipleLines')
# dropping MultipleLines_No phone service column
ml1 = ml.drop(['MultipleLines_No phone service'],1)
telecom = pd.concat([telecom,ml1],axis=1)
# Creating a dummy variable for the variable 'InternetService' and dropping the first one.
iser = pd.get_dummies(telecom['InternetService'],prefix='InternetService',drop_first=True)
telecom = pd.concat([telecom,iser],axis=1)
# Creating a dummy variable for the variable 'OnlineSecurity'.
os = pd.get_dummies(telecom['OnlineSecurity'],prefix='OnlineSecurity')
os1= os.drop(['OnlineSecurity_No internet service'],1)
telecom = pd.concat([telecom,os1],axis=1)
# Creating a dummy variable for the variable 'OnlineBackup'.
ob =pd.get_dummies(telecom['OnlineBackup'],prefix='OnlineBackup')
ob1 =ob.drop(['OnlineBackup_No internet service'],1)
telecom = pd.concat([telecom,ob1],axis=1)
# Creating a dummy variable for the variable 'DeviceProtection'.
dp =pd.get_dummies(telecom['DeviceProtection'],prefix='DeviceProtection')
dp1 = dp.drop(['DeviceProtection_No internet service'],1)
telecom = pd.concat([telecom,dp1],axis=1)
# Creating a dummy variable for the variable 'TechSupport'.
ts =pd.get_dummies(telecom['TechSupport'],prefix='TechSupport')
ts1 = ts.drop(['TechSupport_No internet service'],1)
telecom = pd.concat([telecom,ts1],axis=1)
# Creating a dummy variable for the variable 'StreamingTV'.
st =pd.get_dummies(telecom['StreamingTV'],prefix='StreamingTV')
st1 = st.drop(['StreamingTV_No internet service'],1)
telecom = pd.concat([telecom,st1],axis=1)
# Creating a dummy variable for the variable 'StreamingMovies'.
sm =pd.get_dummies(telecom['StreamingMovies'],prefix='StreamingMovies')
sm1 = sm.drop(['StreamingMovies_No internet service'],1)

58
telecom = pd.concat([telecom,sm1],axis=1)
# In[10]:
telecom['MultipleLines'].value_counts()
# In[11]:
telecom =
telecom.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)
# In[12]:
telecom['TotalCharges'] = pd.to_numeric(telecom['TotalCharges'],errors='coerce')
telecom['tenure'] = telecom['tenure'].astype(float).astype(int)
# In[13]:
telecom.info()
# In[14]:
num_telecom = telecom[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']]
# In[15]:
num_telecom.describe(percentiles=[.25,.5,.75,.90,.95,.99])
# In[16]:
telecom.isnull().sum()
# In[17]:

59
round(100*(telecom.isnull().sum()/len(telecom.index)), 2)
# In[18]:
telecom = telecom[~np.isnan(telecom['TotalCharges'])]
# In[19]:
round(100*(telecom.isnull().sum()/len(telecom.index)), 2)
# In[20]:
df = telecom[['tenure','MonthlyCharges','TotalCharges']]
# In[21]:
normalized_df=(df-df.mean())/df.std()
# In[22]:
telecom = telecom.drop(['tenure','MonthlyCharges','TotalCharges'], 1)
# In[23]:
telecom = pd.concat([telecom,normalized_df],axis=1)
# In[24]:
telecom
# In[25]:

60
churn = (sum(telecom['Churn'])/len(telecom['Churn'].index))*100
# In[26]:
telecom
# In[27]:
churn
# In[28]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.metrics import mean_squared_error
# In[29]:
features = telecom.drop(['Churn','customerID'],axis=1).values
target = telecom['Churn'].values
# In[30]:
features_train,target_train=features[20:30],target[20:30]
features_test,target_test=features[40:50],target[40:50]
print(features_test)
# In[31]:
model = GaussianNB()
model.fit(features_train,target_train)
print(model.predict(features_test))
# In[32]:

61
from sklearn.model_selection import train_test_split
# In[33]:
# Putting feature variable to X
X = telecom.drop(['Churn','customerID'],axis=1)
# Putting response variable to y
y = telecom['Churn']
# In[34]:
y.head()
# In[35]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,
train_size=0.7,test_size=0.3,random_state=100)
# In[36]:
import statsmodels.api as sm
# In[37]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
# In[38]:
# Importing matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')

62
# In[39]:
# Let's see the correlation matrix
plt.figure(figsize = (20,10)) # Size of the figure
sns.heatmap(telecom.corr(),annot = True)
# In[40]:
X_test2 =
X_test.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProtection
_No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],1)
X_train2 =
X_train.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProtectio
n_No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],1)
# In[41]:
plt.figure(figsize = (20,10))
sns.heatmap(X_train2.corr(),annot = True)
# In[42]:
logm2 = sm.GLM(y_train,(sm.add_constant(X_train2)), family = sm.families.Binomial())
# In[43]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 13) # running RFE with 13 variables as output
rfe = rfe.fit(X,y)
print(rfe.support_) # Printing the boolean results
print(rfe.ranking_) # Printing the ranking
# In[44]:
col = ['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year',

63
'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic',
'InternetService_No',
'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges']
# In[45]:
# Let's run the model using the selected variables
from sklearn import metrics
logsk = LogisticRegression()
logsk.fit(X_train[col], y_train)
# In[46]:
#Comparing the model with StatsModels
logm4 = sm.GLM(y_train,(sm.add_constant(X_train[col])), family =
sm.families.Binomial())
# In[47]:
# UDF for calculating vif value
def vif_cal(input_data, dependent_col):
vif_df = pd.DataFrame( columns = ['Var', 'Vif'])
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=sm.OLS(y,x).fit().rsquared
vif=round(1/(1-rsq),2)
vif_df.loc[i] = [xvar_names[i], vif]
return vif_df.sort_values(by = 'Vif', axis=0, ascending=False, inplace=False)
# In[48]:
telecom.columns
['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year',

64
# In[49]:
# Calculating Vif value
vif_cal(input_data=telecom.drop(['customerID','SeniorCitizen', 'Partner', 'Dependents',
'PaymentMethod_Credit card (automatic)','PaymentMethod_Mailed
check',
'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No',
'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes',
'TechSupport_No','StreamingTV_No','StreamingTV_Yes','StreamingMovies_Yes',
'MonthlyCharges'], axis=1), dependent_col='Churn')
# In[50]:
col = ['PaperlessBilling', 'Contract_One year', 'Contract_Two year',
# In[51]:
logm5 = sm.GLM(y_train,(sm.add_constant(X_train[col])), family =
sm.families.Binomial())
# In[52]:
# Calculating Vif value
vif_cal(input_data=telecom.drop(['customerID','PhoneService','SeniorCitizen', 'Partner',
'Dependents',
'PaymentMethod_Credit card (automatic)','PaymentMethod_Mailed
check',
'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No',
'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes',
'TechSupport_No','StreamingTV_No','StreamingTV_Yes','StreamingMovies_Yes',
'MonthlyCharges'], axis=1), dependent_col='Churn')
# In[53]:

65
# Let's run the model using the selected variables
logsk = LogisticRegression()
logsk.fit(X_train[col], y_train)
# In[54]:
y_pred = logsk.predict_proba(X_test[col])
# In[55]:
y_pred_df = pd.DataFrame(y_pred)
# In[56]:
y_pred_1 = y_pred_df.iloc[:,[1]]
# In[57]:
y_pred_1.head()
# In[58]:
y_test_df = pd.DataFrame(y_test)
# In[59]:
y_test_df = pd.DataFrame(y_test)
# In[60]:
# Removing index for both dataframes to append them side by side
y_pred_1.reset_index(drop=True, inplace=True)

66
y_test_df.reset_index(drop=True, inplace=True)
# In[61]:
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
# In[62]:
y_pred_final= y_pred_final.rename(columns={ 1 : 'Churn_Prob'})
# In[63]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['CustID','Churn','Churn_Prob'], axis=1)
# In[64]:
y_pred_final.head()
# In[65]:
# Creating new column 'predicted' with 1 if Churn_Prob>0.5 else 0
y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.5 else 0)
# In[66]:
y_pred_final.head()
# In[67]:
# In[68]:

67
help(metrics.confusion_matrix)
# In[69]:
# Confusion matrix
confusion = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.predicted )
confusion
# In[70]:
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.predicted)
# In[71]:
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# In[72]:
TP / float(TP+FN)
# In[73]:
TN / float(TN+FP)
# In[74]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))
# In[75]:
# positive predictive value
print (TP / float(TP+FP))

68
# In[76]:
# Negative predictive value
print (TN / float(TN+ FN))
# In[77]:
def draw_roc( actual, probs ):
fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
drop_intermediate = False )
auc_score = metrics.roc_auc_score( actual, probs )
plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
return fpr, tpr, thresholds
# In[78]:
draw_roc(y_pred_final.Churn, y_pred_final.predicted)
# In[79]:
# Let's create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_pred_final[i]= y_pred_final.Churn_Prob.map( lambda x: 1 if x > i else 0)
y_pred_final.head()
# In[80]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.

69
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
# TP = confusion[1,1] # true positive
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else
0)
y_pred_final.head()
#Let's check the overall accuracy.
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.final_predicted)
FinalChurn=telecom.filter(['customerID','Churn'])
total_rows = FinalChurn[FinalChurn.Churn == 1]
total_rows
value = len(total_rows)
print("number of people who are correctly using network is"+str(value))

70
5,3.Outputs Screen shots
Fig-1Structure of Data
Fig-2 Histogram

71
Fig-3 Box Plot
Fig-4 Graph plot

72
Fig-5 Conversion of data types

73
Fig-6 Missing Values
Fig-7 Converting the data sets into 0 and 1

74
Fig-8 Naive Bayesian
Fig -9 Optimal cut-off point

75
Fig-10 Correlation matrix
Fig -11 ROC Curve

76
Fig-12 Customers who stay
Fig-13 Customers who leave

77
Chapter-6
TESTING & VALIDATION
6.1 Introduction:
Functional vs. Non-functional Testing
The goal of utilizing numerous testing methodologies in your development process is to
make sure your software can successfully operate in multiple environments and across
different platforms. These can typically be broken down between functional and non-
functional testing. Functional testing involves testing the application against the business
requirements. It incorporates all test types designed to guarantee each part of a piece of
software behaves as expected by using uses cases provided by the design team or business
analyst. These testing methods are usually conducted in order and include:
 Unit testing
 Integration testing
 System testing
 Acceptance testing
Non-functional testing methods incorporate all test types focused on the operational aspects
of a piece of software. These include:
 Performance testing
 Security testing
 Usability testing
 Compatibility testing
The key to releasing high quality software that can be easily adopted by your end users is to
build a robust testing frameworkfs that implements both functional and non-functional
software testing methodologies.
Unit Testing
Unit testing is the first level of testing and is often performed by the developers themselves.
It is the process of ensuring individual components of a piece of software at the code level
are functional and work as they were designed to. Developers in a test-driven environment
will typically write and run the tests prior to the software or feature being passed over to the
test team. Unit testing can be conducted manually, but automating the process will speed up
delivery cycles and expand test coverage. Unit testing will also make debugging easier
because finding issues earlier means they take less time to fix than if they were discovered
later in the testing process. TestLeft is a tool that allows advanced testers and developers to
shift left with the fastest test automation tool embedded in any IDE.

78
Fig-7 Unit Testing Life cycle
Unit Testing Techniques:
 White Box Testing - used to test each one of those functions behavior
 Black Box Testing - Using which the user interface, input and output are tested
 Gray Box Testing - Used to execute tests, risks and assessment methods.
White box testing:
The box testing approach of software testing consists of black box testing and white box
testing. We are discussing here white box testing which also known as glass box is testing,
structural testing, clear box testing, open box testing and transparent box testing. It
tests internal coding and infrastructure of a software focus on checking of predefined inputs
against expected and desired outputs. It is based on inner workings of an application and
revolves around internal structure testing. In this type of testing programming skills are
required to design test cases. The primary goal of white box testing is to focus on the flow of
inputs and outputs through the software and strengthening the security of the software.

79
The term 'white box' is used because of the internal perspective of the system. The clear box
or white box or transparent box name denote the ability to see through the software's outer
shell into its inner workings.
Test cases for white box testing are derived from the design phase of the software
development lifecycle. Data flow testing, control flow testing, path testing, branch testing,
statement and decision coverage all these techniques used by white box testing as a guideline
to create an error-free software.
Fig-8 White Box Testing
White box testing follows some working steps to make testing manageable and easy to
understand what the next task to do. There are some basic steps to perform white box testing.
Generic steps of white box testing:
o Design all test scenarios, test cases and prioritize them according to high priority
number.
o This step involves the study of code at runtime to examine the resource utilization,
not accessed areas of the code, time taken by various methods and operations and so
on.
o In this step testing of internal subroutines takes place. Internal subroutines such as
nonpublic methods, interfaces are able to handle all types of data appropriately or
not.
o This step focuses on testing of control statements like loops and conditional
statements to check the efficiency and accuracy for different data inputs.
o In the last step white box testing includes security testing to check all possible
security loopholes by looking at how the code handles security
Reasoning of white box testing:
o It identifies internal security holes.
o To check the way of input inside the code.
o Check the functionality of conditional loops.
o To test function, object, and statement at an individual level.

80
Advantages of white box testing:
o White box testing optimizes code so hidden errors can be identified.
o Test cases of white box testing can be easily automated.
o This testing is more thorough than other testing approaches as it covers all code
paths.
o It can be started in the SDLC phase even without GUI.
Disadvantages of white box testing:
o White box testing is too much time consuming when it comes to large-scale
programming applications.
o White box testing is much expensive and complex.
o It can lead to production error because it is not detailed by the developers.
o White box testing needs professional programmers who have a detailed knowledge
and understanding of programming language and implementation.
Black box testing:
Black box testing is a technique of software testing which examines the functionality of
software without peering into its internal structure or coding. The primary source of black
box testing is a specification of requirements that is stated by the customer.
In this method, tester selects a function and gives input value to examine its functionality,
and checks whether the function is giving expected output or not. If the function produces
correct output, then it is passed in testing, otherwise failed. The test team reports the
result to the development team and then tests the next function. After completing testing
of all functions if there are severe problems, then it is given back to the development team
for correction.
Fig-9 Black Box Testing
Generic step of black box testing:

Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)