SlideShare une entreprise Scribd logo
1  sur  91
1
A MAJOR PROJECT REPORT ON
“TELECOM CHURN BASED ON ML”
In partial fulfillment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Submitted by
Hemanth Pasula D Vedanth G.Yashwanth venkat Samrat
16911A1246 16911A1210 16911A1217
Under the Esteemed Guidance of
M.Suresh Babu
Assistant. Professor
DEPARTMENT OF INFORMATION TECHNOLOGY
VIDYA JYOTHI INSTITUTE OF TECHNOLOGY
(Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad)
Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad – 500075
2019 - 2020
2
VIDYA JYOTHI INSTITUTE OF TECHNOLOGY
(Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad)
Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad - 500075
DEPARTMENT OF INFORMATION TECHNOLOGY
CERTIFICATE
This is to certify that the Project Report on “TELECOM CHURN BASED ON ML” is a
bonafide work by D Vedanth(16911A1210), Hemanth Pasula(16911A1246),
G.Yashwanth venkat Samrat (16911A1217) in partial fulfillment of the requirement for
the award of the degree of Bachelor of Technology in “INFORMATION
TECHNOLOGY” VJIT Hyderabad during the year 2019 - 2020.
Project Guide Head of the department
M.Suresh Babu, Mr. B. Srinivasulu,
M.Tech, M.E.,
Assistant Professor. Professor.
External Examiner
3
VIDYA JYOTHI INSTITUTE OF TECHNOLOGY
(Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad)
Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad - 500075
2019 - 2020
DECLARATION
We, D Vedanth(16911A1210), Hemanth Pasula(16911A1246), G.Yashwanth
venkat Samrat (16911A1217) hereby declare that Project Report entitled “TELECOM
CHURN BASED ON ML”, is submitted in the partial fulfillment of the requirement for
the award of Bachelor of Technology in Information Technology to Vidya Jyothi
Institute of Technology, Autonomous VJIT - Hyderabad, is an authentic work and has not
been submitted to any other university or institute for the degree.
D Vedanth (16911A1210)
Hemanth Pasula (16911A1246)
G.Yashwanth venkat Samrat(16911A1217)
4
ACKNOWLEDGEMENT
It is a great pleasure to express our deepest sense of gratitude and indebtedness to
our internal guide M.Suresh Babu, Assistant Professor, Department of IT, VJIT, for having
been a source of constant inspiration, precious guidance and generous assistance during the
project work. We deem it as a privilege to have worked under her able guidance. Without her
close monitoring and valuable suggestions this work wouldn’t have taken this shape. We feel
that this help is un-substitutable and unforgettable.
We wish to express our sincere thanks to Dr. P. Venugopal Reddy, Director VJIT, for
providing the college facilities for the completion of the project. We are profoundly thankful to
Mr.B.Srinivasulu, Professor and Head of Department of IT, for his cooperation and
encouragement. Finally, we thank all the faculty members, supporting staff of IT Department and
friends for their kind co-operation and valuable help for completing the project.
TEAM
MEMBERS
D Vedanth (16911A1210)
Hemanth Pasula (16911A1246)
G.Yashwanth venkat Samrat (16911A1217)
5
CONTENTS
1. Introduction
1.1 Motivation 9
1.2 Problem definition 9
1.3 Objective of Project 9
1.3.1 General Objective 9
1.3.2 Specific Objective 10
1.4 Limitations of Project 10
2. LITERATURE SURVEY
2.1 Introduction 11
3. ANALYSIS
3.1 Introduction 12
3.2 Software Requirement Specification 12
3.2.1 Functional requirements 12
3.2.2 Non-Functional requirements 14
3.2.3 Software requirement 14
3.2.4 Hardware requirement 23
3.3 Content diagram of Project 25
4. DESIGN
4.1 Introduction 26
4.2 UML Diagram 42
4.2.1 Class diagram 42
4.2.2 Use case diagram 43
4.2.3 Activity diagram 44
4.2.4 Sequence diagram 45
4.2.5 Collaboration diagram 46
4.2.6 Component diagram 47
4.2.7 Deployment diagram 48
5. IMPLEMENTATION METHODS & RESULTS 49
5.1 Introduction 49
5.2 Explanation of Key functions 50
5.3 Screenshots or output screens 71
6
6. TESTING & VALIDATION
6.1 Introduction 78
6.2 Design of test cases and scenarios 88
6.3 Validation 88
6.4 Conclusion 90
7. CONCLUSION AND FUTURE-ENHANCEMENTS :
First Paragraph - Project Conclusion 90
Second Paragraph - Future enhancement
90
REFERENCES: 91
1. Author Name, Title of Paper/ Book, Publisher’s Name, Year of
publication
2. Full URL Address
7
LIST OF FIGURES PAGE NO
Fig-1 UML Views 28
Fig-2 Merging data sets 50
Fig-3 Inputting missing value 52
Fig-4 Diving into training set 52
Fig-5 Correlation Matrix 53
Fig-6 Histogram 55
Fig-7 Unit testing life cycle 79
Fig-8 White box testing 80
Fig-9 Black box testing 82
Fig-10 Grey box testing 83
Fig-11 Grey box testing feature 84
Fig-12 Software verification 89
8
LIST OF OUTPUT SCREEN SHOTS PAGE NO
Fig-1 Structure of data 71
Fig-2 Histogram 72
Fig-3 Box plot 72
Fig-4 Graph plot 73
Fig-5 Conversion of data types 73
Fig-6 Missing value 74
Fig-7 Converting the data sets 1 and 0 74
Fig-8 Naive Bayesian 75
Fig-9 Optimal cut-off point 75
Fig-10 Correlation matrix 76
Fig-11 ROC curve 76
Fig-12 Customer who stay 77
Fig-13 Customer who leave 77
9
Chapter-1
Introduction
1.1 Motivation
In the present project we are trying to help the service providers to not to lose their
customers. With the help of this we are better able to understand the customers and let the
service providers know their customers.
1.2 Problem definition
Telecommunication is the exchange of signs, signals, messages, words, writings, images and
sounds or information of any nature by wire, radio, optical or
other electromagnetic systems. Telecommunication occurs when the exchange
of information between communication participants includes the use of technology. It is transmitted
through a transmission medium, such as over physical media, for example, over electrical cable, or
via electromagnetic radiation through space such as radio or light. Such transmission paths are often
divided into communication channels which afford the advantages of multiplexing. Since
the Latin term communication is considered the social process of information exchange, the
term telecommunications is often used in its plural form because it involves many different
technologies.
The churn rate, also known as the rate of attrition or customer churn, is the rate at
which customers stop doing business with an entity. It is most commonly expressed as the
percentage of service subscribers who discontinue their subscriptions within a given time
period. It is also the rate at which employees leave their jobs within a certain period. For a
company to expand its clientele, its growth rate (measured by the number of new customers)
must exceed its churn rate.
Machine learning (ML) is the study of computer algorithms that improve automatically
through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build
a mathematical model based on sample data, known as "training data", in order to make predictions
or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a
wide variety of applications, such as email filtering and computer vision, where it is difficult or
infeasible to develop conventional algorithms to perform the needed tasks.
1.3 Objective of Project
1.3.1 General Objective
Customer attrition, also known as customer churn, customer turnover, or customer
defection, is the loss of clients or customers.
Telephone service companies, Internet service providers, pay TV companies, insurance
firms, and alarm monitoring services, often use customer attrition analysis and customer
attrition rates as one of their key business metrics because the cost of retaining an existing
customer is far less than acquiring a new one. Companies from these sectors often have
customer service branches which attempt to win back defecting clients, because recovered
long-term customers can be worth much more to a company than newly recruited clients.
1.3.2 Specific Objective
10
Companies usually make a distinction between voluntary churn and involuntary
churn. Voluntary churn occurs due to a decision by the customer to switch to another
company or service provider, involuntary churn occurs due to circumstances such as a
customer's relocation to a long-term care facility, death, or the relocation to a distant
location. In most applications, involuntary reasons for churn are excluded from the analytical
models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to
factors of the company-customer relationship which companies control, such as how billing
interactions are handled or how after-sales help is provided.
predictive analytics use churn prediction models that predict customer churn by assessing
their propensity of risk to churn. Since these models generate a small prioritized list of
potential defectors, they are effective at focusing customer retention marketing programs on
the subset of the customer base who are most vulnerable to churn.
1.4 Limitations of Project
Acquiring the data of a customer is difficult. In the present days it is very difficult to
know what a person is thinking. So, we are only able to judge with the help of data which is
in front of us. Which can be only 80-90% reliable. So, we should make shore that our
predictions are good.
11
Chapter-2
LITERATURE SURVEY
2.1 Introduction
LOGISTIC REGRESSION
Problem Statement :
"You have a telecom firm which has collected data of all its customers"
The main types of attributes are :
1.Demographics (age, gender etc.)
2.Services availed (internet packs purchased, special offers etc)
3.Expenses (amount of recharge done per month etc.)
Based on all this past information, you want to build a model which will predict whether a
particular customer will churn or not.
So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether
or not a particular customer has churned. It is a binary variable 1 means that the customer
has churned and 0 means the customer has not churned.
With 21 predictor variables we need to predict whether a particular customer will switch to
another telecom provider or not.
The data sets were taken from the Kaggle website.
DATA PROCEDURE
Import the required libraries
1. Importing all datasets
2. Merging all datasets based on condition ("customer_id ")
3. Data Cleaning - checking the null values
4. Check for the missing values and replace them
5. Model building
• Binary encoding
• One hot encoding
• Creating dummy variables and removing the extra columns
6. Feature selection using RFE - Recursive Feature Elimination
7. Getting the predicted values on train set
8. Creating a new column predicted with 1 if churn > 0.5 else 0
9. Create a confusion matrix on train set and test
10. Check the overall accuracy
12
Chapter-3
ANALYSIS
3.1 Introduction
The field of chemistry uses analysis in at least three ways: to identify the components
of a particular chemical compound (qualitative analysis), to identify the proportions of
components in a mixture (quantitative analysis), and to break down chemical processes and
examine chemical reactions between elements of matter. For an example of its use, analysis
of the concentration of elements is important in managing a nuclear reactor, so nuclear
scientists will analyse neutron activation to develop discrete measurements within vast
samples. A matrix can have a considerable effect on the way a chemical analysis is
conducted and the quality of its results. Analysis can be done manually or with a device.
Chemical analysis is an important element of national security among the major world
powers with materials.
3.2 Software Requirement Specification
3.2.1 Functional requirements
Functional requirements may involve calculations, technical details, data
manipulation and processing, and other specific functionality that define what a system is
supposed to accomplish. Behavioural requirements describe all the cases where the system
uses the functional requirements, these are captured in use cases. Functional requirements are
supported by non-functional requirements (also known as "quality requirements"), which
impose constraints on the design or implementation (such as performance requirements,
security, or reliability). Generally, functional requirements are expressed in the form "system
must do requirement," while non-functional requirements take the form "system shall be
requirement." The plan for implementing functional requirements is detailed in the system
design, whereas non-functional requirements are detailed in the system architecture.
Customer data It contains all data related to customer’s services and contract
information. In addition to all offers, packages, and services subscribed to by the customer.
Furthermore, it also contains information generated from CRM system like (all customer
GSMs, Type of subscription, birthday, gender, the location of living and more ).
Network logs data Contains the internal sessions related to internet, calls, and SMS
for each transaction in Telecom operator, like the time needed to open a session for the
internet and call ending status. It could indicate if the session dropped due to an error in the
internal network.
Call details records “CDRs” Contain all charging information about calls, SMS,
MMS, and internet transaction made by customers. This data source is generated as text files.
This data has a large size and there is a lot of detailed information about it. We spent a lot of
time to understand it and to know its sources and storing format. In addition to these records,
the data must be linked to the detailed data stored in relational databases that contain detailed
information about the customer.
13
The Quality of the System is maintained in such a way so that it can be very user
friendly to all the users.
The software quality attributes are assumed as under:
1.Accurate and hence reliable.
In simpler terms, given a set of data points from repeated measurements of the same
quantity, the set can be said to be accurate if their average is close to the true value of the
quantity being measured, while the set can be said to be precise if the values are close to
each other. In the first, more common definition of "accuracy" above, the two concepts are
independent of each other, so a particular set of data can be said to be either accurate, or
precise, or both, or neither.
In everyday language, we use the word reliable to mean that something is dependable
and that it will give behave predictably every time.
2.Secured.
Security is freedom from, or resilience against, potential harm caused by others.
Beneficiaries ... the English language in the 16th century. It is derived from Latin
secures, meaning freedom from anxiety: se (without) + cura (care, anxiety).
3.Fast speed.
In everyday use and in kinematics, the speed of an object is the magnitude of the
change of its position; it is thus a scalar quantity. The average speed of an object in an
interval of time is the distance travelled by the object divided by the duration of the
interval; the instantaneous speed is the limit of the average speed as the duration of the time
interval approaches zero.
4.Compatibility.
a state in which two things are able to exist or occur together without problems or
conflict.
3.2.2 Non Functional requirements
14
In systems engineering and requirements engineering, a non-functional
requirement (NFR) is a requirement that specifies criteria that can be used to judge the
operation of a system, rather than specific behaviours. They are contrasted with functional
requirements that define specific behaviour or functions. The plan for
implementing functional requirements is detailed in the system design. The plan for
implementing non-functional requirements is detailed in the system architecture, because
they are usually architecturally significant requirements.
3.2.3 Software requirement
1.Python
Introduction
Python is an interpreted, high-level, general-purpose programming language. Created
by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace. Its language
constructs and object-oriented approach aim to help programmers write clear, logical code
for small and large-scale projects.
Python is dynamically typed and garbage-collected. It supports multiple programming
paradigms, including structured (particularly, procedural), object-oriented, and functional
programming. Python is often described as a "batteries included" language due to its
comprehensive standard library.
Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0,
released in 2000, introduced features like list comprehensions and a garbage collection
system capable of collecting reference cycles. Python 3.0, released in 2008, was a major
revision of the language that is not completely backward-compatible, and much Python 2
code does not run unmodified on Python 3.
The Python 2 language was officially discontinued in 2020 (first planned for 2015), and
"Python 2.7.18 is the last Python 2.7 release and therefore the last Python 2 release." No
more security patches or other improvements will be released for it.[31][32]
With Python
2's end-of-life, only Python 3.5.x and later are supported.
Python interpreters are available for many operating systems. A global community of
programmers develops and maintains CPython, an open source reference implementation.
A non-profit organization, the Python Software Foundation, manages and directs resources
for Python and CPython development.
History of Python
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde
& Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired
by SETL), capable of exception handling and interfacing with the Amoeba operating
system. Its implementation began in December 1989. Van Rossum shouldered sole
responsibility for the project, as the lead developer, until 12 July 2018, when he announced
his "permanent vacation" from his responsibilities as Python's Benevolent Dictator For Life,
a title the Python community bestowed upon him to reflect his long-term commitment as the
15
project's chief decision-maker. He now shares his leadership as a member of a five-person
steering council. In January 2019, active Python core developers elected Brett Cannon, Nick
Coghlan, Barry Warsaw, Carol Willing and Van Rossum to a five-member "Steering
Council" to lead the project.
Python 2.0 was released on 16 October 2000 with many major new features, including
a cycle-detecting garbage collector and support for Unicode.
Python 3.0 was released on 3 December 2008. It was a major revision of the language that is
not completely backward-compatible. Many of its major features were back ported to Python
2.6.x[
and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which
automates (at least partially) the translation of Python 2 code to Python 3.
Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern
that a large body of existing code could not easily be forward-ported to Python 3.
2.Anaconda 3
Introduction
Anaconda is a free and open-source distribution of the Python and R programming
languages for scientific computing (data science, machine learning applications, large-scale
data processing, predictive analytics, etc.), that aims to simplify package management and
deployment. The distribution includes data-science packages suitable for Windows, Linux,
and macOS. It is developed and maintained by Anaconda, Inc., which was founded by Peter
Wang and Travis Oliphant in 2012. As an Anaconda, Inc. product, it is also known
as Anaconda Distribution or Anaconda Individual Edition, while other products from the
company are Anaconda Team Edition and Anaconda Enterprise Edition, which are both not
free.
Package versions in Anaconda are managed by the package management system conda. This
package manager was spun out as a separate open-source package as it ended up being useful
on its own and for other things than Python. There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend
on, and a small number of other packages.
History of Anaconda 3
Anaconda distribution comes with 1,500 packages selected from PyPI as well as
the conda package and virtual environment manager. It also includes a GUI, Anaconda
Navigator[
, as a graphical alternative to the command line interface (CLI).
The big difference between conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science and the
reason conda exists.
When pip installs a package, it automatically installs any dependent Python packages
without checking if these conflict with previously installed packages.
It will install a package
and any of its dependencies regardless of the state of the existing installation.
Because of this,
a user with a working installation of, for example, Google Tensorflow, can find that it stops
working having used pip to install a different package that requires a different version of the
16
dependent numpy library than the one used by Tensorflow. In some cases, the package may
appear to work but produce different results in detail.
In contrast, conda analyses the current environment including everything currently installed,
and, together with any version limitations specified (e.g. the user may wish to have
Tensorflow version 2,0 or higher), works out how to install a compatible set of
dependencies, and shows a warning if this cannot be done.
Open source packages can be individually installed from the Anaconda repository, Anaconda
Cloud (anaconda.org), or the user's own private repository or mirror, using the command.
Anaconda, Inc. compiles and builds the packages available in the Anaconda repository itself,
and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. Anything
available on PyPI may be installed into a conda environment using pip, and conda will keep
track of what it has installed itself and what pip has installed.
Custom packages can be made using the command, and can be shared with others by
uploading them to Anaconda Cloud, PyPI or other repositories.
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python
3.7. However, it is possible to create new environments that include any version of Python
packaged with conda
Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a
graphical alternative to the command line interface (CLI).
Here we define conda as analyzes your current environment, everything you have installed,
any version limitations you specify (e.g. you only want tensor flow >= 2.0) and figures out
how to install compatible dependencies. Or it will tell you that what you want can't be done.
pip, by contrast, will just install the package you specify and any dependencies, even if that
breaks other packages.
Conda allows users to easily install different versions of binary software packages and any
required libraries appropriate for their computing platform. Also, it allows users to switch
between package versions and download and install updates from a software repository.
Conda is written in the Python programming language, but can manage projects containing
code written in any language (e.g., R), including multi-language projects. Conda can
install Python, while similar Python-based cross-platform package managers (such
as wheel or pip) cannot.
A popular conda channel for bioinformatics software is Bioconda, which provides multiple
software distributions for computational biology. In fact, the conda package and environment
manager is included in all versions of Anaconda, Miniconda and Anaconda Repository.
The big difference between conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science and the
reason conda exists.
When pip installs a package, it automatically installs any dependent Python packages
without checking if these conflict with previously installed packages. It will install a package
and any of its dependencies regardless of the state of the existing installation because of this,
a user with a working installation of, for example, Google Tensorflow, can find that it stops
working having used pip to install a different package that requires a different version of the
dependent numpy library than the one used by Tensorflow. In some cases, the package may
appear to work but produce different results in detail.
17
In contrast, conda analyses the current environment including everything currently installed,
and, together with any version limitations specified (e.g. the user may wish to have
Tensorflow version 2,0 or higher), works out how to install a compatible set of
dependencies, and shows a warning if this cannot be done.
Open source packages can be individually installed from the Anaconda repository, Anaconda
Cloud (anaconda.org), or your own private repository or mirror, using the cond
install command. Anaconda Inc compiles and builds the packages available in the Anaconda
repository itself, and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-
bit. Anything available on PyPI may be installed into a conda environment using pip, and
conda will keep track of what it has installed itself and what pip has installed.
Custom packages can be made using the conda build command, and can be shared with
others by uploading them to Anaconda Cloud,PyPI
or other repositories.
The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python
3.7. However, it is possible to create new environments that include any version of Python
packaged with conda.
Anaconda Navigator
Anaconda Navigator is a desktop graphical user interview included in Anaconda distribution
that allows users to launch applications and manage conda packages, environments and
channels without using. Navigator can search for packages on Anaconda Cloud or in a local
Anaconda Repository, install them in an environment, run the packages and update them. It
is available for window, micrOS and LINUX.
The following applications are available by default in Navigator.
Conda is an open source language-agnostic and environment management system that
installs, runs, and updates packages and their dependencies. It was created for Python
programs, but it can package and distribute software for any language (e.g.,R) .where R is
a pograming language and free software environment for statistical computing and graphics
supported by the R Foundation for Statistical Computing.The R language is widely used
among statisticians and data miners for developing statistical software and data analysis
.Polls, data mining surveys, and studies of scholarly literature databases show substantial
increases in popularity as of February 2020, R ranks 13th in the TIOBE index, a measure of
popularity of programming languages.
A GNU package, the official R software environment is written primarily in C, FORTRAN,
and R itself (thus, it is partially self-hosting) and is freely available under the GNU General
Public License. Pre-compiled executables are provided for various operating systems.
Although R has a command line interface. There are several third-party graphical user
interfaces, such as RStudio, an integrated development environment, and Jupyter, a notebook
interface, including multi-language projects. The conda package and environment manager is
included in all versions of Anaconda, Miniconda, and Anaconda Repository.
Anaconda cloud:
Anaconda Cloud is a package management service by Anaconda where you can find, access,
store and share public and private notebooks, environments, and conda and PyPI
packages. Cloud hosts useful Python packages, notebooks and environments for a wide
18
variety of applications. You do not need to log in or to have a Cloud account, to search for
public packages, download and install them.
Project jyupter:
Project Jupyter is a nonprofit organization created to "develop open-source software, open-
standards, and services for interactive computing across dozens of programming languages".
Spun-off from IPython in 2014 by Fernando Pérez, Project Jupyter supports execution
environments in several dozen languages. Project Jupyter's name is a reference to the three
core programming languages supported by Jupyter, which are Julia, Python and R, and also a
homage to Galileo’s notebooks recording the discovery of the moons of Jupiter Project
Jupyter has developed and supported the interactive computing products Jupyter Notebook,
JupyterHub, and JupyterLab, the next-generation version of Jupyter Notebook.
In 2014, Fernando Pérez announced a spin-off project from IPython called Project
Jupyter IPython continued to exist as a Python shell and a kernel for Jupyter, while
the notebook and other language-agnostic parts of IPython moved under the Jupyter
name Jupyter is language agnostic and it supports execution environments (aka kernels) in
several dozen languages among which are Julia, R, Haskell, Ruby, and of course Python (via
the IPython kernel)
In 2015, GitHub and the Jupyter Project announced native rendering of Jupyter notebooks
file format (.ipynb files) on the GitHub platform.
Jupyter Notebook
Jupyter Notebook(formerly IPython Notebooks) is a web based computational environment
for creating Jupyter notebook documents. The "notebook" term can colloquially make
reference to many different entities, mainly the Jupyter web application, Jupyter Python web
server, or Jupyter document format depending on context. A Jupyter Notebook document is a
JSON document, following a versioned schema, and containing an ordered list of
input/output cells which can contain code, text, mathematics, plots and rich media, usually
ending with the ".ipynb" extension.
A Jupyter Notebook can be converted to a number of open source output formats
(HTML,PRESENTATION SKILLS,PDF ,markdown,python) through "Download As" in the
web interface, via the nbconvert library or "jupyter nbconvert" command line interface in a
shell. To simplify visualisation of Jupyter notebook documents on the web, the library is
provided as a service through NbViewer which can take a URL to any publicly available
notebook document, convert it to HTML on the fly and display it to the user.
Jupyter Notebook interface
Jupyter Notebook can connect to many kernels to allow programming in many languages.
By default Jupyter Notebook ships with the IPython kernel. As of the 2.3 release(October
2014), there are currently 49 Jupyter-compatible kernels for many programming languages,
including Python, R, Julia and Haskell
The Notebook interface was added to IPython in the 0.12 release (December 2011), renamed
to Jupyter notebook in 2015 (IPython 4.0 – Jupyter 1.0). Jupyter Notebook is similar to the
notebook interface of other programs such as Maple, Mathematica, and SageMath, a
computational interface style that originated with Mathematica in the 1980s.
According
to The Atlantic, Jupyter interest overtook the popularity of the Mathematica notebook
interface in early 2018.
19
Jupyter kernels
A Jupyter kernel is a program responsible for handling various types of requests (code
execution, code completions, inspection), and providing a reply. Kernels talk to the other
components of Jupyter using ZeroMQ over the network, and thus can be on the same or
remote machines. Unlike many other Notebook-like interfaces, in Jupyter, kernels are not
aware that they are attached to a specific document, and can be connected to many clients at
once. Usually kernels allow execution of only a single language, but there are a couple of
exceptions.
By default Jupyter ships with IPython as a default kernel and a reference implementation via
the ipykernel wrapper. Kernels for many languages having varying quality and features
are available.
JupyterHub
JupyterHub is a multi-user server for Jupyter Notebooks. It is designed to support many users
by spawning, managing, and proxying many singular Jupyter Notebook servers. While
JupyterHub requires managing servers, third-party services like Jupyo provide an alternative
to JupyterHub by hosting and managing multi-user Jupyter notebooks in the cloud.
JupyterLab
JupyterLab is the next-generation user interface for Project Jupyter. It offers all the familiar
building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser,
rich outputs, etc.) in a flexible and powerful user interface. The first stable release was
announced on February 20, 2018.
Web application
It has also added user- as well as system-based web applications enhancement to add support
for deployment across the variety of environments. It also tries to manage sessions as well as
applications across the network.
Tomcat is building additional components. A number of additional components may be used
with Apache Tomcat. These components may be built by users should they need them or
they can be downloaded from one of the mirrors.
Features:
Tomcat 7.x implements the Servlet 3.0 and JSP 2.2 specifications. It requires Java version
1.6, although previous versions have run on Java 1.1 through 1.5. Versions 5 through 6 saw
improvements in garbage collection, JSP parsing, performance and scalability. Native
wrappers, known as "Tomcat Native", are available for Microsoft Windows and Unix for
platform integration, Tomcat 8.x implements the Servlet 3.1 and JSP 2.3 Specifications.
Apache Tomcat 8.5.x is intended to replace 8.0.x and includes new features pulled forward
from Tomcat 9.0.x. The minimum Java version and implemented specification versions
remain unchanged.
WINDOWS 10
20
Windows 10 is a series of personal computer operating systems produced by Microsoft as
part of its Windows NT family of operating systems. It is the successor to Windows 8.1, and
was released to manufacturing on July 15, 2015, and broadly released for retail sale on July
29, 2015. Windows 10 receives new builds on an ongoing basis, which are available at no
additional cost to users, in addition to additional test builds of Windows 10 which are
available to Windows Insiders. Devices in enterprise environments can receive these updates
at a slower pace, or use long-term support milestones that only receive critical updates, such
as security patches, over their ten-year lifespan of extended support.
One of Windows 10's most notable features is support for universal apps, an expansion of
the Metro-style apps first introduced in Windows 8. Universal apps can be designed to run
across multiple Microsoft product families with nearly identical code—
including PCs, tablets, smartphones, embedded systems, Xbox One, Surface Hub and Mixed
Reality. The Windows user interface was revised to handle transitions between a mouse-
oriented interface and a touchscreen-optimized interface based on available input devices—
particularly on 2-in-1 PCs, both interfaces include an updated Start menu which incorporates
elements of Windows 7's traditional Start menu with the tiles of Windows 8. Windows 10
also introduced the Microsoft Edge web browser, a virtual desktop system, a window and
desktop management feature called Task View, support for fingerprint and face
recognition login, new security features for enterprise environments, and DirectX 12.
Windows 10 received mostly positive reviews upon its original release in July 2015. Critics
praised Microsoft's decision to provide a desktop-oriented interface in line with previous
versions of Windows, contrasting the tablet-oriented approach of 8, although Windows 10's
touch-oriented user interface mode was criticized for containing regressions upon the touch-
oriented interface of Windows 8. Critics also praised the improvements to Windows 10's
bundled software over Windows 8.1, Xbox Live integration, as well as the functionality and
capabilities of the Cortana personal assistant and the replacement of Internet Explorer with
Edge. However, media outlets have been critical of changes to operating system behaviours,
including mandatory update installation, privacy concerns over data collection performed by
the OS for Microsoft and its partners and the adware-like tactics used to promote the
operating system on its release.
Although Microsoft's goal to have Windows 10 installed on over a billion devices within
three years of its release had failed, it still had an estimated usage share of 60% of all the
Windows versions on traditional PCs, and thus 47% of traditional PCs were running
Windows 10 by September 2019. Across all platforms (PC, mobile, tablet and console), 35%
of devices run some kind of Windows, Windows 10 or older.
Microsoft Excel
MicrosoftExcel isa spreadsheet developedby Microsoft for Windows, macOS, Androi
d and iOS. It features calculation, graphing tools, pivot tables, and a macro programming
language called Visual Basic for Applications. It has been a very widely applied spreadsheet
21
for these platforms, especially since version 5 in 1993, and it has replaced Lotus 1-2-3 as the
industry standard for spreadsheets. Excel forms part of the Microsoft Office suite of
software.
Number of rows and columns Versions of Excel up to 7.0 had a limitation in the size of their
data sets of 16K (214
= 16384) rows. Versions 8.0 through 11.0 could handle 64K
(216
= 65536) rows and 256 columns (28
as label 'IV'). Version 12.0 onwards, including the
current Version 16.x, can handle over 1M (220
= 1048576) rows, and 16384 (214
as label
'XFD') columns.
Microsoft Excel up until 2007 version used a proprietary binary file format called Excel
Binary File Format (.XLS) as its primary format. Excel 2007 uses Office Open XML as its
primary file format, an XML-based format that followed after a previous XML-based format
called "XML Spreadsheet" ("XMLSS"), first introduced in Excel 2002.
Although supporting and encouraging the use of new XML-based formats as replacements,
Excel 2007 remained backwards-compatible with the traditional, binary formats. In addition,
most versions of Microsoft Excel can read CSV, DBF, SYLK, DIF, and other legacy
formats. Support for some older file formats was removed in Excel 2007. The file formats
were mainly from DOS-based programs.
Microsoft originally marketed a spreadsheet program called Multiplan in 1982. Multiplan
became very popular on CP/M systems, but on MS-DOS systems it lost popularity to Lotus
1-2-3. Microsoft released the first version of Excel for the Macintosh on September 30,
1985, and the first Windows version was 2.05 (to synchronize with the Macintosh version
2.2) in November 1987. Lotus was slow to bring 1-2-3 to Windows and by the early 1990s,
Excel had started to outsell 1-2-3 and helped Microsoft achieve its position as a leading PC
software developer. This accomplishment solidified Microsoft as a valid competitor and
showed its future of developing GUI software. Microsoft maintained its advantage with
regular new releases, every two years or so.
Excel 2.0 is the first version of Excel for the Intel platform. Versions prior to 2.0 were only
available on the Apple Macintosh.
Excel 2.0 (1987) The first Windows version was labeled "2" to correspond to the Mac
version. This included a run-time version of Windows.
BYTE in 1989 listed Excel for Windows as among the "Distinction" winners of the BYTE
Awards. The magazine stated that the port of the "extraordinary" Macintosh version
"shines", with a user interface as good as or better than the original.
Excel 3.0 (1990) Included toolbars, drawing capabilities, outlining, add-in support, 3D
charts, and many more new features.[
Also, an easter egg in Excel 4.0 reveals a hidden animation of a dancing set of numbers 1
through 3, representing Lotus 1-2-3, which was then crushed by an Excel logo.[81]
22
Excel 5.0 (1993) With version 5.0, Excel has included Visual Basic for Applications (VBA),
a programming language based on Visual Basic which adds the ability to automate tasks in
Excel and to provide user-defined functions (UDF) for use in worksheets. VBA is a powerful
addition to the application and includes a fully featured integrated development
environment (IDE). Macro recording can produce VBA code replicating user actions, thus
allowing simple automation of regular tasks. VBA allows the creation of forms and
in-worksheet controls to communicate with the user. The language supports use (but not
creation) of ActiveX (COM) DLL's; later versions add support for class modules allowing
the use of basic object-oriented programming techniques.
The automation functionality provided by VBA made Excel a target for macro viruses. This
caused serious problems until antivirus products began to detect these
viruses. Microsoft belatedly took steps to prevent the misuse by adding the ability to disable
macros completely, to enable macros when opening a workbook or to trust all macros signed
using a trusted certificate.
Versions 5.0 to 9.0 of Excel contain various Easter eggs, including a "Hall of Tortured
Souls", although since version 10 Microsoft has taken measures to eliminate such
undocumented features from their products.
5.0 was released in a 16-bit x86 version for Windows 3.1 and later in a 32-bit version for NT
3.51 (x86/Alpha/PowerPC)
Included in Office 2007. This release was a major upgrade from the previous version.
Similar to other updated Office products, Excel in 2007 used the new Ribbon menu system.
This was different from what users were used to, and was met with mixed reactions. One
study reported fairly good acceptance by users except highly experienced users and users of
word processing applications with a classical WIMP interface, but was less convinced in
terms of efficiency and organization. However, an online survey reported that a majority of
respondents had a negative opinion of the change, with advanced users being "somewhat
more negative" than intermediate users, and users reporting a self-estimated reduction in
productivity.
Added functionality included the SmartArt set of editable business diagrams. Also added
was an improved management of named variables through the Name Manager, and much-
improved flexibility in formatting graphs, which allow (x, y) coordinate labeling and lines of
arbitrary weight. Several improvements to pivot tables were introduced.
Also like other office products, the Office Open XML file formats were introduced,
including .xlsm for a workbook with macros and .xlsx for a workbook without macros.
Specifically, many of the size limitations of previous versions were greatly increased. To
illustrate, the number of rows was now 1,048,576 (220
) and columns was 16,384 (214
; the far-
right column is XFD). This changes what is a valid A1 reference versus a named range. This
version made more extensive use of multiple cores for the calculation of spreadsheets;
however, VBA macros are not handled in parallel and XLL add-ins were only executed in
parallel if they were thread-safe and this was indicated at registration.
3.2.4 Hardware requirement
23
Server-side hardware
1.Hardware recommended by all software needed
2.Communication hardware to serve client requests
Client-side hardware
1.Hardware recommended by respective client’s operating system and web browser.
2.Communication hardware to communicate the server,
Others
An INTEL Core 2 Duo CPU, a RAM of 4 GB (minimum), a Hard Disk of any GB, a 105
keys keyboard, an Optical Mouse and Color Display are required.
Operating System
As our system will be platforms independent; so, there is no constraints in case of operating
system in using this software. Windows and Linux operating systems are considered to be
extremely stable platforms. One of the advantages of using the Linux operating system is the
speed at which performance enhancements and security patches are made available since it is
an open source product.
Web Server
The available web server’s platforms are IIS and Apache. Since Apache is open source,
software security patches tend to be released faster. One of the downsides to IIS is its
vulnerability to virus attacks, something that Apache rarely has problems with. Apache
offers PHP; a language considered to be the C of the next part is the end user environment.
Before we have explained that we have assume that the end users are not so much aware
about the computer and they are not so much interested to divert their activities to the
automated system from the manual system.
Web Browser
Let's play word association, just like when a psychologist asks you what comes to mind
when you hear certain words: What do you think when you hear the words "Opera. Safari.
Chrome. Firefox."
If you think of the Broadway play version of "The Lion King," maybe it is time to see a
psychologist. However, if you said, "Internet browsers," you're spot on. That's because the
leading Internet Browsers are:
 Google Chrome
 Mozilla Firefox
 Apple Safari
 Microsoft Internet Explorer
24
 Microsoft Edge
 Opera
 Maxton
And that order pretty much lines up with how they're ranked in terms of market share and
popularity...today. Browsers come and go. Ten years ago, Netscape Navigator was a well-
known browser: Netscape is long gone today. Another, called Mosaic, is considered the first
modern browser—it was discontinued in 1997.
So, what exactly is a browser?
Definition:
A browser, short for web browser, is the software application (a program) that you're using
right now to search for, reach and explore websites. Whereas Excel is a program for
spreadsheets and Word® a program for writing documents, a browser is a program for
Internet exploring (which is where that name came from).
Browsers don't get talked about much. A lot of people simply click on the "icon" on our
computers that take us to the Internet—and that's as far as it goes. And in a way, that's
enough. Most of us simply get in a car and turn the key...we don't know what kind of engine
we have or what features it has...it takes us where we want to go. That's why when it comes
to computers:
 There are some computer users that can't name more than one or two browsers
 Many of them don't know they can switch to another browser for free
 There are some who go to Google's webpage to "google" a topic and think that
Google is their browser.
25
3.3 Content diagram of Project
26
Chapter-4
DESIGN
4.1 Introduction
The UML may be used to visualize, specify, construct and document the artifacts of a
software-intensive system.
The UML is only a language and so is just one part of a software development method. The
UML is process independent, although optimally it should be used in a process that is use
case driven, architecture-centric, iterative, and incremental.
What is UML:
The Unified Modeling Language (UML) is a robust notation that we can use to build OOAD
models. It is called so, since it was the unification, of the ideas from different methodologies
from three amigos – Botch, Rumbaugh and Jacobson.
The UML is the standard language for visualizing, specifying, constructing, and
documenting the artifacts of a software-intensive system. It can be used with all processes,
throughout the development life cycle, and across different implementation technologies.
 The UML combines the best from:
o Data Modeling concepts
o Business Modeling (work flow)
o Object Modeling
o Component Modeling
 The UML may be used to:
o Display the boundary of a system and its major functions using use cases and
actors
o Illustrate use case realizations with interaction diagrams
o Represent a static structure of a system using class diagrams
o Model the behavior of objects with state transition diagrams
o Reveal the physical implementation architecture with component &
deployment diagrams
o Extend the functionality of the system with stereotypes
Evolution of UML
27
One of the methods was the Object Modeling Technique (OMT), devised by James
Rumbaugh and others at General Electric. It consists of a series of models - use case, object,
dynamic, and functional that combines to give a full view of a system.
The Botch method was devised by Grady Booch and developed the practice of analyzing a
system as a series of views. It emphasizes analyzing the system from both a macro
development view and micro development view and it was accompanied by a very detailed
notation.
The Object-Oriented Software Engineering (OOSE) method was devised by Ivar Jacobson
and focused on the analysis of system behavior. It advocated that at each stage of the process
there should be a check to see that the requirements of the user were being met.
The UML is composed of three different parts:
1. Model elements 2. Diagrams 3. Views
 Model Elements:
o The model elements represent basic object-oriented concepts such as classes, objects,
and relationships.
o Each model element has a corresponding graphical symbol to represent it in the
diagrams.
 Diagrams:
o Diagrams portray different combinations of model elements.
o For example, the class diagram represents a group of classes and the relationships,
such as association and inheritance, between them.
o The UML provides nine types of diagram - use case, class, object, state chart,
sequence, collaboration, activity, component, and deployment
Views:
o Views provide the highest level of abstraction for analyzing the system.
o Each view is an aspect of the system that is abstracted to a number of related UML
diagrams.
o Taken together, the views of a system provide a picture of the system in its entirety.
o In the UML, the five main views of the system are
o User Structural Behavioral Implementation
Environment
28
Fig-1 UML Views
In addition to model elements, diagrams, and views, the UML provides mechanisms for
adding comments, information, or semantics to diagrams. And it provides mechanisms to
adapt or extend itself to a particular method, software system, or organization.
Extensions of UML:
o Stereotypes can be used to extend the UML notational elements
o Stereotypes may be used to classify and extend associations, inheritance
relationships, classes, and components
Examples:
 Class stereotypes : boundary, control, entity
 Inheritance stereotypes : extend
 Dependency stereotypes : uses
 Component stereotypes : subsystem
Use Case Diagrams:
The use case diagram presents an outside view of the system
The use-case model consists of use-case diagrams.
o The use-case diagrams illustrate the actors, the use cases, and their relationships.
o Use cases also require a textual description (use case specification), as the visual
diagrams can't contain all of the information that is necessary.
o The customers, the end-users, the domain experts, and the developers all have an
input into the development of the use-case model.
o Creating a use-case model involves the following steps:
29
1. defining the system
2. identifying the actors and the use cases
3. describing the use cases
4. defining the relationships between use cases and actors.
5. defining the relationships between the use cases
Use Case:
Definition: is a sequence of actions a system performs that yields an observable result of
value to a particular actor.
Use Case Naming:
Use Case should always be named in business terms, picking the words from the vocabulary
of the particular domain for which we are modeling the system. It should be meaningful to
the user because use case analysis is always done from user’s perspective. It will be usually
verbs or short verb phrases.
Use Case Specification shall document the following:
o Brief Description
o Precondition
o Main Flow
o Alternate Flow
o Exceptional flows
o Post Condition
o Special Requirements
Notation of use case
Actor:
Definition: someone or something outside the system that interacts with the
system
o An actor is external - it is not actually part of what we are building, but an interface
needed to support or use it.
o It represents anything that interacts with the system.
Notation for Actor
30
Relation: Two important types of relation used in Use Case Diagram
 Include: An Include relationship shows behavior that is common to
one or more use cases (Mandatory).
Include relation results when we extract the common sub flows and
make it a use case
Extend: An extend relationship shows optional behavior (Optional)
Extend relation results usually when we add a bit more specialized
feature to the already existing one, we say the use case B extends its
functionality to use case A.
A system boundary rectangle separates the system from the external actors.
An extend relationship indicates that one use case is a variation of another. Extend notation
is a dotted line, labeled <<extend>>, and with an arrow toward the base case. The extension
points, which determines when the extended case is appropriate, is written inside the base
case.
Class Diagrams:
Class diagram shows the existence of classes and their relationships in the structural view of
a system.
UML modeling elements in class diagram:
o Classes and their structure and behavior
o Relationships
 Association
 Aggregation
 Composition
 Dependency
 Generalization / Specialization (inheritance relationships)
 Multiplicity and navigation indicators
 Role names
A class describes properties and behavior of a type of object.
o Classes are found by examining the objects in sequence and collaboration diagram
o A class is drawn as a rectangle with three compartments
o Classes should be named using the vocabulary of the domain Naming standards
should be created e.g., all classes are singular nouns starting with a capital letter
31
o The behavior of a class is represented by its operations. Operations may be found by
examining interaction diagrams
o The structure of a class is represented by its attributes. Attributes may be found by
examining class definitions, the problem requirements, and by applying domain
knowledge
Notation:
Class information: visibility and scope
The class notation is a 3-piece rectangle with the class name, attributes, and operations.
Attributes and operations can be labeled according to access and scope.
Relationship:
Association:
o Association represents the physical or conceptual connection between two or more
objects
o An association is a bi-directional connection between classes
o An association is shown as a line connecting the related classes
Aggregation:
o An aggregation is a stronger form of relationship where the relationship is between a
whole and its parts
o It is entirely conceptual and does nothing more than distinguishing a ‘whole’ from
the ‘part’.
o It doesn’t link the lifetime of the whole and its parts
32
o An aggregation is shown as a line connecting the related classes with a diamond next
to the class representing the whole
Composition:
Composition is a form of aggregation with strong ownership and coincident
lifetime as the part of the whole.
Multiplicity
o Multiplicity defines how many objects participate in relationships
o Multiplicity is the number of instances of one class related to ONE instance of the
other class
This table gives the most common multiplicities
For each association and aggregation, there are two multiplicity decisions to make:
one for each end of the relationship
Although associations and aggregations are bi-directional by default, it is often
desirable to restrict navigation to one direction
If navigation is restricted, an arrowhead is added to indicate the direction of the
navigation
33
Dependency:
o A dependency relationship is a weaker form of relationship showing a relationship
between a client and a supplier where the client does not have semantic knowledge of
the supplier.
o A dependency is shown as a dashed line pointing from the client to the supplier.
Generalization / Specialization (inheritance relationships):
o It is a relationship between a general thing (called the super class or the
parent) and a more specific kind of that thing (called the subclass(es) or the
child).
o An association class is an association that also has class properties (or a class
has association properties)
o A constraint is a semantic relationship among model elements that specifies
conditions and propositions that must be maintained as true: otherwise, the
system described by the model is invalid.
34
o An interface is a specifier for the externally visible operations of a class
without specification of internal structure. An interface is formally equivalent
to abstract class with no attributes and methods, only abstract operations.
o A qualifier is an attribute or set of attributes whose values serve to partition
the set of instances associated with an instance across an association.
Interaction Diagrams:
o Interaction diagrams is used to model the dynamic behavior of the system
o Interaction diagram helps us to identify the classes and its methods
o Interaction diagrams describe how use cases are realized as interactions
among objects
o Show classes, objects, actors and messages between them to achieve the
functionality of a Use Case
There are two types of interaction diagrams:
1. Sequence Diagram
2. Collaboration Diagram
35
1. Sequence Diagram:
A sequence diagram simply depicts interaction between objects in a sequential order i.e. the
order in which these interactions take place. We can also use the terms event diagrams or
event scenarios to refer to a sequence diagram. Sequence diagrams describe how and in what
order the objects in a system function. These diagrams are widely used by businessmen and
software developers to document and understand requirements for new and existing systems.
Sequence Diagram Notations:
1. Actors – An actor in a UML diagram represents a type of role where it interacts with
the system and its objects. It is important to note here that an actor is always outside the
scope of the system we aim to model using the UML diagram.
We use actors to depict various roles including human users and other external
subjects. We represent an actor in a UML diagram using a stick person notation. We
can have multiple actors in a sequence diagram.
For example – Here the user in seat reservation system is shown as an actor where it
exists outside the system and is not a part of the system.
2. Lifelines – A lifeline is a named element which depicts an individual participant in a
sequence diagram. So basically, each instance in a sequence diagram is represented by
a lifeline. Lifeline elements are located at the top in a sequence diagram. The standard
in UML for naming a lifeline follows the following format – Instance Name: Class
36
Name
3. Messages – Communication between objects is depicted using messages. The messages
appear in a sequential order on the lifeline. We represent messages using arrows.
Lifelines and messages form the core of a sequence diagram.
Messages can be broadly classified into the following
 Synchronous messages – A synchronous message waits for a reply before the
interaction can move forward. The sender waits until the receiver has completed
the processing of the message. The caller continues only when it knows that the
37
receiver has processed the previous message i.e. it receives a reply message. A
large number of calls in object-oriented programming are synchronous. We use a
solid arrow head to represent a synchronous message.
 Asynchronous Messages – An asynchronous message does not wait for a reply
from the receiver. The interaction moves forward irrespective of the receiver
processing the previous message or not. We use a lined arrow head to represent
an asynchronous message.
 Create message – We use a Create message to instantiate a new object in the
sequence diagram. There are situations when a particular message call requires
the creation of an object. It is represented with a dotted arrow and create word
labelled on it to specify that it is the create Message symbol.
For example – The creation of a new order on a e-commerce website would
require a new object of Order class to be created.
 Delete Message – We use a Delete Message to delete an object. When an object
is deallocated memory or is destroyed within the system, we use the Delete
Message symbol. It destroys the occurrence of the object in the system. It is
represented by an arrow terminating with a x.
For example – In the scenario below when the order is received by the user, the
object of order class can be destroyed.
 . Self-Message – Certain scenarios might arise where the object needs to send a
message to itself. Such messages are called Self Messages and are
represented with a U-shaped arrow
 Reply Message – Reply messages are used to show the message being sent from
the receiver to the sender. We represent a return/reply message using an open
38
arrowhead with a dotted line. The interaction moves forward only when a reply
message is sent by the receiver.
2. Collaboration diagram:
It shows the interaction of the objects and also the group of all messages sent or received
by an object. This allows us to see the complete set of services that an object must
provide.
The following collaboration diagram realizes a scenario of reserving a copy of book in a
library
 Difference between the Sequence Diagram and Collaboration Diagram
o Sequence diagrams emphasize the temporal aspect of a scenario - they
focus on time.
o Collaboration diagrams emphasize the spatial aspect of a scenario - they focus
on how objects are linked.
Activity Diagram:
An activity diagram is essentially a fancy flowchart. Activity diagrams. An activity diagram
focuses on the flow of activities involved in a single process. The activity diagram shows
how those activities depend on one another.
1. initial State – The starting state before an activity takes place is depicted using the
initial state.
2. Action or Activity State – An activity represents execution of an action on objects or
by objects. We represent an activity using a rectangle with rounded corners. Basically
any action or event that takes place is represented using an activity.
3. Action Flow or Control flows – Action flows or Control flows are also referred to as
paths and edges. They are used to show the transition from one activity state to another.
39
4. Decision node and Branching – When we need to make a decision before deciding the
flow of control, we use the decision node.
5. Guards – A Guard refers to a statement written next to a decision node on an arrow
sometimes within square brackets.
6.Fork – Fork nodes are used to support concurrent activities.
40
Join – Join nodes are used to support concurrent activities converging into one. For join
notations we have two or more incoming edges and one outgoing edge.
6. Merge or Merge Event – Scenarios arise when activities which are not being executed
concurrently have to be merged. We use the merge notation for such scenarios. We can
merge two or more activities into one if the control proceeds onto the next activity
irrespective of the path chosen.
7. Swim lanes – We use swim lanes for grouping related activities in one column. Swim
lanes group related activities into one column or one row. Swim lanes can be vertical
and horizontal. Swim lanes are used to add modularity to the activity diagram. It is not
mandatory to use swim lanes. They usually give more clarity to the activity diagram.
It’s similar to creating a function in a program. It’s not mandatory to do so, but it is a
recommended practice.
41
Component Diagram:
Describe organization and dependency between the software implementation components.
Components are distributable physical units - e.g. source code, object code.
Deployment Diagram:
Describe the configuration of processing resource elements and the mapping of software
implementation components onto them.
Contain components - e.g. object code, source code and nodes (e.g. printer, database, client
machine).
4.2 UML Diagram
42
4.2.1 Class diagram
Diagram:1 class diagram for user and admin
43
4.2.2 Use case diagram
Diagram:2 use case diagram for admin
44
4.2.3 Activity diagram
Diagram:3 Activity diagram for admin
45
4.2.4 Sequence diagram
Diagram:4 sequence diagram for admin
46
4.2.5 Collaboration diagram
Diagram:5 collaboration diagram for admin
4.2.6 Component diagram
47
Diagram:6 component diagram for admin
48
4.2.7 Deployment diagram
Diagram:7 deployment daigram for local server
49
Chapter-5
IMPLEMENTATION METHODS & RESULTS
5.1 Introduction :
We implement our project through Anaconda 3 and jyupter when comes
to Anaconda Anaconda is a free and open source and distribution
the Python and R programming languages for scientific computing (data science, machine
learning applications, large-scale data processing, predictive analytics, etc.),that aims to
simplify package management and deployment. Package versions are managed by
the package management system conda .Anaconda distribution includes data-science
packages suitable for Windows, Linux, and mac OS.
Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as
a graphical alternative to the command line interface (CLI).
Here we define conda as analyzes your current environment, everything you have installed,
any version limitations you specify (e.g. you only want tensor flow >= 2.0) and figures out
how to install compatible dependencies. Or it will tell you that what you want can't be done.
pip, by contrast, will just install the package you specify and any dependencies, even if that
breaks other packages.
Conda allows users to easily install different versions of binary software packages and any
required libraries appropriate for their computing platform. Also, it allows users to switch
between package versions and download and install updates from a software repository.
Conda is written in the Python programming language, but can manage projects containing
code written in any language (e.g., R), including mufti-language projects. Conda can
install Python, while similar Python-based cross-platform package managers (such
as wheel or pip) cannot.
A popular conda channel for bio informatics software is Bioconda, which provides multiple
software distributions for computational biology. In fact, the conda package and environment
manager is included in all versions of Anaconda, Miniconda and Anaconda Repository.
The big difference between conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data science and the
reason conda exists.
When pip installs a package, it automatically installs any dependent Python packages
without checking if these conflict with previously installed packages. It will install a package
and any of its dependencies regardless of the state of the existing installation because of this,
a user with a working installation of, for example, Google Tensor flow, can find that it stops
working having used pip to install a different package that requires a different version of the
dependent numpy library than the one used by Tensor flow. In some cases, the package may
appear to work but produce different results in detail.
50
5.2 Explanation of Key functions
1.Importing and Merging Data
Python code in one module gains access to the code in another module by the process
of importing it. The import statement is the most common way of invoking the import
machinery, but it is not the only way. Functions such as importlib.import_module() and
built-in can also be used to invoke the import machinery.
The import statement combines two operations; it searches for the named module, then it
binds the results of that search to a name in the local scope. The search operation of
the import statement is defined as a call to the function, with the appropriate arguments. The
return value of is used to perform the name binding operation of the import statement. See
the import statement for the exact details of that name binding operation.
Contrary to when you merge new cases, merging in new variables requires the IDs
for each case in the two files to be the same, but the variable names should be different. In
this scenario, which is sometimes referred to as augmenting your data (or in SQL, “joins”) or
merging data by columns (i.e. you’re adding new columns of data to each row), you’re
adding in new variables with information for each existing case in your data file. As with
merging new cases where not all variables are present, the same thing applies if you merge in
new variables where some cases are missing – these should simply be given blank values.
Fig-2 Merging data sets
2.Data volume
Since we don’t know the features that could be useful to predict the churn, we had to work
on all the data that reflect the customer behavior in general. We used data sets related to
51
calls, SMS, MMS, and the internet with all related information like complaints, network
data, IMEI, charging, and other. The data contained transactions for all customers during
nine months before the prediction baseline. The size of this data was more than 70 Terabyte,
and we couldn’t perform the needed feature engineering phase using traditional databases.
3.Data variety
The data used in this research is collected from multiple systems and databases. Each source
generates the data in a different type of files as structured, semi-structured (XML-JSON) or
unstructured (CSV-Text). Dealing with these kinds of data types is very hard without big
data platform since we can work on all the previous data types without making any
modification or transformation. By using the big data platform, we no longer have any
problem with the size of these data or the format in which the data are represented.
4.Unbalanced data set
The generated data set was unbalanced since it is a special case of the classification problem
where the distribution of a class is not usually homogeneous with other classes. The
dominant class is called the basic class, and the other is called the secondary class. The data
set is unbalanced if one of its categories is 10% or less compared to the other one [18].
Although machine learning algorithms are usually designed to improve accuracy by reducing
error, not all of them take into account the class balance, and that may give bad results [18].
In general, classes are considered to be balanced in order to be given the same importance in
training.
We found that Syria Tel data set was unbalanced since the percentage of the secondary class
that represents churn customers is about 5% of the whole data set.
5.Extensive features
The collected data was full of columns, since there is a column for each service, product, and
offer related to calls, SMS, MMS, and internet, in addition to columns related to personnel
and demographic information. If we need to use all these data sources the number of
columns for each customer before the data being processed will exceed ten thousand
columns.
6.Missing values
There is a representation of each service and product for each customer. Missing values may
occur because not all customers have the same subscription. Some of them may have a
number of services and others may have something different. In addition, there are some
columns related to system configurations and these columns have only null value for all
customers.
52
Fig-3 Inputting Missing values
7.Checking the Churn Rate
The most basic way you can calculate churn for a given month is to check how many
customers you have at the start of the month and how many of those customers leave by the
end of the month. Once you have both numbers, divide the number of customers that left by
the number of customers at the start of the month.
8.Splitting Data into Training and Test Sets
As I said before, the data we use is usually split into training data and test data. The training
set contains a known output and the model learns on this data in order to be generalized to
other data later on. We have the test data set (or subset) in order to test our model’s prediction
on this subset.
Fig-4 Diving into training and testing sets
9.Correlation Matrix
A correlation matrix is a table showing correlation coefficients between variables.
Each cell in the table shows the correlation between two variables. A correlation matrix is
used to summarize data, as an input into a more advanced analysis, and as a diagnostic for
advanced analyses.
Typically, a correlation matrix is “square”, with the same variables shown in the rows and
columns. I've shown an example below. This shows correlations between the stated
importance of various things to people. The line of 1.00s going from the top left to the
bottom right is the main diagonal, which shows that each variable always perfectly
53
correlates with itself. This matrix is symmetrical, with the same correlation is shown above
the main diagonal being a mirror image of those below the main diagonal.
Fig-5 Correlation Matrix
10.Feature Selection Using RFE
When U.S. Citizenship and Immigration Services (USCIS) needs more information
in order to proceed any further on your application, it can issue you a Request for Evidence
(RFE). You will need to respond to the RFE within the time frame indicated (usually 30 to
90 days) so that the immigration official adjudicating your case will have enough evidence to
make a favorable decision.
11.Making Predictions
Making predictions is a strategy in which readers use information from a text
(including titles, headings, pictures, and diagrams) and their own personal experiences to
anticipate what they are about to read (or what comes next). A reader involved in making
predictions is focused on the text at hand, constantly thinking ahead and also refining,
revising, and verifying his or her predictions. This strategy also helps students make
connections between their prior knowledge and the text.
Students may initially be more comfortable making predictions about fiction than nonfiction
or informational text. This may be due to the fact that fiction is more commonly used in
early reading instruction. Students also tend to be more comfortable with the structure of
narrative text than they are with the features and structures used in informational text.
However, the strategy is important for all types of text. Teachers should make sure to include
time for instruction, modeling, and practice as students read informational text. They can
also help students successfully make predictions about informational text by ensuring that
students have sufficient background knowledge before beginning to read the text.
54
12.Model Evaluation
Model Evaluation is an integral part of the model development process. It helps to
find the best model that represents our data and how well the chosen model will work in the
future. To avoid outfitting, both methods use a test set (not seen by the model) to evaluate
model performance.
13.ROC Curve
A receiver operating characteristic curve, or ROC curve, is a graphical plot that
illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is
varied.
The ROC curve is created by plotting the true positive rate (TPR) against the false positive
rate (FPR) at various threshold settings. The true-positive rate is also known
as sensitivity, recall or probability of detection in machine learning. The false-positive rate is
also known as probability of false alarm and can be calculated as (1 − specificity). It can also
be thought of as a plot of the power as a function of the Type I Error of the decision rule
(when the performance is calculated from just a sample of the population, it can be thought
of as estimators of these quantities). The ROC curve is thus the sensitivity or recall as a
function of fall-out. In general, if the probability distributions for both detection and false
alarm are known, the ROC curve can be generated by plotting the cumulative distribution
function (area under the probability distribution from to the discrimination threshold) of the
detection probability in the y-axis versus the cumulative distribution function of the false-
alarm probability on the x-axis.
Histogram:
A histogram is a display of statistical information that uses rectangles to show the frequency
of data items in successive numerical intervals of equal size. In the most common form of
histogram, the independent variable is plotted along the horizontal axis and the dependent
variable is plotted along the vertical axis. The data appears as colored or shaded rectangles of
variable area.
The illustration, below, is a histogram showing the results of a final exam given to a
hypothetical class of students. Each score range is denoted by a bar of a certain color. If this
histogram were compared with those of classes from other years that received the same test
from the same professor, conclusions might be drawn about intelligence changes among
students over the years. Conclusions might also be drawn concerning the improvement or
decline of the professor's teaching ability with the passage of time. If this histogram were
compared with those of other classes in the same semester who had received the same final
exam but who had taken the course from different professors, one might draw conclusions
about the relative competence of the professors.
55
Fig-6 Histogram
Some histograms are presented with the independent variable along the vertical axis and the
dependent variable along the horizontal axis. That format is less common than the one shown
here.
5.3 Screenshots or output screens
1.Code
# In[1]:
import pandas as pd
import numpy as np
# In[2]:
churn_data = pd.read_csv("OneDrive/Desktop/project datasets/churn_data.csv")
customer_data = pd.read_csv("OneDrive/Desktop/project datasets/customer_data.csv")
internet_data = pd.read_csv("OneDrive/Desktop/project datasets/internet_data.csv")
# In[3]:
56
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')
# In[4]:
telecom = pd.merge(df_1, internet_data, how='inner', on='customerID')
# In[5]:
telecom.head()
# In[6]:
telecom.describe()
# In[7]:
telecom.info()
# In[8]:
telecom['PhoneService'] = telecom['PhoneService'].map({'Yes': 1, 'No': 0})
telecom['PaperlessBilling'] = telecom['PaperlessBilling'].map({'Yes': 1, 'No': 0})
telecom['Churn'] = telecom['Churn'].map({'Yes': 1, 'No': 0})
telecom['Partner'] = telecom['Partner'].map({'Yes': 1, 'No': 0})
telecom['Dependents'] = telecom['Dependents'].map({'Yes': 1, 'No': 0})
# In[9]:
# Creating a dummy variable for the variable 'Contract' and dropping the first one.
cont = pd.get_dummies(telecom['Contract'],prefix='Contract',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,cont],axis=1)
# Creating a dummy variable for the variable 'PaymentMethod' and dropping the first one.
pm = pd.get_dummies(telecom['PaymentMethod'],prefix='PaymentMethod',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,pm],axis=1)
57
# Creating a dummy variable for the variable 'gender' and dropping the first one.
gen = pd.get_dummies(telecom['gender'],prefix='gender',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,gen],axis=1)
# Creating a dummy variable for the variable 'MultipleLines' and dropping the first one.
ml = pd.get_dummies(telecom['MultipleLines'],prefix='MultipleLines')
# dropping MultipleLines_No phone service column
ml1 = ml.drop(['MultipleLines_No phone service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ml1],axis=1)
# Creating a dummy variable for the variable 'InternetService' and dropping the first one.
iser = pd.get_dummies(telecom['InternetService'],prefix='InternetService',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,iser],axis=1)
# Creating a dummy variable for the variable 'OnlineSecurity'.
os = pd.get_dummies(telecom['OnlineSecurity'],prefix='OnlineSecurity')
os1= os.drop(['OnlineSecurity_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,os1],axis=1)
# Creating a dummy variable for the variable 'OnlineBackup'.
ob =pd.get_dummies(telecom['OnlineBackup'],prefix='OnlineBackup')
ob1 =ob.drop(['OnlineBackup_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ob1],axis=1)
# Creating a dummy variable for the variable 'DeviceProtection'.
dp =pd.get_dummies(telecom['DeviceProtection'],prefix='DeviceProtection')
dp1 = dp.drop(['DeviceProtection_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,dp1],axis=1)
# Creating a dummy variable for the variable 'TechSupport'.
ts =pd.get_dummies(telecom['TechSupport'],prefix='TechSupport')
ts1 = ts.drop(['TechSupport_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ts1],axis=1)
# Creating a dummy variable for the variable 'StreamingTV'.
st =pd.get_dummies(telecom['StreamingTV'],prefix='StreamingTV')
st1 = st.drop(['StreamingTV_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,st1],axis=1)
# Creating a dummy variable for the variable 'StreamingMovies'.
sm =pd.get_dummies(telecom['StreamingMovies'],prefix='StreamingMovies')
sm1 = sm.drop(['StreamingMovies_No internet service'],1)
#Adding the results to the master dataframe
58
telecom = pd.concat([telecom,sm1],axis=1)
# In[10]:
telecom['MultipleLines'].value_counts()
# In[11]:
telecom =
telecom.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)
# In[12]:
telecom['TotalCharges'] = pd.to_numeric(telecom['TotalCharges'],errors='coerce')
telecom['tenure'] = telecom['tenure'].astype(float).astype(int)
# In[13]:
telecom.info()
# In[14]:
num_telecom = telecom[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']]
# In[15]:
num_telecom.describe(percentiles=[.25,.5,.75,.90,.95,.99])
# In[16]:
telecom.isnull().sum()
# In[17]:
59
round(100*(telecom.isnull().sum()/len(telecom.index)), 2)
# In[18]:
telecom = telecom[~np.isnan(telecom['TotalCharges'])]
# In[19]:
round(100*(telecom.isnull().sum()/len(telecom.index)), 2)
# In[20]:
df = telecom[['tenure','MonthlyCharges','TotalCharges']]
# In[21]:
normalized_df=(df-df.mean())/df.std()
# In[22]:
telecom = telecom.drop(['tenure','MonthlyCharges','TotalCharges'], 1)
# In[23]:
telecom = pd.concat([telecom,normalized_df],axis=1)
# In[24]:
telecom
# In[25]:
60
churn = (sum(telecom['Churn'])/len(telecom['Churn'].index))*100
# In[26]:
telecom
# In[27]:
churn
# In[28]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
import numpy as np
from sklearn.metrics import mean_squared_error
# In[29]:
features = telecom.drop(['Churn','customerID'],axis=1).values
target = telecom['Churn'].values
# In[30]:
features_train,target_train=features[20:30],target[20:30]
features_test,target_test=features[40:50],target[40:50]
print(features_test)
# In[31]:
model = GaussianNB()
model.fit(features_train,target_train)
print(model.predict(features_test))
# In[32]:
61
from sklearn.model_selection import train_test_split
# In[33]:
# Putting feature variable to X
X = telecom.drop(['Churn','customerID'],axis=1)
# Putting response variable to y
y = telecom['Churn']
# In[34]:
y.head()
# In[35]:
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,
train_size=0.7,test_size=0.3,random_state=100)
# In[36]:
import statsmodels.api as sm
# In[37]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()
# In[38]:
# Importing matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
62
# In[39]:
# Let's see the correlation matrix
plt.figure(figsize = (20,10)) # Size of the figure
sns.heatmap(telecom.corr(),annot = True)
# In[40]:
X_test2 =
X_test.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProtection
_No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],1)
X_train2 =
X_train.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProtectio
n_No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],1)
# In[41]:
plt.figure(figsize = (20,10))
sns.heatmap(X_train2.corr(),annot = True)
# In[42]:
logm2 = sm.GLM(y_train,(sm.add_constant(X_train2)), family = sm.families.Binomial())
logm2.fit().summary()
# In[43]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 13) # running RFE with 13 variables as output
rfe = rfe.fit(X,y)
print(rfe.support_) # Printing the boolean results
print(rfe.ranking_) # Printing the ranking
# In[44]:
col = ['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year',
63
'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic',
'InternetService_No',
'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges']
# In[45]:
# Let's run the model using the selected variables
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression()
logsk.fit(X_train[col], y_train)
# In[46]:
#Comparing the model with StatsModels
logm4 = sm.GLM(y_train,(sm.add_constant(X_train[col])), family =
sm.families.Binomial())
logm4.fit().summary()
# In[47]:
# UDF for calculating vif value
def vif_cal(input_data, dependent_col):
vif_df = pd.DataFrame( columns = ['Var', 'Vif'])
x_vars=input_data.drop([dependent_col], axis=1)
xvar_names=x_vars.columns
for i in range(0,xvar_names.shape[0]):
y=x_vars[xvar_names[i]]
x=x_vars[xvar_names.drop(xvar_names[i])]
rsq=sm.OLS(y,x).fit().rsquared
vif=round(1/(1-rsq),2)
vif_df.loc[i] = [xvar_names[i], vif]
return vif_df.sort_values(by = 'Vif', axis=0, ascending=False, inplace=False)
# In[48]:
telecom.columns
['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year',
'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic',
'InternetService_No',
'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges']
64
# In[49]:
# Calculating Vif value
vif_cal(input_data=telecom.drop(['customerID','SeniorCitizen', 'Partner', 'Dependents',
'PaymentMethod_Credit card (automatic)','PaymentMethod_Mailed
check',
'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No',
'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes',
'TechSupport_No','StreamingTV_No','StreamingTV_Yes','StreamingMovies_Yes',
'MonthlyCharges'], axis=1), dependent_col='Churn')
# In[50]:
col = ['PaperlessBilling', 'Contract_One year', 'Contract_Two year',
'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic',
'InternetService_No',
'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges']
# In[51]:
logm5 = sm.GLM(y_train,(sm.add_constant(X_train[col])), family =
sm.families.Binomial())
logm5.fit().summary()
# In[52]:
# Calculating Vif value
vif_cal(input_data=telecom.drop(['customerID','PhoneService','SeniorCitizen', 'Partner',
'Dependents',
'PaymentMethod_Credit card (automatic)','PaymentMethod_Mailed
check',
'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No',
'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes',
'TechSupport_No','StreamingTV_No','StreamingTV_Yes','StreamingMovies_Yes',
'MonthlyCharges'], axis=1), dependent_col='Churn')
# In[53]:
65
# Let's run the model using the selected variables
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression()
logsk.fit(X_train[col], y_train)
# In[54]:
y_pred = logsk.predict_proba(X_test[col])
# In[55]:
y_pred_df = pd.DataFrame(y_pred)
# In[56]:
y_pred_1 = y_pred_df.iloc[:,[1]]
# In[57]:
y_pred_1.head()
# In[58]:
y_test_df = pd.DataFrame(y_test)
# In[59]:
y_test_df = pd.DataFrame(y_test)
# In[60]:
# Removing index for both dataframes to append them side by side
y_pred_1.reset_index(drop=True, inplace=True)
66
y_test_df.reset_index(drop=True, inplace=True)
# In[61]:
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
# In[62]:
y_pred_final= y_pred_final.rename(columns={ 1 : 'Churn_Prob'})
# In[63]:
# Rearranging the columns
y_pred_final = y_pred_final.reindex(['CustID','Churn','Churn_Prob'], axis=1)
# In[64]:
y_pred_final.head()
# In[65]:
# Creating new column 'predicted' with 1 if Churn_Prob>0.5 else 0
y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.5 else 0)
# In[66]:
y_pred_final.head()
# In[67]:
from sklearn import metrics
# In[68]:
67
help(metrics.confusion_matrix)
# In[69]:
# Confusion matrix
confusion = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.predicted )
confusion
# In[70]:
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.predicted)
# In[71]:
TP = confusion[1,1] # true positive
TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
# In[72]:
TP / float(TP+FN)
# In[73]:
TN / float(TN+FP)
# In[74]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))
# In[75]:
# positive predictive value
print (TP / float(TP+FP))
68
# In[76]:
# Negative predictive value
print (TN / float(TN+ FN))
# In[77]:
def draw_roc( actual, probs ):
fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
drop_intermediate = False )
auc_score = metrics.roc_auc_score( actual, probs )
plt.figure(figsize=(6, 4))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
return fpr, tpr, thresholds
# In[78]:
draw_roc(y_pred_final.Churn, y_pred_final.predicted)
# In[79]:
# Let's create columns with different probability cutoffs
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_pred_final[i]= y_pred_final.Churn_Prob.map( lambda x: 1 if x > i else 0)
y_pred_final.head()
# In[80]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
69
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
# TP = confusion[1,1] # true positive
# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
# Let's plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else
0)
y_pred_final.head()
#Let's check the overall accuracy.
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.final_predicted)
FinalChurn=telecom.filter(['customerID','Churn'])
total_rows = FinalChurn[FinalChurn.Churn == 1]
total_rows
value = len(total_rows)
print("number of people who are correctly using network is"+str(value))
70
5,3.Outputs Screen shots
Fig-1Structure of Data
Fig-2 Histogram
71
Fig-3 Box Plot
Fig-4 Graph plot
72
Fig-5 Conversion of data types
73
Fig-6 Missing Values
Fig-7 Converting the data sets into 0 and 1
74
Fig-8 Naive Bayesian
Fig -9 Optimal cut-off point
75
Fig-10 Correlation matrix
Fig -11 ROC Curve
76
Fig-12 Customers who stay
Fig-13 Customers who leave
77
Chapter-6
TESTING & VALIDATION
6.1 Introduction:
Functional vs. Non-functional Testing
The goal of utilizing numerous testing methodologies in your development process is to
make sure your software can successfully operate in multiple environments and across
different platforms. These can typically be broken down between functional and non-
functional testing. Functional testing involves testing the application against the business
requirements. It incorporates all test types designed to guarantee each part of a piece of
software behaves as expected by using uses cases provided by the design team or business
analyst. These testing methods are usually conducted in order and include:
 Unit testing
 Integration testing
 System testing
 Acceptance testing
Non-functional testing methods incorporate all test types focused on the operational aspects
of a piece of software. These include:
 Performance testing
 Security testing
 Usability testing
 Compatibility testing
The key to releasing high quality software that can be easily adopted by your end users is to
build a robust testing frameworkfs that implements both functional and non-functional
software testing methodologies.
Unit Testing
Unit testing is the first level of testing and is often performed by the developers themselves.
It is the process of ensuring individual components of a piece of software at the code level
are functional and work as they were designed to. Developers in a test-driven environment
will typically write and run the tests prior to the software or feature being passed over to the
test team. Unit testing can be conducted manually, but automating the process will speed up
delivery cycles and expand test coverage. Unit testing will also make debugging easier
because finding issues earlier means they take less time to fix than if they were discovered
later in the testing process. TestLeft is a tool that allows advanced testers and developers to
shift left with the fastest test automation tool embedded in any IDE.
78
Fig-7 Unit Testing Life cycle
Unit Testing Techniques:
 White Box Testing - used to test each one of those functions behavior
 Black Box Testing - Using which the user interface, input and output are tested
 Gray Box Testing - Used to execute tests, risks and assessment methods.
White box testing:
The box testing approach of software testing consists of black box testing and white box
testing. We are discussing here white box testing which also known as glass box is testing,
structural testing, clear box testing, open box testing and transparent box testing. It
tests internal coding and infrastructure of a software focus on checking of predefined inputs
against expected and desired outputs. It is based on inner workings of an application and
revolves around internal structure testing. In this type of testing programming skills are
required to design test cases. The primary goal of white box testing is to focus on the flow of
inputs and outputs through the software and strengthening the security of the software.
79
The term 'white box' is used because of the internal perspective of the system. The clear box
or white box or transparent box name denote the ability to see through the software's outer
shell into its inner workings.
Test cases for white box testing are derived from the design phase of the software
development lifecycle. Data flow testing, control flow testing, path testing, branch testing,
statement and decision coverage all these techniques used by white box testing as a guideline
to create an error-free software.
Fig-8 White Box Testing
White box testing follows some working steps to make testing manageable and easy to
understand what the next task to do. There are some basic steps to perform white box testing.
Generic steps of white box testing:
o Design all test scenarios, test cases and prioritize them according to high priority
number.
o This step involves the study of code at runtime to examine the resource utilization,
not accessed areas of the code, time taken by various methods and operations and so
on.
o In this step testing of internal subroutines takes place. Internal subroutines such as
nonpublic methods, interfaces are able to handle all types of data appropriately or
not.
o This step focuses on testing of control statements like loops and conditional
statements to check the efficiency and accuracy for different data inputs.
o In the last step white box testing includes security testing to check all possible
security loopholes by looking at how the code handles security
Reasoning of white box testing:
o It identifies internal security holes.
o To check the way of input inside the code.
o Check the functionality of conditional loops.
o To test function, object, and statement at an individual level.
80
Advantages of white box testing:
o White box testing optimizes code so hidden errors can be identified.
o Test cases of white box testing can be easily automated.
o This testing is more thorough than other testing approaches as it covers all code
paths.
o It can be started in the SDLC phase even without GUI.
Disadvantages of white box testing:
o White box testing is too much time consuming when it comes to large-scale
programming applications.
o White box testing is much expensive and complex.
o It can lead to production error because it is not detailed by the developers.
o White box testing needs professional programmers who have a detailed knowledge
and understanding of programming language and implementation.
Black box testing:
Black box testing is a technique of software testing which examines the functionality of
software without peering into its internal structure or coding. The primary source of black
box testing is a specification of requirements that is stated by the customer.
In this method, tester selects a function and gives input value to examine its functionality,
and checks whether the function is giving expected output or not. If the function produces
correct output, then it is passed in testing, otherwise failed. The test team reports the
result to the development team and then tests the next function. After completing testing
of all functions if there are severe problems, then it is given back to the development team
for correction.
Fig-9 Black Box Testing
Generic step of black box testing:
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn
Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn

Contenu connexe

Similaire à Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn

Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...IRJET Journal
 
Customer churn analysis using XGBoosted decision trees
Customer churn analysis using XGBoosted decision treesCustomer churn analysis using XGBoosted decision trees
Customer churn analysis using XGBoosted decision treesnooriasukmaningtyas
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET Journal
 
A Survey on Batch Auditing Systems for Cloud Storage
A Survey on Batch Auditing Systems for Cloud StorageA Survey on Batch Auditing Systems for Cloud Storage
A Survey on Batch Auditing Systems for Cloud StorageIRJET Journal
 
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERNEVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERNIRJET Journal
 
B.Tech. Summer Training Report
B.Tech. Summer Training ReportB.Tech. Summer Training Report
B.Tech. Summer Training ReportShashank Narayan
 
BUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docx
BUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docxBUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docx
BUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docxjasoninnes20
 
INTERNET OF BEHAVIOUR.docx
INTERNET OF BEHAVIOUR.docxINTERNET OF BEHAVIOUR.docx
INTERNET OF BEHAVIOUR.docxAmit Kumar
 
A Two Stage Classification Model for Call Center Purchase Prediction
A Two Stage Classification Model for Call Center Purchase PredictionA Two Stage Classification Model for Call Center Purchase Prediction
A Two Stage Classification Model for Call Center Purchase PredictionTELKOMNIKA JOURNAL
 
WS2_FINAL_MITOS_T03_v20151102
WS2_FINAL_MITOS_T03_v20151102WS2_FINAL_MITOS_T03_v20151102
WS2_FINAL_MITOS_T03_v20151102Arun Sankar
 
235429094 jobportal-documentation
235429094 jobportal-documentation235429094 jobportal-documentation
235429094 jobportal-documentationsireesha nimmagadda
 
SURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCE
SURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCESURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCE
SURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCEIRJET Journal
 
Pranabendu
PranabenduPranabendu
PranabenduPMI2011
 
Pranabendu 131008015758-phpapp02
Pranabendu 131008015758-phpapp02Pranabendu 131008015758-phpapp02
Pranabendu 131008015758-phpapp02PMI_IREP_TP
 
Predicting reaction based on customer's transaction using machine learning a...
Predicting reaction based on customer's transaction using  machine learning a...Predicting reaction based on customer's transaction using  machine learning a...
Predicting reaction based on customer's transaction using machine learning a...IJECEIAES
 
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...IRJET Journal
 

Similaire à Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn (20)

Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
Machine Learning Approaches to Predict Customer Churn in Telecommunications I...
 
Customer churn analysis using XGBoosted decision trees
Customer churn analysis using XGBoosted decision treesCustomer churn analysis using XGBoosted decision trees
Customer churn analysis using XGBoosted decision trees
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom Industry
 
A Survey on Batch Auditing Systems for Cloud Storage
A Survey on Batch Auditing Systems for Cloud StorageA Survey on Batch Auditing Systems for Cloud Storage
A Survey on Batch Auditing Systems for Cloud Storage
 
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERNEVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
EVALUTION OF CHURN PREDICTING PROCESS USING CUSTOMER BEHAVIOUR PATTERN
 
B.Tech. Summer Training Report
B.Tech. Summer Training ReportB.Tech. Summer Training Report
B.Tech. Summer Training Report
 
Billing project
Billing projectBilling project
Billing project
 
BUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docx
BUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docxBUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docx
BUSINESS CASE CAPSTONE2BUSINESS CASE CAPSTONE3.docx
 
INTERNET OF BEHAVIOUR.docx
INTERNET OF BEHAVIOUR.docxINTERNET OF BEHAVIOUR.docx
INTERNET OF BEHAVIOUR.docx
 
A Two Stage Classification Model for Call Center Purchase Prediction
A Two Stage Classification Model for Call Center Purchase PredictionA Two Stage Classification Model for Call Center Purchase Prediction
A Two Stage Classification Model for Call Center Purchase Prediction
 
WS2_FINAL_MITOS_T03_v20151102
WS2_FINAL_MITOS_T03_v20151102WS2_FINAL_MITOS_T03_v20151102
WS2_FINAL_MITOS_T03_v20151102
 
Summer Internship Project (SIP)
Summer Internship Project (SIP)Summer Internship Project (SIP)
Summer Internship Project (SIP)
 
Ashu lic reporrtjgjg
Ashu lic reporrtjgjgAshu lic reporrtjgjg
Ashu lic reporrtjgjg
 
235429094 jobportal-documentation
235429094 jobportal-documentation235429094 jobportal-documentation
235429094 jobportal-documentation
 
SURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCE
SURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCESURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCE
SURVEY ON ONLINE EXAMINATION SYSTEM USING ARTIFICIAL INTELLIGENCE
 
Pranabendu
PranabenduPranabendu
Pranabendu
 
Pranabendu 131008015758-phpapp02
Pranabendu 131008015758-phpapp02Pranabendu 131008015758-phpapp02
Pranabendu 131008015758-phpapp02
 
Predicting reaction based on customer's transaction using machine learning a...
Predicting reaction based on customer's transaction using  machine learning a...Predicting reaction based on customer's transaction using  machine learning a...
Predicting reaction based on customer's transaction using machine learning a...
 
What is Unified Communication & Its Importance?
What is Unified Communication & Its Importance?What is Unified Communication & Its Importance?
What is Unified Communication & Its Importance?
 
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
Estimating the Efficacy of Efficient Machine Learning Classifiers for Twitter...
 

Dernier

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 

Dernier (20)

(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 

Here are the key steps to build a logistic regression model to predict customer churn:1. Data Understanding: Get details of the dataset like variables, data types, missing values etc. Check for any issues.2. Data Preparation: Handle missing data, outliers, encode categorical variables. Split data into train and test sets. 3. Feature Selection: Identify important predictor variables that influence churn. Remove redundant/irrelevant variables.4. Model Building: Use logistic regression on training data. It predicts the probability of churn (0 or 1). log(p/(1-p)) = B0 + B1*X1 + B2*X2 +...+ Bn*Xn

  • 1. 1 A MAJOR PROJECT REPORT ON “TELECOM CHURN BASED ON ML” In partial fulfillment of the requirements for the award of the degree of BACHELOR OF TECHNOLOGY IN INFORMATION TECHNOLOGY Submitted by Hemanth Pasula D Vedanth G.Yashwanth venkat Samrat 16911A1246 16911A1210 16911A1217 Under the Esteemed Guidance of M.Suresh Babu Assistant. Professor DEPARTMENT OF INFORMATION TECHNOLOGY VIDYA JYOTHI INSTITUTE OF TECHNOLOGY (Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad) Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad – 500075 2019 - 2020
  • 2. 2 VIDYA JYOTHI INSTITUTE OF TECHNOLOGY (Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad) Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad - 500075 DEPARTMENT OF INFORMATION TECHNOLOGY CERTIFICATE This is to certify that the Project Report on “TELECOM CHURN BASED ON ML” is a bonafide work by D Vedanth(16911A1210), Hemanth Pasula(16911A1246), G.Yashwanth venkat Samrat (16911A1217) in partial fulfillment of the requirement for the award of the degree of Bachelor of Technology in “INFORMATION TECHNOLOGY” VJIT Hyderabad during the year 2019 - 2020. Project Guide Head of the department M.Suresh Babu, Mr. B. Srinivasulu, M.Tech, M.E., Assistant Professor. Professor. External Examiner
  • 3. 3 VIDYA JYOTHI INSTITUTE OF TECHNOLOGY (Accredited by NBA, Approved by AICTE, Autonomous VJIT Hyderabad) Aziz Nagar Gate, C.B.Post, Chilkur Road, Hyderabad - 500075 2019 - 2020 DECLARATION We, D Vedanth(16911A1210), Hemanth Pasula(16911A1246), G.Yashwanth venkat Samrat (16911A1217) hereby declare that Project Report entitled “TELECOM CHURN BASED ON ML”, is submitted in the partial fulfillment of the requirement for the award of Bachelor of Technology in Information Technology to Vidya Jyothi Institute of Technology, Autonomous VJIT - Hyderabad, is an authentic work and has not been submitted to any other university or institute for the degree. D Vedanth (16911A1210) Hemanth Pasula (16911A1246) G.Yashwanth venkat Samrat(16911A1217)
  • 4. 4 ACKNOWLEDGEMENT It is a great pleasure to express our deepest sense of gratitude and indebtedness to our internal guide M.Suresh Babu, Assistant Professor, Department of IT, VJIT, for having been a source of constant inspiration, precious guidance and generous assistance during the project work. We deem it as a privilege to have worked under her able guidance. Without her close monitoring and valuable suggestions this work wouldn’t have taken this shape. We feel that this help is un-substitutable and unforgettable. We wish to express our sincere thanks to Dr. P. Venugopal Reddy, Director VJIT, for providing the college facilities for the completion of the project. We are profoundly thankful to Mr.B.Srinivasulu, Professor and Head of Department of IT, for his cooperation and encouragement. Finally, we thank all the faculty members, supporting staff of IT Department and friends for their kind co-operation and valuable help for completing the project. TEAM MEMBERS D Vedanth (16911A1210) Hemanth Pasula (16911A1246) G.Yashwanth venkat Samrat (16911A1217)
  • 5. 5 CONTENTS 1. Introduction 1.1 Motivation 9 1.2 Problem definition 9 1.3 Objective of Project 9 1.3.1 General Objective 9 1.3.2 Specific Objective 10 1.4 Limitations of Project 10 2. LITERATURE SURVEY 2.1 Introduction 11 3. ANALYSIS 3.1 Introduction 12 3.2 Software Requirement Specification 12 3.2.1 Functional requirements 12 3.2.2 Non-Functional requirements 14 3.2.3 Software requirement 14 3.2.4 Hardware requirement 23 3.3 Content diagram of Project 25 4. DESIGN 4.1 Introduction 26 4.2 UML Diagram 42 4.2.1 Class diagram 42 4.2.2 Use case diagram 43 4.2.3 Activity diagram 44 4.2.4 Sequence diagram 45 4.2.5 Collaboration diagram 46 4.2.6 Component diagram 47 4.2.7 Deployment diagram 48 5. IMPLEMENTATION METHODS & RESULTS 49 5.1 Introduction 49 5.2 Explanation of Key functions 50 5.3 Screenshots or output screens 71
  • 6. 6 6. TESTING & VALIDATION 6.1 Introduction 78 6.2 Design of test cases and scenarios 88 6.3 Validation 88 6.4 Conclusion 90 7. CONCLUSION AND FUTURE-ENHANCEMENTS : First Paragraph - Project Conclusion 90 Second Paragraph - Future enhancement 90 REFERENCES: 91 1. Author Name, Title of Paper/ Book, Publisher’s Name, Year of publication 2. Full URL Address
  • 7. 7 LIST OF FIGURES PAGE NO Fig-1 UML Views 28 Fig-2 Merging data sets 50 Fig-3 Inputting missing value 52 Fig-4 Diving into training set 52 Fig-5 Correlation Matrix 53 Fig-6 Histogram 55 Fig-7 Unit testing life cycle 79 Fig-8 White box testing 80 Fig-9 Black box testing 82 Fig-10 Grey box testing 83 Fig-11 Grey box testing feature 84 Fig-12 Software verification 89
  • 8. 8 LIST OF OUTPUT SCREEN SHOTS PAGE NO Fig-1 Structure of data 71 Fig-2 Histogram 72 Fig-3 Box plot 72 Fig-4 Graph plot 73 Fig-5 Conversion of data types 73 Fig-6 Missing value 74 Fig-7 Converting the data sets 1 and 0 74 Fig-8 Naive Bayesian 75 Fig-9 Optimal cut-off point 75 Fig-10 Correlation matrix 76 Fig-11 ROC curve 76 Fig-12 Customer who stay 77 Fig-13 Customer who leave 77
  • 9. 9 Chapter-1 Introduction 1.1 Motivation In the present project we are trying to help the service providers to not to lose their customers. With the help of this we are better able to understand the customers and let the service providers know their customers. 1.2 Problem definition Telecommunication is the exchange of signs, signals, messages, words, writings, images and sounds or information of any nature by wire, radio, optical or other electromagnetic systems. Telecommunication occurs when the exchange of information between communication participants includes the use of technology. It is transmitted through a transmission medium, such as over physical media, for example, over electrical cable, or via electromagnetic radiation through space such as radio or light. Such transmission paths are often divided into communication channels which afford the advantages of multiplexing. Since the Latin term communication is considered the social process of information exchange, the term telecommunications is often used in its plural form because it involves many different technologies. The churn rate, also known as the rate of attrition or customer churn, is the rate at which customers stop doing business with an entity. It is most commonly expressed as the percentage of service subscribers who discontinue their subscriptions within a given time period. It is also the rate at which employees leave their jobs within a certain period. For a company to expand its clientele, its growth rate (measured by the number of new customers) must exceed its churn rate. Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks. 1.3 Objective of Project 1.3.1 General Objective Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers. Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients. 1.3.2 Specific Objective
  • 10. 10 Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided. predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn. 1.4 Limitations of Project Acquiring the data of a customer is difficult. In the present days it is very difficult to know what a person is thinking. So, we are only able to judge with the help of data which is in front of us. Which can be only 80-90% reliable. So, we should make shore that our predictions are good.
  • 11. 11 Chapter-2 LITERATURE SURVEY 2.1 Introduction LOGISTIC REGRESSION Problem Statement : "You have a telecom firm which has collected data of all its customers" The main types of attributes are : 1.Demographics (age, gender etc.) 2.Services availed (internet packs purchased, special offers etc) 3.Expenses (amount of recharge done per month etc.) Based on all this past information, you want to build a model which will predict whether a particular customer will churn or not. So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular customer has churned. It is a binary variable 1 means that the customer has churned and 0 means the customer has not churned. With 21 predictor variables we need to predict whether a particular customer will switch to another telecom provider or not. The data sets were taken from the Kaggle website. DATA PROCEDURE Import the required libraries 1. Importing all datasets 2. Merging all datasets based on condition ("customer_id ") 3. Data Cleaning - checking the null values 4. Check for the missing values and replace them 5. Model building • Binary encoding • One hot encoding • Creating dummy variables and removing the extra columns 6. Feature selection using RFE - Recursive Feature Elimination 7. Getting the predicted values on train set 8. Creating a new column predicted with 1 if churn > 0.5 else 0 9. Create a confusion matrix on train set and test 10. Check the overall accuracy
  • 12. 12 Chapter-3 ANALYSIS 3.1 Introduction The field of chemistry uses analysis in at least three ways: to identify the components of a particular chemical compound (qualitative analysis), to identify the proportions of components in a mixture (quantitative analysis), and to break down chemical processes and examine chemical reactions between elements of matter. For an example of its use, analysis of the concentration of elements is important in managing a nuclear reactor, so nuclear scientists will analyse neutron activation to develop discrete measurements within vast samples. A matrix can have a considerable effect on the way a chemical analysis is conducted and the quality of its results. Analysis can be done manually or with a device. Chemical analysis is an important element of national security among the major world powers with materials. 3.2 Software Requirement Specification 3.2.1 Functional requirements Functional requirements may involve calculations, technical details, data manipulation and processing, and other specific functionality that define what a system is supposed to accomplish. Behavioural requirements describe all the cases where the system uses the functional requirements, these are captured in use cases. Functional requirements are supported by non-functional requirements (also known as "quality requirements"), which impose constraints on the design or implementation (such as performance requirements, security, or reliability). Generally, functional requirements are expressed in the form "system must do requirement," while non-functional requirements take the form "system shall be requirement." The plan for implementing functional requirements is detailed in the system design, whereas non-functional requirements are detailed in the system architecture. Customer data It contains all data related to customer’s services and contract information. In addition to all offers, packages, and services subscribed to by the customer. Furthermore, it also contains information generated from CRM system like (all customer GSMs, Type of subscription, birthday, gender, the location of living and more ). Network logs data Contains the internal sessions related to internet, calls, and SMS for each transaction in Telecom operator, like the time needed to open a session for the internet and call ending status. It could indicate if the session dropped due to an error in the internal network. Call details records “CDRs” Contain all charging information about calls, SMS, MMS, and internet transaction made by customers. This data source is generated as text files. This data has a large size and there is a lot of detailed information about it. We spent a lot of time to understand it and to know its sources and storing format. In addition to these records, the data must be linked to the detailed data stored in relational databases that contain detailed information about the customer.
  • 13. 13 The Quality of the System is maintained in such a way so that it can be very user friendly to all the users. The software quality attributes are assumed as under: 1.Accurate and hence reliable. In simpler terms, given a set of data points from repeated measurements of the same quantity, the set can be said to be accurate if their average is close to the true value of the quantity being measured, while the set can be said to be precise if the values are close to each other. In the first, more common definition of "accuracy" above, the two concepts are independent of each other, so a particular set of data can be said to be either accurate, or precise, or both, or neither. In everyday language, we use the word reliable to mean that something is dependable and that it will give behave predictably every time. 2.Secured. Security is freedom from, or resilience against, potential harm caused by others. Beneficiaries ... the English language in the 16th century. It is derived from Latin secures, meaning freedom from anxiety: se (without) + cura (care, anxiety). 3.Fast speed. In everyday use and in kinematics, the speed of an object is the magnitude of the change of its position; it is thus a scalar quantity. The average speed of an object in an interval of time is the distance travelled by the object divided by the duration of the interval; the instantaneous speed is the limit of the average speed as the duration of the time interval approaches zero. 4.Compatibility. a state in which two things are able to exist or occur together without problems or conflict. 3.2.2 Non Functional requirements
  • 14. 14 In systems engineering and requirements engineering, a non-functional requirement (NFR) is a requirement that specifies criteria that can be used to judge the operation of a system, rather than specific behaviours. They are contrasted with functional requirements that define specific behaviour or functions. The plan for implementing functional requirements is detailed in the system design. The plan for implementing non-functional requirements is detailed in the system architecture, because they are usually architecturally significant requirements. 3.2.3 Software requirement 1.Python Introduction Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library. Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3. The Python 2 language was officially discontinued in 2020 (first planned for 2015), and "Python 2.7.18 is the last Python 2.7 release and therefore the last Python 2 release." No more security patches or other improvements will be released for it.[31][32] With Python 2's end-of-life, only Python 3.5.x and later are supported. Python interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, an open source reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development. History of Python Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL), capable of exception handling and interfacing with the Amoeba operating system. Its implementation began in December 1989. Van Rossum shouldered sole responsibility for the project, as the lead developer, until 12 July 2018, when he announced his "permanent vacation" from his responsibilities as Python's Benevolent Dictator For Life, a title the Python community bestowed upon him to reflect his long-term commitment as the
  • 15. 15 project's chief decision-maker. He now shares his leadership as a member of a five-person steering council. In January 2019, active Python core developers elected Brett Cannon, Nick Coghlan, Barry Warsaw, Carol Willing and Van Rossum to a five-member "Steering Council" to lead the project. Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-detecting garbage collector and support for Unicode. Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not completely backward-compatible. Many of its major features were back ported to Python 2.6.x[ and 2.7.x version series. Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3. Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3. 2.Anaconda 3 Introduction Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment. The distribution includes data-science packages suitable for Windows, Linux, and macOS. It is developed and maintained by Anaconda, Inc., which was founded by Peter Wang and Travis Oliphant in 2012. As an Anaconda, Inc. product, it is also known as Anaconda Distribution or Anaconda Individual Edition, while other products from the company are Anaconda Team Edition and Anaconda Enterprise Edition, which are both not free. Package versions in Anaconda are managed by the package management system conda. This package manager was spun out as a separate open-source package as it ended up being useful on its own and for other things than Python. There is also a small, bootstrap version of Anaconda called Miniconda, which includes only conda, Python, the packages they depend on, and a small number of other packages. History of Anaconda 3 Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda package and virtual environment manager. It also includes a GUI, Anaconda Navigator[ , as a graphical alternative to the command line interface (CLI). The big difference between conda and the pip package manager is in how package dependencies are managed, which is a significant challenge for Python data science and the reason conda exists. When pip installs a package, it automatically installs any dependent Python packages without checking if these conflict with previously installed packages. It will install a package and any of its dependencies regardless of the state of the existing installation. Because of this, a user with a working installation of, for example, Google Tensorflow, can find that it stops working having used pip to install a different package that requires a different version of the
  • 16. 16 dependent numpy library than the one used by Tensorflow. In some cases, the package may appear to work but produce different results in detail. In contrast, conda analyses the current environment including everything currently installed, and, together with any version limitations specified (e.g. the user may wish to have Tensorflow version 2,0 or higher), works out how to install a compatible set of dependencies, and shows a warning if this cannot be done. Open source packages can be individually installed from the Anaconda repository, Anaconda Cloud (anaconda.org), or the user's own private repository or mirror, using the command. Anaconda, Inc. compiles and builds the packages available in the Anaconda repository itself, and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64-bit. Anything available on PyPI may be installed into a conda environment using pip, and conda will keep track of what it has installed itself and what pip has installed. Custom packages can be made using the command, and can be shared with others by uploading them to Anaconda Cloud, PyPI or other repositories. The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python 3.7. However, it is possible to create new environments that include any version of Python packaged with conda Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a graphical alternative to the command line interface (CLI). Here we define conda as analyzes your current environment, everything you have installed, any version limitations you specify (e.g. you only want tensor flow >= 2.0) and figures out how to install compatible dependencies. Or it will tell you that what you want can't be done. pip, by contrast, will just install the package you specify and any dependencies, even if that breaks other packages. Conda allows users to easily install different versions of binary software packages and any required libraries appropriate for their computing platform. Also, it allows users to switch between package versions and download and install updates from a software repository. Conda is written in the Python programming language, but can manage projects containing code written in any language (e.g., R), including multi-language projects. Conda can install Python, while similar Python-based cross-platform package managers (such as wheel or pip) cannot. A popular conda channel for bioinformatics software is Bioconda, which provides multiple software distributions for computational biology. In fact, the conda package and environment manager is included in all versions of Anaconda, Miniconda and Anaconda Repository. The big difference between conda and the pip package manager is in how package dependencies are managed, which is a significant challenge for Python data science and the reason conda exists. When pip installs a package, it automatically installs any dependent Python packages without checking if these conflict with previously installed packages. It will install a package and any of its dependencies regardless of the state of the existing installation because of this, a user with a working installation of, for example, Google Tensorflow, can find that it stops working having used pip to install a different package that requires a different version of the dependent numpy library than the one used by Tensorflow. In some cases, the package may appear to work but produce different results in detail.
  • 17. 17 In contrast, conda analyses the current environment including everything currently installed, and, together with any version limitations specified (e.g. the user may wish to have Tensorflow version 2,0 or higher), works out how to install a compatible set of dependencies, and shows a warning if this cannot be done. Open source packages can be individually installed from the Anaconda repository, Anaconda Cloud (anaconda.org), or your own private repository or mirror, using the cond install command. Anaconda Inc compiles and builds the packages available in the Anaconda repository itself, and provides binaries for Windows 32/64 bit, Linux 64 bit and MacOS 64- bit. Anything available on PyPI may be installed into a conda environment using pip, and conda will keep track of what it has installed itself and what pip has installed. Custom packages can be made using the conda build command, and can be shared with others by uploading them to Anaconda Cloud,PyPI or other repositories. The default installation of Anaconda2 includes Python 2.7 and Anaconda3 includes Python 3.7. However, it is possible to create new environments that include any version of Python packaged with conda. Anaconda Navigator Anaconda Navigator is a desktop graphical user interview included in Anaconda distribution that allows users to launch applications and manage conda packages, environments and channels without using. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the packages and update them. It is available for window, micrOS and LINUX. The following applications are available by default in Navigator. Conda is an open source language-agnostic and environment management system that installs, runs, and updates packages and their dependencies. It was created for Python programs, but it can package and distribute software for any language (e.g.,R) .where R is a pograming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.The R language is widely used among statisticians and data miners for developing statistical software and data analysis .Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity as of February 2020, R ranks 13th in the TIOBE index, a measure of popularity of programming languages. A GNU package, the official R software environment is written primarily in C, FORTRAN, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems. Although R has a command line interface. There are several third-party graphical user interfaces, such as RStudio, an integrated development environment, and Jupyter, a notebook interface, including multi-language projects. The conda package and environment manager is included in all versions of Anaconda, Miniconda, and Anaconda Repository. Anaconda cloud: Anaconda Cloud is a package management service by Anaconda where you can find, access, store and share public and private notebooks, environments, and conda and PyPI packages. Cloud hosts useful Python packages, notebooks and environments for a wide
  • 18. 18 variety of applications. You do not need to log in or to have a Cloud account, to search for public packages, download and install them. Project jyupter: Project Jupyter is a nonprofit organization created to "develop open-source software, open- standards, and services for interactive computing across dozens of programming languages". Spun-off from IPython in 2014 by Fernando Pérez, Project Jupyter supports execution environments in several dozen languages. Project Jupyter's name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R, and also a homage to Galileo’s notebooks recording the discovery of the moons of Jupiter Project Jupyter has developed and supported the interactive computing products Jupyter Notebook, JupyterHub, and JupyterLab, the next-generation version of Jupyter Notebook. In 2014, Fernando Pérez announced a spin-off project from IPython called Project Jupyter IPython continued to exist as a Python shell and a kernel for Jupyter, while the notebook and other language-agnostic parts of IPython moved under the Jupyter name Jupyter is language agnostic and it supports execution environments (aka kernels) in several dozen languages among which are Julia, R, Haskell, Ruby, and of course Python (via the IPython kernel) In 2015, GitHub and the Jupyter Project announced native rendering of Jupyter notebooks file format (.ipynb files) on the GitHub platform. Jupyter Notebook Jupyter Notebook(formerly IPython Notebooks) is a web based computational environment for creating Jupyter notebook documents. The "notebook" term can colloquially make reference to many different entities, mainly the Jupyter web application, Jupyter Python web server, or Jupyter document format depending on context. A Jupyter Notebook document is a JSON document, following a versioned schema, and containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media, usually ending with the ".ipynb" extension. A Jupyter Notebook can be converted to a number of open source output formats (HTML,PRESENTATION SKILLS,PDF ,markdown,python) through "Download As" in the web interface, via the nbconvert library or "jupyter nbconvert" command line interface in a shell. To simplify visualisation of Jupyter notebook documents on the web, the library is provided as a service through NbViewer which can take a URL to any publicly available notebook document, convert it to HTML on the fly and display it to the user. Jupyter Notebook interface Jupyter Notebook can connect to many kernels to allow programming in many languages. By default Jupyter Notebook ships with the IPython kernel. As of the 2.3 release(October 2014), there are currently 49 Jupyter-compatible kernels for many programming languages, including Python, R, Julia and Haskell The Notebook interface was added to IPython in the 0.12 release (December 2011), renamed to Jupyter notebook in 2015 (IPython 4.0 – Jupyter 1.0). Jupyter Notebook is similar to the notebook interface of other programs such as Maple, Mathematica, and SageMath, a computational interface style that originated with Mathematica in the 1980s. According to The Atlantic, Jupyter interest overtook the popularity of the Mathematica notebook interface in early 2018.
  • 19. 19 Jupyter kernels A Jupyter kernel is a program responsible for handling various types of requests (code execution, code completions, inspection), and providing a reply. Kernels talk to the other components of Jupyter using ZeroMQ over the network, and thus can be on the same or remote machines. Unlike many other Notebook-like interfaces, in Jupyter, kernels are not aware that they are attached to a specific document, and can be connected to many clients at once. Usually kernels allow execution of only a single language, but there are a couple of exceptions. By default Jupyter ships with IPython as a default kernel and a reference implementation via the ipykernel wrapper. Kernels for many languages having varying quality and features are available. JupyterHub JupyterHub is a multi-user server for Jupyter Notebooks. It is designed to support many users by spawning, managing, and proxying many singular Jupyter Notebook servers. While JupyterHub requires managing servers, third-party services like Jupyo provide an alternative to JupyterHub by hosting and managing multi-user Jupyter notebooks in the cloud. JupyterLab JupyterLab is the next-generation user interface for Project Jupyter. It offers all the familiar building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser, rich outputs, etc.) in a flexible and powerful user interface. The first stable release was announced on February 20, 2018. Web application It has also added user- as well as system-based web applications enhancement to add support for deployment across the variety of environments. It also tries to manage sessions as well as applications across the network. Tomcat is building additional components. A number of additional components may be used with Apache Tomcat. These components may be built by users should they need them or they can be downloaded from one of the mirrors. Features: Tomcat 7.x implements the Servlet 3.0 and JSP 2.2 specifications. It requires Java version 1.6, although previous versions have run on Java 1.1 through 1.5. Versions 5 through 6 saw improvements in garbage collection, JSP parsing, performance and scalability. Native wrappers, known as "Tomcat Native", are available for Microsoft Windows and Unix for platform integration, Tomcat 8.x implements the Servlet 3.1 and JSP 2.3 Specifications. Apache Tomcat 8.5.x is intended to replace 8.0.x and includes new features pulled forward from Tomcat 9.0.x. The minimum Java version and implemented specification versions remain unchanged. WINDOWS 10
  • 20. 20 Windows 10 is a series of personal computer operating systems produced by Microsoft as part of its Windows NT family of operating systems. It is the successor to Windows 8.1, and was released to manufacturing on July 15, 2015, and broadly released for retail sale on July 29, 2015. Windows 10 receives new builds on an ongoing basis, which are available at no additional cost to users, in addition to additional test builds of Windows 10 which are available to Windows Insiders. Devices in enterprise environments can receive these updates at a slower pace, or use long-term support milestones that only receive critical updates, such as security patches, over their ten-year lifespan of extended support. One of Windows 10's most notable features is support for universal apps, an expansion of the Metro-style apps first introduced in Windows 8. Universal apps can be designed to run across multiple Microsoft product families with nearly identical code— including PCs, tablets, smartphones, embedded systems, Xbox One, Surface Hub and Mixed Reality. The Windows user interface was revised to handle transitions between a mouse- oriented interface and a touchscreen-optimized interface based on available input devices— particularly on 2-in-1 PCs, both interfaces include an updated Start menu which incorporates elements of Windows 7's traditional Start menu with the tiles of Windows 8. Windows 10 also introduced the Microsoft Edge web browser, a virtual desktop system, a window and desktop management feature called Task View, support for fingerprint and face recognition login, new security features for enterprise environments, and DirectX 12. Windows 10 received mostly positive reviews upon its original release in July 2015. Critics praised Microsoft's decision to provide a desktop-oriented interface in line with previous versions of Windows, contrasting the tablet-oriented approach of 8, although Windows 10's touch-oriented user interface mode was criticized for containing regressions upon the touch- oriented interface of Windows 8. Critics also praised the improvements to Windows 10's bundled software over Windows 8.1, Xbox Live integration, as well as the functionality and capabilities of the Cortana personal assistant and the replacement of Internet Explorer with Edge. However, media outlets have been critical of changes to operating system behaviours, including mandatory update installation, privacy concerns over data collection performed by the OS for Microsoft and its partners and the adware-like tactics used to promote the operating system on its release. Although Microsoft's goal to have Windows 10 installed on over a billion devices within three years of its release had failed, it still had an estimated usage share of 60% of all the Windows versions on traditional PCs, and thus 47% of traditional PCs were running Windows 10 by September 2019. Across all platforms (PC, mobile, tablet and console), 35% of devices run some kind of Windows, Windows 10 or older. Microsoft Excel MicrosoftExcel isa spreadsheet developedby Microsoft for Windows, macOS, Androi d and iOS. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications. It has been a very widely applied spreadsheet
  • 21. 21 for these platforms, especially since version 5 in 1993, and it has replaced Lotus 1-2-3 as the industry standard for spreadsheets. Excel forms part of the Microsoft Office suite of software. Number of rows and columns Versions of Excel up to 7.0 had a limitation in the size of their data sets of 16K (214 = 16384) rows. Versions 8.0 through 11.0 could handle 64K (216 = 65536) rows and 256 columns (28 as label 'IV'). Version 12.0 onwards, including the current Version 16.x, can handle over 1M (220 = 1048576) rows, and 16384 (214 as label 'XFD') columns. Microsoft Excel up until 2007 version used a proprietary binary file format called Excel Binary File Format (.XLS) as its primary format. Excel 2007 uses Office Open XML as its primary file format, an XML-based format that followed after a previous XML-based format called "XML Spreadsheet" ("XMLSS"), first introduced in Excel 2002. Although supporting and encouraging the use of new XML-based formats as replacements, Excel 2007 remained backwards-compatible with the traditional, binary formats. In addition, most versions of Microsoft Excel can read CSV, DBF, SYLK, DIF, and other legacy formats. Support for some older file formats was removed in Excel 2007. The file formats were mainly from DOS-based programs. Microsoft originally marketed a spreadsheet program called Multiplan in 1982. Multiplan became very popular on CP/M systems, but on MS-DOS systems it lost popularity to Lotus 1-2-3. Microsoft released the first version of Excel for the Macintosh on September 30, 1985, and the first Windows version was 2.05 (to synchronize with the Macintosh version 2.2) in November 1987. Lotus was slow to bring 1-2-3 to Windows and by the early 1990s, Excel had started to outsell 1-2-3 and helped Microsoft achieve its position as a leading PC software developer. This accomplishment solidified Microsoft as a valid competitor and showed its future of developing GUI software. Microsoft maintained its advantage with regular new releases, every two years or so. Excel 2.0 is the first version of Excel for the Intel platform. Versions prior to 2.0 were only available on the Apple Macintosh. Excel 2.0 (1987) The first Windows version was labeled "2" to correspond to the Mac version. This included a run-time version of Windows. BYTE in 1989 listed Excel for Windows as among the "Distinction" winners of the BYTE Awards. The magazine stated that the port of the "extraordinary" Macintosh version "shines", with a user interface as good as or better than the original. Excel 3.0 (1990) Included toolbars, drawing capabilities, outlining, add-in support, 3D charts, and many more new features.[ Also, an easter egg in Excel 4.0 reveals a hidden animation of a dancing set of numbers 1 through 3, representing Lotus 1-2-3, which was then crushed by an Excel logo.[81]
  • 22. 22 Excel 5.0 (1993) With version 5.0, Excel has included Visual Basic for Applications (VBA), a programming language based on Visual Basic which adds the ability to automate tasks in Excel and to provide user-defined functions (UDF) for use in worksheets. VBA is a powerful addition to the application and includes a fully featured integrated development environment (IDE). Macro recording can produce VBA code replicating user actions, thus allowing simple automation of regular tasks. VBA allows the creation of forms and in-worksheet controls to communicate with the user. The language supports use (but not creation) of ActiveX (COM) DLL's; later versions add support for class modules allowing the use of basic object-oriented programming techniques. The automation functionality provided by VBA made Excel a target for macro viruses. This caused serious problems until antivirus products began to detect these viruses. Microsoft belatedly took steps to prevent the misuse by adding the ability to disable macros completely, to enable macros when opening a workbook or to trust all macros signed using a trusted certificate. Versions 5.0 to 9.0 of Excel contain various Easter eggs, including a "Hall of Tortured Souls", although since version 10 Microsoft has taken measures to eliminate such undocumented features from their products. 5.0 was released in a 16-bit x86 version for Windows 3.1 and later in a 32-bit version for NT 3.51 (x86/Alpha/PowerPC) Included in Office 2007. This release was a major upgrade from the previous version. Similar to other updated Office products, Excel in 2007 used the new Ribbon menu system. This was different from what users were used to, and was met with mixed reactions. One study reported fairly good acceptance by users except highly experienced users and users of word processing applications with a classical WIMP interface, but was less convinced in terms of efficiency and organization. However, an online survey reported that a majority of respondents had a negative opinion of the change, with advanced users being "somewhat more negative" than intermediate users, and users reporting a self-estimated reduction in productivity. Added functionality included the SmartArt set of editable business diagrams. Also added was an improved management of named variables through the Name Manager, and much- improved flexibility in formatting graphs, which allow (x, y) coordinate labeling and lines of arbitrary weight. Several improvements to pivot tables were introduced. Also like other office products, the Office Open XML file formats were introduced, including .xlsm for a workbook with macros and .xlsx for a workbook without macros. Specifically, many of the size limitations of previous versions were greatly increased. To illustrate, the number of rows was now 1,048,576 (220 ) and columns was 16,384 (214 ; the far- right column is XFD). This changes what is a valid A1 reference versus a named range. This version made more extensive use of multiple cores for the calculation of spreadsheets; however, VBA macros are not handled in parallel and XLL add-ins were only executed in parallel if they were thread-safe and this was indicated at registration. 3.2.4 Hardware requirement
  • 23. 23 Server-side hardware 1.Hardware recommended by all software needed 2.Communication hardware to serve client requests Client-side hardware 1.Hardware recommended by respective client’s operating system and web browser. 2.Communication hardware to communicate the server, Others An INTEL Core 2 Duo CPU, a RAM of 4 GB (minimum), a Hard Disk of any GB, a 105 keys keyboard, an Optical Mouse and Color Display are required. Operating System As our system will be platforms independent; so, there is no constraints in case of operating system in using this software. Windows and Linux operating systems are considered to be extremely stable platforms. One of the advantages of using the Linux operating system is the speed at which performance enhancements and security patches are made available since it is an open source product. Web Server The available web server’s platforms are IIS and Apache. Since Apache is open source, software security patches tend to be released faster. One of the downsides to IIS is its vulnerability to virus attacks, something that Apache rarely has problems with. Apache offers PHP; a language considered to be the C of the next part is the end user environment. Before we have explained that we have assume that the end users are not so much aware about the computer and they are not so much interested to divert their activities to the automated system from the manual system. Web Browser Let's play word association, just like when a psychologist asks you what comes to mind when you hear certain words: What do you think when you hear the words "Opera. Safari. Chrome. Firefox." If you think of the Broadway play version of "The Lion King," maybe it is time to see a psychologist. However, if you said, "Internet browsers," you're spot on. That's because the leading Internet Browsers are:  Google Chrome  Mozilla Firefox  Apple Safari  Microsoft Internet Explorer
  • 24. 24  Microsoft Edge  Opera  Maxton And that order pretty much lines up with how they're ranked in terms of market share and popularity...today. Browsers come and go. Ten years ago, Netscape Navigator was a well- known browser: Netscape is long gone today. Another, called Mosaic, is considered the first modern browser—it was discontinued in 1997. So, what exactly is a browser? Definition: A browser, short for web browser, is the software application (a program) that you're using right now to search for, reach and explore websites. Whereas Excel is a program for spreadsheets and Word® a program for writing documents, a browser is a program for Internet exploring (which is where that name came from). Browsers don't get talked about much. A lot of people simply click on the "icon" on our computers that take us to the Internet—and that's as far as it goes. And in a way, that's enough. Most of us simply get in a car and turn the key...we don't know what kind of engine we have or what features it has...it takes us where we want to go. That's why when it comes to computers:  There are some computer users that can't name more than one or two browsers  Many of them don't know they can switch to another browser for free  There are some who go to Google's webpage to "google" a topic and think that Google is their browser.
  • 26. 26 Chapter-4 DESIGN 4.1 Introduction The UML may be used to visualize, specify, construct and document the artifacts of a software-intensive system. The UML is only a language and so is just one part of a software development method. The UML is process independent, although optimally it should be used in a process that is use case driven, architecture-centric, iterative, and incremental. What is UML: The Unified Modeling Language (UML) is a robust notation that we can use to build OOAD models. It is called so, since it was the unification, of the ideas from different methodologies from three amigos – Botch, Rumbaugh and Jacobson. The UML is the standard language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system. It can be used with all processes, throughout the development life cycle, and across different implementation technologies.  The UML combines the best from: o Data Modeling concepts o Business Modeling (work flow) o Object Modeling o Component Modeling  The UML may be used to: o Display the boundary of a system and its major functions using use cases and actors o Illustrate use case realizations with interaction diagrams o Represent a static structure of a system using class diagrams o Model the behavior of objects with state transition diagrams o Reveal the physical implementation architecture with component & deployment diagrams o Extend the functionality of the system with stereotypes Evolution of UML
  • 27. 27 One of the methods was the Object Modeling Technique (OMT), devised by James Rumbaugh and others at General Electric. It consists of a series of models - use case, object, dynamic, and functional that combines to give a full view of a system. The Botch method was devised by Grady Booch and developed the practice of analyzing a system as a series of views. It emphasizes analyzing the system from both a macro development view and micro development view and it was accompanied by a very detailed notation. The Object-Oriented Software Engineering (OOSE) method was devised by Ivar Jacobson and focused on the analysis of system behavior. It advocated that at each stage of the process there should be a check to see that the requirements of the user were being met. The UML is composed of three different parts: 1. Model elements 2. Diagrams 3. Views  Model Elements: o The model elements represent basic object-oriented concepts such as classes, objects, and relationships. o Each model element has a corresponding graphical symbol to represent it in the diagrams.  Diagrams: o Diagrams portray different combinations of model elements. o For example, the class diagram represents a group of classes and the relationships, such as association and inheritance, between them. o The UML provides nine types of diagram - use case, class, object, state chart, sequence, collaboration, activity, component, and deployment Views: o Views provide the highest level of abstraction for analyzing the system. o Each view is an aspect of the system that is abstracted to a number of related UML diagrams. o Taken together, the views of a system provide a picture of the system in its entirety. o In the UML, the five main views of the system are o User Structural Behavioral Implementation Environment
  • 28. 28 Fig-1 UML Views In addition to model elements, diagrams, and views, the UML provides mechanisms for adding comments, information, or semantics to diagrams. And it provides mechanisms to adapt or extend itself to a particular method, software system, or organization. Extensions of UML: o Stereotypes can be used to extend the UML notational elements o Stereotypes may be used to classify and extend associations, inheritance relationships, classes, and components Examples:  Class stereotypes : boundary, control, entity  Inheritance stereotypes : extend  Dependency stereotypes : uses  Component stereotypes : subsystem Use Case Diagrams: The use case diagram presents an outside view of the system The use-case model consists of use-case diagrams. o The use-case diagrams illustrate the actors, the use cases, and their relationships. o Use cases also require a textual description (use case specification), as the visual diagrams can't contain all of the information that is necessary. o The customers, the end-users, the domain experts, and the developers all have an input into the development of the use-case model. o Creating a use-case model involves the following steps:
  • 29. 29 1. defining the system 2. identifying the actors and the use cases 3. describing the use cases 4. defining the relationships between use cases and actors. 5. defining the relationships between the use cases Use Case: Definition: is a sequence of actions a system performs that yields an observable result of value to a particular actor. Use Case Naming: Use Case should always be named in business terms, picking the words from the vocabulary of the particular domain for which we are modeling the system. It should be meaningful to the user because use case analysis is always done from user’s perspective. It will be usually verbs or short verb phrases. Use Case Specification shall document the following: o Brief Description o Precondition o Main Flow o Alternate Flow o Exceptional flows o Post Condition o Special Requirements Notation of use case Actor: Definition: someone or something outside the system that interacts with the system o An actor is external - it is not actually part of what we are building, but an interface needed to support or use it. o It represents anything that interacts with the system. Notation for Actor
  • 30. 30 Relation: Two important types of relation used in Use Case Diagram  Include: An Include relationship shows behavior that is common to one or more use cases (Mandatory). Include relation results when we extract the common sub flows and make it a use case Extend: An extend relationship shows optional behavior (Optional) Extend relation results usually when we add a bit more specialized feature to the already existing one, we say the use case B extends its functionality to use case A. A system boundary rectangle separates the system from the external actors. An extend relationship indicates that one use case is a variation of another. Extend notation is a dotted line, labeled <<extend>>, and with an arrow toward the base case. The extension points, which determines when the extended case is appropriate, is written inside the base case. Class Diagrams: Class diagram shows the existence of classes and their relationships in the structural view of a system. UML modeling elements in class diagram: o Classes and their structure and behavior o Relationships  Association  Aggregation  Composition  Dependency  Generalization / Specialization (inheritance relationships)  Multiplicity and navigation indicators  Role names A class describes properties and behavior of a type of object. o Classes are found by examining the objects in sequence and collaboration diagram o A class is drawn as a rectangle with three compartments o Classes should be named using the vocabulary of the domain Naming standards should be created e.g., all classes are singular nouns starting with a capital letter
  • 31. 31 o The behavior of a class is represented by its operations. Operations may be found by examining interaction diagrams o The structure of a class is represented by its attributes. Attributes may be found by examining class definitions, the problem requirements, and by applying domain knowledge Notation: Class information: visibility and scope The class notation is a 3-piece rectangle with the class name, attributes, and operations. Attributes and operations can be labeled according to access and scope. Relationship: Association: o Association represents the physical or conceptual connection between two or more objects o An association is a bi-directional connection between classes o An association is shown as a line connecting the related classes Aggregation: o An aggregation is a stronger form of relationship where the relationship is between a whole and its parts o It is entirely conceptual and does nothing more than distinguishing a ‘whole’ from the ‘part’. o It doesn’t link the lifetime of the whole and its parts
  • 32. 32 o An aggregation is shown as a line connecting the related classes with a diamond next to the class representing the whole Composition: Composition is a form of aggregation with strong ownership and coincident lifetime as the part of the whole. Multiplicity o Multiplicity defines how many objects participate in relationships o Multiplicity is the number of instances of one class related to ONE instance of the other class This table gives the most common multiplicities For each association and aggregation, there are two multiplicity decisions to make: one for each end of the relationship Although associations and aggregations are bi-directional by default, it is often desirable to restrict navigation to one direction If navigation is restricted, an arrowhead is added to indicate the direction of the navigation
  • 33. 33 Dependency: o A dependency relationship is a weaker form of relationship showing a relationship between a client and a supplier where the client does not have semantic knowledge of the supplier. o A dependency is shown as a dashed line pointing from the client to the supplier. Generalization / Specialization (inheritance relationships): o It is a relationship between a general thing (called the super class or the parent) and a more specific kind of that thing (called the subclass(es) or the child). o An association class is an association that also has class properties (or a class has association properties) o A constraint is a semantic relationship among model elements that specifies conditions and propositions that must be maintained as true: otherwise, the system described by the model is invalid.
  • 34. 34 o An interface is a specifier for the externally visible operations of a class without specification of internal structure. An interface is formally equivalent to abstract class with no attributes and methods, only abstract operations. o A qualifier is an attribute or set of attributes whose values serve to partition the set of instances associated with an instance across an association. Interaction Diagrams: o Interaction diagrams is used to model the dynamic behavior of the system o Interaction diagram helps us to identify the classes and its methods o Interaction diagrams describe how use cases are realized as interactions among objects o Show classes, objects, actors and messages between them to achieve the functionality of a Use Case There are two types of interaction diagrams: 1. Sequence Diagram 2. Collaboration Diagram
  • 35. 35 1. Sequence Diagram: A sequence diagram simply depicts interaction between objects in a sequential order i.e. the order in which these interactions take place. We can also use the terms event diagrams or event scenarios to refer to a sequence diagram. Sequence diagrams describe how and in what order the objects in a system function. These diagrams are widely used by businessmen and software developers to document and understand requirements for new and existing systems. Sequence Diagram Notations: 1. Actors – An actor in a UML diagram represents a type of role where it interacts with the system and its objects. It is important to note here that an actor is always outside the scope of the system we aim to model using the UML diagram. We use actors to depict various roles including human users and other external subjects. We represent an actor in a UML diagram using a stick person notation. We can have multiple actors in a sequence diagram. For example – Here the user in seat reservation system is shown as an actor where it exists outside the system and is not a part of the system. 2. Lifelines – A lifeline is a named element which depicts an individual participant in a sequence diagram. So basically, each instance in a sequence diagram is represented by a lifeline. Lifeline elements are located at the top in a sequence diagram. The standard in UML for naming a lifeline follows the following format – Instance Name: Class
  • 36. 36 Name 3. Messages – Communication between objects is depicted using messages. The messages appear in a sequential order on the lifeline. We represent messages using arrows. Lifelines and messages form the core of a sequence diagram. Messages can be broadly classified into the following  Synchronous messages – A synchronous message waits for a reply before the interaction can move forward. The sender waits until the receiver has completed the processing of the message. The caller continues only when it knows that the
  • 37. 37 receiver has processed the previous message i.e. it receives a reply message. A large number of calls in object-oriented programming are synchronous. We use a solid arrow head to represent a synchronous message.  Asynchronous Messages – An asynchronous message does not wait for a reply from the receiver. The interaction moves forward irrespective of the receiver processing the previous message or not. We use a lined arrow head to represent an asynchronous message.  Create message – We use a Create message to instantiate a new object in the sequence diagram. There are situations when a particular message call requires the creation of an object. It is represented with a dotted arrow and create word labelled on it to specify that it is the create Message symbol. For example – The creation of a new order on a e-commerce website would require a new object of Order class to be created.  Delete Message – We use a Delete Message to delete an object. When an object is deallocated memory or is destroyed within the system, we use the Delete Message symbol. It destroys the occurrence of the object in the system. It is represented by an arrow terminating with a x. For example – In the scenario below when the order is received by the user, the object of order class can be destroyed.  . Self-Message – Certain scenarios might arise where the object needs to send a message to itself. Such messages are called Self Messages and are represented with a U-shaped arrow  Reply Message – Reply messages are used to show the message being sent from the receiver to the sender. We represent a return/reply message using an open
  • 38. 38 arrowhead with a dotted line. The interaction moves forward only when a reply message is sent by the receiver. 2. Collaboration diagram: It shows the interaction of the objects and also the group of all messages sent or received by an object. This allows us to see the complete set of services that an object must provide. The following collaboration diagram realizes a scenario of reserving a copy of book in a library  Difference between the Sequence Diagram and Collaboration Diagram o Sequence diagrams emphasize the temporal aspect of a scenario - they focus on time. o Collaboration diagrams emphasize the spatial aspect of a scenario - they focus on how objects are linked. Activity Diagram: An activity diagram is essentially a fancy flowchart. Activity diagrams. An activity diagram focuses on the flow of activities involved in a single process. The activity diagram shows how those activities depend on one another. 1. initial State – The starting state before an activity takes place is depicted using the initial state. 2. Action or Activity State – An activity represents execution of an action on objects or by objects. We represent an activity using a rectangle with rounded corners. Basically any action or event that takes place is represented using an activity. 3. Action Flow or Control flows – Action flows or Control flows are also referred to as paths and edges. They are used to show the transition from one activity state to another.
  • 39. 39 4. Decision node and Branching – When we need to make a decision before deciding the flow of control, we use the decision node. 5. Guards – A Guard refers to a statement written next to a decision node on an arrow sometimes within square brackets. 6.Fork – Fork nodes are used to support concurrent activities.
  • 40. 40 Join – Join nodes are used to support concurrent activities converging into one. For join notations we have two or more incoming edges and one outgoing edge. 6. Merge or Merge Event – Scenarios arise when activities which are not being executed concurrently have to be merged. We use the merge notation for such scenarios. We can merge two or more activities into one if the control proceeds onto the next activity irrespective of the path chosen. 7. Swim lanes – We use swim lanes for grouping related activities in one column. Swim lanes group related activities into one column or one row. Swim lanes can be vertical and horizontal. Swim lanes are used to add modularity to the activity diagram. It is not mandatory to use swim lanes. They usually give more clarity to the activity diagram. It’s similar to creating a function in a program. It’s not mandatory to do so, but it is a recommended practice.
  • 41. 41 Component Diagram: Describe organization and dependency between the software implementation components. Components are distributable physical units - e.g. source code, object code. Deployment Diagram: Describe the configuration of processing resource elements and the mapping of software implementation components onto them. Contain components - e.g. object code, source code and nodes (e.g. printer, database, client machine). 4.2 UML Diagram
  • 42. 42 4.2.1 Class diagram Diagram:1 class diagram for user and admin
  • 43. 43 4.2.2 Use case diagram Diagram:2 use case diagram for admin
  • 44. 44 4.2.3 Activity diagram Diagram:3 Activity diagram for admin
  • 45. 45 4.2.4 Sequence diagram Diagram:4 sequence diagram for admin
  • 46. 46 4.2.5 Collaboration diagram Diagram:5 collaboration diagram for admin 4.2.6 Component diagram
  • 48. 48 4.2.7 Deployment diagram Diagram:7 deployment daigram for local server
  • 49. 49 Chapter-5 IMPLEMENTATION METHODS & RESULTS 5.1 Introduction : We implement our project through Anaconda 3 and jyupter when comes to Anaconda Anaconda is a free and open source and distribution the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.),that aims to simplify package management and deployment. Package versions are managed by the package management system conda .Anaconda distribution includes data-science packages suitable for Windows, Linux, and mac OS. Anaconda distribution comes with 1,500 packages selected from PyPI as well as the conda package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a graphical alternative to the command line interface (CLI). Here we define conda as analyzes your current environment, everything you have installed, any version limitations you specify (e.g. you only want tensor flow >= 2.0) and figures out how to install compatible dependencies. Or it will tell you that what you want can't be done. pip, by contrast, will just install the package you specify and any dependencies, even if that breaks other packages. Conda allows users to easily install different versions of binary software packages and any required libraries appropriate for their computing platform. Also, it allows users to switch between package versions and download and install updates from a software repository. Conda is written in the Python programming language, but can manage projects containing code written in any language (e.g., R), including mufti-language projects. Conda can install Python, while similar Python-based cross-platform package managers (such as wheel or pip) cannot. A popular conda channel for bio informatics software is Bioconda, which provides multiple software distributions for computational biology. In fact, the conda package and environment manager is included in all versions of Anaconda, Miniconda and Anaconda Repository. The big difference between conda and the pip package manager is in how package dependencies are managed, which is a significant challenge for Python data science and the reason conda exists. When pip installs a package, it automatically installs any dependent Python packages without checking if these conflict with previously installed packages. It will install a package and any of its dependencies regardless of the state of the existing installation because of this, a user with a working installation of, for example, Google Tensor flow, can find that it stops working having used pip to install a different package that requires a different version of the dependent numpy library than the one used by Tensor flow. In some cases, the package may appear to work but produce different results in detail.
  • 50. 50 5.2 Explanation of Key functions 1.Importing and Merging Data Python code in one module gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the import machinery, but it is not the only way. Functions such as importlib.import_module() and built-in can also be used to invoke the import machinery. The import statement combines two operations; it searches for the named module, then it binds the results of that search to a name in the local scope. The search operation of the import statement is defined as a call to the function, with the appropriate arguments. The return value of is used to perform the name binding operation of the import statement. See the import statement for the exact details of that name binding operation. Contrary to when you merge new cases, merging in new variables requires the IDs for each case in the two files to be the same, but the variable names should be different. In this scenario, which is sometimes referred to as augmenting your data (or in SQL, “joins”) or merging data by columns (i.e. you’re adding new columns of data to each row), you’re adding in new variables with information for each existing case in your data file. As with merging new cases where not all variables are present, the same thing applies if you merge in new variables where some cases are missing – these should simply be given blank values. Fig-2 Merging data sets 2.Data volume Since we don’t know the features that could be useful to predict the churn, we had to work on all the data that reflect the customer behavior in general. We used data sets related to
  • 51. 51 calls, SMS, MMS, and the internet with all related information like complaints, network data, IMEI, charging, and other. The data contained transactions for all customers during nine months before the prediction baseline. The size of this data was more than 70 Terabyte, and we couldn’t perform the needed feature engineering phase using traditional databases. 3.Data variety The data used in this research is collected from multiple systems and databases. Each source generates the data in a different type of files as structured, semi-structured (XML-JSON) or unstructured (CSV-Text). Dealing with these kinds of data types is very hard without big data platform since we can work on all the previous data types without making any modification or transformation. By using the big data platform, we no longer have any problem with the size of these data or the format in which the data are represented. 4.Unbalanced data set The generated data set was unbalanced since it is a special case of the classification problem where the distribution of a class is not usually homogeneous with other classes. The dominant class is called the basic class, and the other is called the secondary class. The data set is unbalanced if one of its categories is 10% or less compared to the other one [18]. Although machine learning algorithms are usually designed to improve accuracy by reducing error, not all of them take into account the class balance, and that may give bad results [18]. In general, classes are considered to be balanced in order to be given the same importance in training. We found that Syria Tel data set was unbalanced since the percentage of the secondary class that represents churn customers is about 5% of the whole data set. 5.Extensive features The collected data was full of columns, since there is a column for each service, product, and offer related to calls, SMS, MMS, and internet, in addition to columns related to personnel and demographic information. If we need to use all these data sources the number of columns for each customer before the data being processed will exceed ten thousand columns. 6.Missing values There is a representation of each service and product for each customer. Missing values may occur because not all customers have the same subscription. Some of them may have a number of services and others may have something different. In addition, there are some columns related to system configurations and these columns have only null value for all customers.
  • 52. 52 Fig-3 Inputting Missing values 7.Checking the Churn Rate The most basic way you can calculate churn for a given month is to check how many customers you have at the start of the month and how many of those customers leave by the end of the month. Once you have both numbers, divide the number of customers that left by the number of customers at the start of the month. 8.Splitting Data into Training and Test Sets As I said before, the data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test data set (or subset) in order to test our model’s prediction on this subset. Fig-4 Diving into training and testing sets 9.Correlation Matrix A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. Typically, a correlation matrix is “square”, with the same variables shown in the rows and columns. I've shown an example below. This shows correlations between the stated importance of various things to people. The line of 1.00s going from the top left to the bottom right is the main diagonal, which shows that each variable always perfectly
  • 53. 53 correlates with itself. This matrix is symmetrical, with the same correlation is shown above the main diagonal being a mirror image of those below the main diagonal. Fig-5 Correlation Matrix 10.Feature Selection Using RFE When U.S. Citizenship and Immigration Services (USCIS) needs more information in order to proceed any further on your application, it can issue you a Request for Evidence (RFE). You will need to respond to the RFE within the time frame indicated (usually 30 to 90 days) so that the immigration official adjudicating your case will have enough evidence to make a favorable decision. 11.Making Predictions Making predictions is a strategy in which readers use information from a text (including titles, headings, pictures, and diagrams) and their own personal experiences to anticipate what they are about to read (or what comes next). A reader involved in making predictions is focused on the text at hand, constantly thinking ahead and also refining, revising, and verifying his or her predictions. This strategy also helps students make connections between their prior knowledge and the text. Students may initially be more comfortable making predictions about fiction than nonfiction or informational text. This may be due to the fact that fiction is more commonly used in early reading instruction. Students also tend to be more comfortable with the structure of narrative text than they are with the features and structures used in informational text. However, the strategy is important for all types of text. Teachers should make sure to include time for instruction, modeling, and practice as students read informational text. They can also help students successfully make predictions about informational text by ensuring that students have sufficient background knowledge before beginning to read the text.
  • 54. 54 12.Model Evaluation Model Evaluation is an integral part of the model development process. It helps to find the best model that represents our data and how well the chosen model will work in the future. To avoid outfitting, both methods use a test set (not seen by the model) to evaluate model performance. 13.ROC Curve A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as probability of false alarm and can be calculated as (1 − specificity). It can also be thought of as a plot of the power as a function of the Type I Error of the decision rule (when the performance is calculated from just a sample of the population, it can be thought of as estimators of these quantities). The ROC curve is thus the sensitivity or recall as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false- alarm probability on the x-axis. Histogram: A histogram is a display of statistical information that uses rectangles to show the frequency of data items in successive numerical intervals of equal size. In the most common form of histogram, the independent variable is plotted along the horizontal axis and the dependent variable is plotted along the vertical axis. The data appears as colored or shaded rectangles of variable area. The illustration, below, is a histogram showing the results of a final exam given to a hypothetical class of students. Each score range is denoted by a bar of a certain color. If this histogram were compared with those of classes from other years that received the same test from the same professor, conclusions might be drawn about intelligence changes among students over the years. Conclusions might also be drawn concerning the improvement or decline of the professor's teaching ability with the passage of time. If this histogram were compared with those of other classes in the same semester who had received the same final exam but who had taken the course from different professors, one might draw conclusions about the relative competence of the professors.
  • 55. 55 Fig-6 Histogram Some histograms are presented with the independent variable along the vertical axis and the dependent variable along the horizontal axis. That format is less common than the one shown here. 5.3 Screenshots or output screens 1.Code # In[1]: import pandas as pd import numpy as np # In[2]: churn_data = pd.read_csv("OneDrive/Desktop/project datasets/churn_data.csv") customer_data = pd.read_csv("OneDrive/Desktop/project datasets/customer_data.csv") internet_data = pd.read_csv("OneDrive/Desktop/project datasets/internet_data.csv") # In[3]:
  • 56. 56 df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID') # In[4]: telecom = pd.merge(df_1, internet_data, how='inner', on='customerID') # In[5]: telecom.head() # In[6]: telecom.describe() # In[7]: telecom.info() # In[8]: telecom['PhoneService'] = telecom['PhoneService'].map({'Yes': 1, 'No': 0}) telecom['PaperlessBilling'] = telecom['PaperlessBilling'].map({'Yes': 1, 'No': 0}) telecom['Churn'] = telecom['Churn'].map({'Yes': 1, 'No': 0}) telecom['Partner'] = telecom['Partner'].map({'Yes': 1, 'No': 0}) telecom['Dependents'] = telecom['Dependents'].map({'Yes': 1, 'No': 0}) # In[9]: # Creating a dummy variable for the variable 'Contract' and dropping the first one. cont = pd.get_dummies(telecom['Contract'],prefix='Contract',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,cont],axis=1) # Creating a dummy variable for the variable 'PaymentMethod' and dropping the first one. pm = pd.get_dummies(telecom['PaymentMethod'],prefix='PaymentMethod',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,pm],axis=1)
  • 57. 57 # Creating a dummy variable for the variable 'gender' and dropping the first one. gen = pd.get_dummies(telecom['gender'],prefix='gender',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,gen],axis=1) # Creating a dummy variable for the variable 'MultipleLines' and dropping the first one. ml = pd.get_dummies(telecom['MultipleLines'],prefix='MultipleLines') # dropping MultipleLines_No phone service column ml1 = ml.drop(['MultipleLines_No phone service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,ml1],axis=1) # Creating a dummy variable for the variable 'InternetService' and dropping the first one. iser = pd.get_dummies(telecom['InternetService'],prefix='InternetService',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,iser],axis=1) # Creating a dummy variable for the variable 'OnlineSecurity'. os = pd.get_dummies(telecom['OnlineSecurity'],prefix='OnlineSecurity') os1= os.drop(['OnlineSecurity_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,os1],axis=1) # Creating a dummy variable for the variable 'OnlineBackup'. ob =pd.get_dummies(telecom['OnlineBackup'],prefix='OnlineBackup') ob1 =ob.drop(['OnlineBackup_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,ob1],axis=1) # Creating a dummy variable for the variable 'DeviceProtection'. dp =pd.get_dummies(telecom['DeviceProtection'],prefix='DeviceProtection') dp1 = dp.drop(['DeviceProtection_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,dp1],axis=1) # Creating a dummy variable for the variable 'TechSupport'. ts =pd.get_dummies(telecom['TechSupport'],prefix='TechSupport') ts1 = ts.drop(['TechSupport_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,ts1],axis=1) # Creating a dummy variable for the variable 'StreamingTV'. st =pd.get_dummies(telecom['StreamingTV'],prefix='StreamingTV') st1 = st.drop(['StreamingTV_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,st1],axis=1) # Creating a dummy variable for the variable 'StreamingMovies'. sm =pd.get_dummies(telecom['StreamingMovies'],prefix='StreamingMovies') sm1 = sm.drop(['StreamingMovies_No internet service'],1) #Adding the results to the master dataframe
  • 58. 58 telecom = pd.concat([telecom,sm1],axis=1) # In[10]: telecom['MultipleLines'].value_counts() # In[11]: telecom = telecom.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'], 1) # In[12]: telecom['TotalCharges'] = pd.to_numeric(telecom['TotalCharges'],errors='coerce') telecom['tenure'] = telecom['tenure'].astype(float).astype(int) # In[13]: telecom.info() # In[14]: num_telecom = telecom[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']] # In[15]: num_telecom.describe(percentiles=[.25,.5,.75,.90,.95,.99]) # In[16]: telecom.isnull().sum() # In[17]:
  • 59. 59 round(100*(telecom.isnull().sum()/len(telecom.index)), 2) # In[18]: telecom = telecom[~np.isnan(telecom['TotalCharges'])] # In[19]: round(100*(telecom.isnull().sum()/len(telecom.index)), 2) # In[20]: df = telecom[['tenure','MonthlyCharges','TotalCharges']] # In[21]: normalized_df=(df-df.mean())/df.std() # In[22]: telecom = telecom.drop(['tenure','MonthlyCharges','TotalCharges'], 1) # In[23]: telecom = pd.concat([telecom,normalized_df],axis=1) # In[24]: telecom # In[25]:
  • 60. 60 churn = (sum(telecom['Churn'])/len(telecom['Churn'].index))*100 # In[26]: telecom # In[27]: churn # In[28]: import pandas as pd import matplotlib.pyplot as plt from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.metrics import mean_squared_error # In[29]: features = telecom.drop(['Churn','customerID'],axis=1).values target = telecom['Churn'].values # In[30]: features_train,target_train=features[20:30],target[20:30] features_test,target_test=features[40:50],target[40:50] print(features_test) # In[31]: model = GaussianNB() model.fit(features_train,target_train) print(model.predict(features_test)) # In[32]:
  • 61. 61 from sklearn.model_selection import train_test_split # In[33]: # Putting feature variable to X X = telecom.drop(['Churn','customerID'],axis=1) # Putting response variable to y y = telecom['Churn'] # In[34]: y.head() # In[35]: # Splitting the data into train and test X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.7,test_size=0.3,random_state=100) # In[36]: import statsmodels.api as sm # In[37]: # Logistic regression model logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial()) logm1.fit().summary() # In[38]: # Importing matplotlib and seaborn import matplotlib.pyplot as plt import seaborn as sns get_ipython().run_line_magic('matplotlib', 'inline')
  • 62. 62 # In[39]: # Let's see the correlation matrix plt.figure(figsize = (20,10)) # Size of the figure sns.heatmap(telecom.corr(),annot = True) # In[40]: X_test2 = X_test.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProtection _No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],1) X_train2 = X_train.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProtectio n_No','TechSupport_No','StreamingTV_No','StreamingMovies_No'],1) # In[41]: plt.figure(figsize = (20,10)) sns.heatmap(X_train2.corr(),annot = True) # In[42]: logm2 = sm.GLM(y_train,(sm.add_constant(X_train2)), family = sm.families.Binomial()) logm2.fit().summary() # In[43]: from sklearn.linear_model import LogisticRegression logreg = LogisticRegression() from sklearn.feature_selection import RFE rfe = RFE(logreg, 13) # running RFE with 13 variables as output rfe = rfe.fit(X,y) print(rfe.support_) # Printing the boolean results print(rfe.ranking_) # Printing the ranking # In[44]: col = ['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year',
  • 63. 63 'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges'] # In[45]: # Let's run the model using the selected variables from sklearn.linear_model import LogisticRegression from sklearn import metrics logsk = LogisticRegression() logsk.fit(X_train[col], y_train) # In[46]: #Comparing the model with StatsModels logm4 = sm.GLM(y_train,(sm.add_constant(X_train[col])), family = sm.families.Binomial()) logm4.fit().summary() # In[47]: # UDF for calculating vif value def vif_cal(input_data, dependent_col): vif_df = pd.DataFrame( columns = ['Var', 'Vif']) x_vars=input_data.drop([dependent_col], axis=1) xvar_names=x_vars.columns for i in range(0,xvar_names.shape[0]): y=x_vars[xvar_names[i]] x=x_vars[xvar_names.drop(xvar_names[i])] rsq=sm.OLS(y,x).fit().rsquared vif=round(1/(1-rsq),2) vif_df.loc[i] = [xvar_names[i], vif] return vif_df.sort_values(by = 'Vif', axis=0, ascending=False, inplace=False) # In[48]: telecom.columns ['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges']
  • 64. 64 # In[49]: # Calculating Vif value vif_cal(input_data=telecom.drop(['customerID','SeniorCitizen', 'Partner', 'Dependents', 'PaymentMethod_Credit card (automatic)','PaymentMethod_Mailed check', 'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No', 'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes', 'TechSupport_No','StreamingTV_No','StreamingTV_Yes','StreamingMovies_Yes', 'MonthlyCharges'], axis=1), dependent_col='Churn') # In[50]: col = ['PaperlessBilling', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Electronic check','MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges'] # In[51]: logm5 = sm.GLM(y_train,(sm.add_constant(X_train[col])), family = sm.families.Binomial()) logm5.fit().summary() # In[52]: # Calculating Vif value vif_cal(input_data=telecom.drop(['customerID','PhoneService','SeniorCitizen', 'Partner', 'Dependents', 'PaymentMethod_Credit card (automatic)','PaymentMethod_Mailed check', 'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No', 'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes', 'TechSupport_No','StreamingTV_No','StreamingTV_Yes','StreamingMovies_Yes', 'MonthlyCharges'], axis=1), dependent_col='Churn') # In[53]:
  • 65. 65 # Let's run the model using the selected variables from sklearn.linear_model import LogisticRegression from sklearn import metrics logsk = LogisticRegression() logsk.fit(X_train[col], y_train) # In[54]: y_pred = logsk.predict_proba(X_test[col]) # In[55]: y_pred_df = pd.DataFrame(y_pred) # In[56]: y_pred_1 = y_pred_df.iloc[:,[1]] # In[57]: y_pred_1.head() # In[58]: y_test_df = pd.DataFrame(y_test) # In[59]: y_test_df = pd.DataFrame(y_test) # In[60]: # Removing index for both dataframes to append them side by side y_pred_1.reset_index(drop=True, inplace=True)
  • 66. 66 y_test_df.reset_index(drop=True, inplace=True) # In[61]: y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1) # In[62]: y_pred_final= y_pred_final.rename(columns={ 1 : 'Churn_Prob'}) # In[63]: # Rearranging the columns y_pred_final = y_pred_final.reindex(['CustID','Churn','Churn_Prob'], axis=1) # In[64]: y_pred_final.head() # In[65]: # Creating new column 'predicted' with 1 if Churn_Prob>0.5 else 0 y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.5 else 0) # In[66]: y_pred_final.head() # In[67]: from sklearn import metrics # In[68]:
  • 67. 67 help(metrics.confusion_matrix) # In[69]: # Confusion matrix confusion = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.predicted ) confusion # In[70]: metrics.accuracy_score( y_pred_final.Churn, y_pred_final.predicted) # In[71]: TP = confusion[1,1] # true positive TN = confusion[0,0] # true negatives FP = confusion[0,1] # false positives FN = confusion[1,0] # false negatives # In[72]: TP / float(TP+FN) # In[73]: TN / float(TN+FP) # In[74]: # Calculate false postive rate - predicting churn when customer does not have churned print(FP/ float(TN+FP)) # In[75]: # positive predictive value print (TP / float(TP+FP))
  • 68. 68 # In[76]: # Negative predictive value print (TN / float(TN+ FN)) # In[77]: def draw_roc( actual, probs ): fpr, tpr, thresholds = metrics.roc_curve( actual, probs, drop_intermediate = False ) auc_score = metrics.roc_auc_score( actual, probs ) plt.figure(figsize=(6, 4)) plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score ) plt.plot([0, 1], [0, 1], 'k--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate or [1 - True Negative Rate]') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show() return fpr, tpr, thresholds # In[78]: draw_roc(y_pred_final.Churn, y_pred_final.predicted) # In[79]: # Let's create columns with different probability cutoffs numbers = [float(x)/10 for x in range(10)] for i in numbers: y_pred_final[i]= y_pred_final.Churn_Prob.map( lambda x: 1 if x > i else 0) y_pred_final.head() # In[80]: # Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
  • 69. 69 cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci']) from sklearn.metrics import confusion_matrix # TP = confusion[1,1] # true positive # TN = confusion[0,0] # true negatives # FP = confusion[0,1] # false positives # FN = confusion[1,0] # false negatives num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] for i in num: cm1 = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final[i] ) total1=sum(sum(cm1)) accuracy = (cm1[0,0]+cm1[1,1])/total1 speci = cm1[0,0]/(cm1[0,0]+cm1[0,1]) sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1]) cutoff_df.loc[i] =[ i ,accuracy,sensi,speci] print(cutoff_df) # Let's plot accuracy sensitivity and specificity for various probabilities. cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci']) y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else 0) y_pred_final.head() #Let's check the overall accuracy. metrics.accuracy_score( y_pred_final.Churn, y_pred_final.final_predicted) FinalChurn=telecom.filter(['customerID','Churn']) total_rows = FinalChurn[FinalChurn.Churn == 1] total_rows value = len(total_rows) print("number of people who are correctly using network is"+str(value))
  • 73. 73 Fig-6 Missing Values Fig-7 Converting the data sets into 0 and 1
  • 74. 74 Fig-8 Naive Bayesian Fig -9 Optimal cut-off point
  • 76. 76 Fig-12 Customers who stay Fig-13 Customers who leave
  • 77. 77 Chapter-6 TESTING & VALIDATION 6.1 Introduction: Functional vs. Non-functional Testing The goal of utilizing numerous testing methodologies in your development process is to make sure your software can successfully operate in multiple environments and across different platforms. These can typically be broken down between functional and non- functional testing. Functional testing involves testing the application against the business requirements. It incorporates all test types designed to guarantee each part of a piece of software behaves as expected by using uses cases provided by the design team or business analyst. These testing methods are usually conducted in order and include:  Unit testing  Integration testing  System testing  Acceptance testing Non-functional testing methods incorporate all test types focused on the operational aspects of a piece of software. These include:  Performance testing  Security testing  Usability testing  Compatibility testing The key to releasing high quality software that can be easily adopted by your end users is to build a robust testing frameworkfs that implements both functional and non-functional software testing methodologies. Unit Testing Unit testing is the first level of testing and is often performed by the developers themselves. It is the process of ensuring individual components of a piece of software at the code level are functional and work as they were designed to. Developers in a test-driven environment will typically write and run the tests prior to the software or feature being passed over to the test team. Unit testing can be conducted manually, but automating the process will speed up delivery cycles and expand test coverage. Unit testing will also make debugging easier because finding issues earlier means they take less time to fix than if they were discovered later in the testing process. TestLeft is a tool that allows advanced testers and developers to shift left with the fastest test automation tool embedded in any IDE.
  • 78. 78 Fig-7 Unit Testing Life cycle Unit Testing Techniques:  White Box Testing - used to test each one of those functions behavior  Black Box Testing - Using which the user interface, input and output are tested  Gray Box Testing - Used to execute tests, risks and assessment methods. White box testing: The box testing approach of software testing consists of black box testing and white box testing. We are discussing here white box testing which also known as glass box is testing, structural testing, clear box testing, open box testing and transparent box testing. It tests internal coding and infrastructure of a software focus on checking of predefined inputs against expected and desired outputs. It is based on inner workings of an application and revolves around internal structure testing. In this type of testing programming skills are required to design test cases. The primary goal of white box testing is to focus on the flow of inputs and outputs through the software and strengthening the security of the software.
  • 79. 79 The term 'white box' is used because of the internal perspective of the system. The clear box or white box or transparent box name denote the ability to see through the software's outer shell into its inner workings. Test cases for white box testing are derived from the design phase of the software development lifecycle. Data flow testing, control flow testing, path testing, branch testing, statement and decision coverage all these techniques used by white box testing as a guideline to create an error-free software. Fig-8 White Box Testing White box testing follows some working steps to make testing manageable and easy to understand what the next task to do. There are some basic steps to perform white box testing. Generic steps of white box testing: o Design all test scenarios, test cases and prioritize them according to high priority number. o This step involves the study of code at runtime to examine the resource utilization, not accessed areas of the code, time taken by various methods and operations and so on. o In this step testing of internal subroutines takes place. Internal subroutines such as nonpublic methods, interfaces are able to handle all types of data appropriately or not. o This step focuses on testing of control statements like loops and conditional statements to check the efficiency and accuracy for different data inputs. o In the last step white box testing includes security testing to check all possible security loopholes by looking at how the code handles security Reasoning of white box testing: o It identifies internal security holes. o To check the way of input inside the code. o Check the functionality of conditional loops. o To test function, object, and statement at an individual level.
  • 80. 80 Advantages of white box testing: o White box testing optimizes code so hidden errors can be identified. o Test cases of white box testing can be easily automated. o This testing is more thorough than other testing approaches as it covers all code paths. o It can be started in the SDLC phase even without GUI. Disadvantages of white box testing: o White box testing is too much time consuming when it comes to large-scale programming applications. o White box testing is much expensive and complex. o It can lead to production error because it is not detailed by the developers. o White box testing needs professional programmers who have a detailed knowledge and understanding of programming language and implementation. Black box testing: Black box testing is a technique of software testing which examines the functionality of software without peering into its internal structure or coding. The primary source of black box testing is a specification of requirements that is stated by the customer. In this method, tester selects a function and gives input value to examine its functionality, and checks whether the function is giving expected output or not. If the function produces correct output, then it is passed in testing, otherwise failed. The test team reports the result to the development team and then tests the next function. After completing testing of all functions if there are severe problems, then it is given back to the development team for correction. Fig-9 Black Box Testing Generic step of black box testing: