This document compares Python and R for use in data science. Both languages are popular among data scientists, though Python has broader usage among professional developers overall. Python is a general purpose language while R is specialized for statistical computing. Both have extensive libraries for data manipulation, analysis, and visualization. The best choice depends on factors like familiarity, project requirements, and team preferences as both are capable of most data science tasks.
2. PYTHON VS. R
The comparison of Python and R has been a hot
topic in the industry circles for years. R has been
around for more than two decades, specialized
for statistical computing and graphics. Python is
a general-purpose programming language that has
many uses, including data science and statistics.
MANY BEGINNERS HAVE THE SAME
QUESTION IN MIND: WHICH OF THESE
TWO GREAT LANGUAGES SHOULD I
PICK FOR GETTING STARTED WITH
DATA SCIENCE?
3. PYTHON
Released in 1991, Python has built itself a strong reputation for
being an incredibly simple language to get started with and do
almost anything you could imagine. It powers websites, backend
services, native desktop applications, image processing systems,
machine learning pipelines, data transform systems, and more.
It is very well known for its simplicity, making it one of the most
accessible programming languages for anyone to utilize.
4. ADVANTAGES OF PYTHON
FOUR
There is a very large data science community
around the language, which means there are
many tools and libraries for data science
problems.
FIVE
It supports both object-oriented programming
and procedural programming paradigms, which
gives you the freedom to choose depending on
your needs.
ONE
It has a syntax very similar to native
English, so similar that most well-written
scripts make sense reading out-loud.
TWO
It has a great community around it. For
any problem you get stuck with, there are
probably hundreds of other people that asked
the same question and got answers online.
THREE
It has a huge amount of third-party
modules and libraries for any application
you can think of.
With all of these advantages, it is no wonder that Python is one of the most popular
languages in the industry. It is also used among huge tech companies like Google,
Dropbox, Netflix, Stripe and Instagram, according to Ncube.
5. R Project
R Project is a GNU project that consists of the R language, the runtime and the utilities to build
applications with them. R is the interpreted language used in this environment. The language is
specialized around statistical computing and graphics, meaning that it fits into many data science
problems straight away and simplifies data science projects with built-in tooling and third party
libraries around it.
6. ADVANTAGES OF R
ONE
It has many libraries and tools specialized for data operations. The language and these tools allow you to
modify your data structures easily, transform them into more efficient structures or clean them up for your
specific use-cases.
TWO
There are many very popular packages and libraries, such as tidyverse that takes care of data manipulation
and visualization end to end. These libraries allow you to get started easily with your data science tasks
without writing all the algorithms from scratch.
THREE
It has a very well-designed IDE called RStudio. Integrated with the language itself, RStudio provides
syntax highlighting, code completion, integrated help, documentation, data visualization, and debuggers,
allowing you to develop your R projects without leaving your screen.
FOUR
The team behind R has been strongly focused on ensuring that the tools will work on all platforms, and
thanks to those efforts R can run on Windows, macOS and Unix-like operating systems.
FIVE
It has tooling around building web-based dashboards for data analysis and visualizations, such as Shiny
which allows building interactive web apps directly from R.
Along with these advantages and its widespread usage in the data science community, R
stands as a strong alternative to Python in data science projects.
7. COMPARISON: PYTHON VS. R
Since both of the languages offer similar advantages on paper, other factors might impact which of the
language you decide to go with.
Both of the languages are popular in the data science community. However,
when it comes to picking a language to add in your toolchain and experience,
it might make sense to pick one that is popular in the industry and may allow
you to transition to different positions within your area of expertise.
According to Stack Overflow’s 2019 Developer Survey, Python is the 4th most
popular programming language among 72,525 professional developers, even
more popular than Java recently. In the same survey, R is in the 16th position.
POPULARITY
8. One thing to keep in mind regarding these survey results is that they
represent the developer community on Stack Overflow. This data is
not specific to data scientists obviously. However, this may help to
understand the current situation in the industry better.
Looking at the global
salaries worldwide on
the same survey, it
seems like both
Python and R seem
to be standing around
the same point among
55,639 participants,
with R being slightly
better on average.
In addition to the survey results, you can see when
looking at the Stack Overflow Trends that Python
is more popular than R in terms of the number of
questions asked.
...
9. Throughout the whole developer community, Python seems to be more popular than R. However, it is
important to keep in mind that Python is a general-purpose programming language while R is specialized
on statistical computing, which means this comparison is not apples-to-apples when it comes to their
popularity among data scientists.
For a better understanding in terms of data science, we can have a look at the 2019 Kaggle User Survey.
In fact, they have a specific page on the dashboard for Python vs R.
As seen in the Kaggle data, Python has a bigger use among the data science community than R, although
both of the languages have an impressive amount of usage.
10. NUMPY
PANDAS
MATPLOTLIB
As one of the most popular
libraries in the Python ecosystem,
scikit-learn contains tools built on
top of Numpy, Pandas, and Scipy
that are focused on various
machine learning tasks, such as
classification, regression, and
clustering.
SCIKIT-LEARN
Numpy is a fundamental package
that implements various data
manipulation operations on top of
array data structures. It contains
highly efficient implementations
of these data structures, as well
as common functionality for many
statistical computing tasks, and
allows the speeding up many
complex tasks.
PYTHON LIBRARIES
Pandas is a powerful and easy-to-
use open-source library for tabular
data manipulation tasks. It
contains efficient data structures
that are very suitable for working
with labeled data intuitively.
Matplotlib is a library for
creating static or interactive
data visualizations. Thanks to
its simplicity, you can create
highly detailed graphs with a
few lines of Python code.
Initially developed and open-
sourced by Google, Tensorflow is a
highly popular open-source library
for developing and training
machine learning and deep
learning models.
TENSORFLOW
11. TIDYVERSE
GGPLOT2
Caret is a collection of tools and
functions that are specialized for
predictive models and machine
learning, as well as data
manipulation and pre-processing.
CARET
Dplyr is a library for working
with tabular data easily, both in
memory and out of memory.
Tidyverse is a collection of R pack-
ages designed for data science. It
includes many popular libraries in-
cluding, to name a few: ggplot2 for
data visualization, dplyr for intui-
tive data manipulation and readr
for reading rectangular data from
various sources.
Ggplot2 is a library focused on
declaratively building data
visualizations based on the
book The Grammar of
Graphics.
Similar to dplyr, data.table is a
package designed for data
manipulation with an expressive
syntax. It implements efficient
data filtering, selecting and
shaping options that allow you
to get your data in the shape you
need before feeding it into your
models.
DATA.TABLEDPLYR
SHINY
Shiny is a package that allows
you to build highly interactive
web pages from R and build
dashboards easily.
Looking at the number of libraries and the functionality of those packages, it seems like both of the languages have
similar packages that simplify many data science tasks. All in all, for many tasks, when one is doable in Python, it is
doable in R with a very similar effort.
R LIBRARIES
12. WHEN TO USE PYTHON
If you are looking to get into programming in general and want something that
may be used in other areas of software development such as web development,
then Python, being a general-purpose programming language, is a better choice.
A
If you need to do ad-hoc analyses and occasionally share them with other data
scientists / technical people, it might be good to use Python along with Jupyter
Notebooks.
B
If you need to develop APIs to expose your models or will need other software to
interact with your models, it might be helpful for you to invest in Python and its
huge tooling around all kinds of programming tasks. You can expose your models
with a very simple API with Flask or FastAPI, or you can build fully-blown
production-ready web applications with Django.
C
D
Python is easy to get started with as well and it is installed in many systems by
default. Throughout the years it has evolved into different versions with different
setups. Therefore, it is non-trivial to set up a well-functioning data science stack
on your computer.
13. WHEN TO USE R
If you are familiar with other scientific programming languages like MATLAB, it
might be easier for you to learn R and get productive with it. There are many
similarities between those languages, especially with vector operations and the
general mindset about matrix operations rather than procedural methods.
A
If you are looking for ways to build quick dashboards for non-technical stakehold-
ers and internal usage, it might be a good idea to utilize R with the amazing Shiny
library.
B
If you’d prefer to have all your packages handy and mainly focus on your analysis
for your decision-making, and looking for the simplest setup to get started with, R
might be the go-to tool there. Thanks to RStudio and its integrated features, going
from raw data to analysis with visualizations without leaving your window is very
easy.
C
14. Stay up to date with Saturn Cloud on LinkedIn and Twitter.
You may also be interested in: Best Practices for Jupyter Notebooks.
Just like any other problem, the solution mostly depends on the requirements of the problem.
There is no right answer to this question other than “it depends”. Both of these languages are
very powerful, and regardless of which one of them you invest your time in, if you are looking
for a career in data science in the long term, there is no wrong answer. Learning any of these
two languages will pay you in the future one way or another. Instead of falling into analysis
paralysis, just pick one and move on with your work. It is well-understood that both of these
languages are capable of dealing with the majority of data science problems, and the rest boils
down to the methodology, capabilities of the team and the resources at hand, which are most-
ly independent of the language.
Original blog post here.