From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Introduction to Data Science fundamentals
1. Introduction to Data Science
Lecture#1
Program: BS(DS)-Fall 2019
Instructor: Konpal Darakshan
2. Books:
1. Doing Data Science: Straight Talk from the Frontline
by Cathy O'Neil and Rachel Schutt.
2. The R Primer by Claus Thorn Ekstrom.
Other:
1-An Introduction to Data Science by Jeffrey M. Stanton and
Jeffrey S. Saltz.
2-Learn R for Applied Statistics: With Data Visualizations,
Regressions, and Statistics by Eric Goh Ming Hui.
3-Practical Statistics for Data Scientists: 50 Essential Concepts by
Andrew Bruce and Peter C. Bruce.
4-Data Analysis for the Life Sciences with R by Michael I. Love and
Rafael A. Irizarry.
5-R Programming for Data Science by Roger D. Peng.
3. Marking Scheme
• Exams
Final Exam 40 Marks
1st Hourly 15 Marks
2nd Hourly 15 Marks
• Sessional Marks
Lab Manual 5 Marks
Presentation 5 Marks
Assignments 10 Marks
Quizzes 10 Marks
4. Chapter # 01
Introduction: What is Data Science?
• Big Data and Data Science hype
• Getting past the hype
• Why now?
• Datafication
• Current landscape of perspectives
• Data Science Jobs
• What is data Scientist
-In Academia
-In Industry
5. Basic Terminologies
• Data
• It can be
-generated
-collected
-retrieved.
Simulation
Similarity Measures
Data Structures
Algorithms
6. • Data: facts with no meanings.
• Information: learning from facts.
• Knowledge: practical understanding of a subject.
• Understanding: the ability to absorb knowledge and learn to reason.
• Wisdom: the quality of having experience and good judgment; ability to
think and foresee.
• Validity: ways to confirm truth.
7. • Cross-sectional data: applied on data without time.
• Temporal data: applied on time series.
• Spatial: considers location i.e. coordinate determination in touch phones.
• Temporal cum Spatial (GIS): considers change with passage of time for example
population density.
• Measurements of Scales
There are 4 scales of measurement
• Nominal: determines classification of data i.e. male/female.
• Ordinal: determines order of data and can be numerical or non-numerical i.e. time of
day (dawn, morning, noon, afternoon, evening, night).
• Interval: gives the interval of a measurement i.e. temperature interval.
• Ratio: gives ratio of the measurement i.e. weight, height, number of children.
8. Big Data and Data Science Hype:
Skeptical related to Data Sciences.
• Is data sciences only the stuff going in companies like Google, Facebook and
tech companies?
• There’s a distinct lack of respect for the researchers in academia and industry
labs who have been working on this kind of stuff for years, and whose work is
based on decades.
• The hype is crazy-In general, hype masks reality and increases the noise-to-
signal ratio.
• Statisticians already feel that they are studying and working on the “Science of
Data.”
Chapter # 01
Introduction: What is Data Science?
9. Getting Past the Hype
• Rachel’s experience going from getting a PhD in statistics to
working at Google. In her words:
10. We have a couple replies to this:
• Sure, there’s is a difference between industry and academia. But does it really
have to be that way? Why do many courses in school have to be so intrinsically
out of touch with reality?
• Even so, the gap doesn’t represent simply a difference between industry
statistics and academic statistics. The general experience of data scientists
is that, at their job, they have access to a larger body of knowledge and
methodology, as well as a process, which we now define as the data
science process, that has foundations in both statistics and computer
science.
Around all the hype, in other words, there is a ring
of truth: this is something new.
Getting Past the Hype
11. • We have massive amounts of data about many aspects of our lives, and
,simultaneously, What people might not know is that the “datafication” of our
offline behavior has started as well.
• On the Internet, this means Amazon recommendation systems.
• on Facebook, friend recommendations, film and music recommendations, and
so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and
assessments coming out of places like Knewton and Khan Academy.
• In government, this means policies based on data.
Why Now?
12. • In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor
Mayer-Schoenberger wrote an article called “The Rise of Big Data”, In it they
discuss the concept of datafication,
They define datafication as a process of “taking all aspects of
life and turning them into data.”
• They follow up their definition in the article with a line that speaks volumes
about their perspective:
Once we datafy things, we can transform their purpose and
turn the information into new forms of value.
Datafication
13. Examples:
• How we quantify friendships with “likes”.
• “Google’s augmented-reality glasses datafy the gaze.
• Twitter datafies stray thoughts.
• LinkedIn datafies professional networks.
• When we “like” someone or something online, we are intending to be
datafied.
• Browse the Web, we are unintentionally through cookies.
• When we walk around in a store, or even on the street, we are being
datafied, via sensors, cameras, or Google glasses.
• Taking part in a social media experiment.
• All-out surveillance and stalking.
But it’s all datafication
Datafication
14. For example,
• On Quora there’s a discussion from 2010 about “What is Data Science?” and here’s
Metamarket CEO Mike Driscoll’s answer:
Data science, as it’s practiced, is a blend of Red-Bull-fueled hacking
and espresso-inspired statistics.
• Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010.
Current landscape of perspectives
15. • Nathan Yau’s 2009 post, “Rise of the Data Scientist”, which include:
1. Statistics (traditional analysis you’re used to thinking about)
2. Data munging (parsing, scraping, and formatting data)
3. Visualization (graphs, tools, etc.)
• ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’
word”, in which she defends statistics:
• Then at LinkedIn and Facebook, respectively—coined the term “data scientist”
in 2008.
• Wikipedia finally gained an entry on data science in 2012.
Current landscape of perspectives
16. • In 2001, William Cleveland wrote a position paper about data
science called “Data Science: An action plan to expand the field of
statistics.”
• Harvard Business Review declared data scientist to be the
“Sexiest Job of the 21st Century”.
So data science existed before data scientists? Is
this semantics, or does it make sense?
Current landscape of perspectives
17. Data Science Jobs
• For three years running, data science has been dubbed ¨the best job in
America.¨ According to Stack Overflow, it is one of the highest paying
jobs in the software sector.
• The GDPR increased the reliance companies have on data scientists due
to the need for real-time analytics and storing data responsibly.
• There are 465 job openings in New York City alone for data scientists.
• LinkedIn recently picked data scientist as its most promising career of
2019. One of the reasons it got the top spot was that the average salary
for people in the role is $130,000.
• The January report from Indeed, one of the top job sites, showed a 29%
increase in demand for data scientists year over year and a 344%
increase since 2013 -- a dramatic upswing. But while demand -- in the
form of job postings -- continues to rise sharply, searches by job
seekers skilled in data science grew at a slower pace (14%), suggesting a
gap between supply and demand.
18. The growth in data scientist job postings on Indeed, from December 2016 to December 2018
19.
20. OK, So What Is a Data Scientist, Really?
Perhaps the most concrete approach is to define data science is by its usage.
• In Academia
• An academic data scientist is a scientist, trained in anything from social science to
biology, who works with large amounts of data, and must grapple with
computational problems posed by the structure, size, messiness, and the
complexity and nature of the data, while simultaneously solving a real-world
problem.
• In Industry
More generally, a data scientist is someone who knows
• How to design the experiments,
• how to the process of collecting, cleaning, and munging of data.
• Skills that are also necessary for understanding biases in the data, and for
debugging logging output from code.
• Exploratory data analysis, which combines visualization and data sense.
• Find patterns, build models, and algorithms.
• Use analyses for decision making.
21.
22. Data Engineers are the
data professionals who
prepare the “big data”
infrastructure to be
analyzed by Data
Scientists
Data analyst is someone
who merely curates
meaningful insights from
data.
A data scientist is a professional with the capabilities to gather large amounts of
data to analyze and synthesize the information into actionable plans for companies
and other organizations.
What Is a Data Scientist