2. Agenda
• What is Big Data?
• What is Data Science?
• Who are Data Scientists?
• What do Data Scientists do?
• What are the job perspectives for Data Scientists?
• How happy are Data Scientists with their jobs
• Becker’s BS in Data Science
• Becker’s Big Data Analytics concentration
4. How much data do we use
• Everyday, people send 150 billion new email messages
• Every 4 minutes, a terabyte of data (72 hours of video) is uploaded to YouTube
• Facebook’s databases ingest 500 terabytes of new data per day
• The CERN Large Hadron Collider generates 1 petabyte per second
• Sensors from a Boeing 787 jet create 40 terabytes of data per hour
• An Oil & Gas off-shore rig operation generates 8 terabytes a day
• A self-driving car generates 1 gigabyte per second
• General Electric gas turbines generates 500 gigabytes per day
• The proposed Square Kilometer Array telescope will generate an exabyte of data
per day
• 90% of the data in the world today has been created in the last two years alone
• 80% of data captured today is unstructured
4,000,000,000,000,000,000,000 bytes
Zeta Mega KiloGigaTeraPetaExa
5. How much data do we use
According to IBM, 90% of the data in the world today was created in the last 2
years alone.
“Big Data: Getting Ready For The 2013 Big Bang”, Forbes Magazine, May 1, 2013
4,000,000,000,000,000,000,000 bytes
7. Definition of Big Data
• Big Data – tools that process and analyze
complex data at speeds and scales that were
previously not cost-effective.
8. History of Big Data
Humans use
tally sticks to
record data
for the first
time to track
trading
activity and
record
inventory
18,000
century
BCE
2,400
century
BCE
The abacus
is
developed
and the
first
libraries
are built in
Babylonia
300
century
BCE
The Library
of
Alexandria
is the
World’s
Largest
Storage
Center
100-200
century
BCE
Antikythera –
the first
mechanical
computer is
developed in
Greece
1663
John Graunt
conducts the
first
statistical
analysis
experiments
to curb the
spread of
bubonic
plague in
Europe
1865
The Term
“Business
Intelligence”
is used first
1928
Fritz
Pfleumer
creates a
method of
storing data
magnetically,
which forms
the basis of
modern
digital data
storage
1965
The US Gov
plans the
world’s first
data center
to store 742
million tax
returns and
175 million
sets of
fingerprints
on magnetic
tape
1965
Relational
Database
model
developed by
IBM
mathematici
an Edgar F.
Codd.
Everyone can
have an
ability to use
databases,
not just
computer
scientists.
1969
Early use
of term Big
Data in
magazine
article by
Erik Larson
1991
Birth of the
WWW.
Anyone
can upload
their own
data
Birth of the
ARPANET,
that later
led to the
creation of
Internet
(October
29, 1969
22:30)
1989
9. History of Big Data
1996
The price
of digital
storage
makes it
more cost-
effective
than paper
1997
Google
launched
the
World’s
most
popular
search
engine
1997
First use of
the term
Big Data in
an
academic
paper
2001
3 Vs of Big
Data –
Volume,
Velocity
and
Variety -
defined by
Dough
Laney
2005
Hadoop –
an open
source Big
Data
framework
is
developed
2009
The
average US
company
with over
1000
employees
is storing
more than
200 Tb of
data,
according
McKinsey
Global
Institute
Every two
days, as
much data
is being
created, as
was from
the
beginning
of human
civilization
to the year
2003 (Eric
Schmidt,
Google)
2010 2011
By 2018,
the US will
face a
shortfall of
140-
190,000
data
scientists
(McKinsey)
2014
Mobile
internet
use
overtakes
desktop
for the first
time
2015
Internet of
Things is
being
adopted
by
industries
2020
Some 30
billion
objects
may be
connected
to the
Internet of
Things
12. 4 V’s of Big Data
• Volume – a Terabyte? a Petabyte? More?...
• Variety – a Web Log? A Tweeter feed? A
YouTube video?
• Velocity – New data comes every hour?
Minute? Second?
• Veracity – how much do I trust this data?
40%? 100%? 0%?
13. History of Big Data
IBM delivers an HDD, weighing
over a ton, storing 5 Mb of data
(September, 1956)
17. Unstructured Data
• Refers to information that does not have a
pre-defined data model or is not organized in
a pre-defined manner.
• Examples: social network feeds, customer
reviews or comments, YouTube videos, etc.
18. Structured Data
• Refers to information that does not have a
pre-defined data model or is not organized in
a pre-defined manner.
24. 24
What is Data Science?
*http://en.wikipedia.org/wiki/Data_science
• 1960-The term "data science" (originally used interchangeably with
"datalogy") has existed for over thirty years and was used initially as a
substitute for computer science by Peter Naur in 1960.
• 2002-The International Council for Science: Committee on Data for
Science and Technology started the Data Science Journal
• 2004-Usama Fayyad became the first CDO at Yahoo.
• 2008-DJ Patil and Jeff Hammerbacher coined the term “data scientist”
to define their jobs at Linkedin and Facebook, respectively
26. What is Data Science?
Math & Statistics
• Discrete
• Finite
• Linear Algebra
• Multivariate
Computer Science
• Programming
• Business Intelligence
Soft Skills
• Oral Communications
• Creativity
• Project Management
• Team play
• Presentation
28. Data Science vs Data Analytics vs …
• Business Intelligence – covers data analysis and relies heavily on aggregation, focusing on business information
• Statistics – the study of collection, analysis, interpretation, presentation and organization of data.
• Data Mining – a techniques that focuses on modeling and knowledge discovery for predictive rather than prescriptive
purposes
• Data Analytics – a process of inspecting, cleaning, transforming and modeling data with the goal of discovering useful
information, suggesting conclusions, and supporting decision-making.
• Business Analytics - practices for continuous iterative exploration and investigation of past business performance to gain
insight and drive business planning
– Descriptive Analytics – analyzes the past performance and understands that performance by mining historical data to look for the
reasons behind past success or failure
– Predictive Analytics - encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining that
analyze current and historical facts to make predictions about future or otherwise unknown events.
– Prescriptive Analytics - automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences,
and business rules, to make predictions and then suggests decision options to take advantage of the predictions.
• Data Science – an interdisciplinary field about processes and systems to extract knowledge or insights from data in various
forms, either structured or unstructured, which is a continuation of some of the data analysis fields, such as statistics, data
mining, and predictive analytics.
• https://en.wikipedia.org/wiki/Data_science
• https://en.wikipedia.org/wiki/Data_analysis
33. What Do Data Scientists Do?
In a nutshell a data scientist creates data products. This can mean a lot of
things but we can generalize as having the ability to create interfaces for
people and machines that use data of any kind.
Responsibilities vary a lot. It can be running experiments, creating
interfaces using machine learning, providing insights from complex
datasets.
Data scientists work with hypothesis. For instance the experiments we run
at Minclip are becoming full fledged randomised controlled trials but I think
that is the most similar case. I believe the term scientist appeared when
data itself became a field of study. The way machine learning treats data is
highly empirical. The process of improving and validating a model, while
not using the traditional statistical methods of scientific research is,
nevertheless, highly empiric, skeptic and pragmatic. Sometimes more than
some papers that are published.
• Quora http://qr.ae/RUWYc8
34. What Do Data Scientists Do?
• “There are multiple communities of data scientists throughout the
amazon offices which are easily approachable”
• “They mostly work on the vertical like ad space optimization or marketing.
People have in depth understanding of domain and some of the best
minds in the industry”
• “There is a Data Science Toolkit, which contains almost every kind of tools
for Data Scientists… Biggest data warehouse (Datanet) to play with,
Extedned internal wiki of almost every possible topic in the universe of
Data; mentorship of data science wizards”
– Quora, http://qr.ae/RUPSv4
35. What Do Data Scientists Do?
• Netflix Prize – was an open competition for
the best collaborative filtering algorithm to
predict user rating for films, based on previous
ratings without any other information about
users or films.
36. What Do Data Scientists Do?
On 9/21/2009, $1 mln was given to the Pragmatic Chaos team that improved
prediction by 10.06%
37. What Do Data Scientists Do?
• We work on core ML, on computer vision, on computational photography and on language
technologies.
• In computer vision we have a system that processes every single image and video uploaded
to Facebook, totaling well over 1B items per day. We predict the content of an image for
example in order to generate captions for the blind, or to automatically detect and take down
offensive content, improve media search results, automate visual captcha among many other
use cases.
• In language technology, one thing we are trying to do is eliminate language barriers on
Facebook. In order to do this we translate over 2B posts every single day, with over 1800
language directions representing more than 40 unique languages.
• In core ML, we focus on researching and shipping large scale and realtime ML/AI algorithms
for some of the biggest ML applications in the world. Whenever a users logs into Facebook,
these models are used to rank news feed stories (1B users every day, 1.5K stories per user
per day on average), ads, search results (1B+ queries a day), trending news, friend
recommendations and even rank notifications that a user receives, or rank the comments on
a post.
– Quora (http://qr.ae/RZ3JBx)
38. What Do Data Scientists Do?
• There are multiple analytics teams at Facebook
• A team of Data Scientists working on Ads and is probably the largest and most centralized
analytics team at Facebook
• Our goal is to come up with data backed insights which will result in informing the product
road-map or move key metrics that our product teams track. We sometimes also build
infrastructure (less common in my world) that are used by other Data Scientists and
engineers. We work in close concert with Engineering and Product and we often wear
Engineering or Product management hats in addition to our Data Scientist responsibilities.
We spend our time in:
– Analyzing and designing experiments to optimize product features or move key metrics
– Data mining/analysis to come up with business opportunities to pursue or product
feature suggestions or sometimes to understand metric movements.
– Building production ML models (though this is mostly done by SW Engineering)
• The multidisciplinary nature of the role, access to one of the largest troves of data, brilliant
colleagues and ability to create a huge impact in a very short time period make this an
exciting job.
– Quora (http://qr.ae/RUPJbx)
39. What Do Data Scientists Do?
• Predicting the past – let's say you want to determine the gender of Jason Lemkin.
If you are a human, that's easy (hint: he's a man). If you are a computer, it is more
difficult. But you might have a large dataset of genders and first names and see
that 99% of Jasons are men so your algorithm says he is a man. This would be
much more difficult with me ("Auren" is a more gender neutral name) and so you
might not be confident enough to make a gender pronouncement and thus might
need more data (like doing natural language processing on articles about me that
refer to me as "he" and "him).
• Predicting the future – figuring out what posts should be shown to the right
person.
– Quora: http://qr.ae/RUgn33
40. What Do Data Scientists Do?
• Airbnb wrangles a lot of data—roughly 11 petabytes. Much of it, such as a guest’s
lodging preferences and whether a host likes to be continuously booked or prefers
having a few days free between visitations, helps the online marketplace’s search
algorithm determine the most likely match between guest and host.
• Preferences of this sort fall into one of four data categories:
– Behavioral, which describes user behavior as they interact with the Airbnb website;
– Dimensional, which covers user attributes including access device used, language and location;
– Sentiment, which reflects lodging reviews, ratings and survey results;
– Imputed, which infers user behaviors, such as “this guest always travels to big cities, whereas this
other guest always travels to small coastal towns.”
• To collect, process and analyze all this data, Airbnb relies on a team of about 100
people. These include around 20 engineers who support the computing
infrastructure and Newman's 80-person data science team.
– http://www.information-management.com/news/big-data-analytics/how-airbnb-uses-big-data-to-
better-match-guests-rooms-10028582-1.html
41. What Do Data Scientists Do?
• Data captured through all its channels – text message, Twitter, Pebble, Android, Amazon Echo – to name
just a fraction – is fed into the Domino’s Information Management Framework. There it’s combined with
enrichment data from a large number of third party sources such as the United States Postal Service as
well as geocode information, demographic and competitor data, to allow in depth customer segmentation.
• “We have the ability to not only look at a consumer as an individual and assess their buying patterns, but
also look at the multiple consumers residing within a household, understand who is the dominant buyer,
who reacts to our coupons, and, foremost, understand how they react to the channel that they’re coming
to us on.”
– http://www.forbes.com/sites/bernardmarr/2016/04/06/big-data-driven-decision-making-at-dominos-pizza/#5c668fd4647f
42. What Do Data Scientists Do?
(Finance)
Source: Hortonworks
43. What Do Data Scientists Do?
(Government)
• Fraud, Waste and Abuse (FWA)
– Fraud and Abuse occur when there loopholes
created by complex interactions between business
controls, regulatory requirements and day-to-day
process. Recognizing these control point loopholes
are hard, manual review is difficult.
Source: KPMG
44. What Do Data Scientists Do?
(Government)
• Fraud, Waste and Abuse (FWA)
– Fraud and Abuse occur when there loopholes
created by complex interactions between business
controls, regulatory requirements and day-to-day
process. Recognizing these control point loopholes
are hard, manual review is difficult.
Source: KPMG
45. What Do Data Scientists Do?
(Government)
• FWA in Other Sectors
Source: KPMG
46. • Data Analysts/Scientists in Games are
concerned with how to:
– Engage the gamer
– Monetize the gamer
What Do Data Scientists Do?
(Game industry)
47. • Pre-launch data simulation
– Simulating loot drop rules and preference in Call
of Duty before launching the game
What Do Data Scientists Do?
(Game industry)
Source: Activision
48. • In-Game analytics:
– Why are people leaving?
– Investigating churn, building a churn prediction
model and impact behavior before players quit
What Do Data Scientists Do?
(Game industry)
Source: Activision
49. • Game Feature Research:
What Do Data Scientists Do?
(Game industry)
Source: Activision
50. What Do Data Scientists Do?
(Non profit)
Use-case: DataKind.org
Source: DataKind
51. What are the job perspectives?
[By 2018] “The United States alone faces a shortage of
140,000 to 190,000 people with deep analytical skills
as well as 1.5 million managers
and analysts to analyze big data and make decisions
based on their findings.”
• http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation
52. What are the job perspectives?
• http://www.indeed.com/salary?q1=%22Data+Scientist%22&l1=
53. What are the job perspectives?
• https://www.glassdoor.com/Best-Jobs-in-America-LST_KQ0,20.htm
54. What are the job perspectives?
• https://www.dezyre.com/article/data-scientist-salary-report-of-100-top-tech-companies-/218
55. How Happy Are Data Scientists?
Machine Learning Developers are Happy!
StackOverflow survey
56. Bachelor of Science in Data Science
• Building Foundations
• 120 credits
• Foundations in:
– Math
– Statistics and Multivariate Statistics
– Machine Learning
– Computer Programming
– Practicum