This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
5. There's certainly a lot of it!
2015
1 Zettabyte
1 Exabyte
1 Petabyte
(brain) 14 PB: http://www.quora.com/Neuroscience-1/How-much-data-can-the-human-brain-store
(2002) 5 EB: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm
1 Petabyte == 1000 TB 2002 2009
(2009) 800 EB: http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf
(2015) 8 ZB: http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf
2006 2011
(2006) 161 EB: http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf
(2011) 1.8 ZB: http://www.emc.com/leadership/programs/digital-universe.htm (life in video) 60 PB: in 4320p resolution, extrapolated from 16MB for 1:21 of 640x480 video
(w/sound) – almost certainly a gross overestimate, as sleep can be compressed significantly!
5 EB
161 EB
800 EB
1.8 ZB 8.0 ZB
14 PB
60 PB
Data produced each year
100-years of HD video + audio
Human brain's capacity
Data, data everywhere…
References
1 TB = 1000 GB
120 PB
logarithmicscale
6. Data has become a Resource that needs to be carefully stored, processed,
analyzed, visualize and Present where it is required securely.
7. Growing Need for Analytics
DATA
HARNESSING
Companies store
each piece of
information
generated during
the business
operations and
customer
interactions.
DATA VOLUMESData is generated.
Learning from the data
is used in the decision
making and process
optimization.
Data is analyzed. 1.22010
2012
2015
2.4
7.9
Volumes in Trillion GB
DID
YOU
KNOW
?
Generation of Large Amount of Data from Business Transactions
4
Billion
Number of
transactions
every year
900
Number
of Stores
Number
of SKUs
10000
-1 lakh
10. Fourth Paradigm of Science
Turing award winner Jim
Gray imagined data science
as a "fourth paradigm" of
science -
• Thousands of years
• Empirical (अनुभवजन्य)
• Few hundreds of years
• Theoretical (सैद्धांतिक)
• Last fifty years
• Computational (गणनधत्मक)
• “Query the world”
• Last twenty years
• eScience (Data Science)
• “Download the world”
11.
12. What is Data Science
• Data Science is a multi-disciplinary field that uses scientific
methods, processes, algorithms and systems to
extract knowledge and insights from structured and
unstructured data.
• Data Science is a "concept to unify statistics, data analysis,
machine learning and their related methods" in order to
"understand and analyze actual phenomena" with data. It
employs techniques and theories drawn from many fields within
the context of mathematics, statistics, comp. science,
and information science.
• The availability of high-capacity networks, low-cost computers and
storage devices as well as the widespread adoption of hardware
virtualization, service-oriented
architecture and autonomic and utility computing has led to growth
in cloud computing.
14. Data Science : A Definition
Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to:
1. Collect
2. Clean
3. Integrate
4. Analyze
5. Visualize
6. Interact
with data to create data products.
Objective of Data Science is to “Turn Data into Data Products”.
15. Traditionally, the data that we had was mostly structured and small in size,
which could be analyzed by using the simple BI tools. Unlike data in
the traditional systems which was mostly structured, today most of the
data is unstructured or semi-structured. Let’s have a look at the data
trends in the image given below which shows that by 2020, more than 80 % of
the data will be unstructured.
22. What is Analytics?
Data on its own is useless unless you can make sense of it!
WHAT IS ANALYTICS?
The scientific process of transforming data into insight for making
better decisions, offering new opportunities for a competitive
advantage
22
23.
24. Types of Analytics
1
32
Analytics
Prescriptive Analytics
Descriptive analyticsPredictive analytics
Enabling smart decisions
based on data
What should we do?
Mining data to provide
business insights
What has happened?
Predicting the future based
on historical patterns
What could happen?
25. Types of Analytics
Prescriptive
Analytics
advice on possible outcomes
Predictive
Analytics
understanding the future
Descriptive
Analytics
insight into the past
Why do airline prices
change every hour?
How do grocery cashiers
know to hand you coupons
you might actually use?
How does Netflix
frequently recommend
just the right movie?
26. Features Business Intelligence (BI) Data Science
Data Sources
Structured
(Usually SQL, often Data Warehouse)
Both Structured and
Unstructured
( logs, cloud data, SQL,
NoSQL, text)
Approach Statistics and Visualization
Statistics, Machine
Learning, Graph Analysis,
Neuro- linguistic
Programming (NLP)
Focus Past and Present Present and Future
Tools Pentaho, Microsoft BI, QlikView, R
RapidMiner, BigML, Weka,
R
Business Intelligence (BI) vs. Data Science
28. Interest for “Data Science” term since
December 2013
(source: Google Trends)
Hype bag-of-words. Let’s not focus on buzzwords, but on what the
beneath technologies can actually solve.
30. Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records,
Personnel records,
Census, Medical records
Online clicks, GPS logs,
Tweets, Building sensor readings
Priorities Consistency,
Error recovery,
Auditability
Speed,
Availability,
Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL: MongoDB, CouchDB,
Hbase, Cassandra, Riak, Memcached,
Apache River, …
ACID = Atomicity, Consistency, Isolation and Durability
CAP = Consistency, Availability, Partition Tolerance
31. Contrast: Machine Learning
Data Science
Explore many models, build and tune hybrids
Understand empirical properties of models
Develop/use tools that can handle massive
datasets
Take action!
Machine Learning
Develop new (individual) models
Prove mathematical properties of models
Improve/validate on a few, relatively clean,
small datasets
Publish a paper
33. The first war: Terminology
• Analyzing data has a long history!
• There have been many terms that have been used to describe such
endeavors:
• Statistics
• Artificial Intelligence
• Machine learning
• Data analytics
• Since I happen to work in a “Data Science” program perhaps I may be
allowed the indulgence of using that terminology…
34. The Case for Business Analytics
• The Business environment today is
more complex than ever before.
• Businesses are expected to be
diligently responsive to the
increasing demands of customers,
various stakeholders and even
regulators.
• Organizations have been turning to
the use of analytics.
• More than 83% of Global CIOs
surveyed by IBM in 2010 singled out
Business Intelligence and Analytics
as one of their visionary plans for
enhancing competitiveness.
In most cases the primary objective of
an organization that seeks to turn to
analytics is:
• Revenue/Profit growth
• Optimize expenditure
SOLUTION
BUSINESS NEED
GOAL
34
35. Data Analysis Has Been Around for a While…
R.A. Fisher
Howard
Dresner
Peter Luhn
W.E. Deming
36. Experiments, observations, and numerical simulations in many
areas of science and business are currently generating terabytes of
data, and in some cases are on the verge of generating petabytes
and beyond. Analyses of the information contained in these data
sets have already led to major breakthroughs in fields ranging from
genomics to astronomy and high-energy physics and to the
development of new information-based industries.
- Frontiers in Massive Data Analysis, National Research Council of the National Academies
Given a large mass of data, we can by judicious selection
construct perfectly plausible unassailable theories—all of
which, some of which, or none of which may be right.
- Paul Arnold Srere
37. The ability to take data—to be able to understand it, to process it, to
extract value from it, to visualize it, to communicate it—that’s going
to be a hugely important skill in the next decades, not only at the
professional level but even at the educational level for elementary
school kids, for high school kids, for college kids. Because now we
really do have essentially free and ubiquitous data. So the
complimentary scarce factor is the ability to understand that data
and extract value from it.
-Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers
My personal goal: Getting students to be able to
think critically about data.
38. What is Big Data?
The are many examples of "data", but what makes some of it “big”? The classic
definition revolves around the three V’s - Volume, velocity, and variety.
Volume: There is a just a lot of it being generated all the time. Things get
interesting and “big”, when you can’t fit it all on one computer anymore.
Why? There are many ideas here such as MapReduce, Hadoop, etc. that all
revolve around being able to process data that goes from Terabytes, to
Petabytes, to Exabytes.
Velocity: Data is being generated very quickly. Can you even store it all? If
not, then what do you get rid of and what do you keep?
Variety: The data types you mention all take different shapes. What does it
mean to store them so that you can play with or compare them?
39. BIGDATAData that is TOO LARGE & TOO
COMPLEX for conventional data tools
to capture, store and analyze.
Shares traded on US
Stock Markets each
day:
7 Billion
Data generated in
one flight from NY
to London:
10 Terabytes
Number of tweets
per day on Twitter:
400 Million
Number of ‘Likes’
each day on
Facebook:
3 Billion
The 3V’s of Big Data
VOLUME VARIETY VELOCITY
90% OF THE WORLD’S
DATA WAS
GENERATED IN THE
LAST TWO YEARS
Big Data Everywhere!
www.imarticus.org 39
40.
41. Is Big Data the same as Data Science?
Are Big Data and Data Science the same thing?
I wouldn't say so...
Data Science can be done on small data sets.
And not everything done using Big Data would necessarily be called Data
Science.
Big Data
Data
Science
42. Is Big Data the same as Data Science?
Are Big Data and Data Science the same thing?
I wouldn't say so...
Data Science can be done on small data sets.
And not everything done using Big Data would necessarily be called Data
Science.
But there certainly is a substantial overlap!
Big Data
Data
Science
43. Perspective Of Big Data's Growth
• Worldwide Big Data market revenues for software and services are projected to
increase from $42B in 2018 to $103B in 2027, attaining a Compound Annual
Growth Rate (CAGR) of 10.48% according to Wikibon.
•According to an Accenture study, 79% of enterprise executives agree that
companies that do not embrace Big Data will lose their competitive position and
could face extinction. Even more, 83%, have pursued Big Data projects to seize a
competitive edge.
•Forrester predicts the global Big Data software market will be worth $31B this
year, growing 14% from the previous year. The entire global software market is
forecast to be worth $628B in revenue, with $302B from applications.
•Worldwide Big Data market revenues for software and services are projected to
increase from $42B in 2018 to $103B in 2027, attaining a Compound Annual
Growth Rate (CAGR) of 10.48% according to Wikibon.
• 59% of executives say Big Data at their company would be improved through the
use of AI according to PwC.
44.
45.
46.
47.
48.
49.
50.
51. Future Trends
Tech & Industries to watch out in near Future:
• Progressive Web Apps (PWAs) — A mixture of a mobile and web apps.
• Block Chain & Fintech – Meta-model building, reliable trading & credit scoring.
• Healthcare — Diagnosis by Medical Imaging (Computer vision & ML).
• AR/VR — Sport Analysis, Business Cards (Image Tracking), Real -Life Gaming
(Hado).
• AI Speech Assistants, smarter Chat-bot integrations.
• Smart Supply Chain — Digital twins (IoT Sensors).
• 5G — Big data, Mobile cloud computing, scalable IoT & Network function
virtualisation (NFV).
• 3D Printing — Prefabrication efficiency, Defect detection, Predictive ML
maintenance.
• Dark Data — Information that is yet to become available in digital format.
• Quantum Computing — Cutting data processing times into fractions.
52.
53. Thank You!
Dr. Sunil Kr Pandey
Professor & Director (IT & UG)
Institute of Technology & Science
Mohan Nagar, Ghaziabad
Email: sunilpandey@its.edu.in