3. KILO, MEGA, GIGA, TERA, PETA, EXA
ZETTA = 10 21 BYTES
…An organization Over 95% of the
employing 1,000 digital universe is
knowledge workers "unstructured data"
loses $5.7 million – meaning its
annually just in content can't be truly
time wasted having represented by its
to reformat field in a
information as they record, such as
move among name, address, or
applications. Not date of last
finding information transaction. In
costs that same organizations, unstr
organization an uctured data
additional $5.3m a accounts for more
year. than 80% of all
information.
Source: IDC
Source: IDC
4. WHY DATA SCIENCE?
Available data on a scale millions of times larger than 20
years ago: customer transactions; environmental sensor
outputs; genetic and epigenetic sequences; web documents;
digital images and audio
Heterogeneous data sets, with different representations and
formats; mixtures of structured and unstructured data;
some, little, or no metadata; distributed across systems
Chaotic information life cycle, where little time and effort is
spent on what should be kept and what can be discarded
Diverse and/or legacy infrastructure: mainframes running
Cobol connected with high speed networks to sensor arrays
running Linux
5. CRITICAL QUESTIONS
How will global climate change affect sea levels in major
coastal metropolitan areas worldwide?
Does genetic screening reduce cancer mortality for adults
between the ages of 50 and 59?
What gene sequences in cereal grains are associated with
greater crop yields in arid environments?
How can we reduce false positives in automated airline
baggage scans without reducing accuracy?
What Internet data can be mined as predictive of firm
creation among startups that provide new jobs?
6. “BIG DATA” PROVIDES ANSWERS
Water sustainability Drug design and
Climate analysis and development
prediction Advanced materials
Energy through fusion analysis
CO 2 Sequestration New combustion
Hazard analysis and systems
management Virtual product design
Cancer detection and In silico semiconductor
therapy design
NSF Advisory Committee for Cyberinfrastructure, Taskforce for Grand Challenges, Final Report,
March 2011. http://www.nsf.gov/od/oci/taskforces/TaskForceReport_GrandChallenges.pdf
7. NSF Advisory
“All grand challenges face Committee
for
barriers due to challenges in Cyberinfra-
software, in data management structure, Tas
kforce for
and visualization, and in Grand
Challenges, F
coordinating the work of inal
Report, Marc
diverse communities that must h 2011.
work together to develop new http://www.n
sf.gov/od/oci/
models and algorithms, and to taskforces/Ta
skForceRepor
evaluate outputs as a basis for t_GrandChall
enges.pdf
critical decisions.”
8. Knowledge Development
for
Industry, Education, Governme
nt, Research
Domain
Experts Infrastructure
Information
Professionals
Expertise in specific
Organization & Rapid pace of
subject areas Visualization IT development
Limited opportunity to Limited expertise in
master technology skills Information Data Solution
domain areas
Analysis Scientists Integration
Proliferation of big data
Specialized knowledge
& new technology of HW, FW, MW, SW
Digital Curation
Need for knowledge and Communication
information managers challenges
Data Scientists: Transforming Data Into Decisions
9. A DEFINITION OF A DATA SCIENTIST
A data scientist uses deep expertise in the
management, transformation, and analysis of large,
heterogeneous data sets to:
Help infrastructure experts with the architecture of hardware
and software to manage big data challenges
Help domain experts and decision makers reduce the data
deluge into usable knowledge, visualizations, and
presentations
Help institutions and organizations control and curate data
throughout the information lifecycle
Notes de l'éditeur
Facebook friend connections worldwide, a network diagram of the Enron email set, a comparison of similar gene sequences between humans, chimps, and macaques