SlideShare une entreprise Scribd logo
1  sur  116
https://portal.futuregrid.org
Big Data Applications & Analytics Motivation:
Big Data and the Cloud; Centerpieces
of the Future Economy
January 5 2014
Geoffrey Fox
gcf@indiana.edu
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
https://portal.futuregrid.org
Introduction
2
https://portal.futuregrid.org
Abstract
• There is an endlessly growing amount of data as we record every
transaction between people and the environment (whether shopping
or on a social networking site) while smart phones, smart
homes, ubiquitous cities, smart power grids, and intelligent vehicles
deploy sensors recording even more.
• Science with satellites and accelerators is giving data on transactions
of particles and photons at the microscopic scale.
• This data are and will be stored in immense clouds with co-located
storage and computing that perform "analytics" that transform data
into information and then to wisdom and decisions; data mining finds
the proverbial knowledge diamonds in the data rough.
• This disruptive transformation is driving the economy and creating
millions of jobs in the emerging area of "data science".
• We discuss this revolution and its implications for universities and
society
3
https://portal.futuregrid.org
Some Trends
The Data Deluge is clear trend from Commercial
(Amazon, e-commerce) , Community (Facebook, Search)
and Scientific applications
Smaller (INTEL/ARM/AMD) chips drive
Multicore (i.e. more computing) on shared servers
Smaller Light weight clients from smartphones, tablets to
sensors (i.e. more clients)
Clouds with cheaper, greener, easier to use IT for
applications
New jobs associated with new curricula
Clouds as a distributed system (changing a classic CS course)
Data Science (new area)
4
https://portal.futuregrid.org
48 technologies are listed in this year’s hype cycle which is the highest in last ten years. Year
2008 was the lowest (27)
Gartner Says: We are at an interesting moment — a time when the scenarios we’ve been
talking about for a long time are almost becoming reality.
https://portal.futuregrid.org 6
Private Cloud Computing is off the chart
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
https://portal.futuregrid.org 7
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
https://portal.futuregrid.org 8http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
https://portal.futuregrid.org 9
Note number of
“analytics” areas
http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
https://portal.futuregrid.org
Issues of Importance
• Economic Imperative: There are a lot of data and a lot of
jobs
• Computing Model: Industry adopted clouds which are
attractive for data analytics
• Research Model: 4th Paradigm; From Theory to Data driven
science?
• Research/Business opportunities in advancing computing
technologies and algorithms
• Research/Business opportunities in X-Informatics: applying
4th paradigm (more here!)
• Development in Data Science Education: opportunities at
universities
10
https://portal.futuregrid.org
Data Deluge
11
https://portal.futuregrid.org 12Meeker/Wu May 29 2013 Internet Trends D11 Conference
Zettabyte ~1010 Typical Local Storage (100 Gigabytes)
Zettabyte = 1000 Exabytes
Exabyte = 1000 Petabytes
Petabyte = 1000 Terabyte
Terabyte = 1000 Gigabytes
Gigabyte = 1000 Megabytes
https://portal.futuregrid.org 13Meeker/Wu May 29 2013 Internet Trends D11 Conference
20 hours
https://portal.futuregrid.org 14Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.orghttp://cs.metrostate.edu/~sbd/ Oracle
https://portal.futuregrid.org
“Taming the Big Data Tidal Wave” 2012
(Bill Franks, Chief Analytics Officer Teradata)• Web Data (“the original big data”)
– Analyze customer web browsing of e-commerce site to see topics looked at etc.
• Auto Insurance (telematics monitoring driving)
– Equip cars with sensors
• Text data in multiple industries
– Sentiment analysis, identify common issues (as in eBay lamp example), Natural Language processing
• Time and location (GPS) data
– Track trucks (delivery), vehicles(track), people(tell them nearby goodies)
• Retail and manufacturing: RFID
– Asset and inventory management,
• Utility industry: Smart Grid
– Sensors allow dynamic optimization of power
• Gaming industry: Casino Chip tracking (RFID)
– Track individual players, detect fraud, identify patterns
• Industrial engines and equipment: sensor data
– See GE engine
• Video games: telemetry
– This is like monitoring web browsing but rather monitor actions in a game
• Telecommunication and other industries: Social Network data
– Connections make this big data.
– Use connections to find new customers with similar interests
https://portal.futuregrid.org
Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
https://portal.futuregrid.org
Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
MM = Million
https://portal.futuregrid.org
Some Science/Technical Data sizes
• LHC Particle Physics 15 petabytes per year
• Radiology 69 petabytes per year
• Square Kilometer Array Telescope will be 0.5
zettabytes per year raw data in ~2022
• Earth Observation becoming ~4 petabytes per year
• Earthquake Science – few terabytes total today
• PolarGrid Radar studies of glaciers– 100’s
terabytes/year
• Exascale simulation data dumps – ~0.1 zettabyte
per year
19
https://portal.futuregrid.org
Need cost effective
Computing!
Sequence every newborn by
2019 100 petabytes/year
http://www.genome.gov/sequencingcosts/
https://portal.futuregrid.org
The Long Tail of Science
80-20 rule: 20% users generate 80% data but not necessarily 80% knowledge
Collectively “long tail” science is generating a lot of data
Estimated at over 1PB per year and it is growing fast.
CSTI Meeting. October 2012
Dennis Gannon
https://portal.futuregrid.org
Data Intensive Activities
• Particle Physics LHC (bag of events of particles)
• Information Retrieval or web search (bag of words)
• e-commerce (bag of items with properties or users with rankings)
• Social Networking (bag of people with links & properties)
• Health Informatics (bag of health records, gene sequences)
• Sensors – web cams, self driving cars etc. (bag of pixels)
• Using
• Statistics (Histograms, Chisq)
• Deep Learning (Machine Learning)
• Image Analysis (including internet uploaded images)
• Recommender Engines (Bag of Ratings or properties)
• Patterns or Anomaly detection in graphs (linked data)
• On Clouds using MapReduce etc.
22
Bag=Space
https://portal.futuregrid.org
Big Data Ecosystem in One Sentence
Use Clouds running Data Analytics Collaboratively
processing Big Data to solve problems in
X-Informatics ( or e-X)
X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate,
Crisis, Earth Science, Energy, Environment, Finance, Health,
Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar,
Security, Sensor, Social, Sustainability, Wealth and Wellness with
more fields (physics) defined implicitly
Spans Industry and Science (research)
Education: Data Science see recent New York Times articles
http://datascience101.wordpress.com/2013/04/13/new-york-times-data-
science-articles/
https://portal.futuregrid.org
Social Informatics
Visual&Decision
Informatics
https://portal.futuregrid.org
Jobs
25
https://portal.futuregrid.org
Jobs v. Countries
26
http://www.microsoft.com/en-us/news/features/2012/mar12/03-05CloudComputingJobs.aspx
https://portal.futuregrid.org
McKinsey Institute on Big Data Jobs
• There will be a shortage of talent necessary for organizations to take
advantage of big data. By 2018, the United States alone could face a
shortage of 140,000 to 190,000 people with deep analytical skills as well as
1.5 million managers and analysts with the know-how to use the analysis of
big data to make effective decisions.
• Informatics aimed at 1.5 million jobs. Computer Science covers the 140,000
to 190,000
27
http://www.mckinsey.com/mgi/publications/big_data/index.asp.
https://portal.futuregrid.org
Tom Davenport Harvard Business School
http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html Nov 2012
https://portal.futuregrid.org 29Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 30Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org
Industry Trends
31
https://portal.futuregrid.org 32
Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 33Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 34Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 35Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 36Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 37Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 38Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 39Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org
Computing Model
Industry adopted clouds which are
attractive for data analytics
40
https://portal.futuregrid.org
For last 5 years Cloud Computing and last 2 years Big Data Transformational
Note in 2013 Big Data moves to 5-10 year slot
https://portal.futuregrid.org
Amazon Cloud AWS making money
• It took Amazon Web Services (AWS) eight years to hit
$650 million in revenue, according to Citigroup in
2010.
• Just three years later, Macquarie Capital analyst Ben
Schachter estimates that AWS will top $3.8 billion in
2013 revenue, up from $2.1 billion in 2012
(estimated), valuing the AWS business at $19 billion.
• First public cloud computing supplier building on many
cloud systems used to run Amazon, Google, Bing, eBay
….
https://portal.futuregrid.org
Physically Clouds are Clear
• A bunch of computers in an efficient data center
with an excellent Internet connection
• They were produced to meet need of public-
facing Web 2.0 e-Commerce/Social Networking
sites
• They can be considered as “optimal giant data
center” plus internet connection
• Note enterprises use private clouds that are
giant data centers but not optimized for Internet
access
The Microsoft Cloud is Built on Data Centers
Quincy, WA Chicago, IL San Antonio, TX Dublin, Ireland Generation 4 DCs
~100 Globally Distributed Data Centers
Range in size from “edge” facilities to megascale (100K to 1M servers)
CSTI Meeting.
October 2012
Dennis Gannon
Build giant data centers with 100,000’s of computers;
~ 200-1000 to a shipping container with Internet access
Data Centers Clouds &
Economies of Scale
Range in size from “edge”
facilities to megascale.
Economies of scale
Approximate costs for a small size
center (1K servers) and a larger,
50K server center.
Each data center is
11.5 times
the size of a football field
Technology Cost in small-
sized Data
Center
Cost in Large
Data Center
Ratio
Network $95 per Mbps/
month
$13 per Mbps/
month
7.1
Storage $2.20 per GB/
month
$0.40 per GB/
month
5.7
Administration ~140 servers/
Administrator
>1000 Servers/
Administrator
7.1
2 Google warehouses of computers on
the banks of the Columbia River, in
The Dalles, Oregon
Such centers use 20MW-200MW
(Future) each with 150 watts per CPU
Save money from large size,
positioning with cheap power and
access with Internet
http://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf
https://portal.futuregrid.org
Virtualization made several things more
convenient
• Virtualization = abstraction; run a job – you know not
where
• Virtualization = use hypervisor to support “images”
– Allows you to define complete job as an “image” – OS +
application
• Efficient packing of multiple applications into one
server as they don’t interfere (much) with each other
if in different virtual machines;
• They interfere if put as two jobs in same machine as
for example must have same OS and same OS
services
• Also security model between VM’s more robust than
between processes
https://portal.futuregrid.org
Microsoft Server Consolidation
• http://research.microsoft.com/pubs/78813/AJ18_EN.pdf
• Typical data center CPU has 9.75% utilization
• Take 5000 SQL servers and rehost on virtual machines with 6:1
consolidation
47
60% saving
https://portal.futuregrid.org
The Google gmail example
• http://www.google.com/green/pdfs/google-green-computing.pdf
• Clouds win by efficient resource use and efficient data centers
48
Business
Type
Number of
users
# servers IT Power
per user
PUE (Power
Usage
effectiveness)
Total
Power per
user
Annual
Energy per
user
Small 50 2 8W 2.5 20W 175 kWh
Medium 500 2 1.8W 1.8 3.2W 28.4 kWh
Large 10000 12 0.54W 1.6 0.9W 7.6 kWh
Gmail
(Cloud)
  < 0.22W 1.16 < 0.25W < 2.2 kWh
https://portal.futuregrid.org
Clouds Offer From different points of view
• Features from NIST:
– On-demand service (elastic);
– Broad network access;
– Resource pooling;
– Flexible resource allocation;
– Measured service
• Economies of scale in performance and electrical power (Green IT)
• Powerful new software models
– Platform as a Service is not an alternative to Infrastructure as a
Service – it is instead an incredible valued added
– Amazon is as much PaaS as Azure
• They are cheaper than classic clusters unless latter 100% utilized
49
https://portal.futuregrid.org
BPM = Business Process management
IaaS Hardware e.g. Server
PaaS Systems Services e.g.
MapReduce, Database
SaaS Applications e.g.
Recommender System, Clustering
BPaaS Particular Application Set
https://portal.futuregrid.org
Research Model
4th Paradigm; From Theory to Data
driven science?
51
https://portal.futuregrid.org
http://www.wired.com/wired/issue/16-07 September 2008
https://portal.futuregrid.org
The 4 paradigms of Scientific Research
1. Theory
2. Experiment or Observation
• E.g. Newton observed apples falling to design his theory of
mechanics
3. Simulation of theory or model Supercomputers
4. Data-driven (Big Data) or The Fourth Paradigm: Data-
Intensive Scientific Discovery (aka Data Science)
• http://research.microsoft.com/en-
us/collaboration/fourthparadigm/ A free book
• More data; less models
https://portal.futuregrid.org
Anand Rajaraman is Senior Vice President at Walmart Global
eCommerce, where he heads up the newly created
@WalmartLabs,
More data usually beats better algorithms
Here's how the competition works. Netflix has provided a large
data set that tells you how nearly half a million people have rated
about 18,000 movies. Based on these ratings, you are asked to
predict the ratings of these users for movies in the set that they
have not rated. The first team to beat the accuracy of Netflix's
proprietary algorithm by a certain margin wins a prize of $1
million!
Different student teams in my class adopted different approaches
to the problem, using both published algorithms and novel ideas.
Of these, the results from two of the teams illustrate a broader
point. Team A came up with a very sophisticated algorithm using
the Netflix data. Team B used a very simple algorithm, but they
added in additional data beyond the Netflix set: information
about movie genres from the Internet Movie Database(IMDB).
Guess which team did better?
http://anand.typepad.com/datawocky/2008/03/more-data-
usual.html
20120117berkeley1.pdf Jeff Hammerbacher
https://portal.futuregrid.org
Data Science Process
https://portal.futuregrid.org
DIKW Process
• Data becomes
• Information becomes
• Knowledge becomes
• Wisdom or Decisions
– Community acceptance of results or approach
important here
– Volume of bits&bytes decreases as we proceed
down DIKW pipeline
https://portal.futuregrid.org
Database
SS SS SS SS SS SS
SS: Sensor or Data
Interchange
Service
Workflow
through multiple
filter/discovery
clouds
Another
Cloud
Raw Data  Data  Information  Knowledge  Wisdom  Decisions
SSSS
Another
Service
SS
Another
Grid SS
SS
SS
SS
SS
SS
SS
SS
Storage
Cloud
Compute
Cloud
SS
SSSS
SS
Filter
Cloud
Filter
Cloud
Filter
Cloud
Discovery
Cloud
Discovery
Cloud
Filter
Cloud
Filter
Cloud
Filter
Cloud
SS
Filter
Cloud
Filter
Cloud Filter
Cloud
Filter
Cloud
Distributed
Grid
Hadoop
Cluster
SS
Data Deluge is also Information/Knowledge/Wisdom/Decision Deluge?
https://portal.futuregrid.org
Example of Google Maps/Navigation
• Data comes from traditional maps (US
Geological Survey), Satellites (overlays) and
street cams
• Information is presented by basic Google
Maps web page
• Knowledge is a particular optimized route
• Decisions (Wisdom) comes from deciding to
drive a particular route
https://portal.futuregrid.org
Physics-Informatics
Looking for Higgs Particle
with Large Hadron Collider LHC
https://portal.futuregrid.org
The LHC produces some 15 petabytes of data per year of all varieties and with the exact value
depending on duty factor of accelerator (which is reduced simply to cut electricity cost but also
due to malfunction of one or more of the many complex systems) and experiments. The raw
data produced by experiments is processed on the LHC Computing Grid, which has some
200,000 Cores arranged in a three level structure. Tier-0 is CERN itself, Tier 1 are national
facilities and Tier 2 are regional systems. For example one LHC experiment (CMS) has 7 Tier-1
and 50 Tier-2 facilities.
This analysis raw data  reconstructed data  AOD
and TAGS  Physics is performed on the multi-tier
LHC Computing Grid. Note that every event can be
analyzed independently so that many events can be
processed in parallel with some concentration
operations such as those to gather entries in a
histogram. This implies that both Grid and Cloud
solutions work with this type of data with currently
Grids being the only implementation today. Higgs Event
http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf
Note LHC lies in a
tunnel 27
kilometres (17 mi)
in circumference
ATLAS Expt
https://portal.futuregrid.org
http://www.interactions.org/cms/?pid=1032811
The inside of the RHIC (Relativistic Heavy Ion
Collider) tunnel, a 2.4-mile high-tech particle
racetrack at Brookhaven National Laboratory.
https://portal.futuregrid.org
Model
http://www.quantumdiaries.org/2012/09/07/why-particle-detectors-need-a-trigger/atlasmgg/
https://portal.futuregrid.org
Personal Note
• As a naïve undergraduate in 1964, I was told by Professor who later left
university to enter church that bumps like
were particles. I was amazed and found this more intriguing than anything else I
had heard about so I decided to do PhD in particle physics.
• I later decided computing moving faster than physics, so I went into Informatics
• Also I was alarmed by size and time scale of physics activities
• Note ATLAS is 45 metres long, 25 metres in diameter, and weighs about 7,000
tons. The experiment is a collaboration involving roughly 3,000 physicists at 175
institutions in 38 countries
• US version of LHC, Superconducting Super Collider (SSC) discussed in 1983 was
cancelled in 1993 after $2B spent
https://portal.futuregrid.org
http://www.sciencedirect.com/
science/article/pii/S037026931
200857X
https://portal.futuregrid.org
Recommender Systems
65
https://portal.futuregrid.org
Overview of Many Informatics Areas
• In many cases, one needs personalized matching of
items to people or perhaps collections of items to
collections of people
• People to products: Online and Offline Commerce
• People to People: Social Networking
• People to Jobs or Employers: Job Sites
• People+Queries to the Web: Information Retrieval
(search as in Bing/Google)
https://portal.futuregrid.org
Recommender Systems in more detail
• A large number of online and offline commerce activities plus
basic Internet site personalization relies on “recommender
systems”
• Given real-time action by user, immediately suggest new actions
(as in Amazon buy recommendations on web)
• Based on past actions of users (and others) suggest movies to look
at, restaurants to eat at, events to go to, books and music to buy
• Based on mix of explicit user choice and grouping of internet sites,
present customized Google News page
• Given sales statistics, decide on discounts at “real” supermarkets
and placement of related (by analysis of buying habits) products
• Identify possible colleagues at Social Networking sites like
LinkedIn
• Identify matches between employers and employees at sites like
CareerBuilder and Monster
https://portal.futuregrid.org
Everything is an Optimization Problem?
• Fit Model to Data
– Higgs + Background
• Match User to Jobs or Books or Other Users?
• Classification is optimizing assignment of members of an
ontology (list of categories) to data
• Typically minimize some function (or maximize negative of
function)
– Interesting feature of these problems is ingenious choice of
function
– Note Physics minimizes (free) energy
• Often involves thinking of people and/or items as points in
a space (not always a traditional vector space)
– Space called “bag” in “bag of words” model for information
retrieval
http://www.slideshare.net/xamat/building-largescale-
realworld-recommender-systems-recsys2012-tutorial
Netflix on Personalization
http://www.slideshare.net/xamat/building-largescale-
realworld-recommender-systems-recsys2012-tutorial
Netflix on Recommendations
http://www.slideshare.net/xamat/building-largescale-
realworld-recommender-systems-recsys2012-tutorial
April 2013: The last two quarters have each brought more than 2 million new
streaming subscriber signups. That gives Netflix a current total of nearly 29.2 million
subscribers
http://www.ifi.uzh.ch/ce/teaching/spring20
12/16-Recommender-Systems_Slides.pdf
http://www.slideshare.net/xamat/building-largescale-
realworld-recommender-systems-recsys2012-tutorial
Note Netflix and others run tests
all the time on subsets of
customers
Netflix on Data Science
Distances in Funny Spaces I
• In user-based collaborative filtering, we can think of users in a space
of dimension N where there are N items and M users.
– Let i run over items and u over users
• Then each user is represented as a vector Ui(u) in “item-space”
where ratings are vector components. We are looking for users u u’
that are near each other in this space as measured by some distance
between Ui(u) and Ui(u’)
• If u and u’ rate all items then these are “real” vectors but almost
always they each only rates a small fraction of items and the number
in common is even smaller
• The “Pearson coefficient” is just one distance measure that can be
used
– Only sum over i rated by u and u’
Last.fm uses this for songs as does Amazon, Netflix
Distances in Funny Spaces II
• In item-based collaborative filtering, we can think of items in a space
of dimension M where there are N items and M users.
– Let i run over items and u over users
• Then each item is represented as a vector Ru(i) in “user-space”
where ratings are vector components. We are looking for items i i’
that are near each other in this space as measured by some distance
between Ru(i) and Ru(i’)
• If i and i’ rated by all users then these are “real” vectors but almost
always they are each only rated by a small fraction of users and the
number in common is even smaller
• The “Cosine measure” is just one distance measure that can be used
– Only sum over users u rating both i and i’
Distances in Funny Spaces III
• In content based recommender systems, we can think of items in a
space of dimension M where there are N items and M properties.
– Let i run over items and p over properties
• Then each item is represented as a vector Pp(i) in “property-space”
where values of properties are vector components. We are looking
for items i i’ that are near each other in this space as measured by
some distance between Pp(i) and Rp(i’)
• Properties could be “reading age” or “character highlighted” or
“author” for books
• Properties can be genre or artist for songs and video
• Properties can characterize pixel structure for images used in face
recognition, driving etc.
Pandora uses this for songs (Music Genome) as does Amazon, Netflix
Do we need “real” spaces?
• Much of (eCommerce/LifeStyle) Informatics involves “points”
– Events in LHC analysis
– Users (people) or items (books, jobs, music, other people)
• These points can be thought of being in a “space” or “bag”
– Set of all books
– Set of all physics reactions
– Set of all Internet users
• However as in recommender systems where a given user
only rates some items, we don’t know “full position”
• However we can nearly always define a useful distance
d(a,b) between points
• Always d(a,b) >= 0
• Usually d(a,b) = d(b,a)
• Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality
Using Distances
• The simplest way to use distances is “nearest
neighbor algorithms” – given one point, find a set
of points near it – cut off by number of identified
nearby points and/or distance to initial point
– Here point is either user or item
• Another approach is divide space into regions
(topics, latent factors) consisting of nearby points
– This is clustering
– Also other algorithms like Gaussian mixture models or
Latent Semantic Analysis or Latent Dirichlet Allocation
which use a more sophisticated model
https://portal.futuregrid.org
Web Search
Information Retrieval
79
https://portal.futuregrid.org
“Web Data Analytics”
• Get the digital data (from web or from scanning)
• Need to crawl web (? Solved “engineering” problem)
• Preprocess data to get searchable things (words
positions)
• Form Inverted Index mapping words to documents
• Typically use TF-IDF (term frequency, Inverse
Document frequency) to quantify importance of word
match
• Rank relevance of documents: PageRank
• Lots of technology for advertising, “reverse
engineering” “preventing reverse engineering”
• Clustering of documents into topics (as in Google
News)
Size of face proportional to PageRank
82Deepak Agarwal & Bee-Chung Chen @ ICML’11
Modern Recommendation Systems (from Yahoo)
• Goal (Function to Optimize – Long Term dollars)
– Serve the right item to a user in a given context to optimize long-
term business objectives
• A scientific discipline that involves
– Large scale Machine Learning & Statistics
• Offline Models (capture global & stable characteristics)
• Online Models (incorporates dynamic components)
• Explore/Exploit (active and adaptive experimentation)
– Multi-Objective Optimization
• Click-rates (CTR), Engagement, advertising revenue, diversity, etc
– Inferring user interest
• Constructing User Profiles
– Natural Language Processing to understand content
• Topics, “aboutness”, entities, follow-up of something, breaking news,…
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
83Deepak Agarwal & Bee-Chung Chen @ ICML’11
Recommend applications
Recommend search queries
Recommend news article
Recommend packages:
Image
Title, summary
Links to other pages
Pick 4 out of a pool of K
K = 20 ~ 40
Dynamic
Routes traffic other pages
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
84Deepak Agarwal & Bee-Chung Chen @ ICML’11
Some examples from content optimization
• Simple version
– I have a content module on my page, content inventory is obtained
from a third party source which is further refined through editorial
oversight. Can I algorithmically recommend content on this
module? I want to improve overall click-rate (CTR) on this module
• More advanced
– I got X% lift in CTR. But I have additional information on other
downstream utilities (e.g. advertising revenue). Can I increase
downstream utility without losing too many clicks?
• Highly advanced
– There are multiple modules running on my webpage. How do I
perform a simultaneous optimization?
http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
https://portal.futuregrid.org
Cloud Applications in Research
85
https://portal.futuregrid.org
Science Computing Environments
• Large Scale Supercomputers – Multicore nodes linked by high
performance low latency network
– Increasingly with GPU enhancement
– Suitable for highly parallel simulations
• High Throughput Systems such as European Grid Initiative EGI or
Open Science Grid OSG typically aimed at pleasingly parallel jobs
– Can use “cycle stealing”
– Classic example is LHC data analysis
• Grids federate resources as in EGI/OSG or enable convenient access
to multiple backend systems including supercomputers
• Use Services (SaaS)
– Portals make access convenient and
– Workflow integrates multiple processes into a single job
86
https://portal.futuregrid.org
Clouds HPC and Grids
• Synchronization/communication Performance
Grids > Clouds > Classic HPC Systems
• Clouds naturally execute effectively Grid workloads but are less
clear for closely coupled HPC applications
• Classic HPC machines as MPI engines offer highest possible
performance on closely coupled problems
• The 4 forms of MapReduce/MPI
1) Map Only – pleasingly parallel
2) Classic MapReduce as in Hadoop; single Map followed by reduction with
fault tolerant use of disk
3) Iterative MapReduce use for data mining such as Expectation Maximization
in clustering etc.; Cache data in memory between iterations and support the
large collective communication (Reduce, Scatter, Gather, Multicast) use in
data mining
4) Classic MPI! Support small point to point messaging efficiently as used in
partial differential equation solvers
https://portal.futuregrid.org
What Applications work in Clouds
• Pleasingly (moving to modestly) parallel applications of all sorts
with roughly independent data or spawning independent
simulations
– Long tail of science and integration of distributed sensors
• Commercial and Science Data analytics that can use MapReduce
(some of such apps) or its iterative variants (most other data
analytics apps)
• Which science applications are using clouds?
– Venus-C (Azure in Europe): 27 applications not using Scheduler,
Workflow or MapReduce (except roll your own)
– 50% of applications on FutureGrid are from Life Science
– Locally Lilly corporation is commercial cloud user (for drug
discovery) but not IU Biology
• But overall very little science use of clouds yet
88
https://portal.futuregrid.org
Parallelism over Users and Usages
• “Long tail of science” can be an important usage mode of clouds.
• In some areas like particle physics and astronomy, i.e. “big science”,
there are just a few major instruments generating now petascale
data driving discovery in a coordinated fashion.
• In other areas such as genomics and environmental science, there
are many “individual” researchers with distributed collection and
analysis of data whose total data and processing needs can match
the size of big science.
• Clouds can provide scaling convenient resources for this important
aspect of science.
• Can be map only use of MapReduce if different usages naturally
linked e.g. exploring docking of multiple chemicals or alignment of
multiple DNA sequences
– Collecting together or summarizing multiple “maps” is a simple Reduction
89
https://portal.futuregrid.org
Internet of Things and the Cloud
• It is projected that there will be 24-75 billion devices on the Internet
by 2020. Most will be small sensors that send streams of information
into the cloud where it will be processed and integrated with other
streams and turned into knowledge that will help our lives in a
multitude of small and big ways.
• The cloud will become increasing important as a controller of and
resource provider for the Internet of Things.
• As well as today’s use for smart phone and gaming console support,
“Intelligent River” “smart homes and grid” and “ubiquitous cities”
build on this vision and we could expect a growth in cloud
supported/controlled robotics.
• Some of these “things” will be supporting science
• Natural parallelism over “things”
• “Things” are distributed and so form a Grid
90
https://portal.futuregrid.org
Sensors (Things) as a Service
Sensors as a Service
Sensor
Processing as
a Service
(could use
MapReduce)
A larger sensor ………
Output Sensor
https://sites.google.com/site/opensourceiotcloud/ Open Source Sensor (IoT) Cloud
https://portal.futuregrid.org 92Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 93Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 94Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org
Parallel Computing and
MapReduce
95
https://portal.futuregrid.org
Classic Parallel Computing• HPC: Typically SPMD (Single Program Multiple Data) “maps” typically
processing particles or mesh points interspersed with multitude of
low latency messages supported by specialized networks such as
Infiniband and technologies like MPI
– Often run large capability jobs with 100K (going to 1.5M) cores on same job
– National DoE/NSF/NASA facilities run 100% utilization
– Fault fragile and cannot tolerate “outlier maps” taking longer than others
• Clouds: MapReduce has asynchronous maps typically processing data
points with results saved to disk. Final reduce phase integrates results
from different maps
– Fault tolerant and does not require map synchronization
– Map only useful special case
• HPC + Clouds: Iterative MapReduce caches results between
“MapReduce” steps and supports SPMD parallel computing with
large messages as seen in parallel kernels (linear algebra) in clustering
and other data mining 96
https://portal.futuregrid.org
MapReduce “File/Data Repository” Parallelism
Instruments
Disks Map1 Map2 Map3
Reduce
Communication
Map = (data parallel) computation reading and writing data
Reduce = Collective/Consolidation phase e.g. forming multiple
global sums as in histogram
Portals
/Users
Iterative MapReduce
Map Map Map Map
Reduce Reduce Reduce
• Sam thought of “drinking” the apple
Sam’s Problem
http://www.slideshare.net/esaliya/mapreduce-in-simple-terms
 He used a to cut the
and a to make juice.
(<a’, > , <o’, > , <p’, > )
• Implemented a parallel version of his innovation
Creative Sam
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of slice is a list of <key, value> pairs
Grouped by key
Each input to a reduce is a <key, value-list> (possibly a
list of these, depending on the grouping/hashing
mechanism)
e.g. <ao, ( …)>
Reduced into a list of values
The idea of Map Reduce in Data Intensive
Computing
A list of <key, value> pairs mapped into another
list of <key, value> pairs which gets grouped by
the key and reduced into a list of values
https://portal.futuregrid.org
Data Science Education
Opportunities at Universities
see recent New York Times articles
http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/
100
https://portal.futuregrid.org
Data Science Education
• Broad Range of Topics from Policy to curation to
applications and algorithms, programming models,
data systems, statistics, and broad range of CS
subjects such as Clouds, Programming, HCI,
• Plenty of Jobs and broader range of possibilities
than computational science but similar cosmic
issues
– What type of degree (Certificate, minor, track, “real”
degree)
– What implementation (department, interdisciplinary
group supporting education and research program)
101
https://portal.futuregrid.org
At Indiana University
• Have a proposal to set up certificates and Masters
degree in data science
• Joint between 3 units in School of Informatics and
Computing: Computer Science, Informatics,
Information & Library Science, and Statistics
department in COAS College
• Looking at version with Kelley with Business data
analytics flavor
• Attractive to offer online as few universities have
this and so a potentially large audience outside IU
102
https://portal.futuregrid.org 103Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org 104Meeker/Wu May 29 2013 Internet Trends D11 Conference
https://portal.futuregrid.org
Massive Open Online Courses (MOOC)
• MOOC’s are very “hot” these days with Udacity and Coursera as start-
ups; perhaps over 100,000 participants
• Relevant to Data Science as this is a new field with few courses at
most universities
• Typical model is collection of short prerecorded segments (talking
head over PowerPoint) of length 3-15 minutes
– This is Boredom limit http://blog.coursera.org/post/49750392396/on-the-
topic-of-boredom
• These “lesson objects” can be viewed as “songs”
• Google Course Builder (python open source) builds customizable
MOOC’s as “playlists” of “songs”
• Tells you to capture all material as “lesson objects”
• We are aiming to build a repository of many “songs”; used in many
ways – tutorials, classes …
105
https://portal.futuregrid.org 106
http://x-informatics.appspot.com/course
Complete end of July
https://portal.futuregrid.org 107
• Seven ~10 minutes lesson objects in this lecture
• IU wants us to close caption if use in real course
https://portal.futuregrid.org
Customizable MOOC’s I
• We could teach one class to 100,000 students or 2,000
classes to 50 students
• The 2,000 class choice has 2 useful features
– One can use the usual (electronic) mentoring/grading technology
– One can customize each of 2,000 classes for a particular audience
given their level and interests
– One can even allow student to customize – that’s what one does
in making play lists in iTunes
• Both models can be supported by a repository of lesson
objects (10-15 minute video segments) in the cloud
• The teacher can choose from existing lesson objects and
add their own to produce a new customized course with
new lessons contributed back to repository
108
https://portal.futuregrid.org
Science Cloud MOOC
Repository
109
http://iucloudsummerschool.appspot.com/preview
Unit ~1 hour with ~6 lessons,
Total 115 lesson objects
https://portal.futuregrid.org
Customizable MOOC’s II
• The 3-15 minute Video over PowerPoint of MOOC lesson
object’s is easy to re-use
• Qiu (IU)and Hayden (ECSU Elizabeth City State University –
(a small HBCU Historically Black University) will customize a
module
– Starting with Qiu’s cloud computing course at IU
– Adding material on use of Cloud Computing in Remote Sensing
(area covered by ECSU course)
• This is a model for adding cloud curricula material to wide
set of universities where faculty not able to teach
• Defining how to support computing labs associated with
MOOC’s with clouds or VM’s on clients
– Appliances scale as download to student’s client
110
https://portal.futuregrid.org 111
Can of course
build many
different
interfaces
Songs stored
on YouTube
Songs
prepared with
Adobe
Presenter on
Laptop
http://cloudmooc.soic.indiana.edu/
https://portal.futuregrid.org
Two limits where MOOC’s are Compelling
• High volume courses (CS/Ph/Chem/Bio101…)
where scalability of MOOC’s make them
attractive to reach a lot of students
• Niche areas where there is some student
interest but either no faculty expertise or not
enough students to justify traditional courses
– Offer to many institutions simultaneously
112
https://portal.futuregrid.org 115
Meeker/Wu May 29 2013 Internet
Trends D11 Conference
https://portal.futuregrid.org
Conclusions
117
https://portal.futuregrid.org
Conclusions
• Clouds are here to stay and one should plan on exploiting them
• Data Intensive studies in business and research continue to grow in
importance
– Data Analytics: Everything is an optimization problem in a funny space
• Growing employment opportunities in clouds and data related
activities and so popular with students
– Enabling many of the most important companies from Facebook/Google to
General Electric
• Need community discussion of data science education
– Agree on curricula; is such a degree attractive?
• MOOC’s interesting for
– Disseminating new curricula
– Managing course fragments that can be assembled into custom
courses for particular interdisciplinary students
118
https://portal.futuregrid.org
Big Data Ecosystem in One Sentence
Use Clouds running Data Analytics
Collaboratively processing Big Data to solve
problems in X-Informatics educated in data
science
X = Astronomy, Biology, Biomedicine, Business, Chemistry,
Climate, Crisis, Earth Science, Energy, Environment, Finance,
Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology,
Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and
Wellness with more fields (physics) defined implicitly
Spans Industry and Science (research)

Contenu connexe

Tendances

Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big dataPrashant Sharma
 
Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Bessie Chu
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)Shahbaz Anjam
 
Big data introduction
Big data introductionBig data introduction
Big data introductionvikas samant
 
Fundamentals of Big Data in 2 minutes!!
Fundamentals of Big Data in  2 minutes!!Fundamentals of Big Data in  2 minutes!!
Fundamentals of Big Data in 2 minutes!!Simplify360
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsRamakant Gawande
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13Rei Lynn Hayashi
 
Big data Presentation
Big data PresentationBig data Presentation
Big data PresentationAswadmehar
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applicationsali easazadeh
 
Big data overview external
Big data overview externalBig data overview external
Big data overview externalBrett Colbert
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Hritika Raj
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 

Tendances (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Ppt for Application of big data
Ppt for Application of big dataPpt for Application of big data
Ppt for Application of big data
 
Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan Approaching Big Data: Lesson Plan
Approaching Big Data: Lesson Plan
 
Big data (word file)
Big data  (word file)Big data  (word file)
Big data (word file)
 
Big data introduction
Big data introductionBig data introduction
Big data introduction
 
Fundamentals of Big Data in 2 minutes!!
Fundamentals of Big Data in  2 minutes!!Fundamentals of Big Data in  2 minutes!!
Fundamentals of Big Data in 2 minutes!!
 
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of thingsBig Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
Big Data & Future - Big Data, Analytics, Cloud, SDN, Internet of things
 
The big data value chain r1-31 oct13
The big data value chain r1-31 oct13The big data value chain r1-31 oct13
The big data value chain r1-31 oct13
 
Big data
Big dataBig data
Big data
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big data Presentation
Big data PresentationBig data Presentation
Big data Presentation
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big Data Trends
Big Data TrendsBig Data Trends
Big Data Trends
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applications
 
The promise and challenge of Big Data
The promise and challenge of Big DataThe promise and challenge of Big Data
The promise and challenge of Big Data
 
Big data overview external
Big data overview externalBig data overview external
Big data overview external
 
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
Big data PPT prepared by Hritika Raj (Shivalik college of engg.)
 
Big data
Big dataBig data
Big data
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 

En vedette

Cloud Computing: What it Means/Does/Costs and Why You Should Care
Cloud Computing: What it Means/Does/Costs and Why You Should CareCloud Computing: What it Means/Does/Costs and Why You Should Care
Cloud Computing: What it Means/Does/Costs and Why You Should CareDW Nelson
 
Taking Healthcare to the Cloud
Taking Healthcare to the CloudTaking Healthcare to the Cloud
Taking Healthcare to the CloudJerry Collins
 
Cloud Computing in Health
Cloud Computing in HealthCloud Computing in Health
Cloud Computing in HealthJuan Bru
 
Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015
Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015
Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015StampedeCon
 
Cannespiration (Inspiration from Cannes Lions 2013)
Cannespiration (Inspiration from Cannes Lions 2013)Cannespiration (Inspiration from Cannes Lions 2013)
Cannespiration (Inspiration from Cannes Lions 2013)Chris Rawlinson
 
SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...
SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...
SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...Service Design Network
 
Lessons from Cannes Lions by Nora Kirta
Lessons from Cannes Lions by Nora KirtaLessons from Cannes Lions by Nora Kirta
Lessons from Cannes Lions by Nora KirtaNORD DDB RIGA
 
Using Mobile Video & Rich Media to Promote Your Business (Encore)
Using Mobile Video & Rich Media to Promote Your Business (Encore)Using Mobile Video & Rich Media to Promote Your Business (Encore)
Using Mobile Video & Rich Media to Promote Your Business (Encore)Purplegator
 
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...CA Technologies
 
The Great Unknown - How can operators leverage big data to prevent future rev...
The Great Unknown - How can operators leverage big data to prevent future rev...The Great Unknown - How can operators leverage big data to prevent future rev...
The Great Unknown - How can operators leverage big data to prevent future rev...cVidya Networks
 
The Goodness Manifesto
The Goodness ManifestoThe Goodness Manifesto
The Goodness ManifestoDony Peter
 
Masters thesis - Fraud & Big Data
Masters thesis - Fraud & Big DataMasters thesis - Fraud & Big Data
Masters thesis - Fraud & Big DataStephanie Canovas
 
Cloud computing & big data for service innovation & learning
Cloud computing & big data for service innovation & learningCloud computing & big data for service innovation & learning
Cloud computing & big data for service innovation & learning2016
 
Big Data, Cloud Computing, and Privacy Implications
Big Data, Cloud Computing, and Privacy ImplicationsBig Data, Cloud Computing, and Privacy Implications
Big Data, Cloud Computing, and Privacy ImplicationsAntigone Peyton
 
SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...
SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...
SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...TrackMaven
 

En vedette (20)

Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Aamod_Chandra
Aamod_ChandraAamod_Chandra
Aamod_Chandra
 
Cloud Computing: What it Means/Does/Costs and Why You Should Care
Cloud Computing: What it Means/Does/Costs and Why You Should CareCloud Computing: What it Means/Does/Costs and Why You Should Care
Cloud Computing: What it Means/Does/Costs and Why You Should Care
 
Taking Healthcare to the Cloud
Taking Healthcare to the CloudTaking Healthcare to the Cloud
Taking Healthcare to the Cloud
 
Cloud Computing in Health
Cloud Computing in HealthCloud Computing in Health
Cloud Computing in Health
 
Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015
Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015
Identity Fraud Protection Using Big Data Analytics - StampedeCon 2015
 
Cannespiration (Inspiration from Cannes Lions 2013)
Cannespiration (Inspiration from Cannes Lions 2013)Cannespiration (Inspiration from Cannes Lions 2013)
Cannespiration (Inspiration from Cannes Lions 2013)
 
SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...
SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...
SDNC13 -Day2- Protecting your imagination: Inspiring Cultures in Corporate En...
 
Lessons from Cannes Lions by Nora Kirta
Lessons from Cannes Lions by Nora KirtaLessons from Cannes Lions by Nora Kirta
Lessons from Cannes Lions by Nora Kirta
 
Using Mobile Video & Rich Media to Promote Your Business (Encore)
Using Mobile Video & Rich Media to Promote Your Business (Encore)Using Mobile Video & Rich Media to Promote Your Business (Encore)
Using Mobile Video & Rich Media to Promote Your Business (Encore)
 
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
Ten Commandments for Tackling Fraud: The Role of Big Data and Predictive Anal...
 
The Great Unknown - How can operators leverage big data to prevent future rev...
The Great Unknown - How can operators leverage big data to prevent future rev...The Great Unknown - How can operators leverage big data to prevent future rev...
The Great Unknown - How can operators leverage big data to prevent future rev...
 
The Goodness Manifesto
The Goodness ManifestoThe Goodness Manifesto
The Goodness Manifesto
 
Masters thesis - Fraud & Big Data
Masters thesis - Fraud & Big DataMasters thesis - Fraud & Big Data
Masters thesis - Fraud & Big Data
 
Cloud computing & big data for service innovation & learning
Cloud computing & big data for service innovation & learningCloud computing & big data for service innovation & learning
Cloud computing & big data for service innovation & learning
 
Big data abstract
Big data abstractBig data abstract
Big data abstract
 
Big Data, Cloud Computing, and Privacy Implications
Big Data, Cloud Computing, and Privacy ImplicationsBig Data, Cloud Computing, and Privacy Implications
Big Data, Cloud Computing, and Privacy Implications
 
SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...
SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...
SPARK 2016: Searching for the Content Bigfoot: How to Create a Data-Driven Co...
 
If you're happy and you know it...
If you're happy and you know it...If you're happy and you know it...
If you're happy and you know it...
 

Similaire à Big Data Analytics and Cloud Computing Drive Future Economy

Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence DevelopmentManojKumarR41
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and InternetSanoj Kumar
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsWay-Yen Lin
 
INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementSimen Smaaberg
 
Bigdata the technological renaissance
Bigdata the technological renaissanceBigdata the technological renaissance
Bigdata the technological renaissanceRituBhargava7
 
Foresight conversation
Foresight conversationForesight conversation
Foresight conversationsuresh sood
 
Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular ComputingJenny Midwinter
 
Big data seminar at Broadridge
Big data seminar at BroadridgeBig data seminar at Broadridge
Big data seminar at BroadridgeSoftware Engineer
 
data about tech and data trends
data about tech and data trendsdata about tech and data trends
data about tech and data trendsantarme
 
Modern data integration | Diyotta
Modern data integration | Diyotta Modern data integration | Diyotta
Modern data integration | Diyotta diyotta
 

Similaire à Big Data Analytics and Cloud Computing Drive Future Economy (20)

Business with Big data
Business with Big dataBusiness with Big data
Business with Big data
 
Data Mining With Big Data
Data Mining With Big DataData Mining With Big Data
Data Mining With Big Data
 
Jobs Complexity
Jobs ComplexityJobs Complexity
Jobs Complexity
 
Fixing data science & Accelerating Artificial Super Intelligence Development
 Fixing data science & Accelerating Artificial Super Intelligence Development Fixing data science & Accelerating Artificial Super Intelligence Development
Fixing data science & Accelerating Artificial Super Intelligence Development
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Datapreneurs
DatapreneursDatapreneurs
Datapreneurs
 
future2020
future2020future2020
future2020
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Big Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data ScientistsBig Data, Big Deal: For Future Big Data Scientists
Big Data, Big Deal: For Future Big Data Scientists
 
13 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v313 pv-do es-18-bigdata-v3
13 pv-do es-18-bigdata-v3
 
INN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for managementINN530 - Assignment 2, Big data and cloud computing for management
INN530 - Assignment 2, Big data and cloud computing for management
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Bigdata the technological renaissance
Bigdata the technological renaissanceBigdata the technological renaissance
Bigdata the technological renaissance
 
Foresight conversation
Foresight conversationForesight conversation
Foresight conversation
 
Machine Learning meets Granular Computing
Machine Learning meets Granular ComputingMachine Learning meets Granular Computing
Machine Learning meets Granular Computing
 
Big Data Seminar At Broadridge
Big Data Seminar At BroadridgeBig Data Seminar At Broadridge
Big Data Seminar At Broadridge
 
Big data seminar at Broadridge
Big data seminar at BroadridgeBig data seminar at Broadridge
Big data seminar at Broadridge
 
Big Data seminar BR-new
Big Data seminar BR-newBig Data seminar BR-new
Big Data seminar BR-new
 
data about tech and data trends
data about tech and data trendsdata about tech and data trends
data about tech and data trends
 
Modern data integration | Diyotta
Modern data integration | Diyotta Modern data integration | Diyotta
Modern data integration | Diyotta
 

Plus de Geoffrey Fox

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...Geoffrey Fox
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Geoffrey Fox
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC ConvergenceGeoffrey Fox
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online EducationGeoffrey Fox
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Geoffrey Fox
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Geoffrey Fox
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Geoffrey Fox
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...Geoffrey Fox
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityGeoffrey Fox
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationGeoffrey Fox
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Geoffrey Fox
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 

Plus de Geoffrey Fox (20)

AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
AI-Driven Science and Engineering with the Global AI and Modeling Supercomput...
 
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
Next Generation Grid: Integrating Parallel and Distributed Computing Runtimes...
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
Big Data HPC Convergence
Big Data HPC ConvergenceBig Data HPC Convergence
Big Data HPC Convergence
 
Data Science and Online Education
Data Science and Online EducationData Science and Online Education
Data Science and Online Education
 
Big Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other thingsBig Data HPC Convergence and a bunch of other things
Big Data HPC Convergence and a bunch of other things
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
Classifying Simulation and Data Intensive Applications and the HPC-Big Data C...
 
Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel Visualizing and Clustering Life Science Applications in Parallel 
Visualizing and Clustering Life Science Applications in Parallel 
 
Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...Lessons from Data Science Program at Indiana University: Curriculum, Students...
Lessons from Data Science Program at Indiana University: Curriculum, Students...
 
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
 
Data Science Curriculum at Indiana University
Data Science Curriculum at Indiana UniversityData Science Curriculum at Indiana University
Data Science Curriculum at Indiana University
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
Experience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC TechnologyExperience with Online Teaching with Open Source MOOC Technology
Experience with Online Teaching with Open Source MOOC Technology
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data and Clouds: Research and Education
Big Data and Clouds: Research and EducationBig Data and Clouds: Research and Education
Big Data and Clouds: Research and Education
 
Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...Comparing Big Data and Simulation Applications and Implications for Software ...
Comparing Big Data and Simulation Applications and Implications for Software ...
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

Big Data Analytics and Cloud Computing Drive Future Economy

  • 1. https://portal.futuregrid.org Big Data Applications & Analytics Motivation: Big Data and the Cloud; Centerpieces of the Future Economy January 5 2014 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington
  • 3. https://portal.futuregrid.org Abstract • There is an endlessly growing amount of data as we record every transaction between people and the environment (whether shopping or on a social networking site) while smart phones, smart homes, ubiquitous cities, smart power grids, and intelligent vehicles deploy sensors recording even more. • Science with satellites and accelerators is giving data on transactions of particles and photons at the microscopic scale. • This data are and will be stored in immense clouds with co-located storage and computing that perform "analytics" that transform data into information and then to wisdom and decisions; data mining finds the proverbial knowledge diamonds in the data rough. • This disruptive transformation is driving the economy and creating millions of jobs in the emerging area of "data science". • We discuss this revolution and its implications for universities and society 3
  • 4. https://portal.futuregrid.org Some Trends The Data Deluge is clear trend from Commercial (Amazon, e-commerce) , Community (Facebook, Search) and Scientific applications Smaller (INTEL/ARM/AMD) chips drive Multicore (i.e. more computing) on shared servers Smaller Light weight clients from smartphones, tablets to sensors (i.e. more clients) Clouds with cheaper, greener, easier to use IT for applications New jobs associated with new curricula Clouds as a distributed system (changing a classic CS course) Data Science (new area) 4
  • 5. https://portal.futuregrid.org 48 technologies are listed in this year’s hype cycle which is the highest in last ten years. Year 2008 was the lowest (27) Gartner Says: We are at an interesting moment — a time when the scenarios we’ve been talking about for a long time are almost becoming reality.
  • 6. https://portal.futuregrid.org 6 Private Cloud Computing is off the chart http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
  • 9. https://portal.futuregrid.org 9 Note number of “analytics” areas http://public.brighttalk.com/resource/core/19507/august_21_hype_cycle_fenn_lehong_29685.pdf
  • 10. https://portal.futuregrid.org Issues of Importance • Economic Imperative: There are a lot of data and a lot of jobs • Computing Model: Industry adopted clouds which are attractive for data analytics • Research Model: 4th Paradigm; From Theory to Data driven science? • Research/Business opportunities in advancing computing technologies and algorithms • Research/Business opportunities in X-Informatics: applying 4th paradigm (more here!) • Development in Data Science Education: opportunities at universities 10
  • 12. https://portal.futuregrid.org 12Meeker/Wu May 29 2013 Internet Trends D11 Conference Zettabyte ~1010 Typical Local Storage (100 Gigabytes) Zettabyte = 1000 Exabytes Exabyte = 1000 Petabytes Petabyte = 1000 Terabyte Terabyte = 1000 Gigabytes Gigabyte = 1000 Megabytes
  • 13. https://portal.futuregrid.org 13Meeker/Wu May 29 2013 Internet Trends D11 Conference 20 hours
  • 14. https://portal.futuregrid.org 14Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 16. https://portal.futuregrid.org “Taming the Big Data Tidal Wave” 2012 (Bill Franks, Chief Analytics Officer Teradata)• Web Data (“the original big data”) – Analyze customer web browsing of e-commerce site to see topics looked at etc. • Auto Insurance (telematics monitoring driving) – Equip cars with sensors • Text data in multiple industries – Sentiment analysis, identify common issues (as in eBay lamp example), Natural Language processing • Time and location (GPS) data – Track trucks (delivery), vehicles(track), people(tell them nearby goodies) • Retail and manufacturing: RFID – Asset and inventory management, • Utility industry: Smart Grid – Sensors allow dynamic optimization of power • Gaming industry: Casino Chip tracking (RFID) – Track individual players, detect fraud, identify patterns • Industrial engines and equipment: sensor data – See GE engine • Video games: telemetry – This is like monitoring web browsing but rather monitor actions in a game • Telecommunication and other industries: Social Network data – Connections make this big data. – Use connections to find new customers with similar interests
  • 17. https://portal.futuregrid.org Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html
  • 18. https://portal.futuregrid.org Ruh VP Software GE http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html MM = Million
  • 19. https://portal.futuregrid.org Some Science/Technical Data sizes • LHC Particle Physics 15 petabytes per year • Radiology 69 petabytes per year • Square Kilometer Array Telescope will be 0.5 zettabytes per year raw data in ~2022 • Earth Observation becoming ~4 petabytes per year • Earthquake Science – few terabytes total today • PolarGrid Radar studies of glaciers– 100’s terabytes/year • Exascale simulation data dumps – ~0.1 zettabyte per year 19
  • 20. https://portal.futuregrid.org Need cost effective Computing! Sequence every newborn by 2019 100 petabytes/year http://www.genome.gov/sequencingcosts/
  • 21. https://portal.futuregrid.org The Long Tail of Science 80-20 rule: 20% users generate 80% data but not necessarily 80% knowledge Collectively “long tail” science is generating a lot of data Estimated at over 1PB per year and it is growing fast. CSTI Meeting. October 2012 Dennis Gannon
  • 22. https://portal.futuregrid.org Data Intensive Activities • Particle Physics LHC (bag of events of particles) • Information Retrieval or web search (bag of words) • e-commerce (bag of items with properties or users with rankings) • Social Networking (bag of people with links & properties) • Health Informatics (bag of health records, gene sequences) • Sensors – web cams, self driving cars etc. (bag of pixels) • Using • Statistics (Histograms, Chisq) • Deep Learning (Machine Learning) • Image Analysis (including internet uploaded images) • Recommender Engines (Bag of Ratings or properties) • Patterns or Anomaly detection in graphs (linked data) • On Clouds using MapReduce etc. 22 Bag=Space
  • 23. https://portal.futuregrid.org Big Data Ecosystem in One Sentence Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X) X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields (physics) defined implicitly Spans Industry and Science (research) Education: Data Science see recent New York Times articles http://datascience101.wordpress.com/2013/04/13/new-york-times-data- science-articles/
  • 27. https://portal.futuregrid.org McKinsey Institute on Big Data Jobs • There will be a shortage of talent necessary for organizations to take advantage of big data. By 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions. • Informatics aimed at 1.5 million jobs. Computer Science covers the 140,000 to 190,000 27 http://www.mckinsey.com/mgi/publications/big_data/index.asp.
  • 28. https://portal.futuregrid.org Tom Davenport Harvard Business School http://fisheritcenter.haas.berkeley.edu/Big_Data/index.html Nov 2012
  • 29. https://portal.futuregrid.org 29Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 30. https://portal.futuregrid.org 30Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 32. https://portal.futuregrid.org 32 Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 33. https://portal.futuregrid.org 33Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 34. https://portal.futuregrid.org 34Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 35. https://portal.futuregrid.org 35Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 36. https://portal.futuregrid.org 36Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 37. https://portal.futuregrid.org 37Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 38. https://portal.futuregrid.org 38Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 39. https://portal.futuregrid.org 39Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 40. https://portal.futuregrid.org Computing Model Industry adopted clouds which are attractive for data analytics 40
  • 41. https://portal.futuregrid.org For last 5 years Cloud Computing and last 2 years Big Data Transformational Note in 2013 Big Data moves to 5-10 year slot
  • 42. https://portal.futuregrid.org Amazon Cloud AWS making money • It took Amazon Web Services (AWS) eight years to hit $650 million in revenue, according to Citigroup in 2010. • Just three years later, Macquarie Capital analyst Ben Schachter estimates that AWS will top $3.8 billion in 2013 revenue, up from $2.1 billion in 2012 (estimated), valuing the AWS business at $19 billion. • First public cloud computing supplier building on many cloud systems used to run Amazon, Google, Bing, eBay ….
  • 43. https://portal.futuregrid.org Physically Clouds are Clear • A bunch of computers in an efficient data center with an excellent Internet connection • They were produced to meet need of public- facing Web 2.0 e-Commerce/Social Networking sites • They can be considered as “optimal giant data center” plus internet connection • Note enterprises use private clouds that are giant data centers but not optimized for Internet access
  • 44. The Microsoft Cloud is Built on Data Centers Quincy, WA Chicago, IL San Antonio, TX Dublin, Ireland Generation 4 DCs ~100 Globally Distributed Data Centers Range in size from “edge” facilities to megascale (100K to 1M servers) CSTI Meeting. October 2012 Dennis Gannon Build giant data centers with 100,000’s of computers; ~ 200-1000 to a shipping container with Internet access
  • 45. Data Centers Clouds & Economies of Scale Range in size from “edge” facilities to megascale. Economies of scale Approximate costs for a small size center (1K servers) and a larger, 50K server center. Each data center is 11.5 times the size of a football field Technology Cost in small- sized Data Center Cost in Large Data Center Ratio Network $95 per Mbps/ month $13 per Mbps/ month 7.1 Storage $2.20 per GB/ month $0.40 per GB/ month 5.7 Administration ~140 servers/ Administrator >1000 Servers/ Administrator 7.1 2 Google warehouses of computers on the banks of the Columbia River, in The Dalles, Oregon Such centers use 20MW-200MW (Future) each with 150 watts per CPU Save money from large size, positioning with cheap power and access with Internet http://research.microsoft.com/en-us/people/barga/sc09_cloudcomp_tutorial.pdf
  • 46. https://portal.futuregrid.org Virtualization made several things more convenient • Virtualization = abstraction; run a job – you know not where • Virtualization = use hypervisor to support “images” – Allows you to define complete job as an “image” – OS + application • Efficient packing of multiple applications into one server as they don’t interfere (much) with each other if in different virtual machines; • They interfere if put as two jobs in same machine as for example must have same OS and same OS services • Also security model between VM’s more robust than between processes
  • 47. https://portal.futuregrid.org Microsoft Server Consolidation • http://research.microsoft.com/pubs/78813/AJ18_EN.pdf • Typical data center CPU has 9.75% utilization • Take 5000 SQL servers and rehost on virtual machines with 6:1 consolidation 47 60% saving
  • 48. https://portal.futuregrid.org The Google gmail example • http://www.google.com/green/pdfs/google-green-computing.pdf • Clouds win by efficient resource use and efficient data centers 48 Business Type Number of users # servers IT Power per user PUE (Power Usage effectiveness) Total Power per user Annual Energy per user Small 50 2 8W 2.5 20W 175 kWh Medium 500 2 1.8W 1.8 3.2W 28.4 kWh Large 10000 12 0.54W 1.6 0.9W 7.6 kWh Gmail (Cloud)   < 0.22W 1.16 < 0.25W < 2.2 kWh
  • 49. https://portal.futuregrid.org Clouds Offer From different points of view • Features from NIST: – On-demand service (elastic); – Broad network access; – Resource pooling; – Flexible resource allocation; – Measured service • Economies of scale in performance and electrical power (Green IT) • Powerful new software models – Platform as a Service is not an alternative to Infrastructure as a Service – it is instead an incredible valued added – Amazon is as much PaaS as Azure • They are cheaper than classic clusters unless latter 100% utilized 49
  • 50. https://portal.futuregrid.org BPM = Business Process management IaaS Hardware e.g. Server PaaS Systems Services e.g. MapReduce, Database SaaS Applications e.g. Recommender System, Clustering BPaaS Particular Application Set
  • 51. https://portal.futuregrid.org Research Model 4th Paradigm; From Theory to Data driven science? 51
  • 53. https://portal.futuregrid.org The 4 paradigms of Scientific Research 1. Theory 2. Experiment or Observation • E.g. Newton observed apples falling to design his theory of mechanics 3. Simulation of theory or model Supercomputers 4. Data-driven (Big Data) or The Fourth Paradigm: Data- Intensive Scientific Discovery (aka Data Science) • http://research.microsoft.com/en- us/collaboration/fourthparadigm/ A free book • More data; less models
  • 54. https://portal.futuregrid.org Anand Rajaraman is Senior Vice President at Walmart Global eCommerce, where he heads up the newly created @WalmartLabs, More data usually beats better algorithms Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. The first team to beat the accuracy of Netflix's proprietary algorithm by a certain margin wins a prize of $1 million! Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database(IMDB). Guess which team did better? http://anand.typepad.com/datawocky/2008/03/more-data- usual.html 20120117berkeley1.pdf Jeff Hammerbacher
  • 56. https://portal.futuregrid.org DIKW Process • Data becomes • Information becomes • Knowledge becomes • Wisdom or Decisions – Community acceptance of results or approach important here – Volume of bits&bytes decreases as we proceed down DIKW pipeline
  • 57. https://portal.futuregrid.org Database SS SS SS SS SS SS SS: Sensor or Data Interchange Service Workflow through multiple filter/discovery clouds Another Cloud Raw Data  Data  Information  Knowledge  Wisdom  Decisions SSSS Another Service SS Another Grid SS SS SS SS SS SS SS SS Storage Cloud Compute Cloud SS SSSS SS Filter Cloud Filter Cloud Filter Cloud Discovery Cloud Discovery Cloud Filter Cloud Filter Cloud Filter Cloud SS Filter Cloud Filter Cloud Filter Cloud Filter Cloud Distributed Grid Hadoop Cluster SS Data Deluge is also Information/Knowledge/Wisdom/Decision Deluge?
  • 58. https://portal.futuregrid.org Example of Google Maps/Navigation • Data comes from traditional maps (US Geological Survey), Satellites (overlays) and street cams • Information is presented by basic Google Maps web page • Knowledge is a particular optimized route • Decisions (Wisdom) comes from deciding to drive a particular route
  • 60. https://portal.futuregrid.org The LHC produces some 15 petabytes of data per year of all varieties and with the exact value depending on duty factor of accelerator (which is reduced simply to cut electricity cost but also due to malfunction of one or more of the many complex systems) and experiments. The raw data produced by experiments is processed on the LHC Computing Grid, which has some 200,000 Cores arranged in a three level structure. Tier-0 is CERN itself, Tier 1 are national facilities and Tier 2 are regional systems. For example one LHC experiment (CMS) has 7 Tier-1 and 50 Tier-2 facilities. This analysis raw data  reconstructed data  AOD and TAGS  Physics is performed on the multi-tier LHC Computing Grid. Note that every event can be analyzed independently so that many events can be processed in parallel with some concentration operations such as those to gather entries in a histogram. This implies that both Grid and Cloud solutions work with this type of data with currently Grids being the only implementation today. Higgs Event http://grids.ucs.indiana.edu/ptliupages/publications/Where%20does%20all%20the%20data%20come%20from%20v7.pdf Note LHC lies in a tunnel 27 kilometres (17 mi) in circumference ATLAS Expt
  • 61. https://portal.futuregrid.org http://www.interactions.org/cms/?pid=1032811 The inside of the RHIC (Relativistic Heavy Ion Collider) tunnel, a 2.4-mile high-tech particle racetrack at Brookhaven National Laboratory.
  • 63. https://portal.futuregrid.org Personal Note • As a naïve undergraduate in 1964, I was told by Professor who later left university to enter church that bumps like were particles. I was amazed and found this more intriguing than anything else I had heard about so I decided to do PhD in particle physics. • I later decided computing moving faster than physics, so I went into Informatics • Also I was alarmed by size and time scale of physics activities • Note ATLAS is 45 metres long, 25 metres in diameter, and weighs about 7,000 tons. The experiment is a collaboration involving roughly 3,000 physicists at 175 institutions in 38 countries • US version of LHC, Superconducting Super Collider (SSC) discussed in 1983 was cancelled in 1993 after $2B spent
  • 66. https://portal.futuregrid.org Overview of Many Informatics Areas • In many cases, one needs personalized matching of items to people or perhaps collections of items to collections of people • People to products: Online and Offline Commerce • People to People: Social Networking • People to Jobs or Employers: Job Sites • People+Queries to the Web: Information Retrieval (search as in Bing/Google)
  • 67. https://portal.futuregrid.org Recommender Systems in more detail • A large number of online and offline commerce activities plus basic Internet site personalization relies on “recommender systems” • Given real-time action by user, immediately suggest new actions (as in Amazon buy recommendations on web) • Based on past actions of users (and others) suggest movies to look at, restaurants to eat at, events to go to, books and music to buy • Based on mix of explicit user choice and grouping of internet sites, present customized Google News page • Given sales statistics, decide on discounts at “real” supermarkets and placement of related (by analysis of buying habits) products • Identify possible colleagues at Social Networking sites like LinkedIn • Identify matches between employers and employees at sites like CareerBuilder and Monster
  • 68. https://portal.futuregrid.org Everything is an Optimization Problem? • Fit Model to Data – Higgs + Background • Match User to Jobs or Books or Other Users? • Classification is optimizing assignment of members of an ontology (list of categories) to data • Typically minimize some function (or maximize negative of function) – Interesting feature of these problems is ingenious choice of function – Note Physics minimizes (free) energy • Often involves thinking of people and/or items as points in a space (not always a traditional vector space) – Space called “bag” in “bag of words” model for information retrieval
  • 71. http://www.slideshare.net/xamat/building-largescale- realworld-recommender-systems-recsys2012-tutorial April 2013: The last two quarters have each brought more than 2 million new streaming subscriber signups. That gives Netflix a current total of nearly 29.2 million subscribers
  • 73. http://www.slideshare.net/xamat/building-largescale- realworld-recommender-systems-recsys2012-tutorial Note Netflix and others run tests all the time on subsets of customers Netflix on Data Science
  • 74. Distances in Funny Spaces I • In user-based collaborative filtering, we can think of users in a space of dimension N where there are N items and M users. – Let i run over items and u over users • Then each user is represented as a vector Ui(u) in “item-space” where ratings are vector components. We are looking for users u u’ that are near each other in this space as measured by some distance between Ui(u) and Ui(u’) • If u and u’ rate all items then these are “real” vectors but almost always they each only rates a small fraction of items and the number in common is even smaller • The “Pearson coefficient” is just one distance measure that can be used – Only sum over i rated by u and u’ Last.fm uses this for songs as does Amazon, Netflix
  • 75. Distances in Funny Spaces II • In item-based collaborative filtering, we can think of items in a space of dimension M where there are N items and M users. – Let i run over items and u over users • Then each item is represented as a vector Ru(i) in “user-space” where ratings are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Ru(i) and Ru(i’) • If i and i’ rated by all users then these are “real” vectors but almost always they are each only rated by a small fraction of users and the number in common is even smaller • The “Cosine measure” is just one distance measure that can be used – Only sum over users u rating both i and i’
  • 76. Distances in Funny Spaces III • In content based recommender systems, we can think of items in a space of dimension M where there are N items and M properties. – Let i run over items and p over properties • Then each item is represented as a vector Pp(i) in “property-space” where values of properties are vector components. We are looking for items i i’ that are near each other in this space as measured by some distance between Pp(i) and Rp(i’) • Properties could be “reading age” or “character highlighted” or “author” for books • Properties can be genre or artist for songs and video • Properties can characterize pixel structure for images used in face recognition, driving etc. Pandora uses this for songs (Music Genome) as does Amazon, Netflix
  • 77. Do we need “real” spaces? • Much of (eCommerce/LifeStyle) Informatics involves “points” – Events in LHC analysis – Users (people) or items (books, jobs, music, other people) • These points can be thought of being in a “space” or “bag” – Set of all books – Set of all physics reactions – Set of all Internet users • However as in recommender systems where a given user only rates some items, we don’t know “full position” • However we can nearly always define a useful distance d(a,b) between points • Always d(a,b) >= 0 • Usually d(a,b) = d(b,a) • Rarely d(a,b) + d(b,c) >= d(a,c) Triangle Inequality
  • 78. Using Distances • The simplest way to use distances is “nearest neighbor algorithms” – given one point, find a set of points near it – cut off by number of identified nearby points and/or distance to initial point – Here point is either user or item • Another approach is divide space into regions (topics, latent factors) consisting of nearby points – This is clustering – Also other algorithms like Gaussian mixture models or Latent Semantic Analysis or Latent Dirichlet Allocation which use a more sophisticated model
  • 80. https://portal.futuregrid.org “Web Data Analytics” • Get the digital data (from web or from scanning) • Need to crawl web (? Solved “engineering” problem) • Preprocess data to get searchable things (words positions) • Form Inverted Index mapping words to documents • Typically use TF-IDF (term frequency, Inverse Document frequency) to quantify importance of word match • Rank relevance of documents: PageRank • Lots of technology for advertising, “reverse engineering” “preventing reverse engineering” • Clustering of documents into topics (as in Google News)
  • 81. Size of face proportional to PageRank
  • 82. 82Deepak Agarwal & Bee-Chung Chen @ ICML’11 Modern Recommendation Systems (from Yahoo) • Goal (Function to Optimize – Long Term dollars) – Serve the right item to a user in a given context to optimize long- term business objectives • A scientific discipline that involves – Large scale Machine Learning & Statistics • Offline Models (capture global & stable characteristics) • Online Models (incorporates dynamic components) • Explore/Exploit (active and adaptive experimentation) – Multi-Objective Optimization • Click-rates (CTR), Engagement, advertising revenue, diversity, etc – Inferring user interest • Constructing User Profiles – Natural Language Processing to understand content • Topics, “aboutness”, entities, follow-up of something, breaking news,… http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
  • 83. 83Deepak Agarwal & Bee-Chung Chen @ ICML’11 Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other pages Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
  • 84. 84Deepak Agarwal & Bee-Chung Chen @ ICML’11 Some examples from content optimization • Simple version – I have a content module on my page, content inventory is obtained from a third party source which is further refined through editorial oversight. Can I algorithmically recommend content on this module? I want to improve overall click-rate (CTR) on this module • More advanced – I got X% lift in CTR. But I have additional information on other downstream utilities (e.g. advertising revenue). Can I increase downstream utility without losing too many clicks? • Highly advanced – There are multiple modules running on my webpage. How do I perform a simultaneous optimization? http://pages.cs.wisc.edu/~beechung/icml11-tutorial/
  • 86. https://portal.futuregrid.org Science Computing Environments • Large Scale Supercomputers – Multicore nodes linked by high performance low latency network – Increasingly with GPU enhancement – Suitable for highly parallel simulations • High Throughput Systems such as European Grid Initiative EGI or Open Science Grid OSG typically aimed at pleasingly parallel jobs – Can use “cycle stealing” – Classic example is LHC data analysis • Grids federate resources as in EGI/OSG or enable convenient access to multiple backend systems including supercomputers • Use Services (SaaS) – Portals make access convenient and – Workflow integrates multiple processes into a single job 86
  • 87. https://portal.futuregrid.org Clouds HPC and Grids • Synchronization/communication Performance Grids > Clouds > Classic HPC Systems • Clouds naturally execute effectively Grid workloads but are less clear for closely coupled HPC applications • Classic HPC machines as MPI engines offer highest possible performance on closely coupled problems • The 4 forms of MapReduce/MPI 1) Map Only – pleasingly parallel 2) Classic MapReduce as in Hadoop; single Map followed by reduction with fault tolerant use of disk 3) Iterative MapReduce use for data mining such as Expectation Maximization in clustering etc.; Cache data in memory between iterations and support the large collective communication (Reduce, Scatter, Gather, Multicast) use in data mining 4) Classic MPI! Support small point to point messaging efficiently as used in partial differential equation solvers
  • 88. https://portal.futuregrid.org What Applications work in Clouds • Pleasingly (moving to modestly) parallel applications of all sorts with roughly independent data or spawning independent simulations – Long tail of science and integration of distributed sensors • Commercial and Science Data analytics that can use MapReduce (some of such apps) or its iterative variants (most other data analytics apps) • Which science applications are using clouds? – Venus-C (Azure in Europe): 27 applications not using Scheduler, Workflow or MapReduce (except roll your own) – 50% of applications on FutureGrid are from Life Science – Locally Lilly corporation is commercial cloud user (for drug discovery) but not IU Biology • But overall very little science use of clouds yet 88
  • 89. https://portal.futuregrid.org Parallelism over Users and Usages • “Long tail of science” can be an important usage mode of clouds. • In some areas like particle physics and astronomy, i.e. “big science”, there are just a few major instruments generating now petascale data driving discovery in a coordinated fashion. • In other areas such as genomics and environmental science, there are many “individual” researchers with distributed collection and analysis of data whose total data and processing needs can match the size of big science. • Clouds can provide scaling convenient resources for this important aspect of science. • Can be map only use of MapReduce if different usages naturally linked e.g. exploring docking of multiple chemicals or alignment of multiple DNA sequences – Collecting together or summarizing multiple “maps” is a simple Reduction 89
  • 90. https://portal.futuregrid.org Internet of Things and the Cloud • It is projected that there will be 24-75 billion devices on the Internet by 2020. Most will be small sensors that send streams of information into the cloud where it will be processed and integrated with other streams and turned into knowledge that will help our lives in a multitude of small and big ways. • The cloud will become increasing important as a controller of and resource provider for the Internet of Things. • As well as today’s use for smart phone and gaming console support, “Intelligent River” “smart homes and grid” and “ubiquitous cities” build on this vision and we could expect a growth in cloud supported/controlled robotics. • Some of these “things” will be supporting science • Natural parallelism over “things” • “Things” are distributed and so form a Grid 90
  • 91. https://portal.futuregrid.org Sensors (Things) as a Service Sensors as a Service Sensor Processing as a Service (could use MapReduce) A larger sensor ……… Output Sensor https://sites.google.com/site/opensourceiotcloud/ Open Source Sensor (IoT) Cloud
  • 92. https://portal.futuregrid.org 92Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 93. https://portal.futuregrid.org 93Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 94. https://portal.futuregrid.org 94Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 96. https://portal.futuregrid.org Classic Parallel Computing• HPC: Typically SPMD (Single Program Multiple Data) “maps” typically processing particles or mesh points interspersed with multitude of low latency messages supported by specialized networks such as Infiniband and technologies like MPI – Often run large capability jobs with 100K (going to 1.5M) cores on same job – National DoE/NSF/NASA facilities run 100% utilization – Fault fragile and cannot tolerate “outlier maps” taking longer than others • Clouds: MapReduce has asynchronous maps typically processing data points with results saved to disk. Final reduce phase integrates results from different maps – Fault tolerant and does not require map synchronization – Map only useful special case • HPC + Clouds: Iterative MapReduce caches results between “MapReduce” steps and supports SPMD parallel computing with large messages as seen in parallel kernels (linear algebra) in clustering and other data mining 96
  • 97. https://portal.futuregrid.org MapReduce “File/Data Repository” Parallelism Instruments Disks Map1 Map2 Map3 Reduce Communication Map = (data parallel) computation reading and writing data Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Portals /Users Iterative MapReduce Map Map Map Map Reduce Reduce Reduce
  • 98. • Sam thought of “drinking” the apple Sam’s Problem http://www.slideshare.net/esaliya/mapreduce-in-simple-terms  He used a to cut the and a to make juice.
  • 99. (<a’, > , <o’, > , <p’, > ) • Implemented a parallel version of his innovation Creative Sam (<a, > , <o, > , <p, > , …) Each input to a map is a list of <key, value> pairs Each output of slice is a list of <key, value> pairs Grouped by key Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism) e.g. <ao, ( …)> Reduced into a list of values The idea of Map Reduce in Data Intensive Computing A list of <key, value> pairs mapped into another list of <key, value> pairs which gets grouped by the key and reduced into a list of values
  • 100. https://portal.futuregrid.org Data Science Education Opportunities at Universities see recent New York Times articles http://datascience101.wordpress.com/2013/04/13/new-york-times-data-science-articles/ 100
  • 101. https://portal.futuregrid.org Data Science Education • Broad Range of Topics from Policy to curation to applications and algorithms, programming models, data systems, statistics, and broad range of CS subjects such as Clouds, Programming, HCI, • Plenty of Jobs and broader range of possibilities than computational science but similar cosmic issues – What type of degree (Certificate, minor, track, “real” degree) – What implementation (department, interdisciplinary group supporting education and research program) 101
  • 102. https://portal.futuregrid.org At Indiana University • Have a proposal to set up certificates and Masters degree in data science • Joint between 3 units in School of Informatics and Computing: Computer Science, Informatics, Information & Library Science, and Statistics department in COAS College • Looking at version with Kelley with Business data analytics flavor • Attractive to offer online as few universities have this and so a potentially large audience outside IU 102
  • 103. https://portal.futuregrid.org 103Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 104. https://portal.futuregrid.org 104Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 105. https://portal.futuregrid.org Massive Open Online Courses (MOOC) • MOOC’s are very “hot” these days with Udacity and Coursera as start- ups; perhaps over 100,000 participants • Relevant to Data Science as this is a new field with few courses at most universities • Typical model is collection of short prerecorded segments (talking head over PowerPoint) of length 3-15 minutes – This is Boredom limit http://blog.coursera.org/post/49750392396/on-the- topic-of-boredom • These “lesson objects” can be viewed as “songs” • Google Course Builder (python open source) builds customizable MOOC’s as “playlists” of “songs” • Tells you to capture all material as “lesson objects” • We are aiming to build a repository of many “songs”; used in many ways – tutorials, classes … 105
  • 107. https://portal.futuregrid.org 107 • Seven ~10 minutes lesson objects in this lecture • IU wants us to close caption if use in real course
  • 108. https://portal.futuregrid.org Customizable MOOC’s I • We could teach one class to 100,000 students or 2,000 classes to 50 students • The 2,000 class choice has 2 useful features – One can use the usual (electronic) mentoring/grading technology – One can customize each of 2,000 classes for a particular audience given their level and interests – One can even allow student to customize – that’s what one does in making play lists in iTunes • Both models can be supported by a repository of lesson objects (10-15 minute video segments) in the cloud • The teacher can choose from existing lesson objects and add their own to produce a new customized course with new lessons contributed back to repository 108
  • 110. https://portal.futuregrid.org Customizable MOOC’s II • The 3-15 minute Video over PowerPoint of MOOC lesson object’s is easy to re-use • Qiu (IU)and Hayden (ECSU Elizabeth City State University – (a small HBCU Historically Black University) will customize a module – Starting with Qiu’s cloud computing course at IU – Adding material on use of Cloud Computing in Remote Sensing (area covered by ECSU course) • This is a model for adding cloud curricula material to wide set of universities where faculty not able to teach • Defining how to support computing labs associated with MOOC’s with clouds or VM’s on clients – Appliances scale as download to student’s client 110
  • 111. https://portal.futuregrid.org 111 Can of course build many different interfaces Songs stored on YouTube Songs prepared with Adobe Presenter on Laptop http://cloudmooc.soic.indiana.edu/
  • 112. https://portal.futuregrid.org Two limits where MOOC’s are Compelling • High volume courses (CS/Ph/Chem/Bio101…) where scalability of MOOC’s make them attractive to reach a lot of students • Niche areas where there is some student interest but either no faculty expertise or not enough students to justify traditional courses – Offer to many institutions simultaneously 112
  • 113. https://portal.futuregrid.org 115 Meeker/Wu May 29 2013 Internet Trends D11 Conference
  • 115. https://portal.futuregrid.org Conclusions • Clouds are here to stay and one should plan on exploiting them • Data Intensive studies in business and research continue to grow in importance – Data Analytics: Everything is an optimization problem in a funny space • Growing employment opportunities in clouds and data related activities and so popular with students – Enabling many of the most important companies from Facebook/Google to General Electric • Need community discussion of data science education – Agree on curricula; is such a degree attractive? • MOOC’s interesting for – Disseminating new curricula – Managing course fragments that can be assembled into custom courses for particular interdisciplinary students 118
  • 116. https://portal.futuregrid.org Big Data Ecosystem in One Sentence Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics educated in data science X = Astronomy, Biology, Biomedicine, Business, Chemistry, Climate, Crisis, Earth Science, Energy, Environment, Finance, Health, Intelligence, Lifestyle, Marketing, Medicine, Pathology, Policy, Radar, Security, Sensor, Social, Sustainability, Wealth and Wellness with more fields (physics) defined implicitly Spans Industry and Science (research)