SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
5/10/2012




                           A Berkeley View of Big Data

                                     Anthony D. Joseph
                                        UC Berkeley


                                   EDUSERV Symposium
                                      10 May 2012




                                    Who Am I?
                    • Research:
                       – Internet-scale systems (RAD Lab, AMP Lab)
                       – Security (DETERlab Testbed)
                       – Adversarial machine learning (SecML)

                    • Teaching (undergrad/grad): operating
                      systems and systems, security, networking

                    Disclaimer: I don’t speak for UC or our
                     research sponsors




AMPLab Overview -
franklin@cs.berkeley.edu                                                    1
5/10/2012




                                 Big Data is Massive…
                  • Facebook:
                     – 130TB/day: user logs
                     – 200-400TB/day: 83 million pictures
                     – >40 Billion photos

                  • Google: > 25 PB/day processed data

                  • Data generated by LHC: 1 PB/sec

                  • Total data created in 2010: 1.ZettaByte
                    (1,000,000 PB)/year
                     – ~60% increase every year
                     3




                                      …and Diverse…
                     • Walmart
                           – >1 million customer
                             transactions/hr
                           – >2.5 PByte customer DB

                     • Human genome sequencing
                           – Analyzing 3 billion base pairs
                           – Ten years for first one (2003)
                           – Today, less than one week



                     4




AMPLab Overview -
franklin@cs.berkeley.edu                                             2
5/10/2012




                                     …and Novel…
                   • Analyzing data from user behavior vs user input

                   • USGS TED
                     – Twitter-based Earthquake Detector




                   • Google Trends: “nowcasting”
                         – http://www.google.org/flutrends/
                         – US 2009 “Cash for Clunkers” program success
                         – US State unemployment rates
                     5




                               …and Grows Bigger…
                  • More and more devices



                  • More and more people




                  • Cheaper and cheaper storage
                     – ~50% increase in GB/$ every year
                     6




AMPLab Overview -
franklin@cs.berkeley.edu                                                        3
5/10/2012




                                     …and Bigger!
                  • Log everything!
                     – Don’t always know what question you’ll need
                       to answer


                  • Stored data
                    growing faster
                    than both
                    available
                    storage and GB/$

                     7




                            Which Big Data to Keep?

                    • Hard to decide what to delete




                         – Thankless decision: people know only when you
                           are wrong!
                         – “Climate Research Unit (CRU) scientists admit
                           they threw away key data used in global warming
                           calculations”
                     8




AMPLab Overview -
franklin@cs.berkeley.edu                                                            4
5/10/2012




                          Data Retention Requirements

                     • New NSF data retention requirements
                           – Proposals submitted after 18 January 2011
                             must include a “Data Management Plan”
                           – Have to keep all data (including metadata) for
                             3 years after research award conclusion
                           – Institutional/org considerations:
                             • Opportunity to invest in pooled storage: campus,
                               systemwide, regional, …
                             • Typical cost: 8TB chunks at $1.44/GB/year
                               collaborative space and $0.17/GB/year for archive
                     9
                               space




                              Big Data Isn’t Always Big


                           Data that is expensive to manage,
                            and hard to extract value from


                  • You don’t need to be big to have big data problem!
                     – Inadequate tools to analyze data
                     – Data management may dominate infrastructure cost


                     10




AMPLab Overview -
franklin@cs.berkeley.edu                                                                  5
5/10/2012




                              Big Data is not Cheap!
                  • Storing and managing 1PB
                    data: $500K-$1M/ year
                     – Facebook: 200 PB/year

                                                          100%
                  • “Typical” cloud-based


                                               Infrastructure cost
                                                                     80%
                    service startup (e.g.,                           60%   ~1PB storage capacity
                    Conviva)                                         40%

                     – Log storage dominates                         20%

                       infrastructure cost                           0%
                                                                            2007         2008        2009     2010
                                                                                   Storage cluster    Other
                     11




                     Hard to Extract Value from Data!
                  • Data is
                     – Diverse, variety of sources
                     – Uncurated, no schema, inconsistent semantics, syntax
                     – Integration a huge challenge

                  • No easy way to get answers that are
                     – High-quality
                     – Timely

                  • Challenge: maximize value from data by getting
                    best possible answers
                     12




AMPLab Overview -
franklin@cs.berkeley.edu                                                                                                    6
5/10/2012




                     Requires Multifaceted Approach
                     • Three dimensions to improve data
                       analysis
                           – Improving scale, efficiency, and quality of
                             algorithms (Algorithms)
                           – Scaling up datacenters (Machines)
                           – Leverage human activity and intelligence
                             (People)


                     • Need to adaptively and flexibly combine all
                       three dimensions
                     13




                                  The State of the Art
                     • Today’s apps: fixed point in solution space
                                         Algorithms
                                                          Watson/IBM
                                                                       search




                                                                       Machines



                            People
                   Need techniques to dynamically pick best
                     14
                               operating point



AMPLab Overview -
franklin@cs.berkeley.edu                                                                 7
5/10/2012




                      What Is the Big Data Problem?
                     • For two main reasons:
                           – the more data the greater chance to find any
                             pattern you’d like to find
                              • the more rows in a table, the more columns
                              • the more columns, the more hypotheses that can
                                be considered
                              • indeed, the number of hypotheses grows
                                exponentially in the number of columns
                           – the more data the less likely a sophisticated
                             ML algorithm will run in an acceptable time
                             frame
                              • and then we have to back off to cheaper
                                algorithms that may be more error-prone




                       A Formulation of the Problem

                     • Given an inferential goal and a fixed
                       computational budget, provide a guarantee
                       (supported by an algorithm and an analysis) that
                       the quality of inference will increase
                       monotonically as data accrue (without bound)
                           – This is far from being achieved in the current state of
                             the literature!
                     • It can be achieved by building a scalable system
                       that blends statistical and computational design
                       principles




AMPLab Overview -
franklin@cs.berkeley.edu                                                                      8
5/10/2012




                                    Big Data in the US
                     • Many Fortune 1000+ companies with huge write
                       once, read none big data collections
                           – For all the reasons I’ve already outlined…

                     • US Government agencies in same situation
                           – New R&D funding

                     • Many companies developing proprietary solutions

                     • Very active open source big data tools committee
                           – Broad international participation
                           – Data Without Borders helping non-profits through pro
                             bono data collection, analysis, and visualization
                     17




                            Significant USG Investment
                     • 29 March 2012
                           – US federal agencies announced more than
                             $200 million in new commitments
                           – Dept of Defense, Dept of Homeland Security,
                             Dept of Energy, Veterans Administration, Office
                             of Scientific and Technical Information, Health
                             and Human Services, Food and Drug Admin,
                             National Archives & Records Admin, National
                             Aerospace & Space Admin, National Institutes
                             of Health, National Science Foundation,
                             National Security Agency, US Geological
                     18      Service




AMPLab Overview -
franklin@cs.berkeley.edu                                                                   9
5/10/2012




                     Active Open Source Community

                     • On-going development of several elements
                       of Big Data analysis pipeline
                           •   Apache Hadoop (MapReduce)
                           •   Hive
                           •   Apache Pig
                           •   R / Octave
                     • Much more is needed!
                           • E.g., new analysis environments

                     19




                                        The AMP Lab
                        Make sense of data at scale by tightly
                    integrating algorithms, machines, and people
                                        Algorithms
                                                       Watson/IBM
                                                                    search




                                                                    Machines



                     20
                               People




AMPLab Overview -
franklin@cs.berkeley.edu                                                             10
5/10/2012




                               AMP Faculty and Sponsors
                     • Faculty
                           –   Alex Bayen (mobile sensing platforms)
                           –   Armando Fox (systems)
                           –   Michael Franklin (databases): Director
                           –   Michael Jordan (machine learning): Co-director
                           –   Anthony Joseph (security & privacy)
                           –   Randy Katz (systems)
                           –   David Patterson (systems)
                           –   Ion Stoica (systems): Co-director
                           –   Scott Shenker (networking)
                     • Sponsors:



                     21




                                             Algorithms
                     • State-of-art Machine Learning (ML)
                       algorithms do not scale
                           – Prohibitive to process all data points
                           Estimate




                                                                       true answer



                                           How do you know
                                           when to stop?

                                                      # of data points
                     22




AMPLab Overview -
franklin@cs.berkeley.edu                                                                   11
5/10/2012




                                        Algorithms
                     • Given any problem, data and a budget
                           – Immediate results with continuous improvement
                           – Calibrate answer: provide error bars
                           Estimate




                                                                  true answer



                                              Error bars on every
                                              answer!

                                                 # of data points
                     23




                                        Algorithms
                     • Given any problem, data and a time budget
                           – Immediate results with continuous improvement
                           – Calibrate answer: provide error bars
                           Estimate




                                                                  true answer



                                           Stop when error
                                           smaller than a given
                                           threshold
                                                 # of data points
                     24                                time




AMPLab Overview -
franklin@cs.berkeley.edu                                                              12
5/10/2012




                                              Algorithms
                     • Given any problem, data and a time budget
                               – Automatically pick the best algorithm
                    Estimate




                                                         simple
                                                                             true answer
                                                             sophisticated



                                error    pick
                                too high sophisticated     pick simple
                                                                              time
                     25




                                               Machines
                     • “The datacenter as a computer” still in its
                       infancy
                               – Special purpose clusters, e.g., Hadoop cluster
                               – Highly variable performance
                               – Hard to program
                               – Hard to debug



                                                                  =?

                     26




AMPLab Overview -
franklin@cs.berkeley.edu                                                                         13
5/10/2012




                                                                   Machines
                      • Make datacenter a real computer!


                  • Share datacenter between multiple cluster computing
                  apps
                  • Provide new abstractions and services
                                                                                                  AMP
                                                                                                  stack
                                      Datacenter “OS” (e.g., Mesos)
                                                                                                  Existing
                     Node OS            Node OS                           …         Node OS
                    (e.g. Linux)     (e.g. Windows)                                (e.g. Linux)   stack

                      27




                                                                   Machines
                      • Make datacenter a real computer!


                                                                     Support existing
                    Hive
                                                       Cassandra
                                          Hypertbale




                                                                     cluster computing
                               MPI
                      Hadoop




                                     …
                                                                     apps
                                                                                                  AMP
                                                                                                  stack
                                      Datacenter “OS” (e.g., Mesos)

                     Node OS            Node OS                                     Node OS
                                                                                                  Existing
                                                                          …                       stack
                    (e.g. Linux)     (e.g. Windows)                                (e.g. Linux)

                      28




AMPLab Overview -
franklin@cs.berkeley.edu                                                                                           14
5/10/2012




                                                                   Machines
                      • Make datacenter a real computer!
                                                                                          Predictive &
                   Support interactive                                                    insightful query
                   and iterative data                                                     language
                   analysis (e.g., ML
                    Hive
                                                       Cassandra
                                          Hypertbale
                                                                                     PIQL


                                                                    Spark
                               MPI




                   algorithms)…
                      Hadoop




                                                                            …

                                                                                    SCADS            AMP
                                                                                                     stack
                                                        Consistency
                                      Datacenter “OS” (e.g., Mesos)
                                                        adjustable data
                     Node OS            Node OS         store                          Node OS
                                                                                                     Existing
                                                                                …                    stack
                    (e.g. Linux)     (e.g. Windows)                                   (e.g. Linux)

                      29




                                                                   Machines
                      • Make datacenter a real computer!

                                                   Applications, tools
                    Hive
                                                       Cassandra
                                          Hypertbale




                                                                                     PIQL
                                                                    Spark




                                                                   • Advanced ML algorithms
                               MPI
                      Hadoop




                                     …                                     …

                                                                   • Interactive data mining
                                                                                    SCADS
                                                                                                     AMP
                                                                   • Collaborative visualization     stack
                                      Datacenter “OS” (e.g., Mesos)

                     Node OS            Node OS                                        Node OS
                                                                                                     Existing
                                                                                …                    stack
                    (e.g. Linux)     (e.g. Windows)                                   (e.g. Linux)

                      30




AMPLab Overview -
franklin@cs.berkeley.edu                                                                                              15
5/10/2012




                                               People
                     • Humans can make sense of messy data!




                     31




                                               People
                  • Make people an integrated part of
                    the system!
                     – Leverage human activity
                                                                   Machines +
                     – Leverage human intelligence
                       (crowdsourcing):                            Algorithms
                           • Curate and clean dirty data
                                                                        Questions
                                                             activity




                                                                                    Answers




                           • Answer imprecise questions
                                                             data,




                           • Test and improve algorithms


                  • Challenge
                     – Inconsistent answer quality in all
                       dimensions (e.g., type of question,
                       time, cost)
                     32




AMPLab Overview -
franklin@cs.berkeley.edu                                                                            16
5/10/2012




                                  Real Applications
                  • Mobile Millennium Project
                     – Alex Bayen, Civil and Environment
                       Engineering, UC Berkeley
                  • Microsimulation of urban
                    development
                     – Paul Waddell, College of
                       Environment Design, UC Berkeley
                  • Crowd based opinion formation
                     – Ken Goldberg, Industrial
                       Engineering and Operations
                       Research, UC Berkeley
                  • Personalized Sequencing
                     – Taylor Sittler, UCSF
                     33




                           Personalized Sequencing




                     34




AMPLab Overview -
franklin@cs.berkeley.edu                                         17
5/10/2012




                                       The AMP Lab
                        Make sense of data at scale by tightly
                    integrating algorithms, machines, and people
                                         Algorithms
                                                      Microsimulation
                                    Mobile
                                    Millennium

                                                                Sequencing




                                                                             Machines



                     35
                           People




                                    Big Data in 2020
                                       Are you prepared?
                   • To create a new generation of big data scientist
                   • For ML to become an engineering discipline
                   • For people to be deeply integrated in big data
                     analysis pipeline
                   • Will your institution
                       – offer a big data curriculum touching all fields?
                       – have hired cross-disciplinary faculty?
                       – have invested in (pooled) storage infrastructure?
                       – have invested in public/private clouds?
                     36
                       – have built inter/intra campus networks?




AMPLab Overview -
franklin@cs.berkeley.edu                                                                      18
5/10/2012




                                          Summary
                     • Goal: Tame Big Data Problem
                           – Get results with right quality at the right time
                     • Approach: Holistic integration of
                       Algorithms, Machines, and People
                     • Huge research issues across many
                       domains




                     37




AMPLab Overview -
franklin@cs.berkeley.edu                                                              19

Contenu connexe

Tendances

XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your RoleJay Gendron
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen ScienceAndrea Wiggins
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...
Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...
Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...Data Blueprint
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012Ian Foster
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeEric Kansa
 

Tendances (13)

XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
Online Communities in Citizen Science
Online Communities in Citizen ScienceOnline Communities in Citizen Science
Online Communities in Citizen Science
 
What matters ?
What matters ?What matters ?
What matters ?
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Wither OWL
Wither OWLWither OWL
Wither OWL
 
Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...
Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...
Data-Ed Online: Practical Applications for Data Warehousing, Analytics, BI, a...
 
Mexico talk foster march 2012
Mexico talk foster march 2012Mexico talk foster march 2012
Mexico talk foster march 2012
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional Practice
 

En vedette

Putting public cloud in your government ICT strategy
Putting public cloud in your government ICT strategyPutting public cloud in your government ICT strategy
Putting public cloud in your government ICT strategyEduserv
 
American Psychiatric Association 17april2014
American Psychiatric Association 17april2014American Psychiatric Association 17april2014
American Psychiatric Association 17april2014Eduserv
 
American Psychological Association 17April2015
American Psychological Association 17April2015American Psychological Association 17April2015
American Psychological Association 17April2015Eduserv
 
Eduserv impact-report-2015
Eduserv impact-report-2015Eduserv impact-report-2015
Eduserv impact-report-2015Eduserv
 
Delivering value with online resources
Delivering value with online resourcesDelivering value with online resources
Delivering value with online resourcesEduserv
 
Adur and Worthing Case Study - Paul Brewer
Adur and Worthing Case Study - Paul BrewerAdur and Worthing Case Study - Paul Brewer
Adur and Worthing Case Study - Paul BrewerEduserv
 
Eduserv Adobe ETLA webinar January 2014
Eduserv Adobe ETLA webinar January 2014Eduserv Adobe ETLA webinar January 2014
Eduserv Adobe ETLA webinar January 2014Eduserv
 
AWS overview - Steve Bryen, AWS
AWS overview - Steve Bryen, AWSAWS overview - Steve Bryen, AWS
AWS overview - Steve Bryen, AWSEduserv
 
Shared Services in Local Government
Shared Services in Local GovernmentShared Services in Local Government
Shared Services in Local GovernmentEduserv
 

En vedette (9)

Putting public cloud in your government ICT strategy
Putting public cloud in your government ICT strategyPutting public cloud in your government ICT strategy
Putting public cloud in your government ICT strategy
 
American Psychiatric Association 17april2014
American Psychiatric Association 17april2014American Psychiatric Association 17april2014
American Psychiatric Association 17april2014
 
American Psychological Association 17April2015
American Psychological Association 17April2015American Psychological Association 17April2015
American Psychological Association 17April2015
 
Eduserv impact-report-2015
Eduserv impact-report-2015Eduserv impact-report-2015
Eduserv impact-report-2015
 
Delivering value with online resources
Delivering value with online resourcesDelivering value with online resources
Delivering value with online resources
 
Adur and Worthing Case Study - Paul Brewer
Adur and Worthing Case Study - Paul BrewerAdur and Worthing Case Study - Paul Brewer
Adur and Worthing Case Study - Paul Brewer
 
Eduserv Adobe ETLA webinar January 2014
Eduserv Adobe ETLA webinar January 2014Eduserv Adobe ETLA webinar January 2014
Eduserv Adobe ETLA webinar January 2014
 
AWS overview - Steve Bryen, AWS
AWS overview - Steve Bryen, AWSAWS overview - Steve Bryen, AWS
AWS overview - Steve Bryen, AWS
 
Shared Services in Local Government
Shared Services in Local GovernmentShared Services in Local Government
Shared Services in Local Government
 

Similaire à Anthony Joseph

Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Philip Bourne
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementMarieke Guy
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its ChallengesKathirvel Ayyaswamy
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web DataMarieke Guy
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop IntroductionJayant Mukherjee
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information LifecycleMicah Altman
 

Similaire à Anthony Joseph (20)

Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
Data Science BD2K Update for NIH
Data Science BD2K Update for NIH Data Science BD2K Update for NIH
Data Science BD2K Update for NIH
 
Big Data
Big Data Big Data
Big Data
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Research issues in the big data and its Challenges
Research issues in the big data and its ChallengesResearch issues in the big data and its Challenges
Research issues in the big data and its Challenges
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
google Bigtable
google Bigtablegoogle Bigtable
google Bigtable
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Needs for Data Management & Citation Throughout the Information Lifecycle
Needs for Data Management & Citation Throughout  the Information LifecycleNeeds for Data Management & Citation Throughout  the Information Lifecycle
Needs for Data Management & Citation Throughout the Information Lifecycle
 
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
NISO Forum, Denver, Sept. 24, 2012: Needs for Data Management & Citation Thro...
 
DBMS
DBMSDBMS
DBMS
 

Plus de Eduserv

Phase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect optionPhase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect optionEduserv
 
Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources Eduserv
 
Lightning talk - EBSCO
Lightning talk - EBSCOLightning talk - EBSCO
Lightning talk - EBSCOEduserv
 
Lightning talk - Boopsie
Lightning talk - BoopsieLightning talk - Boopsie
Lightning talk - BoopsieEduserv
 
Lightning talk - Softlink
Lightning talk - SoftlinkLightning talk - Softlink
Lightning talk - SoftlinkEduserv
 
Lightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZineLightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZineEduserv
 
Lightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest AgreementsLightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest AgreementsEduserv
 
Phase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolutionPhase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolutionEduserv
 
Key considerations when mapping your end user experience
Key considerations when mapping your end user experienceKey considerations when mapping your end user experience
Key considerations when mapping your end user experienceEduserv
 
Our product development methodology
Our product development methodologyOur product development methodology
Our product development methodologyEduserv
 
How Readers Discover Content
How Readers Discover ContentHow Readers Discover Content
How Readers Discover ContentEduserv
 
OpenAthens product update
OpenAthens product updateOpenAthens product update
OpenAthens product updateEduserv
 
OpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome addressOpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome addressEduserv
 
Generating leads with content marketing
Generating leads with content marketingGenerating leads with content marketing
Generating leads with content marketingEduserv
 
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016Eduserv
 
Mobius from Maplesoft
Mobius from MaplesoftMobius from Maplesoft
Mobius from MaplesoftEduserv
 
QSR NVivo
QSR NVivo QSR NVivo
QSR NVivo Eduserv
 
How Eduserv are helping local government organisations
How Eduserv are helping local government organisationsHow Eduserv are helping local government organisations
How Eduserv are helping local government organisationsEduserv
 
Is cloud the right fit for your needs?
Is cloud the right fit for your needs?Is cloud the right fit for your needs?
Is cloud the right fit for your needs?Eduserv
 
Planning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing CouncilsPlanning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing CouncilsEduserv
 

Plus de Eduserv (20)

Phase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect optionPhase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect option
 
Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources
 
Lightning talk - EBSCO
Lightning talk - EBSCOLightning talk - EBSCO
Lightning talk - EBSCO
 
Lightning talk - Boopsie
Lightning talk - BoopsieLightning talk - Boopsie
Lightning talk - Boopsie
 
Lightning talk - Softlink
Lightning talk - SoftlinkLightning talk - Softlink
Lightning talk - Softlink
 
Lightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZineLightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZine
 
Lightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest AgreementsLightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest Agreements
 
Phase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolutionPhase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolution
 
Key considerations when mapping your end user experience
Key considerations when mapping your end user experienceKey considerations when mapping your end user experience
Key considerations when mapping your end user experience
 
Our product development methodology
Our product development methodologyOur product development methodology
Our product development methodology
 
How Readers Discover Content
How Readers Discover ContentHow Readers Discover Content
How Readers Discover Content
 
OpenAthens product update
OpenAthens product updateOpenAthens product update
OpenAthens product update
 
OpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome addressOpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome address
 
Generating leads with content marketing
Generating leads with content marketingGenerating leads with content marketing
Generating leads with content marketing
 
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
 
Mobius from Maplesoft
Mobius from MaplesoftMobius from Maplesoft
Mobius from Maplesoft
 
QSR NVivo
QSR NVivo QSR NVivo
QSR NVivo
 
How Eduserv are helping local government organisations
How Eduserv are helping local government organisationsHow Eduserv are helping local government organisations
How Eduserv are helping local government organisations
 
Is cloud the right fit for your needs?
Is cloud the right fit for your needs?Is cloud the right fit for your needs?
Is cloud the right fit for your needs?
 
Planning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing CouncilsPlanning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing Councils
 

Dernier

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Dernier (20)

Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Anthony Joseph

  • 1. 5/10/2012 A Berkeley View of Big Data Anthony D. Joseph UC Berkeley EDUSERV Symposium 10 May 2012 Who Am I? • Research: – Internet-scale systems (RAD Lab, AMP Lab) – Security (DETERlab Testbed) – Adversarial machine learning (SecML) • Teaching (undergrad/grad): operating systems and systems, security, networking Disclaimer: I don’t speak for UC or our research sponsors AMPLab Overview - franklin@cs.berkeley.edu 1
  • 2. 5/10/2012 Big Data is Massive… • Facebook: – 130TB/day: user logs – 200-400TB/day: 83 million pictures – >40 Billion photos • Google: > 25 PB/day processed data • Data generated by LHC: 1 PB/sec • Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year – ~60% increase every year 3 …and Diverse… • Walmart – >1 million customer transactions/hr – >2.5 PByte customer DB • Human genome sequencing – Analyzing 3 billion base pairs – Ten years for first one (2003) – Today, less than one week 4 AMPLab Overview - franklin@cs.berkeley.edu 2
  • 3. 5/10/2012 …and Novel… • Analyzing data from user behavior vs user input • USGS TED – Twitter-based Earthquake Detector • Google Trends: “nowcasting” – http://www.google.org/flutrends/ – US 2009 “Cash for Clunkers” program success – US State unemployment rates 5 …and Grows Bigger… • More and more devices • More and more people • Cheaper and cheaper storage – ~50% increase in GB/$ every year 6 AMPLab Overview - franklin@cs.berkeley.edu 3
  • 4. 5/10/2012 …and Bigger! • Log everything! – Don’t always know what question you’ll need to answer • Stored data growing faster than both available storage and GB/$ 7 Which Big Data to Keep? • Hard to decide what to delete – Thankless decision: people know only when you are wrong! – “Climate Research Unit (CRU) scientists admit they threw away key data used in global warming calculations” 8 AMPLab Overview - franklin@cs.berkeley.edu 4
  • 5. 5/10/2012 Data Retention Requirements • New NSF data retention requirements – Proposals submitted after 18 January 2011 must include a “Data Management Plan” – Have to keep all data (including metadata) for 3 years after research award conclusion – Institutional/org considerations: • Opportunity to invest in pooled storage: campus, systemwide, regional, … • Typical cost: 8TB chunks at $1.44/GB/year collaborative space and $0.17/GB/year for archive 9 space Big Data Isn’t Always Big Data that is expensive to manage, and hard to extract value from • You don’t need to be big to have big data problem! – Inadequate tools to analyze data – Data management may dominate infrastructure cost 10 AMPLab Overview - franklin@cs.berkeley.edu 5
  • 6. 5/10/2012 Big Data is not Cheap! • Storing and managing 1PB data: $500K-$1M/ year – Facebook: 200 PB/year 100% • “Typical” cloud-based Infrastructure cost 80% service startup (e.g., 60% ~1PB storage capacity Conviva) 40% – Log storage dominates 20% infrastructure cost 0% 2007 2008 2009 2010 Storage cluster Other 11 Hard to Extract Value from Data! • Data is – Diverse, variety of sources – Uncurated, no schema, inconsistent semantics, syntax – Integration a huge challenge • No easy way to get answers that are – High-quality – Timely • Challenge: maximize value from data by getting best possible answers 12 AMPLab Overview - franklin@cs.berkeley.edu 6
  • 7. 5/10/2012 Requires Multifaceted Approach • Three dimensions to improve data analysis – Improving scale, efficiency, and quality of algorithms (Algorithms) – Scaling up datacenters (Machines) – Leverage human activity and intelligence (People) • Need to adaptively and flexibly combine all three dimensions 13 The State of the Art • Today’s apps: fixed point in solution space Algorithms Watson/IBM search Machines People Need techniques to dynamically pick best 14 operating point AMPLab Overview - franklin@cs.berkeley.edu 7
  • 8. 5/10/2012 What Is the Big Data Problem? • For two main reasons: – the more data the greater chance to find any pattern you’d like to find • the more rows in a table, the more columns • the more columns, the more hypotheses that can be considered • indeed, the number of hypotheses grows exponentially in the number of columns – the more data the less likely a sophisticated ML algorithm will run in an acceptable time frame • and then we have to back off to cheaper algorithms that may be more error-prone A Formulation of the Problem • Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound) – This is far from being achieved in the current state of the literature! • It can be achieved by building a scalable system that blends statistical and computational design principles AMPLab Overview - franklin@cs.berkeley.edu 8
  • 9. 5/10/2012 Big Data in the US • Many Fortune 1000+ companies with huge write once, read none big data collections – For all the reasons I’ve already outlined… • US Government agencies in same situation – New R&D funding • Many companies developing proprietary solutions • Very active open source big data tools committee – Broad international participation – Data Without Borders helping non-profits through pro bono data collection, analysis, and visualization 17 Significant USG Investment • 29 March 2012 – US federal agencies announced more than $200 million in new commitments – Dept of Defense, Dept of Homeland Security, Dept of Energy, Veterans Administration, Office of Scientific and Technical Information, Health and Human Services, Food and Drug Admin, National Archives & Records Admin, National Aerospace & Space Admin, National Institutes of Health, National Science Foundation, National Security Agency, US Geological 18 Service AMPLab Overview - franklin@cs.berkeley.edu 9
  • 10. 5/10/2012 Active Open Source Community • On-going development of several elements of Big Data analysis pipeline • Apache Hadoop (MapReduce) • Hive • Apache Pig • R / Octave • Much more is needed! • E.g., new analysis environments 19 The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Watson/IBM search Machines 20 People AMPLab Overview - franklin@cs.berkeley.edu 10
  • 11. 5/10/2012 AMP Faculty and Sponsors • Faculty – Alex Bayen (mobile sensing platforms) – Armando Fox (systems) – Michael Franklin (databases): Director – Michael Jordan (machine learning): Co-director – Anthony Joseph (security & privacy) – Randy Katz (systems) – David Patterson (systems) – Ion Stoica (systems): Co-director – Scott Shenker (networking) • Sponsors: 21 Algorithms • State-of-art Machine Learning (ML) algorithms do not scale – Prohibitive to process all data points Estimate true answer How do you know when to stop? # of data points 22 AMPLab Overview - franklin@cs.berkeley.edu 11
  • 12. 5/10/2012 Algorithms • Given any problem, data and a budget – Immediate results with continuous improvement – Calibrate answer: provide error bars Estimate true answer Error bars on every answer! # of data points 23 Algorithms • Given any problem, data and a time budget – Immediate results with continuous improvement – Calibrate answer: provide error bars Estimate true answer Stop when error smaller than a given threshold # of data points 24 time AMPLab Overview - franklin@cs.berkeley.edu 12
  • 13. 5/10/2012 Algorithms • Given any problem, data and a time budget – Automatically pick the best algorithm Estimate simple true answer sophisticated error pick too high sophisticated pick simple time 25 Machines • “The datacenter as a computer” still in its infancy – Special purpose clusters, e.g., Hadoop cluster – Highly variable performance – Hard to program – Hard to debug =? 26 AMPLab Overview - franklin@cs.berkeley.edu 13
  • 14. 5/10/2012 Machines • Make datacenter a real computer! • Share datacenter between multiple cluster computing apps • Provide new abstractions and services AMP stack Datacenter “OS” (e.g., Mesos) Existing Node OS Node OS … Node OS (e.g. Linux) (e.g. Windows) (e.g. Linux) stack 27 Machines • Make datacenter a real computer! Support existing Hive Cassandra Hypertbale cluster computing MPI Hadoop … apps AMP stack Datacenter “OS” (e.g., Mesos) Node OS Node OS Node OS Existing … stack (e.g. Linux) (e.g. Windows) (e.g. Linux) 28 AMPLab Overview - franklin@cs.berkeley.edu 14
  • 15. 5/10/2012 Machines • Make datacenter a real computer! Predictive & Support interactive insightful query and iterative data language analysis (e.g., ML Hive Cassandra Hypertbale PIQL Spark MPI algorithms)… Hadoop … SCADS AMP stack Consistency Datacenter “OS” (e.g., Mesos) adjustable data Node OS Node OS store Node OS Existing … stack (e.g. Linux) (e.g. Windows) (e.g. Linux) 29 Machines • Make datacenter a real computer! Applications, tools Hive Cassandra Hypertbale PIQL Spark • Advanced ML algorithms MPI Hadoop … … • Interactive data mining SCADS AMP • Collaborative visualization stack Datacenter “OS” (e.g., Mesos) Node OS Node OS Node OS Existing … stack (e.g. Linux) (e.g. Windows) (e.g. Linux) 30 AMPLab Overview - franklin@cs.berkeley.edu 15
  • 16. 5/10/2012 People • Humans can make sense of messy data! 31 People • Make people an integrated part of the system! – Leverage human activity Machines + – Leverage human intelligence (crowdsourcing): Algorithms • Curate and clean dirty data Questions activity Answers • Answer imprecise questions data, • Test and improve algorithms • Challenge – Inconsistent answer quality in all dimensions (e.g., type of question, time, cost) 32 AMPLab Overview - franklin@cs.berkeley.edu 16
  • 17. 5/10/2012 Real Applications • Mobile Millennium Project – Alex Bayen, Civil and Environment Engineering, UC Berkeley • Microsimulation of urban development – Paul Waddell, College of Environment Design, UC Berkeley • Crowd based opinion formation – Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley • Personalized Sequencing – Taylor Sittler, UCSF 33 Personalized Sequencing 34 AMPLab Overview - franklin@cs.berkeley.edu 17
  • 18. 5/10/2012 The AMP Lab Make sense of data at scale by tightly integrating algorithms, machines, and people Algorithms Microsimulation Mobile Millennium Sequencing Machines 35 People Big Data in 2020 Are you prepared? • To create a new generation of big data scientist • For ML to become an engineering discipline • For people to be deeply integrated in big data analysis pipeline • Will your institution – offer a big data curriculum touching all fields? – have hired cross-disciplinary faculty? – have invested in (pooled) storage infrastructure? – have invested in public/private clouds? 36 – have built inter/intra campus networks? AMPLab Overview - franklin@cs.berkeley.edu 18
  • 19. 5/10/2012 Summary • Goal: Tame Big Data Problem – Get results with right quality at the right time • Approach: Holistic integration of Algorithms, Machines, and People • Huge research issues across many domains 37 AMPLab Overview - franklin@cs.berkeley.edu 19