SlideShare une entreprise Scribd logo
1  sur  62
Télécharger pour lire hors ligne
Introduction to Big Data

          Dr. Putchong Uthayopas
  Department of Computer Engineering,
Faculty of Engineering, Kasetsart University.
                pu@ku.ac.th
Agenda
•   Introduction and Motivation
•   Big Data Characteristics
•   Big Data Technology
•   Using Big Data
•   Trends
Introduction and Motivation
We are living in the world of Data


                                                         Video
                                                      Surveillance

           Social Media




Mobile Sensors




                                                      Gene Sequencing
  Smart Grids
                   Geophysical      Medical Imaging
                   Exploration
Some Data sizes
~40 109 Web pages at ~300 kilobytes each = 10 Petabytes
Youtube 48 hours video uploaded per minute;
  in 2 months in 2010, uploaded more than total NBC ABC CBS
  ~2.5 petabytes per year uploaded?
LHC 15 petabytes per year
Radiology 69 petabytes per year
Square Kilometer Array Telescope will be 100
terabits/second
Earth Observation becoming ~4 petabytes per year
Earthquake Science – few terabytes total today
PolarGrid – 100’s terabytes/year
Exascale simulation data dumps – terabytes/second

                                                       5
http://www.touchagency.com/free-twitter-infographic/
Information as an Asset
• Cloud will enable larger and larger data to be
  easily collected and used
• People will deposit information into the cloud
  – Bank, personal ware house
• New technology will emerge
  – Larger and scalable storage technology
  – Innovative and complex data analysis/visualization for
    multimedia data
  – Security technology to ensure privacy
• Cloud will be mankind intelligent and memory!
“Data is the new oil.”
Andreas Weigend, Stanford (ex Amazon)


Data is more like soup – its
messy and you don’t know
what’s in it….
The Coming of Data Deluge
• In the past, most scientific disciplines could be described
  as small data, or evendata poor. Most experiments or studies
  had to contend with just a few hundred or a few thousand
  data points.
• Now, thanks to massively complex new instruments and
  simulators, many disciplines are generating correspondingly
  massive data sets that are described as big data, or data rich.
   – Consider the Large Hadron Collider, which will eventually generate
     about 15 petabytes of data per year. A petabyte is about a million
     gigabytes, so that qualifies as a full-fledged data deluge.

    The Coming Data Deluge: As science becomes more data intensive, so does our language
    BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
“Herculean” and
“Heroic”


Particle
physics
data
Scale: an explosion of data




     http://www.phgfoundation.org/reports/10364/



“A single sequencer can now generate in a day what it took 10
years to collect for the Human Genome Project”
Creating a connectome
• neuroscientists have set the goal of creating a connectome, a
  complete map of the brain's neural circuitry.
   – an image of a cubic millimeter chunk of the brain would comprise
     about 1 petabyte of data (at a 5-nanometer resolution).
   – There are about a million cubic millimeters of neural matter to map,
     making a total of about a thousand exabytes (an exabyte is about a
     thousand petabytes)
   – qualifies as what Jim Gray once called an exaflood of data.

    The Coming Data Deluge: As science becomes more data intensive, so does our language
    BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
The new model is for the data to be captured by
instruments or generated by simulations before
being processed by software and for the resulting
information or knowledge to be stored in computers.
Scientists only get to look at their data fairly late in
this pipeline. The techniques and technologies for
such data-intensive science are so different that it is
worth distinguishing data-intensive science from
computational science as a new, fourth paradigm for
scientific exploration.
—Jim Gray, computer scientist
•   The White House today announced a
    $200 million big-data initiative to
    create tools to improve scientific
    research by making sense of the huge
    amounts of data now available..
•   Grants and research programs are
    geared at improving the core
    technologies around managing and
    processing big data sets, speeding up
    scientific research with big data, and
    encouraging universities to train more
    data scientists and engineers.
•   The emergent field of data science is
    changing the direction and speed of
    scientific research by letting people
    fine-tune their inquiries by tapping
    into giant data sets.
•   Medical research, for example, is
    moving from broad-based treatments
    to highly targeted pharmaceutical
    testing for a segment of the
    population or people with specific
    genetic markers.
So, what is big data?
Big Data
“Big data is data that exceeds the processing
capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain
value from this data, you must choose an
alternative way to process it.”




       Reference: “What is big data? An introduction to the big data
       landscape.”, Edd Dumbill, http://radar.oreilly.com/2012/01/what-is-big-
       data.html
Amazon View of Big Data

 'Big data' refers to a collection of tools, techniques
  and technologies which make it easy to work with
 data at any scale. These distributed, scalable tools
  provide flexible programming models to navigate
   and explore data of any shape and size, from a
                   variety of sources.
The Value of Big Data
• Analytical use
  – Big data analytics can reveal insights hidden
    previously by data too costly to process.
     • peer influence among customers, revealed by analyzing
       shoppers’ transactions, social and geographical data.
  – Being able to process every item of data in reasonable
    time removes the troublesome need for sampling and
    promotes an investigative approach to data.
• Enabling new products.
  – Facebook has been able to craft a highly personalized
    user experience and create a new kind of advertising
    business
Big Data Characteristics
3 Characteristics of Big Data

Volume     • Volumes of data are larger than those conventional
             relational database infrastructures can cope with




           • Rate at which data flows in is much faster.
Velocity     • Mobile event and interaction by users.
             • Video, image , audio from users



           • the source data is diverse, and doesn’t fall into neat
Variety      relational structures eg. text from social networks,
             image data, a raw feed directly from a sensor source.
Big Data Challenge
Volume
• How to process data so big that can not be move, or store.

Velocity
• A lot of data coming very fast so it can not be stored such as Web
  usage log , Internet, mobile messages. Stream processing is needed
  to filter unused data or extract some knowledge real-time.

Variety
• So many type of unstructured data format making conventional
  database useless.
Big Data Technology
From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
What is needed for big data
•   Your data
•   Storage infrastructure
•   Computing infrastructure
•   Middleware to handle BIG Data
•   Data Analysis
    – Statistical analysis
    – Data Mining
• People
How to deal with big data
• Integration of
    –   Storage
    –   Processing
    –   Analysis Algorithm
    –   Visualization
                                       Processing


 Massive
  Data           Stream                Processing   Visualize
 Stream        processing

                             Storage
                                       Processing   Analysis
How can we store and process
           massive data
• Beyond capability of a single server
• Basic Infrastructure
   – Cluster of servers
   – High speed interconnected
   – High speed storage cluster
• Incoming data will be spread across the server farm
• Processing is quickly distributed to the farm
• Result is collected and send back
NoSQL (Not Only SQL)
• Next Generation Databases mostly addressing some
  of the points:
   – being non-relational, distributed, open-
     source and horizontally scalable.
   – Used to handle a huge amount of data
   – The original intention has been modern web-scale
     databases.

 Reference: http://nosql-database.org/
MongoDB
•   MongoDB is a general purpose, open-
    source database.
•   MongoDB features:
     –   Document data model with dynamic
         schemas
     –   Full, flexible index support and rich queries
     –   Auto-Sharding for horizontal scalability
     –   Built-in replication for high availability
     –   Text search
     –   Advanced security
     –   Aggregation Framework and MapReduce
     –   Large media storage with GridFS
What is Hadoop?

- Hadoop or Apache Hadoop

- open-source software framework

- supports data-intensive distributed applications.

- develop by the Apache

- derived from Google's MapReduce and Google File

  System (GFS) papers.

- Implement with Java
Overview




                    worker




           master
HDFS
Google Cloud Platform
• App engines
   – mobile and web app
• Cloud SQL
   – MySQL on the cloud
• Cloud Storage
   – Data storage
• Big Query
   – Data analysis
• Google Compute Engine
   – Processing of large data
Amazon
• Amazon EC2
  – Computation Service using VM
• Amazon DynamoDB
  – Large scalable NoSQL databased
  – Fully distributed shared nothing architecture
• Amazon Elastic MapReduce (Amazon EMR)
  – Hadoop based analysis engine
  – Can be used to analyse data from DynamoDB
Issues
• I/O capability of a single computer is limited ,
  how to handle massive data
• Big Data can not be moved
  – Careful planning must be done to handle big data
  – Processing capability must be there from the start
Using Big Data
WHAT FACEBOOK KNOWS




                               Cameron Marlow calls himself Facebook's "in-
                               house sociologist." He and his team can
http://www.facebook.com/data   analyze essentially all the information the site
                               gathers.
Study of Human Society
• Facebook, in collaboration with the University
  of Milan, conducted experiment that involved
  – the entire social network as of May 2011
  – more than 10 percent of the world's population.
• Analyzing the 69 billion friend connections
  among those 721 million people showed that
  – four intermediary friends are usually enough to
    introduce anyone to a random stranger.
The links of Love
•   Often young women specify that
    they are “in a relationship” with
    their “best friend forever”.
     – Roughly 20% of all relationships for
       the 15-and-under crowd are
       between girls.
     – This number dips to 15% for 18-
       year-olds and is just 7% for 25-year-
       olds.
•   Anonymous US users who were
    over 18 at the start of the
    relationship
     – the average of the shortest number
       of steps to get from any one U.S.
       user to any other individual is 16.7.
     – This is much higher than the 4.74
       steps you’d need to go from any
       Facebook user to another through
       friendship, as opposed to romantic,               Graph shown the relationship of anonymous US users who were
       ties.                                             over 18 at the start of the relationship.


                  http://www.facebook.com/notes/facebook-data-team/the-links-of-
                  love/10150572088343859
Why?
• Facebook can improve users experience
  – make useful predictions about users' behavior
  – make better guesses about which ads you might
    be more or less open to at any given time
• Right before Valentine's Day this year a blog
  post from the Data Science Team listed the
  songs most popular with people who had
  recently signaled on Facebook that they had
  entered or left a relationship
How facebook handle Big Data?
• Facebook built its data storage system using open-
  source software called Hadoop.
   – Hadoop spreading them across many machines inside a
     data center.
   – Use Hive, open-source that acts as a translation service,
     making it possible to query vast Hadoop data stores using
     relatively simple code.
• Much of Facebook's data resides in one Hadoop store
  more than 100 petabytes (a million gigabytes) in size,
  says Sameet Agarwal, a director of engineering at
  Facebook who works on data infrastructure, and the
  quantity is growing exponentially. "Over the last few
  years we have more than doubled in size every year,”
Google Flu
•   pattern emerges when all the flu-
    related search queries are added
    together.
•   We compared our query counts with
    traditional flu surveillance systems
    and found that many search queries
    tend to be popular exactly when flu
    season is happening.
•   By counting how often we see these
    search queries, we can estimate how
    much flu is circulating in different
    countries and regions around the
    world.

http://www.google.org/flutrends/abo
ut/how.html
From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
Preparing for BigData
• Understanding and preparing your data
    – To effectively analyse and, more importantly, cross-analyse your data sets – this is often
      where the most insightful results come from – you need to have a rigorous knowledge of
      what data you have.
• Getting staff up to scratch
    –   Finding people with data analysis experience
• Defining the business objectives
    –   Once the end goal has been decided then a strategy can be created for implementing big data analytics
        to support the delivery of this goal
• Sourcing the right suppliers and technology
    –   But in terms of storage, hardware, and data warehousing, you will need to make a range of decisions to
        make sure you have all the capabilities and functionality required to meet your big data needs.




http://www.thebigdatainsightgroup.com/site/article/preparing-big-data-revolution
Trends
Trends
• A move toward large and scalable Virtual
  Infrastructure
  – Providing computing service
  – Providing basic storage service
  – Providing Scalable large database
     • NOSQL
  – Providing Analysis Service
• All these services has to come together
  – Big data can not moved!
Issues
• Security
   – Will you let an important data being accumulate outside your
     organization?
       • If it is not an important data, why analyze them ?
   – Who own the data? If you discontinue the service, is the data
     being destroy properly.
   – Protection in multi-tenant environment
• Big data can not be moved easily
   – Processing have to be near. Just can not ship data around
       • So you finally have to select the same cloud for your processing. Is it
         available, easy, fast?
• New learning, development cost
   – Need new programming, porting?
   – Tools is mature enough?
When to use Big data on the Cloud
• When data is already on the cloud
  – Virtual organization
  – Cloud based SaaS Service
• For startup
  –   CAPEX to OPEX
  –   No need to maintain large infra
  –   Focus on scalability and pay as you go
  –   Data is on the cloud anyway
• For experimental project
  – Pilot for new services
Summary
• Big data is coming.
  – Changing the way we do science
  – Big data are being accumulate anyway
  – Knowledge is power.
     • Better understand your customer so you can offer
       better service
• Tools and Technology is available
  – Still being developed fast
Thank you

Contenu connexe

Tendances

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Robert Grossman
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremAnthonyOtuonye
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applicationsPadma Metta
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsVijay Raghavan
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkRobert Grossman
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdfAkuhuruf
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
 
A Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven SocietyA Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven SocietyAnthonyOtuonye
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data miningPolash Halder
 
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...Artificial Intelligence Institute at UofSC
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...Robert Grossman
 
Myths about data science and big data analytics
Myths about data science and big data analyticsMyths about data science and big data analytics
Myths about data science and big data analyticsChulalongkorn University
 

Tendances (20)

How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE Theorem
 
V3 i35
V3 i35V3 i35
V3 i35
 
Bigdata and Hadoop with applications
Bigdata and Hadoop with applicationsBigdata and Hadoop with applications
Bigdata and Hadoop with applications
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
 
Adversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World TalkAdversarial Analytics - 2013 Strata & Hadoop World Talk
Adversarial Analytics - 2013 Strata & Hadoop World Talk
 
kambatla2014.pdf
kambatla2014.pdfkambatla2014.pdf
kambatla2014.pdf
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
A Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven SocietyA Novel Framework for Big Data Processing in a Data-driven Society
A Novel Framework for Big Data Processing in a Data-driven Society
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Big data and data mining
Big data and data miningBig data and data mining
Big data and data mining
 
Elementary Concepts of data minig
Elementary Concepts of data minigElementary Concepts of data minig
Elementary Concepts of data minig
 
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
Semantics-empowered Approaches to Big Data Processing for Physical-Cyber-Soci...
 
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
How Data Commons are Changing the Way that Large Datasets Are Analyzed and Sh...
 
Myths about data science and big data analytics
Myths about data science and big data analyticsMyths about data science and big data analytics
Myths about data science and big data analytics
 

En vedette

Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Sheryl Grant
 
Networking Issues For Big Data
Networking Issues For Big DataNetworking Issues For Big Data
Networking Issues For Big Datarjain51
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science EducationJames Hendler
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualizationbigdataviz_bay
 
Dna the next big thing in data storage
Dna the next big thing in data storageDna the next big thing in data storage
Dna the next big thing in data storageOther Mother
 
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Academia Sinica
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectivePalaniappan SP
 
Big Data - 25 Facts to Know
Big Data - 25 Facts to KnowBig Data - 25 Facts to Know
Big Data - 25 Facts to KnowTamela Coval
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 

En vedette (17)

The Big Y DNA Test
The Big Y DNA TestThe Big Y DNA Test
The Big Y DNA Test
 
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
Socializing Big Data: Collaborative Opportunities in Computer Science, the So...
 
Networking Issues For Big Data
Networking Issues For Big DataNetworking Issues For Big Data
Networking Issues For Big Data
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Dna the next big thing in data storage
Dna the next big thing in data storageDna the next big thing in data storage
Dna the next big thing in data storage
 
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...Computational Social Science:The Collaborative Futures of Big Data, Computer ...
Computational Social Science:The Collaborative Futures of Big Data, Computer ...
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data - 25 Facts to Know
Big Data - 25 Facts to KnowBig Data - 25 Facts to Know
Big Data - 25 Facts to Know
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big Data
Big DataBig Data
Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Similaire à Big Data

Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreSoftweb Solutions
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.KGMGROUP
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaMaria de la Iglesia
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleEnis Afgan
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
 
Why manage research data?
Why manage research data?Why manage research data?
Why manage research data?Graham Pryor
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data ScienceFeyzi R. Bagirov
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Dan Taylor
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudOla Spjuth
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGGeoffrey Fox
 

Similaire à Big Data (20)

Big Data on The Cloud
Big Data on The CloudBig Data on The Cloud
Big Data on The Cloud
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
Big Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and moreBig Data in Action : Operations, Analytics and more
Big Data in Action : Operations, Analytics and more
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la IglesiaBIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
BIMCV, Banco de Imagen Medica de la Comunidad Valenciana. María de la Iglesia
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
 
The pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an exampleThe pulse of cloud computing with bioinformatics as an example
The pulse of cloud computing with bioinformatics as an example
 
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
(Em)Powering Science: High-Performance Infrastructure in Biomedical Science
 
Why manage research data?
Why manage research data?Why manage research data?
Why manage research data?
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
BigData.pptx
BigData.pptxBigData.pptx
BigData.pptx
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
Big data
Big dataBig data
Big data
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Data-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and CloudData-intensive bioinformatics on HPC and Cloud
Data-intensive bioinformatics on HPC and Cloud
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 

Plus de Putchong Uthayopas (13)

Education in Disrupted World
Education in Disrupted WorldEducation in Disrupted World
Education in Disrupted World
 
Portrait Photography
Portrait PhotographyPortrait Photography
Portrait Photography
 
MOOC Wunca Talk
MOOC Wunca TalkMOOC Wunca Talk
MOOC Wunca Talk
 
Future of the cloud
Future of the cloud Future of the cloud
Future of the cloud
 
10 things
10 things10 things
10 things
 
IT trends for co-creation
IT trends for co-creationIT trends for co-creation
IT trends for co-creation
 
Cloud Computing: A New Trend in IT
Cloud Computing: A New Trend in ITCloud Computing: A New Trend in IT
Cloud Computing: A New Trend in IT
 
Learning Life and Photography
Learning Life and PhotographyLearning Life and Photography
Learning Life and Photography
 
What is Cloud Computing ?
What is Cloud Computing ?What is Cloud Computing ?
What is Cloud Computing ?
 
Simple Introduction to Cloud for Users
Simple Introduction to Cloud for UsersSimple Introduction to Cloud for Users
Simple Introduction to Cloud for Users
 
The Building of Thai Grid
The Building of Thai GridThe Building of Thai Grid
The Building of Thai Grid
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
 
Project Evaluation
Project EvaluationProject Evaluation
Project Evaluation
 

Dernier

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 

Dernier (20)

Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 

Big Data

  • 1. Introduction to Big Data Dr. Putchong Uthayopas Department of Computer Engineering, Faculty of Engineering, Kasetsart University. pu@ku.ac.th
  • 2. Agenda • Introduction and Motivation • Big Data Characteristics • Big Data Technology • Using Big Data • Trends
  • 4. We are living in the world of Data Video Surveillance Social Media Mobile Sensors Gene Sequencing Smart Grids Geophysical Medical Imaging Exploration
  • 5. Some Data sizes ~40 109 Web pages at ~300 kilobytes each = 10 Petabytes Youtube 48 hours video uploaded per minute; in 2 months in 2010, uploaded more than total NBC ABC CBS ~2.5 petabytes per year uploaded? LHC 15 petabytes per year Radiology 69 petabytes per year Square Kilometer Array Telescope will be 100 terabits/second Earth Observation becoming ~4 petabytes per year Earthquake Science – few terabytes total today PolarGrid – 100’s terabytes/year Exascale simulation data dumps – terabytes/second 5
  • 7.
  • 8. Information as an Asset • Cloud will enable larger and larger data to be easily collected and used • People will deposit information into the cloud – Bank, personal ware house • New technology will emerge – Larger and scalable storage technology – Innovative and complex data analysis/visualization for multimedia data – Security technology to ensure privacy • Cloud will be mankind intelligent and memory!
  • 9. “Data is the new oil.” Andreas Weigend, Stanford (ex Amazon) Data is more like soup – its messy and you don’t know what’s in it….
  • 10. The Coming of Data Deluge • In the past, most scientific disciplines could be described as small data, or evendata poor. Most experiments or studies had to contend with just a few hundred or a few thousand data points. • Now, thanks to massively complex new instruments and simulators, many disciplines are generating correspondingly massive data sets that are described as big data, or data rich. – Consider the Large Hadron Collider, which will eventually generate about 15 petabytes of data per year. A petabyte is about a million gigabytes, so that qualifies as a full-fledged data deluge. The Coming Data Deluge: As science becomes more data intensive, so does our language BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
  • 12. Scale: an explosion of data http://www.phgfoundation.org/reports/10364/ “A single sequencer can now generate in a day what it took 10 years to collect for the Human Genome Project”
  • 13. Creating a connectome • neuroscientists have set the goal of creating a connectome, a complete map of the brain's neural circuitry. – an image of a cubic millimeter chunk of the brain would comprise about 1 petabyte of data (at a 5-nanometer resolution). – There are about a million cubic millimeters of neural matter to map, making a total of about a thousand exabytes (an exabyte is about a thousand petabytes) – qualifies as what Jim Gray once called an exaflood of data. The Coming Data Deluge: As science becomes more data intensive, so does our language BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
  • 14. The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration. —Jim Gray, computer scientist
  • 15.
  • 16.
  • 17.
  • 18. The White House today announced a $200 million big-data initiative to create tools to improve scientific research by making sense of the huge amounts of data now available.. • Grants and research programs are geared at improving the core technologies around managing and processing big data sets, speeding up scientific research with big data, and encouraging universities to train more data scientists and engineers. • The emergent field of data science is changing the direction and speed of scientific research by letting people fine-tune their inquiries by tapping into giant data sets. • Medical research, for example, is moving from broad-based treatments to highly targeted pharmaceutical testing for a segment of the population or people with specific genetic markers.
  • 19.
  • 20. So, what is big data?
  • 21. Big Data “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.” Reference: “What is big data? An introduction to the big data landscape.”, Edd Dumbill, http://radar.oreilly.com/2012/01/what-is-big- data.html
  • 22. Amazon View of Big Data 'Big data' refers to a collection of tools, techniques and technologies which make it easy to work with data at any scale. These distributed, scalable tools provide flexible programming models to navigate and explore data of any shape and size, from a variety of sources.
  • 23. The Value of Big Data • Analytical use – Big data analytics can reveal insights hidden previously by data too costly to process. • peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. – Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data. • Enabling new products. – Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business
  • 25. 3 Characteristics of Big Data Volume • Volumes of data are larger than those conventional relational database infrastructures can cope with • Rate at which data flows in is much faster. Velocity • Mobile event and interaction by users. • Video, image , audio from users • the source data is diverse, and doesn’t fall into neat Variety relational structures eg. text from social networks, image data, a raw feed directly from a sensor source.
  • 26. Big Data Challenge Volume • How to process data so big that can not be move, or store. Velocity • A lot of data coming very fast so it can not be stored such as Web usage log , Internet, mobile messages. Stream processing is needed to filter unused data or extract some knowledge real-time. Variety • So many type of unstructured data format making conventional database useless.
  • 28. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
  • 29. What is needed for big data • Your data • Storage infrastructure • Computing infrastructure • Middleware to handle BIG Data • Data Analysis – Statistical analysis – Data Mining • People
  • 30. How to deal with big data • Integration of – Storage – Processing – Analysis Algorithm – Visualization Processing Massive Data Stream Processing Visualize Stream processing Storage Processing Analysis
  • 31. How can we store and process massive data • Beyond capability of a single server • Basic Infrastructure – Cluster of servers – High speed interconnected – High speed storage cluster • Incoming data will be spread across the server farm • Processing is quickly distributed to the farm • Result is collected and send back
  • 32. NoSQL (Not Only SQL) • Next Generation Databases mostly addressing some of the points: – being non-relational, distributed, open- source and horizontally scalable. – Used to handle a huge amount of data – The original intention has been modern web-scale databases. Reference: http://nosql-database.org/
  • 33. MongoDB • MongoDB is a general purpose, open- source database. • MongoDB features: – Document data model with dynamic schemas – Full, flexible index support and rich queries – Auto-Sharding for horizontal scalability – Built-in replication for high availability – Text search – Advanced security – Aggregation Framework and MapReduce – Large media storage with GridFS
  • 34. What is Hadoop? - Hadoop or Apache Hadoop - open-source software framework - supports data-intensive distributed applications. - develop by the Apache - derived from Google's MapReduce and Google File System (GFS) papers. - Implement with Java
  • 35. Overview worker master
  • 36. HDFS
  • 37. Google Cloud Platform • App engines – mobile and web app • Cloud SQL – MySQL on the cloud • Cloud Storage – Data storage • Big Query – Data analysis • Google Compute Engine – Processing of large data
  • 38. Amazon • Amazon EC2 – Computation Service using VM • Amazon DynamoDB – Large scalable NoSQL databased – Fully distributed shared nothing architecture • Amazon Elastic MapReduce (Amazon EMR) – Hadoop based analysis engine – Can be used to analyse data from DynamoDB
  • 39. Issues • I/O capability of a single computer is limited , how to handle massive data • Big Data can not be moved – Careful planning must be done to handle big data – Processing capability must be there from the start
  • 41. WHAT FACEBOOK KNOWS Cameron Marlow calls himself Facebook's "in- house sociologist." He and his team can http://www.facebook.com/data analyze essentially all the information the site gathers.
  • 42. Study of Human Society • Facebook, in collaboration with the University of Milan, conducted experiment that involved – the entire social network as of May 2011 – more than 10 percent of the world's population. • Analyzing the 69 billion friend connections among those 721 million people showed that – four intermediary friends are usually enough to introduce anyone to a random stranger.
  • 43. The links of Love • Often young women specify that they are “in a relationship” with their “best friend forever”. – Roughly 20% of all relationships for the 15-and-under crowd are between girls. – This number dips to 15% for 18- year-olds and is just 7% for 25-year- olds. • Anonymous US users who were over 18 at the start of the relationship – the average of the shortest number of steps to get from any one U.S. user to any other individual is 16.7. – This is much higher than the 4.74 steps you’d need to go from any Facebook user to another through friendship, as opposed to romantic, Graph shown the relationship of anonymous US users who were ties. over 18 at the start of the relationship. http://www.facebook.com/notes/facebook-data-team/the-links-of- love/10150572088343859
  • 44. Why? • Facebook can improve users experience – make useful predictions about users' behavior – make better guesses about which ads you might be more or less open to at any given time • Right before Valentine's Day this year a blog post from the Data Science Team listed the songs most popular with people who had recently signaled on Facebook that they had entered or left a relationship
  • 45. How facebook handle Big Data? • Facebook built its data storage system using open- source software called Hadoop. – Hadoop spreading them across many machines inside a data center. – Use Hive, open-source that acts as a translation service, making it possible to query vast Hadoop data stores using relatively simple code. • Much of Facebook's data resides in one Hadoop store more than 100 petabytes (a million gigabytes) in size, says Sameet Agarwal, a director of engineering at Facebook who works on data infrastructure, and the quantity is growing exponentially. "Over the last few years we have more than doubled in size every year,”
  • 46. Google Flu • pattern emerges when all the flu- related search queries are added together. • We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. • By counting how often we see these search queries, we can estimate how much flu is circulating in different countries and regions around the world. http://www.google.org/flutrends/abo ut/how.html
  • 47. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
  • 48. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56. Preparing for BigData • Understanding and preparing your data – To effectively analyse and, more importantly, cross-analyse your data sets – this is often where the most insightful results come from – you need to have a rigorous knowledge of what data you have. • Getting staff up to scratch – Finding people with data analysis experience • Defining the business objectives – Once the end goal has been decided then a strategy can be created for implementing big data analytics to support the delivery of this goal • Sourcing the right suppliers and technology – But in terms of storage, hardware, and data warehousing, you will need to make a range of decisions to make sure you have all the capabilities and functionality required to meet your big data needs. http://www.thebigdatainsightgroup.com/site/article/preparing-big-data-revolution
  • 58. Trends • A move toward large and scalable Virtual Infrastructure – Providing computing service – Providing basic storage service – Providing Scalable large database • NOSQL – Providing Analysis Service • All these services has to come together – Big data can not moved!
  • 59. Issues • Security – Will you let an important data being accumulate outside your organization? • If it is not an important data, why analyze them ? – Who own the data? If you discontinue the service, is the data being destroy properly. – Protection in multi-tenant environment • Big data can not be moved easily – Processing have to be near. Just can not ship data around • So you finally have to select the same cloud for your processing. Is it available, easy, fast? • New learning, development cost – Need new programming, porting? – Tools is mature enough?
  • 60. When to use Big data on the Cloud • When data is already on the cloud – Virtual organization – Cloud based SaaS Service • For startup – CAPEX to OPEX – No need to maintain large infra – Focus on scalability and pay as you go – Data is on the cloud anyway • For experimental project – Pilot for new services
  • 61. Summary • Big data is coming. – Changing the way we do science – Big data are being accumulate anyway – Knowledge is power. • Better understand your customer so you can offer better service • Tools and Technology is available – Still being developed fast