4. We are living in the world of Data
Video
Surveillance
Social Media
Mobile Sensors
Gene Sequencing
Smart Grids
Geophysical Medical Imaging
Exploration
5. Some Data sizes
~40 109 Web pages at ~300 kilobytes each = 10 Petabytes
Youtube 48 hours video uploaded per minute;
in 2 months in 2010, uploaded more than total NBC ABC CBS
~2.5 petabytes per year uploaded?
LHC 15 petabytes per year
Radiology 69 petabytes per year
Square Kilometer Array Telescope will be 100
terabits/second
Earth Observation becoming ~4 petabytes per year
Earthquake Science – few terabytes total today
PolarGrid – 100’s terabytes/year
Exascale simulation data dumps – terabytes/second
5
8. Information as an Asset
• Cloud will enable larger and larger data to be
easily collected and used
• People will deposit information into the cloud
– Bank, personal ware house
• New technology will emerge
– Larger and scalable storage technology
– Innovative and complex data analysis/visualization for
multimedia data
– Security technology to ensure privacy
• Cloud will be mankind intelligent and memory!
9. “Data is the new oil.”
Andreas Weigend, Stanford (ex Amazon)
Data is more like soup – its
messy and you don’t know
what’s in it….
10. The Coming of Data Deluge
• In the past, most scientific disciplines could be described
as small data, or evendata poor. Most experiments or studies
had to contend with just a few hundred or a few thousand
data points.
• Now, thanks to massively complex new instruments and
simulators, many disciplines are generating correspondingly
massive data sets that are described as big data, or data rich.
– Consider the Large Hadron Collider, which will eventually generate
about 15 petabytes of data per year. A petabyte is about a million
gigabytes, so that qualifies as a full-fledged data deluge.
The Coming Data Deluge: As science becomes more data intensive, so does our language
BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
12. Scale: an explosion of data
http://www.phgfoundation.org/reports/10364/
“A single sequencer can now generate in a day what it took 10
years to collect for the Human Genome Project”
13. Creating a connectome
• neuroscientists have set the goal of creating a connectome, a
complete map of the brain's neural circuitry.
– an image of a cubic millimeter chunk of the brain would comprise
about 1 petabyte of data (at a 5-nanometer resolution).
– There are about a million cubic millimeters of neural matter to map,
making a total of about a thousand exabytes (an exabyte is about a
thousand petabytes)
– qualifies as what Jim Gray once called an exaflood of data.
The Coming Data Deluge: As science becomes more data intensive, so does our language
BY PAUL MCFEDRIES / IEEE SPECTURM FEBRUARY 2011
14. The new model is for the data to be captured by
instruments or generated by simulations before
being processed by software and for the resulting
information or knowledge to be stored in computers.
Scientists only get to look at their data fairly late in
this pipeline. The techniques and technologies for
such data-intensive science are so different that it is
worth distinguishing data-intensive science from
computational science as a new, fourth paradigm for
scientific exploration.
—Jim Gray, computer scientist
15.
16.
17.
18. • The White House today announced a
$200 million big-data initiative to
create tools to improve scientific
research by making sense of the huge
amounts of data now available..
• Grants and research programs are
geared at improving the core
technologies around managing and
processing big data sets, speeding up
scientific research with big data, and
encouraging universities to train more
data scientists and engineers.
• The emergent field of data science is
changing the direction and speed of
scientific research by letting people
fine-tune their inquiries by tapping
into giant data sets.
• Medical research, for example, is
moving from broad-based treatments
to highly targeted pharmaceutical
testing for a segment of the
population or people with specific
genetic markers.
21. Big Data
“Big data is data that exceeds the processing
capacity of conventional database systems. The
data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain
value from this data, you must choose an
alternative way to process it.”
Reference: “What is big data? An introduction to the big data
landscape.”, Edd Dumbill, http://radar.oreilly.com/2012/01/what-is-big-
data.html
22. Amazon View of Big Data
'Big data' refers to a collection of tools, techniques
and technologies which make it easy to work with
data at any scale. These distributed, scalable tools
provide flexible programming models to navigate
and explore data of any shape and size, from a
variety of sources.
23. The Value of Big Data
• Analytical use
– Big data analytics can reveal insights hidden
previously by data too costly to process.
• peer influence among customers, revealed by analyzing
shoppers’ transactions, social and geographical data.
– Being able to process every item of data in reasonable
time removes the troublesome need for sampling and
promotes an investigative approach to data.
• Enabling new products.
– Facebook has been able to craft a highly personalized
user experience and create a new kind of advertising
business
25. 3 Characteristics of Big Data
Volume • Volumes of data are larger than those conventional
relational database infrastructures can cope with
• Rate at which data flows in is much faster.
Velocity • Mobile event and interaction by users.
• Video, image , audio from users
• the source data is diverse, and doesn’t fall into neat
Variety relational structures eg. text from social networks,
image data, a raw feed directly from a sensor source.
26. Big Data Challenge
Volume
• How to process data so big that can not be move, or store.
Velocity
• A lot of data coming very fast so it can not be stored such as Web
usage log , Internet, mobile messages. Stream processing is needed
to filter unused data or extract some knowledge real-time.
Variety
• So many type of unstructured data format making conventional
database useless.
28. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
29. What is needed for big data
• Your data
• Storage infrastructure
• Computing infrastructure
• Middleware to handle BIG Data
• Data Analysis
– Statistical analysis
– Data Mining
• People
30. How to deal with big data
• Integration of
– Storage
– Processing
– Analysis Algorithm
– Visualization
Processing
Massive
Data Stream Processing Visualize
Stream processing
Storage
Processing Analysis
31. How can we store and process
massive data
• Beyond capability of a single server
• Basic Infrastructure
– Cluster of servers
– High speed interconnected
– High speed storage cluster
• Incoming data will be spread across the server farm
• Processing is quickly distributed to the farm
• Result is collected and send back
32. NoSQL (Not Only SQL)
• Next Generation Databases mostly addressing some
of the points:
– being non-relational, distributed, open-
source and horizontally scalable.
– Used to handle a huge amount of data
– The original intention has been modern web-scale
databases.
Reference: http://nosql-database.org/
33. MongoDB
• MongoDB is a general purpose, open-
source database.
• MongoDB features:
– Document data model with dynamic
schemas
– Full, flexible index support and rich queries
– Auto-Sharding for horizontal scalability
– Built-in replication for high availability
– Text search
– Advanced security
– Aggregation Framework and MapReduce
– Large media storage with GridFS
34. What is Hadoop?
- Hadoop or Apache Hadoop
- open-source software framework
- supports data-intensive distributed applications.
- develop by the Apache
- derived from Google's MapReduce and Google File
System (GFS) papers.
- Implement with Java
37. Google Cloud Platform
• App engines
– mobile and web app
• Cloud SQL
– MySQL on the cloud
• Cloud Storage
– Data storage
• Big Query
– Data analysis
• Google Compute Engine
– Processing of large data
38. Amazon
• Amazon EC2
– Computation Service using VM
• Amazon DynamoDB
– Large scalable NoSQL databased
– Fully distributed shared nothing architecture
• Amazon Elastic MapReduce (Amazon EMR)
– Hadoop based analysis engine
– Can be used to analyse data from DynamoDB
39. Issues
• I/O capability of a single computer is limited ,
how to handle massive data
• Big Data can not be moved
– Careful planning must be done to handle big data
– Processing capability must be there from the start
41. WHAT FACEBOOK KNOWS
Cameron Marlow calls himself Facebook's "in-
house sociologist." He and his team can
http://www.facebook.com/data analyze essentially all the information the site
gathers.
42. Study of Human Society
• Facebook, in collaboration with the University
of Milan, conducted experiment that involved
– the entire social network as of May 2011
– more than 10 percent of the world's population.
• Analyzing the 69 billion friend connections
among those 721 million people showed that
– four intermediary friends are usually enough to
introduce anyone to a random stranger.
43. The links of Love
• Often young women specify that
they are “in a relationship” with
their “best friend forever”.
– Roughly 20% of all relationships for
the 15-and-under crowd are
between girls.
– This number dips to 15% for 18-
year-olds and is just 7% for 25-year-
olds.
• Anonymous US users who were
over 18 at the start of the
relationship
– the average of the shortest number
of steps to get from any one U.S.
user to any other individual is 16.7.
– This is much higher than the 4.74
steps you’d need to go from any
Facebook user to another through
friendship, as opposed to romantic, Graph shown the relationship of anonymous US users who were
ties. over 18 at the start of the relationship.
http://www.facebook.com/notes/facebook-data-team/the-links-of-
love/10150572088343859
44. Why?
• Facebook can improve users experience
– make useful predictions about users' behavior
– make better guesses about which ads you might
be more or less open to at any given time
• Right before Valentine's Day this year a blog
post from the Data Science Team listed the
songs most popular with people who had
recently signaled on Facebook that they had
entered or left a relationship
45. How facebook handle Big Data?
• Facebook built its data storage system using open-
source software called Hadoop.
– Hadoop spreading them across many machines inside a
data center.
– Use Hive, open-source that acts as a translation service,
making it possible to query vast Hadoop data stores using
relatively simple code.
• Much of Facebook's data resides in one Hadoop store
more than 100 petabytes (a million gigabytes) in size,
says Sameet Agarwal, a director of engineering at
Facebook who works on data infrastructure, and the
quantity is growing exponentially. "Over the last few
years we have more than doubled in size every year,”
46. Google Flu
• pattern emerges when all the flu-
related search queries are added
together.
• We compared our query counts with
traditional flu surveillance systems
and found that many search queries
tend to be popular exactly when flu
season is happening.
• By counting how often we see these
search queries, we can estimate how
much flu is circulating in different
countries and regions around the
world.
http://www.google.org/flutrends/abo
ut/how.html
47. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
48. From “Data Driven Discovery in Science:The Fourth Paradigm”, Alex Szalay, Johns Hopkins University
49.
50.
51.
52.
53.
54.
55.
56. Preparing for BigData
• Understanding and preparing your data
– To effectively analyse and, more importantly, cross-analyse your data sets – this is often
where the most insightful results come from – you need to have a rigorous knowledge of
what data you have.
• Getting staff up to scratch
– Finding people with data analysis experience
• Defining the business objectives
– Once the end goal has been decided then a strategy can be created for implementing big data analytics
to support the delivery of this goal
• Sourcing the right suppliers and technology
– But in terms of storage, hardware, and data warehousing, you will need to make a range of decisions to
make sure you have all the capabilities and functionality required to meet your big data needs.
http://www.thebigdatainsightgroup.com/site/article/preparing-big-data-revolution
58. Trends
• A move toward large and scalable Virtual
Infrastructure
– Providing computing service
– Providing basic storage service
– Providing Scalable large database
• NOSQL
– Providing Analysis Service
• All these services has to come together
– Big data can not moved!
59. Issues
• Security
– Will you let an important data being accumulate outside your
organization?
• If it is not an important data, why analyze them ?
– Who own the data? If you discontinue the service, is the data
being destroy properly.
– Protection in multi-tenant environment
• Big data can not be moved easily
– Processing have to be near. Just can not ship data around
• So you finally have to select the same cloud for your processing. Is it
available, easy, fast?
• New learning, development cost
– Need new programming, porting?
– Tools is mature enough?
60. When to use Big data on the Cloud
• When data is already on the cloud
– Virtual organization
– Cloud based SaaS Service
• For startup
– CAPEX to OPEX
– No need to maintain large infra
– Focus on scalability and pay as you go
– Data is on the cloud anyway
• For experimental project
– Pilot for new services
61. Summary
• Big data is coming.
– Changing the way we do science
– Big data are being accumulate anyway
– Knowledge is power.
• Better understand your customer so you can offer
better service
• Tools and Technology is available
– Still being developed fast