2. What is Data
The word Data is plural of datum in the Latin dare
which meant "to give", that is to “something given”.
Data as an abstract concept can be viewed as the
lowest level of abstraction from
which information and then knowledge are derived.
Information in raw or unorganized form(such as
alphabets, numbers, or symbols) that refer to,
or represent, conditions, ideas, or objects. Data is
limitless and present everywhere in the universe. See
also information and knowledge.
Computers: Symbols or signals that are input,
stored, and processed by a computer, for output as
usable information.
3. Type of Data
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, SemanticWeb (RDF), …
Streaming Data
You can only scan the data once
4. Big Data
Definition
Big data is a massive volume of both structured and
unstructured data that is so large that it's difficult to
process with traditional database and software
techniques.
Big data is the term for a collection of data sets so
large and complex that it becomes difficult to
process using on-hand database management tools
or traditional data processing applications
Big data is data whose scale, diversity, and
complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract
value and hidden knowledge from it…
5. Walmart handles more than 1 million customer
transactions every hour.
Facebook handles 40 billion photos from its
user base.
Decoding the human genome originally took 10
years to process; now it can be achieved in one
week.
Google processes 20 PB a day (2008)
Wayback Machine has 3 PB + 100TB/month
(3/2009)
Facebook has 2.5 PB of user data + 15TB/day
(4/2009)
eBay has 6.5 PB of user data + 50TB/day (5/2009)
Where the
Big Data???
6. DataUnits
Big Data is Data growing
faster than Moore’s law
1 Bytes - 8 Bits
1 Kilobyte(KB) - 10^3 Bytes
1 Megabyte(MB) - 10^6 Bytes
1 Gigabyte(GB) - 10^9 Bytes
1 Terabyte(TB) - 10^12 Bytes)
7. Big Big Big
Data
Petabyte(PB) - 10^15 Bytes
Exabyte (EB) - 10^18 Bytes
Zettabyte(ZB) - 10^21 Bytes
Yottabyte (YB) - 10^24 Bytes
Xenottabyte(XB) - 10^27 Bytes
Shilentnobyte (SB) - 10^30 Bytes
Domegrottebyte (DB) - 10^33 Bytes
10. Varity
Various formats, types, and structures
Text, numerical, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
Static data vs. streaming data
A single application can be
generating/collecting many types of data
11. Velocity
Data is begin generated fast and need to be
processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location,
your purchase history, what you like send
promotions right now for store next to you
Healthcare monitoring: sensors monitoring
your activities and body any abnormal
measurements require immediate reaction
16. Who’s
Generating Big
Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
17. Implementation
of Big Data
Parallel DBMS technologies
Proposed in late eighties
Matured over the last two decades
Multi-billion dollar industry: Proprietary
DBMS Engines intended as Data
Warehousing solutions for very large
enterprises
Map Reduce
pioneered by Google
popularized byYahoo! (Hadoop)
19. MapReduce Parallel DBMS technologies
Data-parallel programming
model
An associated parallel and
distributed
implementation for
commodity clusters
Popularized by open-
source Hadoop
Used byYahoo!,
Facebook,
Amazon, and the list
is growing …
Popularly used for more than
two decades
Research Projects: Gamma,
Grace, …
Commercial: Multi-billion
dollar industry but access to
only a privileged few
Relational Data Model
Indexing
Familiar SQL interface
Advanced query optimization
Well understood and studied
Comparison
20. MapReduce
Advantages
Automatic Parallelization:
Depending on the size of RAW INPUT DATA
instantiate multiple MAP tasks
Similarly, depending upon the number of
intermediate <key, value> partitions
instantiate multiple REDUCE tasks
Run-time:
Data partitioning
Task scheduling
Handling machine failures
Managing inter-machine communication
Completely transparent to the programmer / analyst
/ end user
22. Why Hadoop
Big Data analytics and the apache hadoop
open source project are rapidly emerging as
the preferred solution to address business &
technology trends that’s are disrupting
traditional data management & processing
25. Challenge in
Big Data
Big Data Integration is Multidisciplinary
Less than 10% of Big Data world are genuinely
relational
Meaningful data integration in the real, messy, schema-
less and complex Big Data world of database and
semantic web using multidisciplinary and multi-
technology method
The Linked Open Data Ripper
Mapping, Ranking,Visualization, Key Matching,
Snappiness
Demonstrate theValue of Semantics: let data integration
drive DBMS technology
Large volumes of heterogeneous data, like link data
and RDF
26. Provocations
for Big Data
1. Automating Research Changes the Definition of
Knowledge
2. Claim to Objectively and Accuracy are
Misleading
3. Bigger Data are not always Better data
4. Not all Data are equivalent
5. Just because it is accessible doesn’t make it
ethical
6. Limited access to big data creates new digital
divides
30. Why are they
collecting all
this data?
Target Marketing
To send you catalogs for exactly
the merchandise you typically
purchase.
To suggest medications that
precisely match your medical
history.
To “push” television channels to
your set instead of your
“pulling” them in.
To send advertisements on
those channels just for us!
Targeted Information
To know what you need before
you even know you need it
based on past purchasing
habits!
To notify you of your expiring
driver’s license or credit cards
or last refill on a Rx, etc.
To give you turn-by-turn
directions to a shelter in case of
emergency.
31. Future
Enhancement
Smartphones and tablets outsold desktop and
laptop computers in 2011. There are more
Smartphones in the U.S. in 2012 than people!
The phone in your pocket has more programmable
memory, more storage and more capability than
several large IBM computers.
It takes dozens of microprocessors running 100 million
lines of code to get a premium car out of the
driveway, and this software is only going to get more
complex. In fact, the cost of software and electronics
accounts for 30-40% of the price.
32. Conclusion
Big Data and Big Data Analytics – Not Just for Large
Organizations
It Is Not Just About Building Bigger Databases
Moving Processing to the Data SourceYields Big Dividends
Choose the Most Appropriate Big Data Scenario
Complete data scenario whereby entire data sets can
be properly managed and factored into analytical
processing, complete with in-database or in-memory
processing and grid technologies.
Targeted data scenarios that use analytics and data
management tools to determine the right data to feed
into analytic models, for situations where using data set
isn’t technically feasible or adds little value.
33. Closing
Thought
Big data is not just about helping an organization be
more successful – to market more effectively or improve
business operations.
High-performance analytics from designed to support
big data initiatives, with in-memory, in-database and grid
computing options.
Those organizations can benefit from cloud computing,
where big data analytics is delivered as a service and IT
resources can be quickly adjusted to meet changing
business demands.
On Demand provides customers with the option to push
big data analytics to greatly eliminating the time, capital
expense and maintenance associated with on-premises
deployments.