7. copyright: Sixth Sense Advisors Inc @2012 7
Big Data
Big Data can be defined as data that can grow in volume, velocity, variety and complexity at
unprecedented pace. The growth and complexity present challenges with the capture, storage,
management, analysis and visualization using the typical BI tool stack
8. copyright: Sixth Sense Advisors Inc @2012 8
Tapping into the data
Business Infrastructure
Today we do Big or Small
Structured data
compute with Small and
used today
Large structured data sets
Big Data Big Data will mean Big or
existing across Small compute with Big
the enterprise data sets, not always
that can be available in structured or
made available semi-structured formats
to business
9. copyright: Sixth Sense Advisors Inc @2012 9
Analytics
• Analytics is the key visualization technique to analyze and
monetize from Big Data
• The field of analytics is resurging from the advent of Big
Data
• Social Analytics
• Sensor Analytics
• Text Analytics
• Deep Data Mining
• Analytics needs metadata for integration
• Applications
• Fraud Detection
• Campaign Optimization
• Demand and Supply Optimization
• Forecast Optimization
10. copyright: Sixth Sense Advisors Inc @2012
Long Tail
The New Way
(with a bigger, longer tail)
The Old Way
(Pareto Principle, Control
or 80/20 rule)
Source: http://en.wikipedia.org/wiki/The_Long_Tail
20%
When Web 2.0 is applied…
11. copyright: Sixth Sense Advisors Inc @2012
2008 US Presidential Elections
$32 million raised from 275,000 people
who gave $100 or less
12. copyright: Sixth Sense Advisors Inc @2012
Long Tail Example
Web 2.0 significantly increases
total value contributed/received
by aggregating the long tail of
smaller value donors.
High $ value
donors,
Small
constellation
Source: http://en.wikipedia.org/wiki/The_Long_Tail
20%
Low $ value donors,
Larger constellation
BIG
Data
15. copyright: Sixth Sense Advisors Inc @2012 15
What do we collect
• Facebook has an average of 30 billion pieces of content added
every month
• YouTube receives 24hours of video, every minute
• 5 Billion mobile phones in use in 2010
• A leading retailer in the UK collects 1.5 billion pieces of
information to adjust prices and promotions
• Amazon.com: 30% of sales is out of its recommendation engine
• A Boeing Jet Engine produces 20TB/Hour for engineers to
examine in real time to make improvements
17. copyright: Sixth Sense Advisors Inc @2012 17
Why DWBI Fails Repeatedly
Lost value =
Business Value Sum
(Latencies)+
Business Situation Opportunity
Cost
Data Latency
Value
Lost
Data is ready
Analysis Latency
Information is available
Decision Latency
Decision is made
Action time or Action distance
Time
Base Graph Courtesy – Dr. Richard Hackathorn
18. copyright: Sixth Sense Advisors Inc @2012 18
The Data Landscape
Datamarts
Transactional Reports
Systems ODS & Analytical
Databases
Dashboar
Enterprise ds
Datawarehous Datamarts
Transactional
Systems ODS e
& Analytical
Databases Analytic
Models
Other
Transactional
Applicatio
ODS Datamarts ns
Systems
& Analytical
Databases
Data Transformation
19. copyright: Sixth Sense Advisors Inc @2012 19
ACID Kills
• Atomic – All of the work in a transaction completes
(commit) or none of it completes
• Consistent – A transaction transforms the database
from one consistent state to another consistent state.
Consistency is defined in terms of constraints.
• Isolated – The results of any changes made during a
transaction are not visible until the transaction has
committed.
• Durable – The results of a committed transaction
survive failures
20. copyright: Sixth Sense Advisors Inc @2012 20
BIG
Data
Scenarios
EXAMPLES
To: Bob.Collins@bankwithus.com
Dear Mr. Collins,
This email is in reference to my bank account which has
been efficiently handled by your bank for more than five
years. There has been no problem till date until last week
the situation went out of the hand.
I have deposited one of my high amount cheque to my
bank account no: 65656512 which was to be credited
same day but due to your staff carelessness it wasn’t
done and because of this negligence my reputation in the
market has been tarnished. Furthermore I had issued one
payment cheque to the party which was showing
bounced due to “Insufficient balance” just because my
cheque didn’t make on time.
My relationship with your bank has matured with the time
and it’s a shame to tell you about this kind of services are
not acceptable when it is question of somebody’s
reputation. I hope you got my point and I am attaching a
copy of the same for further rapid procedures and remit
into my account in a day.
Yours sincerely
Daniel Carter
Ph: 564-009-2311
21. copyright: Sixth Sense Advisors Inc @2012 21
BIG Data Text Example
• We
will
o9en
imply
addi>onal
informa>on
in
spoken
language
by
the
way
we
place
stress
on
words.
• The
sentence
"I
never
said
she
stole
my
money"
demonstrates
the
importance
stress
can
play
in
a
sentence,
and
thus
the
inherent
difficulty
a
natural
language
processor
can
have
in
parsing
it.
• "I
never
said
she
stole
my
money"
-‐
Someone
else
said
it,
but
I
didn't.
• "I
never
said
she
stole
my
money"
-‐
I
simply
didn't
ever
say
it.
• "I
never
said
she
stole
my
money"
-‐
I
might
have
implied
it
in
some
way,
but
I
never
explicitly
said
it.
• "I
never
said
she
stole
my
money"
-‐
I
said
someone
took
it;
I
didn't
say
it
was
she.
• "I
never
said
she
stole
my
money"
-‐
I
just
said
she
probably
borrowed
it.
• "I
never
said
she
stole
my
money"
-‐
I
said
she
stole
someone
else's
money.
• "I
never
said
she
stole
my
money"
-‐
I
said
she
stole
something,
but
not
my
money
• Depending
on
which
word
the
speaker
places
the
stress,
this
sentence
could
have
several
dis>nct
meanings.
Example Source: Wikepedia
22. copyright: Sixth Sense Advisors Inc @2012 22
Pattern Detection
Clustering Techniques Utilities
K-Means Accuracy Measures
Maximin Range Filters
Agglomerative K-Fold Cross Validation
Divisive Merge & Subset
Regression Vector Magnitude
Classification Techniques
Native Bayes Examples
Neural Networks • Text – OCR, Machine, Digital
Back Propogational • Face recognition, verification, retrieval.
Recursively Splitting • Finger prints recognition.
K-Nearest Neighbor • Speech recognition.
Minimum Distance • Medical diagnosis: X-Ray, EKG analysis
• Machine diagnostics data
Reduction Techniques • Geological data
Backward Elimination • Automated Target Recognition (ATR).
Forward Selection • Image segmentation and analysis (recognition
Attribute Removal from aerial or satelite photographs).
Principal Components
25. copyright: Sixth Sense Advisors Inc @2012 25
Performance
Re-Engineering a Ferrari Engine in a Yugo does not make the fastest
race car.
+ New Data Types
Current
Data + New volume • POOR
Management + New Analytics Performance
Platform • Failed
+ New Data Retention
(RDBMS + ETL Programs
+BI) + New Data Workloads
26. copyright: Sixth Sense Advisors Inc @2012 26
Big
Data
and
You
• You
need
to
write
data
quickly
and
reliably
• Incoming
data
streams
are
different
in
type,
size,
complexity
• But
wri>ng
it
to
disk
or
memory
is
not
the
ul>mate
goal
• You
need
to
validate
data
in
real-‐>me
• You
need
to
count
and
aggregate
as
your
write
• You
need
to
analyze
in
real-‐>me
as
later
even
if
seconds
later
is
historical
• You
need
to
scale-‐up
and
scale-‐out
on
demand
28. copyright: Sixth Sense Advisors Inc @2012 28
Data Warehouse Appliance
High Availability • A Data Warehouse (DW)
Appliance is an integrated
Standard SQL Interface set of servers, storage,
OS, database and
Advanced Compression interconnect specifically
preconfigured and tuned
MPP for the rigors of data
warehousing.
Leverages existing BI, ETL and OLTP investments
• DW appliances offer an
Hadoop & MapReduce Interface / Embedded attractive price /
performance value
Minimal disk I/O bottleneck; simultaneously load & query proposition and are
frequently a fraction of the
Auto Database Management cost of traditional data
warehouse solutions.
31. copyright: Sixth Sense Advisors Inc @2012 31
Hadoop & RDBMS Analogy
RDBMS Hadoop
Sports car: Cargo train:
• refined • rough
• has a lot of features • missing a lot of luxury
• accelerates very fast • slow to accelerate
• pricey • carries almost anything
• expensive to maintain
• moves a lot of stuff very
efficiently
* Original Slide Author- Amr Adwallah , CloudEra
38. copyright: Sixth Sense Advisors Inc @2012 38
Map Reduce
n Technique for indexing and searching large data volumes
n Two Phases, Map and Reduce
n Map
n Extract sets of Key-Value pairs from underlying data
n Potentially in Parallel on multiple machines
n Reduce
n Merge and sort sets of Key-Value pairs
n Results may be useful for other searches
45. copyright: Sixth Sense Advisors Inc @2012 45
Big Data Challenges
• Integration to the EDW is still an open issue – Big Data
reduces to small metrics, and this translates into the
current state issues faced with EDW data
• Big Data requires lot of Taxonomy processing especially
in Content related Search
• There are several applications that need high
performing memory architectures as data is compute
intensive – example image processing of brain scans
• Technology is improving by the day, but integration and
deployment are becoming equally complex.