This document provides an overview of WSO2 and their offerings for building big data solutions. WSO2 provides open source components for building complete cloud platforms and is recognized as a leader in application infrastructure by Gartner and Forrester. They discuss the challenges of big data due to the large volumes and speeds at which data is generated today. WSO2's products like BAM and CEP help customers address the full data lifecycle from collection, storage, processing to analytics for big data use cases. The document outlines an example big data architecture implemented using WSO2 components along with other technologies like Cassandra.
[API Word 2021] - Quantum Duality of “API as a Business and a Technology”
Building Your Big Data Solution with WSO2
1. Learn
with
WSO2
-‐
Building
your
Big
Data
Solu8on
Srinath
Perera
Director
of
Research
WSO2
Inc.
2. About WSO2
• Providing the only complete open source componentized
cloud platform
– Dedicated to removing all the stumbling blocks to enterprise agility
– Enabling you to focus on business logic and business value
• Recognized by leading analyst firms as visionaries and
leaders
– Gartner cites WSO2 as visionaries in all 3 categories of
application infrastructure
– Forrester places WSO2 in top 2 for API Management
• Global corporation with offices in USA, UK & Sri Lanka
– 200+ employees and growing
• Business model of selling comprehensive support &
maintenance for our products
4. Consider
a
day
in
your
life
• What
is
the
best
road
to
take?
• Would
there
be
any
bad
weather?
• What
is
the
best
way
to
invest
the
money?
• Should
I
take
that
loan?
• Can
I
op8mize
my
day?
• Is
there
a
way
to
do
this
faster?
• What
have
others
done
in
similar
cases?
• Which
product
should
I
buy?
5. People
wanted
to
(through
ages)
• To
know
(what
happened?)
• To
Explain
(why
it
happened)
• To
Predict
(what
will
happen?)
6. What
is
Big
data?
• There
is
lot
of
data
available
– E.g.
Internet
of
things
• We
have
compu8ng
power
• We
have
technology
• Goal
is
same
– To
know
– To
Explain
– To
predict
• Challenge
is
the
full
lifecycle
8. Data
Avalanche/
Moore’s
law
of
data
• We
are
now
collec8ng
and
conver8ng
large
amount
of
data
to
digital
forms
• 90%
of
the
data
in
the
world
today
was
created
within
the
past
two
years.
• Amount
of
data
we
have
doubles
very
fast
9. In
real
life,
most
data
are
Big
• Web
does
millions
of
ac8vi8es
per
second,
and
so
much
server
logs
are
created.
• Social
networks
e.g.
Facebook,
800
Million
ac8ve
users,
40
billion
photos
from
its
user
base.
• There
are
>4
billion
phones
and
>25%
are
smart
phones.
There
are
billions
of
RFID
tags.
• Observa8onal
and
Sensor
data
– Weather
Radars,
Balloons
– Environmental
Sensors
– Telescopes
– Complex
physics
simula8ons
10. Why
Big
Data
is
hard?
• How
store?
Assuming
1TB
bytes
it
takes
1000
computers
to
store
a
1PB
• How
to
move?
Assuming
10Gb
network,
it
takes
2
hours
to
copy
1TB,
or
83
days
to
copy
a
1PB
• How
to
search?
Assuming
each
record
is
1KB
and
one
machine
can
process
1000
records
per
sec,
it
needs
277CPU
days
to
process
a
1TB
and
785
CPU
years
to
process
a
1
PB
• How
to
process?
– How
to
convert
algorithms
to
work
in
large
size
– How
to
create
new
algorithms
hap://www.susanica.com/photo/9
11. Why
it
is
hard
(Contd.)?
• System
build
of
many
computers
• That
handles
lots
of
data
• Running
complex
logic
• This
pushes
us
to
fron8er
of
Distributed
Systems
and
Databases
• More
data
does
not
mean
there
is
a
simple
model
• Some
models
can
be
complex
as
the
system
hap://www.flickr.com/photos/mariachily/5250487136,
Licensed
CC
13. WSO2
Offerings
• Two
tools
– WSO2
BAM
for
store
and
process
– WSO2
CEP
for
real8me
processing
• These
tools
covers
whole
processing
life
cycle
for
your
Big
Data
with
help
of
few
other
products
as
needed.
– WSO2
Storage
server
– WSO2
User
Experience
Server
15. Sensors
• Built
sensors
in
WSO2
Products
• Event
logs
– Click
streams,
Emails,
chat,
search,
tweets
,Transac8ons
…
• Custom
Sensors
– Video
surveillance,
Cash
flows,
Traffic,
Surveillance,
Smart
Grid,
Produc8on
line,
RFID
(e.g.
Walmart),
GPS
sensors,
Mobile
Phone,
Internet
of
Things
hap://www.flickr.com/photos/imuaoo/4257813689/
by
Ian
Muaoo,
hap://www.flickr.com/photos/eastcapital/4554220770/,
hap://www.flickr.com/
photos/patdavid/4619331472/
by
Pat
David
copyright
CC
16. Collec8ng
Data
• Data
collected
at
sensors
and
sent
to
big
data
system
via
events
or
flat
files
• Event
Streams:
we
name
the
events
by
its
content/
originator
• Get
data
through
– Point
to
Point
– Event
Bus
• E.g.
Data
bridge
– a
thrij
based
transport
we
did
that
do
about
400k
events/
sec
17. Storing
Data
• Historically
we
used
databases
– Scale
is
a
challenge:
replica8on,
sharding
• Scalable
op8ons
– NoSQL
(Cassandra,
Hbase)
[If
data
is
structured]
• Column
families
Gaining
Ground
– Distributed
file
systems
(e.g.
HDFS)
[If
data
is
unstructured]
• New
SQL
– In
Memory
compu8ng,
VoltDB
• Specialized
data
structures
– Graph
Databases,
Data
structure
servers
hap://www.flickr.com/photos/keso/
363133967/
18. Storing
Data
(Contd.)
• WSO2
Offerings
(WSO2
Storage
Server)
– Small
Structured
Data:
keep
in
rela8onal
databases.
– Large
structured
data
:
Cassandra
– Large
unstructured
data:
HDFS
19. Making
Sense
of
Data
• To
know
(what
happened?)
– Basic
analy8cs
+
visualiza8ons
(min,
max,
average,
histogram,
distribu8ons
…
)
– Interac8ve
drill
down
• To
explain
(why)
– Data
mining,
classifica8ons,
building
models,
clustering
• To
forecast
– Neural
networks,
decision
models
20. Making
Sense
of
Data
(Contd.)
• Batch
processing
–
WSO2
BAM
– Hive
Scripts
– Map
Reduce
Jobs
• Real
8me
processing
–
CEP
– Event
Query
Language
• Above
two
are
the
plarorm,
you
need
to
program
your
usecase.
21. To
know
(what
happened?)
• Mainly
Analy8cs
– Min,
Max,
average,
correla8on,
histograms
– Might
join
group
data
in
many
ways
• Implemented
with
MapReduce
or
Queries
• Data
is
ojen
presented
with
some
visualiza8ons
• Examples
–
forensics
– Assessments
– Historical
data/
reports/
trends
hap://www.flickr.com/photos/isriya/
2967310333/
22. To
Explain
(Paaerns)
• Correla8on
– Scaaer
plot,
sta8s8cal
correla8on
• Data
Mining
(Detec8ng
Paaerns)
– Clustering
and
classifica8on
– Finding
Similar
items
– Finding
Hubs
and
authori8es
in
a
Graph
– Finding
frequent
item
sets
– Making
recommenda8on
• Apache
Mahout
hap://www.flickr.com/photos/eriwst/2987739376/
and
hap://www.flickr.com/photos/focx/5035444779/
23. To
Predict:
Forecasts
and
Models
• Trying
to
build
a
model
for
the
data
• Theore8cally
or
empirically
– Analy8cal
models
(e.g.
Physics)
– Neural
networks
– Reinforcement
learning
– Unsupervised
learning
(clustering,
dimensionality
reduc8on,
kernel
methods)
• Examples
– Transla8on
– Weather
Forecast
models
– Building
profiles
of
users
– Traffic
models
– Economic
models
• Lot
of
domain
specific
work
hap://misterbijou.blogspot.com/
2010_09_01_archive.html
24. Informa8on
Visualiza8on
• Presen8ng
informa8on
– To
end
user
– To
decision
takers
– To
scien8st
• Interac8ve
explora8on
• Sending
alerts
• WSO2
UES
– Jaggery
based
• BAM/
CEP
can
Work
with
most
other
UI
tools
hap://www.flickr.com/photos/
stevefaeembra/3604686097/
25. WSO2
UES
• Dashboards,
and
Store
• Build
your
own
Uis
with
Jaggery
26. MapReduce/
Hadoop
• First
introduced
by
Google,
and
used
as
the
processing
model
for
their
architecture
• Implemented
by
opensource
projects
like
Apache
Hadoop
and
Spark
• Users
writes
two
func8ons:
map
and
reduce
• The
framework
handles
the
details
like
distributed
processing,
fault
tolerance,
load
balancing
etc.
• Widely
used,
and
the
one
of
the
catalyst
of
Big
data
void map(ctx, k, v){
tokens = v.split();
for t in tokens
ctx.emit(t,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
28. Data
In
the
Move
• Idea
is
to
process
data
as
they
are
received
in
streaming
fashion
• Used
when
we
need
– Very
fast
output
– Lots
of
events
(few
100k
to
millions)
– Processing
without
storing
(e.g.
too
much
data)
• Two
main
technologies
– Stream
Processing
(e.g.
Strom,
hap://storm-‐project.net/
)
– Complex
Event
Processing
(CEP)
hap://wso2.com/products/
complex-‐event-‐processor/
29. Complex
Event
Processing
(CEP)
• Sees
inputs
as
Event
streams
and
queried
with
SQL
like
language
• Supports
Filters,
Windows,
Join,
Paaerns
and
Sequences
from p=PINChangeEvents#win.time(3600) join
t=TransactionEvents[p.custid=custid][amount>10000]
#win.time(3600)
return t.custid, t.amount;
31. Case
Study
1:
Tracing
Business
Process
• Business
process
is
built
using
many
services
• Track
trace
each
step,
and
analyze
to
understand
how
to
op8mize
• E.g.
sales
pipeline
32. Some
Queries
• Conversion
rate?
• How
many
deals
in
pipeline
at
each
month?
• Average
size
of
the
deals?
• Average
8me
deal
takes?
• Can
we
guess
an
large
size
deals
early?
• Which
is
beaer?
Going
for
few
large
ones
or
many
small
ones?
• Was
there
any
delays
from
Ourside?
33. Hive:
Average
Size
of
the
Deal
• Hive
uses
an
SQL
like
synatax.
• Easy
to
understand
and
learn
hive> LOAD DATA ..
hive> SELECT avg(value) from LEAD_ACTIVITY
WHERE action=“closedWon” groupby month;
35. How
many
deals
in
Pipeline?(Contd.)
void map(ctx, k, v){
Deals deal= parse(v);
int month = getMonth(deal.time);
ctx.emit(month,1)
}
void reduce(ctx, k, values[]){
count = 0;
for v in values
count = count + v;
ctx.emit(k,count);
}
36. Case
study
2:
DEBS
Challenge
• Event
Processing
challenge
• Real
football
game,
sensors
in
player
shoes
+
ball
• Events
in
15k
Hz
• Event
format
– Sensor
ID,
TS,
x,
y,
z,
v,
a
• Queries
– Running
Stats
– Ball
Possession
– Heat
Map
of
Ac8vity
– Shots
at
Goal
37. Example:
Detect
ball
Possession
• Possession
is
8me
a
player
hit
the
ball
un8l
someone
else
hits
it
or
it
goes
out
of
the
ground
from Ball#window.length(1) as b join
Players#window.length(1) as p
unidirectional
on debs: getDistance(b.x,b.y,b.z,
p.x, p.y, p.z) < 1000
and b.a > 55
select ...
insert into hitStream
from old = hitStream ,
b = hitStream [old. pid != pid ],
n= hitStream[b.pid == pid]*,
( e1 = hitStream[b.pid != pid ]
or e2= ballLeavingHitStream)
select ...
insert into BallPossessionStream
hap://www.flickr.com/photos/glennharper/146164820/
38. Conclusions
• What
is
Big
Data?
• Big
Data
Architecture
– Collec8ng
data
– Storing
data
– Processing
Data
• WSO2
Offerings
• Case
Studies
40. Engage with WSO2
• Helping you get the most out of your deployments
• From project evaluation and inception to development
and going into production, WSO2 is your partner in
ensuring 100% project success