This document discusses how to scale the WSO2 BAM platform to handle billions of requests and terabytes of data. It describes scaling the major BAM components like the data receiver, data storage, analyzer engine, and dashboard. The data receiver uses Apache Thrift for efficient data transfer. Cassandra provides scalable data storage. The analyzer engine leverages Hadoop and Hive for distributed processing. Zookeeper coordinates tasks. These changes enable BAM deployments from single node to fully distributed high availability setups.
Scaling up wso2 bam for billions of requests and terabytes of data
1. Scaling Up WSO2 BAM for Billions of
Requests and Terabytes of Data
Buddhika Chamith
Software Engineer – WSO2 BAM
2. Business Activity Monitoring
“The aggregation, analysis, and
presentation of real-time information
about activities inside organizations
and involving customers and partners.”
- Gartner
3. Aggregation
● Capturing data
● Data storage
● What data to
capture?
4. Analysis
● Data operations
● Building KPIs
● Operate on large
amounts of historic
data or new data
● Building BI
5. Presentation
● Visualizing KPIs/BI
● Custom Dashboards
● Visualization tools
● Not just dashboards!
8. Data Agents
● Push data to BAM
● Collecting
● Service data
● Mediation data
● Logs etc.
● Various interceptors used
● Axis2 Handlers
● Synapse Mediators
● Tomcat Valves
● Log4j Appenders
10. Apache Thrift
● A RPC framework
● With a pluggable architecture
for mixing different transports
with different protocols
● Has multiple language
bindings (Java, C++, Python,
Perl, C# etc.)
● We mainly use Java binding
11. Not Just Performance...
● Load balancing
● Failover
● All available within a Java SDK libary.
● You can use it too.
12. Data Receiver
● Capture and transfer data to subscribed sinks.
● Not just the database.
● Can be clustered.
● Load balancing is handled from client side.
14. Data Storage
● Apache Cassandra
● NoSQL column family
implementation
● Scalable, HA and no
SPOF.
● Very high write
throughput and good
read throughput
● Tunable consistency
with data replication
19. Analyzer Engine
● Idea : Distribute processing to multiple nodes to
run in parallel
● Obvious choice : Hadoop
● Uses Map Reduce Programming paradigm
20. Map Reduce
● Process multiple data
chunks paralley at
Mappers.
● Aggregate map
outputs having similar
keys at Reducers and
store the result.
● Let's think of a useful
example..
21. Hadoop Components
● Job Tracker
● Name node
● Secondary Name Node
● Task Trackers
● Data Nodes
22. It's Cool But ..
● Do we need to have a
Hadoop cluster in order to
try out BAM?
● Are we supposed to code
Hadoop jobs to get
BAM to summarize some
thing?
● Answers
1) No
Courtesy: http://goo.gl/QEnpN 2) No. Ok may be very
rarely at best.
23. Apache Hive
● You write SQL. (Almost)
● Let Hive convert to Map Reduce jobs.
● So Hive does two things
● Provide an abstraction for Hadoop Map Reduce
● Submit the analytic jobs to Hadoop
● Hive may spawn a Hadoop JVM locally or
delegate to a Hadoop Cluster
26. Task Framework
● Run Hive scripts periodically
● Can specify as cron expressions/ predefined
templates
● Handles task failover in case of node faliure
● Uses Zookeeper for coordination
27. Zookeeper
● Can be run seperately or embedded within
BAM