Scaling up wso2 bam for billions of requests and terabytes of data

•

2 j'aime•1,278 vues

This document discusses how to scale the WSO2 BAM platform to handle billions of requests and terabytes of data. It describes scaling the major BAM components like the data receiver, data storage, analyzer engine, and dashboard. The data receiver uses Apache Thrift for efficient data transfer. Cassandra provides scalable data storage. The analyzer engine leverages Hadoop and Hive for distributed processing. Zookeeper coordinates tasks. These changes enable BAM deployments from single node to fully distributed high availability setups.

Scaling Up WSO2 BAM for Billions of
Requests and Terabytes of Data
Buddhika Chamith
Software Engineer – WSO2 BAM

Business Activity Monitoring

“The aggregation, analysis, and
presentation of real-time information
about activities inside organizations
and involving customers and partners.”
- Gartner

Aggregation

● Capturing data
● Data storage
● What data to
capture?

Analysis
● Data operations
● Building KPIs
● Operate on large
amounts of historic
data or new data
● Building BI

Presentation

● Visualizing KPIs/BI
● Custom Dashboards
● Visualization tools
● Not just dashboards!

Data Agents
● Push data to BAM
● Collecting
● Service data
● Mediation data
● Logs etc.
● Various interceptors used
● Axis2 Handlers
● Synapse Mediators
● Tomcat Valves
● Log4j Appenders

Performance Considerations

● Should be asynchronous
● Event batching
● SOAP?
● Apache Thrift (Binary protocol)

Apache Thrift

● A RPC framework
● With a pluggable architecture
for mixing different transports
with different protocols
● Has multiple language
bindings (Java, C++, Python,
Perl, C# etc.)
● We mainly use Java binding

Not Just Performance...

● Load balancing
● Failover
● All available within a Java SDK libary.
● You can use it too.

Data Receiver

● Capture and transfer data to subscribed sinks.
● Not just the database.
● Can be clustered.
● Load balancing is handled from client side.

Data Storage
● Apache Cassandra
● NoSQL column family
implementation
● Scalable, HA and no
SPOF.
● Very high write
throughput and good
read throughput
● Tunable consistency
with data replication

Results
With a single receiver node allocated 2GB heap with quad core on
RHEL.

Analyzer Engine
● Idea : Distribute processing to multiple nodes to
run in parallel
● Obvious choice : Hadoop
● Uses Map Reduce Programming paradigm

Map Reduce
● Process multiple data
chunks paralley at
Mappers.
● Aggregate map
outputs having similar
keys at Reducers and
store the result.
● Let's think of a useful
example..

Hadoop Components
● Job Tracker
● Name node

● Secondary Name Node

● Task Trackers

● Data Nodes

It's Cool But ..
● Do we need to have a
Hadoop cluster in order to
try out BAM?
● Are we supposed to code
Hadoop jobs to get
BAM to summarize some
thing?
● Answers
1) No
Courtesy: http://goo.gl/QEnpN 2) No. Ok may be very
rarely at best.

Apache Hive
● You write SQL. (Almost)
● Let Hive convert to Map Reduce jobs.
● So Hive does two things
● Provide an abstraction for Hadoop Map Reduce
● Submit the analytic jobs to Hadoop
● Hive may spawn a Hadoop JVM locally or
delegate to a Hadoop Cluster

Task Framework
● Run Hive scripts periodically
● Can specify as cron expressions/ predefined
templates
● Handles task failover in case of node faliure
● Uses Zookeeper for coordination

Zookeeper

● Can be run seperately or embedded within
BAM

Summary
● BAM
● Need for scalability
● Scaling BAM components
● Results
● BAM deployment patterns

Contenu connexe

Tendances

Spark streaming with apache kafka

punesparkmeetup

#GeodeSummit - Redis to Geode Adaptor

PivotalOpenSourceHub

Managing multi tenant resource toward Hive 2.0

Kai Sasaki

Adobe Systems uses “SaasBase Analytics” to incrementally process large heterogeneous data sets into pre-aggregated, indexed views, stored in HBase to be queried in real- time. Our goal was to process new data in real- time (currently minutes) and have it ready for a large number of concurrent queries that execute in milliseconds. This set our problem apart from what is traditionally solved with Hive or PIG. In this talk I’ll describe the design and the strategies (and hacks) we used to achieve low latency and scalability, from theoretical model to the entire process of ETL to warehousing and queries.

HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

Cloudera, Inc.

Event Streaming Architectures with Confluent and ScyllaDB

ScyllaDB

Netflix's Big Leap from Oracle to Cassandra

Roopa Tangirala

Building big data pipelines with Kafka and Kubernetes

Venu Ryali

Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings. Speakers: Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...

Yahoo Developer Network

MariaDB on Docker

MariaDB plc

MySQL in the Hosted Cloud

Colin Charles

Using Kafka to scale database replication

Venu Ryali

M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures

MariaDB plc

YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. We will discuss our data model and how we use row based atomicity to manage parallel data integration problems. We’ll also talk about where we don’t use HBase and instead use a traditional SQL based infrastructure. We’ll cover the benefits of using MapReduce and HBase for index generation. Then we’ll cover our migration of some tasks from a message based queue to the Coprocessor framework as well as our future Coprocessor use cases. Finally, we’ll talk briefly about our operational experience with HBase, our hardware choices and challenges we’ve had.

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Cloudera, Inc.

CosmosDB for DBAs & Developers

Niko Neugebauer

Columnstore improvements in SQL Server 2016

Niko Neugebauer

M|18 Analyzing Data with the MariaDB AX Platform

MariaDB plc

AWS Cloud SAA Relational Database presentation

TATA LILIAN SHULIKA

2010 12 mysql_clusteroverview

Dimas Prasetyo

Sun Web Server Brief

Murthy Chintalapati

John Leach Co-Founder and CTO of Splice Machine with 15+ years software development and machine learning experience will discuss how to use HBase co-processors to build an ANSI-99 SQL database with 1) parallelization of SQL execution plans, 2) ACID transactions with snapshot isolation and 3) consistent secondary indexing. Transactions are critical in traditional RDBMSs because they ensure reliable updates across multiple rows and tables. Most operational applications require transactions, but even analytics systems use transactions to reliably update secondary indexes after a record insert or update. In the Hadoop ecosystem, HBase is a key-value store with real-time updates, but it does not have multi-row, multi-table transactions, secondary indexes or a robust query language like SQL. Combining SQL with a full transactional model over HBase opens a whole new set of OLTP and OLAP use cases for Hadoop that was traditionally reserved for RDBMSs like MySQL or Oracle. However, a transactional HBase system has the advantage of scaling out with commodity servers, leading to a 5x-10x cost savings over traditional databases like MySQL or Oracle. HBase co-processors, introduced in release 0.92, provide a flexible and high-performance framework to extend HBase. In this talk, we show how we used HBase co-processors to support a full ANSI SQL RDBMS without modifying the core HBase source. We will discuss how endpoint transactions are used to serialize SQL execution plans over to regions so that computation is local to where the data is stored. Additionally, we will show how observer co-processors simultaneously support both transactions and secondary indexing. The talk will also discuss how Splice Machine extended the work of Google Percolator, Yahoo Labs’ OMID, and the University of Waterloo on distributed snapshot isolation for transactions. Lastly, performance benchmarks will be provided, including full TPC-C and TPC-H results that show how Hadoop/HBase can be a replacement of traditional RDBMS solutions. To view the accompanying slide deck: http://www.slideshare.net/ChicagoHUG/

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...

Chicago Hadoop Users Group

Tendances (20)

Spark streaming with apache kafka

#GeodeSummit - Redis to Geode Adaptor

Managing multi tenant resource toward Hive 2.0

HBaseCon 2012 | Low Latency OLAP with HBase - Cosmin Lehene, Adobe

Event Streaming Architectures with Confluent and ScyllaDB

Netflix's Big Leap from Oracle to Cassandra

Building big data pipelines with Kafka and Kubernetes

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...

MariaDB on Docker

MySQL in the Hosted Cloud

Using Kafka to scale database replication

M|18 Migrating from Oracle and Handling PL/SQL Stored Procedures

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

CosmosDB for DBAs & Developers

Columnstore improvements in SQL Server 2016

M|18 Analyzing Data with the MariaDB AX Platform

AWS Cloud SAA Relational Database presentation

2010 12 mysql_clusteroverview

Sun Web Server Brief

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...

Similaire à Scaling up wso2 bam for billions of requests and terabytes of data

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

Big Data Montreal

MapReduce with Hadoop and Ruby

Swanand Pagnis

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.

New Analytics Toolbox DevNexus 2015

Robbie Strickland

Apache spark on Hadoop Yarn Resource Manager

haridasnss

Balkan - data eng meetup - data fusion

Balkan Misirli

Architecting and productionising data science applications at scale

samthemonad

H2O on Hadoop Dec 12

Sri Ambati

Introduction to AWS Big Data

Omid Vahdaty

Web-scale data processing: practical approaches for low-latency and batch

Edward Capriolo

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned

Sri Ambati

Apache bigtopwg7142013

Doug Chang

Hadoop is a new paradigm for data processing that scales near linearly to petabytes of data. Commodity hardware running open source software provides unprecedented cost effectiveness. It is affordable to save large, raw datasets, unfiltered, in Hadoop's file system. Together with Hadoop's computational power, this facilitates operations such as ad hoc analysis and retroactive schema changes. An extensive open source tool-set is being built around these capabilities, making it easy to integrate Hadoop into many new application areas.

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...

Cloudera, Inc.

SCM Puppet: from an intro to the scaling

Stanislav Osipov

Hadoop And Big Data - My Presentation To Selective Audience

Chandra Sekhar

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

Hadoop Cluster on Docker Containers

pranav_joshi

Apache Jena Elephas and Friends

Rob Vesse

LAS16-305: Smart City Big Data Visualization on 96Boards Speakers: Naresh Bhat, Ganesh Raju Date: September 28, 2016 ★ Session Description ★ Cities are getting identified as smart cities based on what and how data are used to do predictive analytics. Smart City as a phrase can have a wide spectrum of meaning. But there are two key things (Data and Analytics) that ‘smart’ refers to in smart city. With IoT gaining so much market attention, brings in the power to drive the implementation. Data collection, Storage and Analytics provide so much potential. This talk will go over a sample use case scenario utilizing ODPi based Hadoop eco system and H20 visualizations for analytics. ★ Resources ★ Etherpad: pad.linaro.org/p/las16-305 Presentations & Videos: http://connect.linaro.org/resource/las16/las16-305/ ★ Event Details ★ Linaro Connect Las Vegas 2016 – #LAS16 September 26-30, 2016 http://www.linaro.org http://connect.linaro.org

LAS16-305: Smart City Big Data Visualization on 96Boards

Linaro

Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016

Ganesh Raju

Cloudera Impala: A Modern SQL Engine for Apache Hadoop

Cloudera, Inc.

Similaire à Scaling up wso2 bam for billions of requests and terabytes of data (20)

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

MapReduce with Hadoop and Ruby

New Analytics Toolbox DevNexus 2015

Apache spark on Hadoop Yarn Resource Manager

Balkan - data eng meetup - data fusion

Architecting and productionising data science applications at scale

H2O on Hadoop Dec 12

Introduction to AWS Big Data

Web-scale data processing: practical approaches for low-latency and batch

Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned

Apache bigtopwg7142013

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...

SCM Puppet: from an intro to the scaling

Hadoop And Big Data - My Presentation To Selective Audience

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Hadoop Cluster on Docker Containers

Apache Jena Elephas and Friends

LAS16-305: Smart City Big Data Visualization on 96Boards

Smart City Big Data Visualization on 96Boards - Linaro Connect Las Vegas 2016

Cloudera Impala: A Modern SQL Engine for Apache Hadoop

Plus de WSO2

Driving Innovation: Scania's API Revolution with WSO2

WSO2

At its core, the challenge of managing Human Resources data is an integration challenge: estimates range from 2-3 HR systems in use at a typical SMB, up to a few dozen systems implemented amongst enterprise HR departments, and these systems seldom integrate seamlessly between themselves. Providing a multi-tenant, cloud-native solution to integrate these hundreds of HR-related systems, normalize their disparate data models and then render that consolidated information for stakeholder decision making has been a substantial undertaking, but one significantly eased by leveraging Ballerina. In this session, we’ll cover: The overall software architecture for VHR’s Cloud Data Platform Critical decision points leading to adoption of Ballerina for the CDP Ballerina’s role in multiple evolutionary steps to the current architecture Roadmap for the CDP architecture and plans for Ballerina WSO2’s partnership in bringing continual success for the CD

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform

WSO2

The integration landscape is changing rapidly with the introduction of technologies like GraphQL, gRPC, stream processing, iPaaS, and platformless. However, not all existing applications and industries can keep up with these new technologies. Certain industries, like manufacturing, logistics, and finance, still rely on well-established EDI-based message formats. Some applications use XML or CSV with file-based communications, while others have strict on premises deployment requirements. This talk focuses on how Ballerina's built-in integration capabilities can bridge the gap between "old" and "new" technologies, modernizing enterprise applications without disrupting business operations.

Modernizing Legacy Systems Using Ballerina

WSO2

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...

WSO2

WSO2CON 2024 Slides - Unlocking Value with AI

WSO2

In this keynote, Asanka Abeysinghe, CTO,WSO2 will explore the shift towards platformless technology ecosystems and their importance in driving digital adaptability and innovation. We will discuss strategies for leveraging decentralized architectures and integrating diverse technologies, with a focus on building resilient, flexible, and future-ready IT infrastructures. We will also highlight WSO2's roadmap, emphasizing our commitment to supporting this transformative journey with our evolving product suite.

Platformless Horizons for Digital Adaptability

WSO2

Quantum Leap in Next-Generation Computing

WSO2

WSO2CON 2024 - Elevating the Integration Game to the Cloud

WSO2

WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation

WSO2

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source

WSO2

WSO2CON 2024 Slides - Open Source to SaaS

WSO2

WSO2CON 2024 - Does Open Source Still Matter?

WSO2

WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...

WSO2

WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications

WSO2

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...

WSO2

WSO2CON 2024 - Software Engineering for Digital Businesses

WSO2

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...

WSO2

WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation

WSO2

WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!

WSO2

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...

WSO2

Plus de WSO2 (20)

Driving Innovation: Scania's API Revolution with WSO2

Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform

Modernizing Legacy Systems Using Ballerina

WSO2CON 2024 - Unlocking the Identity: Embracing CIAM 2.0 for a Competitive A...

WSO2CON 2024 Slides - Unlocking Value with AI

Platformless Horizons for Digital Adaptability

Quantum Leap in Next-Generation Computing

WSO2CON 2024 - Elevating the Integration Game to the Cloud

WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation

WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source

WSO2CON 2024 Slides - Open Source to SaaS

WSO2CON 2024 - Does Open Source Still Matter?

WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...

WSO2CON 2024 - Architecting AI in the Enterprise: APIs and Applications

WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...

WSO2CON 2024 - Software Engineering for Digital Businesses

WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...

WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation

WSO2CON 2024 - Not Just Microservices: Rightsize Your Services!

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...

Scaling up wso2 bam for billions of requests and terabytes of data

1. Scaling Up WSO2 BAM for Billions of Requests and Terabytes of Data Buddhika Chamith Software Engineer – WSO2 BAM

2. Business Activity Monitoring “The aggregation, analysis, and presentation of real-time information about activities inside organizations and involving customers and partners.” - Gartner

3. Aggregation ● Capturing data ● Data storage ● What data to capture?

4. Analysis ● Data operations ● Building KPIs ● Operate on large amounts of historic data or new data ● Building BI

5. Presentation ● Visualizing KPIs/BI ● Custom Dashboards ● Visualization tools ● Not just dashboards!

6. Need for Scalability

7. BAM 2.x - Component Architecture

8. Data Agents ● Push data to BAM ● Collecting ● Service data ● Mediation data ● Logs etc. ● Various interceptors used ● Axis2 Handlers ● Synapse Mediators ● Tomcat Valves ● Log4j Appenders

9. Performance Considerations ● Should be asynchronous ● Event batching ● SOAP? ● Apache Thrift (Binary protocol)

10. Apache Thrift ● A RPC framework ● With a pluggable architecture for mixing different transports with different protocols ● Has multiple language bindings (Java, C++, Python, Perl, C# etc.) ● We mainly use Java binding

11. Not Just Performance... ● Load balancing ● Failover ● All available within a Java SDK libary. ● You can use it too.

12. Data Receiver ● Capture and transfer data to subscribed sinks. ● Not just the database. ● Can be clustered. ● Load balancing is handled from client side.

13. Data Bridge

14. Data Storage ● Apache Cassandra ● NoSQL column family implementation ● Scalable, HA and no SPOF. ● Very high write throughput and good read throughput ● Tunable consistency with data replication

15. Deployment – Storage Cluster

16. Reciever Cluster

17. Results With a single receiver node allocated 2GB heap with quad core on RHEL.

18. Disk Growth

19. Analyzer Engine ● Idea : Distribute processing to multiple nodes to run in parallel ● Obvious choice : Hadoop ● Uses Map Reduce Programming paradigm

20. Map Reduce ● Process multiple data chunks paralley at Mappers. ● Aggregate map outputs having similar keys at Reducers and store the result. ● Let's think of a useful example..

21. Hadoop Components ● Job Tracker ● Name node ● Secondary Name Node ● Task Trackers ● Data Nodes

22. It's Cool But .. ● Do we need to have a Hadoop cluster in order to try out BAM? ● Are we supposed to code Hadoop jobs to get BAM to summarize some thing? ● Answers 1) No Courtesy: http://goo.gl/QEnpN 2) No. Ok may be very rarely at best.

23. Apache Hive ● You write SQL. (Almost) ● Let Hive convert to Map Reduce jobs. ● So Hive does two things ● Provide an abstraction for Hadoop Map Reduce ● Submit the analytic jobs to Hadoop ● Hive may spawn a Hadoop JVM locally or delegate to a Hadoop Cluster

24. A Typical Hive Script

25. Results

26. Task Framework ● Run Hive scripts periodically ● Can specify as cron expressions/ predefined templates ● Handles task failover in case of node faliure ● Uses Zookeeper for coordination

27. Zookeeper ● Can be run seperately or embedded within BAM

28. Analyzer Cluster

29. Dashboard ● Making dashboard scale.

30. Deployment Patterns Single Node

31. High Availability

32. Fully Distributed Setup

33. Summary ● BAM ● Need for scalability ● Scaling BAM components ● Results ● BAM deployment patterns