This document summarizes a presentation about distributed caching technologies from key-value stores to in-memory data grids. It discusses the memory hierarchy and how software caches can improve performance by reducing data access latency and offloading storage. Different caching patterns like cache-aside, read-through, write-through and write-behind are explained. Popular caching products including Memcached, Redis, Cassandra and data grids are overviewed. Advanced concepts covered include data distribution, replication, consistency protocols and use cases.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
MongoDB is an open-source, document-oriented database that provides high performance and horizontal scalability. It uses a document-model where data is organized in flexible, JSON-like documents rather than rigidly defined rows and tables. Documents can contain multiple types of nested objects and arrays. MongoDB is best suited for applications that need to store large amounts of unstructured or semi-structured data and benefit from horizontal scalability and high performance.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
FoundationDB is a next-generation database that aims to provide high performance transactions at massive scale through a distributed design. It addresses limitations of NoSQL databases by providing a transactional, fault-tolerant foundation using tools like the Flow programming language. FoundationDB has demonstrated high performance that exceeds other NoSQL databases, and provides ease of scaling, building abstractions, and operation through its transactional design and automated partitioning. The goal is to solve challenges of state management so developers can focus on building applications.
Communication between Microservices is inherently unreliable. These integration points may produce cascading failures, slow responses, service outages. We will walk through stability patterns like timeouts, circuit breaker, bulkheads and discuss how they improve stability of Microservices.
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
MongoDB is an open-source, document-oriented database that provides high performance and horizontal scalability. It uses a document-model where data is organized in flexible, JSON-like documents rather than rigidly defined rows and tables. Documents can contain multiple types of nested objects and arrays. MongoDB is best suited for applications that need to store large amounts of unstructured or semi-structured data and benefit from horizontal scalability and high performance.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
This document provides an overview and introduction to NoSQL databases. It begins with an agenda that explores key-value, document, column family, and graph databases. For each type, 1-2 specific databases are discussed in more detail, including their origins, features, and use cases. Key databases mentioned include Voldemort, CouchDB, MongoDB, HBase, Cassandra, and Neo4j. The document concludes with references for further reading on NoSQL databases and related topics.
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
FoundationDB is a next-generation database that aims to provide high performance transactions at massive scale through a distributed design. It addresses limitations of NoSQL databases by providing a transactional, fault-tolerant foundation using tools like the Flow programming language. FoundationDB has demonstrated high performance that exceeds other NoSQL databases, and provides ease of scaling, building abstractions, and operation through its transactional design and automated partitioning. The goal is to solve challenges of state management so developers can focus on building applications.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.
In this session, you’ll learn:
– Some background about Spark at Netflix
– About output committers, and how both Spark and Hadoop handle failures
– How HDFS and S3 differ, and why HDFS committers don’t work well
– A new output committer that uses the S3 multi-part upload API
– How you can use this new committer in your Spark applications to avoid duplicating data
Implementing MongoDB at Shutterfly (Kenny Gorman)MongoSF
Shutterfly implemented MongoDB to address problems with their existing metadata storage architecture using an Oracle database, including slow time to market, high costs, performance issues, and lack of scalability. They developed a new data architecture using MongoDB for its simple API, open source software, and ability to partition and distribute data. Initial results showed a 500% improvement in costs and a 900% improvement in performance and latency.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
MongoDB WiredTiger Internals: Journey To TransactionsMydbops
MongoDB has adapted transaction feature (ACID Properties) in MongoDB 4.0. This talk focuses on the internals of how MongoDB adapted the ACID properties with Weird Tiger Engine. Weird tiger offers more future possibilities for MongoDB. This tech talk was presented at Mydbops Database Meetup on 27-04-2019 by Manosh Malai Senior Devops/NoSQL Consultant with Mydbops and Ranjith Database Administrator with Mydbops.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
The document discusses compaction in RocksDB, an embedded key-value storage engine. It describes the two compaction styles in RocksDB: level style compaction and universal style compaction. Level style compaction stores data in multiple levels and performs compactions by merging files from lower to higher levels. Universal style compaction keeps all files in level 0 and performs compactions by merging adjacent files in time order. The document provides details on the compaction process and configuration options for both styles.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
Of course there is no such thing as perfect service discovery, and we will see why in the talk. However, the way ProxySQL is deployed in this case minimizes the risk for split-brains, and this is why I qualify it as almost perfect. But let’s step back a little...
MySQL alone is not a high availability solution. To provide resilience to primary failure, other components need to be integrated with MySQL. At MessageBird, these additional components are ProxySQL and Orchestrator. In this talk, we describe how ProxySQL is architectured to provide close to perfect Service Discovery and how this, combined with Orchestrator, allows for automatic failover. The talk presents the details of the integration of MySQL, ProxySQL and Orchestrator in Google Cloud (and it would be easy to re-implement a similar architecture at other cloud vendors or on-premises). We will also cover lessons learned for the 2 years this architecture has been in production. Come to this talk to learn more about MySQL high availability, ProxySQL and Orchestrator.
Optimizing MariaDB for maximum performanceMariaDB plc
When it comes to optimizing the performance of a database, DBAs have to look at everything from the OS to the network. In this session, MariaDB Enterprise Architect Manjot Singh shares best practices for getting the most out of MariaDB. He highlights recommended OS settings, important configuration and tuning parameters, options for improving replication and clustering performance and features such as query result caching.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
Power of the Log: LSM & Append Only Data Structuresconfluent
LSM trees provide an efficient way to structure databases by organizing data sequentially in logs. They optimize for write performance by batching writes together sequentially on disk. To optimize reads, data is organized into levels and bloom filters and caching are used to avoid searching every file. This log-structured approach works well for many systems by aligning with how hardware is optimized for sequential access. The immutability of appended data also simplifies concurrency. This log-centric approach can be applied beyond databases to distributed systems as well.
Storm is a distributed and fault-tolerant realtime computation system. It was created at BackType/Twitter to analyze tweets, links, and users on Twitter in realtime. Storm provides scalability, reliability, and ease of programming. It uses components like Zookeeper, ØMQ, and Thrift. A Storm topology defines the flow of data between spouts that read data and bolts that process data. Storm guarantees processing of all data through its reliability APIs and guarantees no data loss even during failures.
Running MariaDB in multiple data centersMariaDB plc
The document discusses running MariaDB across multiple data centers. It begins by outlining the need for multi-datacenter database architectures to provide high availability, disaster recovery, and continuous operation. It then describes topology choices for different use cases, including traditional disaster recovery, geo-synchronous distributed architectures, and how technologies like MariaDB Master/Slave and Galera Cluster work. The rest of the document discusses answering key questions when designing a multi-datacenter topology, trade-offs to consider, architecture technologies, and pros and cons of different approaches.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
The document provides an overview and examples of data modeling techniques for Cassandra. It discusses four use cases - shopping cart data, user activity tracking, log collection/aggregation, and user form versioning. For each use case, it describes the business needs, issues with a relational database approach, and provides the Cassandra data model solution with examples in CQL. The models showcase techniques like de-normalizing data, partitioning, clustering, counters, maps and setting TTL for expiration. The presentation aims to help attendees properly model their data for Cassandra use cases.
This document provides an overview of distributed caching solutions and summarizes key points about local caching, replicated caching, and distributed caching. It discusses common use cases for distributed caching and outlines some popular open source Java caching frameworks like EHCache, Infinispan, HazelCast, Memcached, and Terracotta Server. The document also includes examples of EHCache configuration and an overview of BigMemory, EHCache's off-heap memory solution.
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...Lindsey Aitchison
Having tested and validated Oracle Grid reference configurations, Dell Global Solutions engineers share their insight of how best to set up and implement this computing resource to enable networked computers to share on-demand resource pools.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.
In this session, you’ll learn:
– Some background about Spark at Netflix
– About output committers, and how both Spark and Hadoop handle failures
– How HDFS and S3 differ, and why HDFS committers don’t work well
– A new output committer that uses the S3 multi-part upload API
– How you can use this new committer in your Spark applications to avoid duplicating data
Implementing MongoDB at Shutterfly (Kenny Gorman)MongoSF
Shutterfly implemented MongoDB to address problems with their existing metadata storage architecture using an Oracle database, including slow time to market, high costs, performance issues, and lack of scalability. They developed a new data architecture using MongoDB for its simple API, open source software, and ability to partition and distribute data. Initial results showed a 500% improvement in costs and a 900% improvement in performance and latency.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
MongoDB WiredTiger Internals: Journey To TransactionsMydbops
MongoDB has adapted transaction feature (ACID Properties) in MongoDB 4.0. This talk focuses on the internals of how MongoDB adapted the ACID properties with Weird Tiger Engine. Weird tiger offers more future possibilities for MongoDB. This tech talk was presented at Mydbops Database Meetup on 27-04-2019 by Manosh Malai Senior Devops/NoSQL Consultant with Mydbops and Ranjith Database Administrator with Mydbops.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
The document discusses compaction in RocksDB, an embedded key-value storage engine. It describes the two compaction styles in RocksDB: level style compaction and universal style compaction. Level style compaction stores data in multiple levels and performs compactions by merging files from lower to higher levels. Universal style compaction keeps all files in level 0 and performs compactions by merging adjacent files in time order. The document provides details on the compaction process and configuration options for both styles.
Introduction to memcached, a caching service designed for optimizing performance and scaling in the web stack, seen from perspective of MySQL/PHP users. Given for 2nd year students of professional bachelor in ICT at Kaho St. Lieven, Gent.
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorJean-François Gagné
Of course there is no such thing as perfect service discovery, and we will see why in the talk. However, the way ProxySQL is deployed in this case minimizes the risk for split-brains, and this is why I qualify it as almost perfect. But let’s step back a little...
MySQL alone is not a high availability solution. To provide resilience to primary failure, other components need to be integrated with MySQL. At MessageBird, these additional components are ProxySQL and Orchestrator. In this talk, we describe how ProxySQL is architectured to provide close to perfect Service Discovery and how this, combined with Orchestrator, allows for automatic failover. The talk presents the details of the integration of MySQL, ProxySQL and Orchestrator in Google Cloud (and it would be easy to re-implement a similar architecture at other cloud vendors or on-premises). We will also cover lessons learned for the 2 years this architecture has been in production. Come to this talk to learn more about MySQL high availability, ProxySQL and Orchestrator.
Optimizing MariaDB for maximum performanceMariaDB plc
When it comes to optimizing the performance of a database, DBAs have to look at everything from the OS to the network. In this session, MariaDB Enterprise Architect Manjot Singh shares best practices for getting the most out of MariaDB. He highlights recommended OS settings, important configuration and tuning parameters, options for improving replication and clustering performance and features such as query result caching.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
In this presentation, Raghavendra BM of Valuebound has discussed the basics of MongoDB - an open-source document database and leading NoSQL database.
----------------------------------------------------------
Get Socialistic
Our website: http://valuebound.com/
LinkedIn: http://bit.ly/2eKgdux
Facebook: https://www.facebook.com/valuebound/
Twitter: http://bit.ly/2gFPTi8
Power of the Log: LSM & Append Only Data Structuresconfluent
LSM trees provide an efficient way to structure databases by organizing data sequentially in logs. They optimize for write performance by batching writes together sequentially on disk. To optimize reads, data is organized into levels and bloom filters and caching are used to avoid searching every file. This log-structured approach works well for many systems by aligning with how hardware is optimized for sequential access. The immutability of appended data also simplifies concurrency. This log-centric approach can be applied beyond databases to distributed systems as well.
Storm is a distributed and fault-tolerant realtime computation system. It was created at BackType/Twitter to analyze tweets, links, and users on Twitter in realtime. Storm provides scalability, reliability, and ease of programming. It uses components like Zookeeper, ØMQ, and Thrift. A Storm topology defines the flow of data between spouts that read data and bolts that process data. Storm guarantees processing of all data through its reliability APIs and guarantees no data loss even during failures.
Running MariaDB in multiple data centersMariaDB plc
The document discusses running MariaDB across multiple data centers. It begins by outlining the need for multi-datacenter database architectures to provide high availability, disaster recovery, and continuous operation. It then describes topology choices for different use cases, including traditional disaster recovery, geo-synchronous distributed architectures, and how technologies like MariaDB Master/Slave and Galera Cluster work. The rest of the document discusses answering key questions when designing a multi-datacenter topology, trade-offs to consider, architecture technologies, and pros and cons of different approaches.
This document provides an overview of NoSQL databases and compares them to relational databases. It discusses the different types of NoSQL databases including key-value stores, document databases, wide column stores, and graph databases. It also covers some common concepts like eventual consistency, CAP theorem, and MapReduce. While NoSQL databases provide better scalability for massive datasets, relational databases offer more mature tools and strong consistency models.
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinDataStax Academy
The document provides an overview and examples of data modeling techniques for Cassandra. It discusses four use cases - shopping cart data, user activity tracking, log collection/aggregation, and user form versioning. For each use case, it describes the business needs, issues with a relational database approach, and provides the Cassandra data model solution with examples in CQL. The models showcase techniques like de-normalizing data, partitioning, clustering, counters, maps and setting TTL for expiration. The presentation aims to help attendees properly model their data for Cassandra use cases.
This document provides an overview of distributed caching solutions and summarizes key points about local caching, replicated caching, and distributed caching. It discusses common use cases for distributed caching and outlines some popular open source Java caching frameworks like EHCache, Infinispan, HazelCast, Memcached, and Terracotta Server. The document also includes examples of EHCache configuration and an overview of BigMemory, EHCache's off-heap memory solution.
Building an Oracle Grid with Oracle VM on Dell Blade Servers and EqualLogic i...Lindsey Aitchison
Having tested and validated Oracle Grid reference configurations, Dell Global Solutions engineers share their insight of how best to set up and implement this computing resource to enable networked computers to share on-demand resource pools.
This document provides an overview of Cassandra, a decentralized structured storage model. Some key points:
- Cassandra is a distributed database designed to handle large amounts of data across commodity servers. It provides high availability with no single point of failure.
- Cassandra's data model is based on Dynamo and BigTable, with data distributed across nodes through consistent hashing. It uses a column-based data structure with rows, columns, column families and supercolumns.
- Cassandra was originally developed at Facebook to address issues of high write throughput and latency for their inbox search feature, which now stores over 50TB of data across 150 nodes.
- Other large companies using Cassandra include Netflix, eBay
The document provides information about Couchbase, a NoSQL database. It discusses Couchbase's key-value data model and how data is stored and accessed. The main architectural components are nodes, clusters, buckets, and documents. Data is accessed via reads, writes, views, and N1QL queries. Couchbase provides scalability and high performance through its caching architecture and append-only disk writes.
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
In this session, we'll discuss architectural, design and tuning best practices for building rock solid and scalable Alfresco Solutions. We'll cover the typical use cases for highly scalable Alfresco solutions, like massive injection and high concurrency, also introducing 3.3 and 3.4 Transfer / Replication services for building complex high availability enterprise architectures.
Cassandra is used for real-time bidding in online advertising. It processes billions of bid requests per day with low latency requirements. Segment data, which assigns product or service affinity to user groups, is stored in Cassandra to reduce calculations and allow users to be bid on sooner. Tuning the cache size and understanding the active dataset helps optimize performance.
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
1) NoSQL databases are non-relational and schema-free, providing alternatives to SQL databases for big data and high availability applications.
2) Common NoSQL database models include key-value stores, column-oriented databases, document databases, and graph databases.
3) The CAP theorem states that a distributed data store can only provide two out of three guarantees around consistency, availability, and partition tolerance.
This document discusses how to design and deliver scalable and resilient web services. It begins by describing typical web architectures that do not scale well and can have performance issues. It then introduces Windows Server AppFabric Caching as a solution to address these issues. AppFabric Caching provides an in-memory distributed cache that can scale across servers and processes. It allows caching data in a shared cache across web servers, services and clients. This improves performance and scalability over traditional caching approaches. The document concludes by covering how to deploy, use and administer AppFabric Caching.
RadFS is a modification of HDFS that aims to improve random access performance through caching and pooling of file handles. It implements all interactions with DataNodes as stateless positioned reads. This reduces server load and allows connections and threads to be reused. Benchmark results show RadFS provides faster random reads than HDFS, though caching adds overhead and the checksum implementation requires two reads per operation. Further work is needed to optimize checksumming and implement pipelining for improved streaming performance.
Beyond The Data Grid: Coherence, Normalisation, Joins and Linear ScalabilityBen Stopford
In 2009 RBS set out to build a single store of trade and risk data that all applications in the bank could use. This talk discusses a number of novel techniques that were developed as part of this work. Based on Oracle Coherence the ODC departs from the trend set by most caching solutions by holding its data in a normalised form making it both memory efficient and easy to change. However it does this in a novel way that supports most arbitrary queries without the usual problems associated with distributed joins. We'll be discussing these patterns as well as others that allow linear scalability, fault tolerance and millisecond latencies.
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.
Yaroslav Nedashkovsky - "Data Engineering in Information Security: how to col...Lviv Startup Club
This document discusses the system architecture for collecting, storing, and processing terabytes of data from viruses. It describes using Cassandra to store variety of data sources in a scalable way, PostgreSQL for some relational data, AWS Kinesis and Spark Streaming for streaming and processing data in real-time, and providing a REST API to access insights. The overall goal is to collect petabytes of data and gain insights through analytics.
This document discusses cache and consistency in NoSQL databases. It introduces distributed caching using Memcached to improve performance and reduce load on database servers. It discusses using consistent hashing to partition and replicate data across servers while maintaining consistency. Paxos is presented as an efficient algorithm for maintaining consistency during updates in a distributed system in a more flexible way than traditional 2PC and 3PC approaches.
This document discusses experiments conducted to determine the optimal hardware and software configurations for building a cost-efficient Swift object storage cluster with expected performance. It describes testing different configurations for proxy and storage nodes under small and large object upload workloads. The results show that for small object uploads, high-CPU instances performed best for storage nodes while either high-CPU or high-end instances worked well for proxies. For large object uploads, large instances were most cost-effective for storage nodes and high-end instances remained suitable for proxies. The findings provide guidance on right-sizing hardware based on workload characteristics.
- A key objective of computer systems is achieving high performance at low cost, measured by price/performance ratio.
- Processor performance depends on how fast instructions can be fetched from memory and executed.
- Caches improve performance by storing recently accessed data from main memory closer to the processor, reducing access time compared to main memory. This can increase hit rates but requires managing cache misses and write policies.
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalVigyan Jain
This document provides guidance on sizing MongoDB deployments on AWS for optimal performance. It discusses key considerations for capacity planning like testing workloads, measuring performance, and adjusting over time. Different AWS services like compute-optimized instances and storage options like EBS are reviewed. Best practices for WiredTiger like sizing cache, effects of compression and encryption, and monitoring tools are covered. The document emphasizes starting simply and scaling based on business needs and workload profiling.
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Ontico
1. The document discusses a distributed, scalable, and highly reliable data storage system for virtual machines and other uses called HighLoad++.
2. It proposes using a simplified design that focuses on core capabilities like data replication and recovery to achieve both low costs and high performance.
3. The design splits data into chunks that are replicated across multiple servers and includes metadata servers to track the location and versions of chunks to enable eventual consistency despite failures.
This document provides an overview of Apache Cassandra including its history, architecture, data modeling concepts, and how to install and use it with Python. Key points include that Cassandra is a distributed, scalable NoSQL database designed without single points of failure. It discusses Cassandra's architecture including nodes, datacenters, clusters, commit logs, memtables, and SSTables. Data modeling concepts explained are keyspaces, column families, and designing for even data distribution and minimizing reads. The document also provides examples of creating a keyspace, reading data using Python driver, and demoing data clustering.
Similaire à From distributed caches to in-memory data grids (20)
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfflufftailshop
When it comes to unit testing in the .NET ecosystem, developers have a wide range of options available. Among the most popular choices are NUnit, XUnit, and MSTest. These unit testing frameworks provide essential tools and features to help ensure the quality and reliability of code. However, understanding the differences between these frameworks is crucial for selecting the most suitable one for your projects.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
2. Memory Hierarchy
R
<1ns
L1
~4 cycles, ~1ns Cost
L2
~10 cycles, ~3ns
L3
~42 cycles, ~15ns
DRAM
>65ns
Flash / SSD / USB
Storage
term
HDD
Tapes, Remote systems, etc
2 Max A. Alexejev
3. Software caches
Improve response times by reducing data access latency
Offload persistent storages
Only work for IO-bound applications!
3 Max A. Alexejev
4. Caches and data location
Consistency
protocol
Shared
Local Remote
Distributed
Hierarchical Distribution
algorithm
4 Max A. Alexejev
5. Ok, so how do we grow beyond one node?
Data replication
5 Max A. Alexejev
6. Pro’s and Con’s of replication
Pro
• Best read performance (for local replicated caches)
• Fault tolerant cache (both local and remote)
• Can be smart: replicate only part of CRUD cycle
Con
• Poor writes performance
• Additional network load
• Can scale only vertically: limited by single machine size
• In case of master-master replication, requires complex consistency protocol
6 Max A. Alexejev
7. Ok, so how do we grow beyond one node?
Data distribution
7 Max A. Alexejev
8. Pro’s and Con’s of data distribution
Pro
• Can scale horizontally beyond single machine size
• Reads and writes performance scales horizontally
Con
• No fault tolerance for cached data
• Increased latency of reads (due to network round-trip and
serialization expenses)
8 Max A. Alexejev
9. What do high-load applications need
from cache?
Linear Distributed
Low
horizontal
latency cache
scalability
9 Max A. Alexejev
10. Cache access patterns: Client
Cache Aside
For reading data: For writing data
1. Application asks 1. Application writes
for some data for a some new data or
given key updates existing.
Cache
2. Check the cache 2. Write it to the
3. If data is in the cache
cache return it to 3. Write it to the DB.
the user
4. If data is not in the Overall:
cache fetch it from
the DB, put it in • Increases reads
the cache, return it performance
to the user. • Offloads DB reads
DB
• Introduces race
conditions for
writes
10 Max A. Alexejev
11. Cache access patterns: Client
Read Through
For reading data:
1. Application asks for some data for a given key
2. Check the cache
3. If data is in the cache return it to the user Cache
4. If data is not in the cache – cache will invoke fetching
it from the DB by himself, saving retrieved value and
returning it to the user.
Overall:
• Reduces reads latency
• Offloads read load from underlying storage
• May have blocking behavior, thus helping with dog-pile DB
effect
• Requires “smarter” cache nodes
11 Max A. Alexejev
12. Cache access patterns: Client
Write Through
For writing data
1. Application writes some new data or
updates existing. Cache
2. Write it to the cache
3. Cache will then synchronously write it
to the DB.
Overall:
• Slightly increases writes latency
DB
• Provides natural invalidation
• Removes race conditions on writes
12 Max A. Alexejev
13. Cache access patterns: Client
Write Behind
For writing data
1. Application writes some new data or updates existing.
2. Write it to the cache
Cache adds writes request to its internal queue.
3.
Cache
4. Later, cache asynchronously flushes queue to DB on a
periodic basis and/or when queue size reaches certain
limit.
Overall:
• Dramatically reduces writes latency by a price of
inconsistency window
• Provides writes batching
• May provide updates deduplication DB
13 Max A. Alexejev
14. A variety of products on the market…
Memcached
Hazelcast
Cassandra
GigaSpaces
Redis
Terracotta
Oracle
Coherence Infinispan
MongoDB
Riak
EhCache …
14 Max A. Alexejev
15. KV caches NoSQL Data Grids
Oracle
Memcached Redis
Coherence
Ehcache Cassandra GemFire
Lets sort em out! … MongoDB GigaSpaces
Some products are really hard to
sort – like Terracotta in both DSO … GridGain
and Express modes.
Hazelcast
Infinispan
15 Max A. Alexejev
16. Why don’t we have any distributed
in-memory RDBMS?
Master – MultiSlaves configuration
• Is, if fact, an example of replication
• Helps with reads distribution, but does not help with writes
• Does not scale beyond single master
Horizontal partitioning (sharding)
• Helps with reads and writes for datasets with good data affinity
• Does not work nicely with joins semantics (i.e., there are no
distributed joins)
16 Max A. Alexejev
17. Key-Value caches
• Memcached and EHCache are good examples to look at
• Keys and values are arbitrary binary (serializable) entities
• Basic operations are put(K,V), get(K), replace(K,V), remove(K)
• May provide group operations like getAll(…) and putAll(…)
• Some operations provide atomicity guarantees (CAS,
inc/dec)
17 Max A. Alexejev
18. Memcached
• Developed for LiveJournal in 2003
• Has client libraries in PHP, Java,
Ruby, Python and many others
• Nodes are independent and don’t
communicate with each other
18 Max A. Alexejev
19. EHCache
• Initially named “Easy Hibernate Cache”
• Java-centric, mature product with open-
source and commercial editions
• Open-source version provides only
replication capabilities, distributed
caching requires commercial license for
both EHCache and Terracotta TSA
19 Max A. Alexejev
20. NoSQL Systems
A whole bunch of different products with both persistent and
non-persistent storage options. Lets call them caches and
storages, accordingly.
Built to provide good horizontal scalability
Try to fill the feature gap between pure KV and full-blown
RDBMS
20 Max A. Alexejev
21. Written in C, supported by
VMWare
Client libraries for C, C#, Java,
Scala, PHP, Erlang, etc
Single-threaded async impl
Has configurable persistence
Case study: Redis
Works with K-V pairs, where K is a
string and V may be either number,
hset users:goku powerlevel 9000 string or Object (JSON)
hget users:goku powerlevel Provides 5 interfaces for: strings,
hashes, sorted lists, sets, sorted
sets
Supports transactions
21 Max A. Alexejev
22. Use cases: Redis
Good for fixed lists, tagging, ratings, counters, analytics and
queues (pub-sub messaging)
Has Master – MultiSlave replication support. Master node is
currently a SPOF.
Distributed Redis was named “Redis Cluster” and is currently
under development
22 Max A. Alexejev
23. • Written in Java, developed in
Facebook.
• Inspired by Amazon Dynamo
replication mechanics, but
uses column-based data
model.
Case study: Cassandra • Good for logs processing,
index storage, voting, jobs
storage etc.
• Bad for transactional
processing.
• Want to know more? Ask
Alexey!
23 Max A. Alexejev
24. In-Memory Data Grids
New generation of caching products, trying to combine benefits of replicated and
distributed schemes.
24 Max A. Alexejev
25. IMDG: Evolution
Data Grids Computational
• Reliable storage and Grids
live data balancing • Reliable jobs
among grid nodes execution, scheduling
and load balancing
Modern
IMDG
25 Max A. Alexejev
26. IMDG: Caching concepts
• Implements KV cache interface • Live data redistribution when nodes are
going up or down – no data loss, no
• Provides indexed search by values clients termination
• Provides reliable distributed locks • Supports RT, WT, WB caching patterns
interface and hierarchical caches (near caching)
• Caching scheme – partitioned or • Supports atomic computations on grid
distributed, may be specified per cache nodes
or cache service
• Provides events subscription for entries
(change notifications)
• Configurable fault tolerance for
distributed schemes (HA)
• Equal data (and read/write load) 26 Max A. Alexejev
distribution among grid nodes
27. IMDG: Under the hood
• All data is split in a number of sections,
called partitions.
• Partition, rather then entry, is an atomic
unit of data migration when grid
rebalances. Number of partitions is fixed
for cluster lifetime.
• Indexes are distributed among grid nodes.
• Clients may or may not be part of the grid
cluster.
27 Max A. Alexejev
28. IMDG Under the hood:
Requests routing
For get() and put() requests:
1. Cluster member, that makes a request, calculates key hash
code.
2. Partition number is calculated using this hash code.
3. Node is identified by partition number.
4. Request is then routed to identified node, executed, and
results are sent back to the client member who initiated
request.
For filter queries:
1. Cluster member initiating requests sends it to all storage
enabled nodes in the cluster.
2. Query is executed on every node using distributed indexes
and partial results are sent to the requesting member.
3. Requesting member merges partial results locally.
4. Final result set is returned from filter method.
28 Max A. Alexejev
29. IMDG: Advanced use-cases
Messaging
Map-Reduce calculations
Cluster-wide singleton
And more…
29 Max A. Alexejev
30. GC tuning for large grid nodes
An easy way to go: rolling restarts or storage-enabled cluster
nodes. Can not be used in any project.
A complex way to go: fine-tune CMS collector to ensure that
it will always keep up cleaning garbage concurrently under
normal production workload.
An expensive way to go: use OffHeap storages provided by
some vendors (Oracle, Terracotta) and use direct memory
buffers available to JVM.
30 Max A. Alexejev
31. IMDG: Market players
Oracle Coherence: commercial, free for evaluation use.
GigaSpaces: commercial.
GridGain: commercial.
Hazelcast: open-source.
Infinispan: open-source.
31 Max A. Alexejev
32. Terracotta
A company behind EHCache, Quartz and Terracotta Server Array.
Acquired by Software AG.
32 Max A. Alexejev
33. Terracotta Server Array
All data is split in a number of sections, called stripes.
Stripes consist of 2 or more Terracotta nodes. One of them is Active node, others have Passive status.
All data is distributed among stripes and replicated inside stripes.
Open Source limitation: only one stripe. Such setup will support HA, but will not distribute cache data. I.e., it is not horizontally scalable.
33 Max A. Alexejev