A major problem in computer science is performing element counting, distinct element counts, and presence checks on enormous streams of data, in real-time, at scale, without breaking the bank or over-complicating our apps. In this talk, we’ll learn how to address these issues using probabilistic data structures. We’ll learn what a probabilistic data structure is, we’ll learn about the guts of the Count-Min Sketch, Bloom Filter, and HyperLogLog data structures. And finally, we’ll talk about using them in our apps at scale.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Ray Serve: A new scalable machine learning model serving library on RaySimon Mo
When a machine learning model needs to served for interactive use cases, the models are either wrapped inside a Flask server or deployed using external services like Sagemaker. Both methods come with flaws. In this talk, you will learn about how ray serve uses ray to address the limitations of current approaches and enable scalable model serving.
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
Many clusters, many problems? Having many clusters has benefits: reduced blast radius, less vertical scaling of cluster components, and a natural trust boundary. In this session, Zalando shows its approach for running 140+ clusters on AWS, how it does continuous delivery for its cluster infrastructure, and how it created open-source tooling to manage cost efficiency and improve developer experience. The company openly shares its failures and the learnings collected during three years of Kubernetes in production.
AWS re:Invent session OPN211 on 2019-12-05
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
The document discusses routed networks in OpenStack Neutron. It describes how routed networks implement layer 3 connectivity while allowing scalability by associating subnets to network segments. Key points include new Neutron APIs for segments and ports in routed networks, integration with the Nova scheduler, and options for implementing distributed virtual routing with features like floating IPs, multiple availability zones, and BGP routing.
DynamoDB and Schema Design
The document discusses DynamoDB schema design and core concepts. It covers how to model different types of relationships in DynamoDB, using primary keys, secondary indexes, and global secondary indexes. It also discusses techniques for optimizing performance and minimizing costs, such as using projections, sparse indexes, and sharding indexes. The document provides an overview of DynamoDB components and new transactional APIs.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes Hadoop as having two main components - the Hadoop Distributed File System (HDFS) which stores data across infrastructure, and MapReduce which processes the data in a parallel, distributed manner. HDFS provides redundancy, scalability, and fault tolerance. Together these components provide a solution for businesses to efficiently analyze the large, unstructured "Big Data" they collect.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...DataStax
Successfully running Apache Cassandra in production often means knowing what configuration settings to change and which ones to leave as default. Over the years the cassandra.yaml file has grown to provide a number of settings that can improve stability and performance. While the file contains plenty of helpful comments, there is more to be said about the settings and when to change them.
In this talk Edward Capriolo, Consultant at The Last Pickle, will break down the parameters in the configuration files. Looking at those that are essential to getting started, those that impact performance, those that improve availability, the exotic ones, and the ones that should not be played with. This talk is ideal for someone someone setting up Cassandra for the first time up to people with deployments in productions and wondering what the more exotic configuration options do.
About the Speaker
Edward Capriolo Consultant, The Last Pickle
Long time Apache Cassandra user, big data enthusiast.
Ray Serve: A new scalable machine learning model serving library on RaySimon Mo
When a machine learning model needs to served for interactive use cases, the models are either wrapped inside a Flask server or deployed using external services like Sagemaker. Both methods come with flaws. In this talk, you will learn about how ray serve uses ray to address the limitations of current approaches and enable scalable model serving.
How Zalando runs Kubernetes clusters at scale on AWS - AWS re:InventHenning Jacobs
Many clusters, many problems? Having many clusters has benefits: reduced blast radius, less vertical scaling of cluster components, and a natural trust boundary. In this session, Zalando shows its approach for running 140+ clusters on AWS, how it does continuous delivery for its cluster infrastructure, and how it created open-source tooling to manage cost efficiency and improve developer experience. The company openly shares its failures and the learnings collected during three years of Kubernetes in production.
AWS re:Invent session OPN211 on 2019-12-05
Cloud Spanner is the first and only relational database service that is both strongly consistent and horizontally scalable. With Cloud Spanner you enjoy all the traditional benefits of a relational database: ACID transactions, relational schemas (and schema changes without downtime), SQL queries, high performance, and high availability. But unlike any other relational database service, Cloud Spanner scales horizontally, to hundreds or thousands of servers, so it can handle the highest of transactional workloads.
The document discusses routed networks in OpenStack Neutron. It describes how routed networks implement layer 3 connectivity while allowing scalability by associating subnets to network segments. Key points include new Neutron APIs for segments and ports in routed networks, integration with the Nova scheduler, and options for implementing distributed virtual routing with features like floating IPs, multiple availability zones, and BGP routing.
DynamoDB and Schema Design
The document discusses DynamoDB schema design and core concepts. It covers how to model different types of relationships in DynamoDB, using primary keys, secondary indexes, and global secondary indexes. It also discusses techniques for optimizing performance and minimizing costs, such as using projections, sparse indexes, and sharding indexes. The document provides an overview of DynamoDB components and new transactional APIs.
The document discusses Hadoop, an open-source software framework that allows distributed processing of large datasets across clusters of computers. It describes Hadoop as having two main components - the Hadoop Distributed File System (HDFS) which stores data across infrastructure, and MapReduce which processes the data in a parallel, distributed manner. HDFS provides redundancy, scalability, and fault tolerance. Together these components provide a solution for businesses to efficiently analyze the large, unstructured "Big Data" they collect.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
You’re ready to make your applications more responsive, scalable, fast and secure. Then it’s time to get started with NGINX. In this webinar, you will learn how to install NGINX from a package or from source onto a Linux host. We’ll then look at some common operating system tunings you could make to ensure your NGINX install is ready for prime time.
View full webinar on demand at http://nginx.com/resources/webinars/installing-tuning-nginx/
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Log analysis challenges include searching logs across multiple services and servers. The ELK stack provides a solution with Logstash to centralize log collection, Elasticsearch for storage and search, and Kibana for visualization. Logstash uses input, filter, and output plugins to collect, parse, and forward logs. Example configurations show using stdin and filters to parse OpenStack logs before outputting to Elasticsearch and Kibana for analysis and dashboards.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
The document discusses techniques for storing time series data at scale in a time series database (TSDB). It describes storing 16 bytes of data per sample by compressing timestamps and values. It proposes organizing data into blocks, chunks, and files to handle high churn rates. An index structure uses unique IDs and sorted label mappings to enable efficient queries over millions of time series and billions of samples. Benchmarks show the TSDB can handle over 100,000 samples/second while keeping memory, CPU and disk usage low.
The document provides an overview of Kubernetes networking concepts including single pod networking, pod to pod communication, service discovery and load balancing, external access patterns, network policies, Istio service mesh, multi-cluster networking, and best practices. It covers topics such as pod IP addressing, communication approaches like L2, L3, overlays, services, ingress controllers, network policies, multi-cluster use cases and deployment options.
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Henning Jacobs
Kubernetes hat sich als defacto Standard für Cloud Native Plattformen etabliert. Doch warum? Welche Vorteile und Fallstricke gibt es in der Praxis? Henning Jacobs zeigt am Beispiel von Zalando wie Kubernetes als Infrastruktur für 1200+ Entwickler dient, welche Aspekte Kubernetes trotz seiner Komplexität einzigartig machen, und was dies für die Developer Experience bedeutet.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone.
In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS.
We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk.
Speakers
Anu Engineer, Software Engineer, Hortonworks
Xiaoyu Yao, Software Engineer, Hortonworks
This document discusses 10 techniques for persistence on macOS beyond using LaunchAgents. It provides an overview of techniques such as modifying shell startup files, using Pluggable Authentication Modules (PAM), Hammerspoon, preference panes, screen savers, color pickers, periodic scripts, Terminal preferences, emond (the event monitor daemon), and folder actions. For each technique, it describes how it works and how it could be detected and monitored for malicious use. The conclusion is that while malware most commonly uses LaunchAgents, there are many unexplored persistence options that blue teams need to prepare for.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
This document provides an overview of probabilistic data structures including Bloom filters, Cuckoo filters, Count-Min sketch, majority algorithm, linear counting, LogLog, HyperLogLog, locality sensitive hashing, minhash, and simhash. It discusses how Bloom filters work using hash functions to insert elements into a bit array and can tell if an element is likely present or definitely not present. It also explains how Count-Min sketch tracks frequencies in a stream using hash functions and finding the minimum value in each hash table cell. Finally, it summarizes how HyperLogLog estimates cardinality by tracking the maximum number of leading zeros when hashing random numbers.
The document provides guidance on how to succeed in coding interviews. It explains that employers use coding interviews to assess applicants' problem solving, algorithm design, and coding skills. It advises applicants to ask questions, think out loud, take time to design an approach before coding, test code thoroughly, and be prepared to discuss code efficiency. Examples of common coding interview questions and how to approach them are also provided. The document aims to help applicants feel comfortable with problem types and perform well in coding interviews.
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn
This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial:
1) Spark Streaming - Workflow
2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection
3) Spark Streaming - DStream
4) Word Count Hands-on using Spark Streaming
5) Spark Streaming - Running Locally Vs Running on Cluster
6) Introduction to Apache Kafka
7) Apache Kafka Hands-on on CloudxLab
8) Integrating Spark Streaming & Kafka
9) Spark Streaming & Kafka Hands-on
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks
Today, general-purpose CPU clusters are the most widely used environment for data analytics workloads. Recently, acceleration solutions employing field-programmable hardware have emerged providing cost, performance and power consumption advantages. Field programmable gate arrays (FPGAs) and graphics processing units (GPUs) are two leading technologies being applied. GPUs are well-known for high-performance dense-matrix, highly regular operations such as graphics processing and matrix manipulation. FPGAs are flexible in terms of programming architecture and are adept at providing performance for operations that contain conditionals and/or branches. These architectural differences have significant performance impacts, which manifest all the way up to the application layer. It is therefore critical that data scientists and engineers understand these impacts in order to inform decisions about if and how to accelerate.
This talk will characterize the architectural aspects of the two hardware types as applied to analytics, with the ultimate goal of informing the application programmer. Recently, both GPUs and FPGAs have been applied to Apache SparkSQL, via services on Amazon Web Services (AWS) cloud. These solutions’ goal is providing Spark users high performance and cost savings. We first characterize the key aspects of the two hardware platforms. Based on this characterization, we examine and contrast the sets and types of SparkSQL operations they accelerate well, how they accelerate them, and the implications for the user’s application. Finally, we present and analyze a performance comparison of the two AWS solutions (one FPGA-based, one GPU-based). The tests employ the TPC-DS (decision support) benchmark suite, a widely used performance test for data analytics.
You’re ready to make your applications more responsive, scalable, fast and secure. Then it’s time to get started with NGINX. In this webinar, you will learn how to install NGINX from a package or from source onto a Linux host. We’ll then look at some common operating system tunings you could make to ensure your NGINX install is ready for prime time.
View full webinar on demand at http://nginx.com/resources/webinars/installing-tuning-nginx/
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Log analysis challenges include searching logs across multiple services and servers. The ELK stack provides a solution with Logstash to centralize log collection, Elasticsearch for storage and search, and Kibana for visualization. Logstash uses input, filter, and output plugins to collect, parse, and forward logs. Example configurations show using stdin and filters to parse OpenStack logs before outputting to Elasticsearch and Kibana for analysis and dashboards.
Amazon EC2 provides a broad selection of instance types to accommodate a diverse mix of workloads. In this session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
The document discusses techniques for storing time series data at scale in a time series database (TSDB). It describes storing 16 bytes of data per sample by compressing timestamps and values. It proposes organizing data into blocks, chunks, and files to handle high churn rates. An index structure uses unique IDs and sorted label mappings to enable efficient queries over millions of time series and billions of samples. Benchmarks show the TSDB can handle over 100,000 samples/second while keeping memory, CPU and disk usage low.
The document provides an overview of Kubernetes networking concepts including single pod networking, pod to pod communication, service discovery and load balancing, external access patterns, network policies, Istio service mesh, multi-cluster networking, and best practices. It covers topics such as pod IP addressing, communication approaches like L2, L3, overlays, services, ingress controllers, network policies, multi-cluster use cases and deployment options.
Why Kubernetes? Cloud Native and Developer Experience at Zalando - Enterprise...Henning Jacobs
Kubernetes hat sich als defacto Standard für Cloud Native Plattformen etabliert. Doch warum? Welche Vorteile und Fallstricke gibt es in der Praxis? Henning Jacobs zeigt am Beispiel von Zalando wie Kubernetes als Infrastruktur für 1200+ Entwickler dient, welche Aspekte Kubernetes trotz seiner Komplexität einzigartig machen, und was dies für die Developer Experience bedeutet.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Ozone is an object store for Hadoop. Ozone solves the small file problem of HDFS, which allows users to store trillions of files in Ozone and access them as if there are on HDFS. Ozone plugs into existing Hadoop deployments seamlessly, and programs like Hive, LLAP, and Spark work without any modifications. This talk looks at the architecture, reliability, and performance of Ozone.
In this talk, we will also explore Hadoop distributed storage layer, a block storage layer that makes this scaling possible, and how we plan to use the Hadoop distributed storage layer for scaling HDFS.
We will demonstrate how to install an Ozone cluster, how to create volumes, buckets, and keys, how to run Hive and Spark against HDFS and Ozone file systems using federation, so that users don’t have to worry about where the data is stored. In other words, a full user primer on Ozone will be part of this talk.
Speakers
Anu Engineer, Software Engineer, Hortonworks
Xiaoyu Yao, Software Engineer, Hortonworks
This document discusses 10 techniques for persistence on macOS beyond using LaunchAgents. It provides an overview of techniques such as modifying shell startup files, using Pluggable Authentication Modules (PAM), Hammerspoon, preference panes, screen savers, color pickers, periodic scripts, Terminal preferences, emond (the event monitor daemon), and folder actions. For each technique, it describes how it works and how it could be detected and monitored for malicious use. The conclusion is that while malware most commonly uses LaunchAgents, there are many unexplored persistence options that blue teams need to prepare for.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
This document provides an overview of probabilistic data structures including Bloom filters, Cuckoo filters, Count-Min sketch, majority algorithm, linear counting, LogLog, HyperLogLog, locality sensitive hashing, minhash, and simhash. It discusses how Bloom filters work using hash functions to insert elements into a bit array and can tell if an element is likely present or definitely not present. It also explains how Count-Min sketch tracks frequencies in a stream using hash functions and finding the minimum value in each hash table cell. Finally, it summarizes how HyperLogLog estimates cardinality by tracking the maximum number of leading zeros when hashing random numbers.
The document provides guidance on how to succeed in coding interviews. It explains that employers use coding interviews to assess applicants' problem solving, algorithm design, and coding skills. It advises applicants to ask questions, think out loud, take time to design an approach before coding, test code thoroughly, and be prepared to discuss code efficiency. Examples of common coding interview questions and how to approach them are also provided. The document aims to help applicants feel comfortable with problem types and perform well in coding interviews.
This document discusses various topics related to writing Verilog code that will efficiently synthesize to digital logic circuits, including:
- For loops can synthesize if the number of iterations is fixed. Loops will be "unrolled" during synthesis.
- Generate statements allow conditional or parameterized instantiation of modules before simulation.
- Coding styles like using multi-cycle designs or sharing logic units can reduce area compared to naively written code.
- Gated clocks require careful handling to avoid timing issues between clock domains.
- Conditional assignments and always blocks require complete cases or default assignments to avoid unintended latches.
- Examples compare the area and delay of different implementations of multipliers,
In this talk I will try explain the memory internals of Python and discover how it handles memory management and object creation.
The idea is explain how objects are created and deleted in Python and how garbage collector(gc) functions to automatically release memory when the object taking the space is no longer in use.
I will review the main mechanims for memory allocation and how the garbage collector works in conjunction with the memory manager for reference counting of the python objects.
Finally, I will comment the best practices for memory managment such as writing efficient code in python scripts.
In this talk I will try explain the memory internals of Python and discover how it handles memory management and object creation.The idea is explain how objects are created and deleted in Python and how garbage collector(gc) functions to automatically release memory when the object taking the space is no longer in use.
Also I will review the main mechanisms for memory allocation and how the garbage collector works in conjunction with the memory manager for reference counting of the python objects.
Finally, I will comment the best practices for memory managment such as writing efficient code.
These could be the main talking points:
-Introduccition to memory management
-Garbage collector and reference counting with python
-Review the gc module for configuring the python garbage collector
-Best practices for memory managment
Basics in algorithms and data structure Eman magdy
The document discusses data structures and algorithms. It notes that good programmers focus on data structures and their relationships, while bad programmers focus on code. It then provides examples of different data structures like trees and binary search trees, and algorithms for searching, inserting, deleting, and traversing tree structures. Key aspects covered include the time complexity of different searching algorithms like sequential search and binary search, as well as how to implement operations like insertion and deletion on binary trees.
Python for Scientific Computing -- Ricardo Cruzrpmcruz
This document discusses Python for scientific computing. It provides notes on NumPy, the fundamental package for scientific computing in Python. NumPy allows vectorized mathematical operations on multidimensional arrays in a simple and efficient manner. The notes cover common NumPy operations and syntax as compared to MATLAB and R. Pandas is also introduced as a package for data manipulation and analysis based on the concept of data frames from R. Examples are given of generating fake data to demonstrate modeling capabilities in Python.
The document discusses different representations of signed integers in binary, including sign-magnitude, one's complement, and two's complement. It also covers binary addition and subtraction using two's complement representation. Memory locations are addressed and memory operations like load, store are explained. Different instruction formats like one-address, two-address, three-address are introduced along with examples.
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Too Much Data? - Just Sample, Just Hash, ...Andrii Gakhov
Code & Supply | Pittsburgh Meetup | May 31, 2019
Probabilistic Data Structures and Algorithms (PDSA) is a common name of data structures based on different hashing techniques. They have been incorporated into Spark SQL. They are also used by Amazon Redshift and Google BigQuery, Redis and Elasticsearch, and many others. Consequently, PDSA is not just some interesting academic topic.
Book "Probabilistic Data Structures and Algorithms for Big Data Applications" (ISBN: 978-3748190486 ) https://pdsa.gakhov.com
This document provides an overview and summary of the HyperMinHash sketch, which is a probabilistic data structure that can estimate the size of a dataset and the size of unions between datasets using only logarithmic space. It discusses how counting problems like "how many users like cats or dogs" can be solved efficiently using sketching techniques instead of needing to process all the raw data. The HyperMinHash sketch works by hashing elements and tracking the largest number of leading zeros observed in each "bucket" to estimate sizes, and sketches can be merged to estimate unions between sets. It enables queries like set operations and similarity to be performed quickly and cheaply on large datasets.
Melbourne, Australia Global Day of Code Retreat 2018 gdcr18 - Event SlidesVictoria Schiffer
On 17th November across the globe the Global Day of CodeRetreat was happening. In Melbourne 28 Developers rocked up to code and learn together. Here are Melbourne's slides, including the chosen constraints for the day.
Thanks to my awesome co-hosts, co-facilitators and co-organisers Tomasz Janowski (REA), Luke McCarthy (ThoughtWorks), Ren Koh (SEEK), Terence Duong (SEEK), Seamus Kearney (SEEK), Birgitta Mabbett (SEEK) and Kelly Benson (ThoughtWorks) for a great team effort to making the day successful and heaps of fun!
The document summarizes the Count-Min Sketch streaming algorithm. It uses a two-dimensional array and d independent hash functions to estimate item frequencies in a data stream using sublinear space. It works by incrementing the appropriate counters in each row when an item arrives. The estimated frequency of an item is the minimum value across the rows. Analysis shows that for an array width w proportional to 1/ε, the estimate will be within an additive error of ε times the total frequency with high probability.
This document provides an overview of the Python programming language in under 90 minutes. It covers Python basics like Hello World, variables, data types, objects, functions, conditionals, and more. The goal is to teach readers enough Python to read, write, and understand basic Python programs in a short period of time. It also provides references to additional resources like the author's book for learning Python in more depth.
This document provides an overview of the Python programming language in under 90 minutes. It discusses Python basics like Hello World, variables, data types, objects, functions, conditionals, and more. The goal is to teach readers enough Python to read, write, and understand basic Python programs in a short period of time. It also recommends the book "Treading on Python Volume 1" which covers the content of this talk in more detail.
This document discusses algorithms and their analysis. It covers linear search, binary search, sorting algorithms, and analyzing time complexity using Big O, Big Theta, and Big Omega notation. Binary search runs in logarithmic time O(log n) while linear search takes linear time O(n). The traveling salesperson problem takes exponential time O(n!) and is considered an inefficient algorithm.
The document discusses generative adversarial networks (GANs) and their applications to image-to-image translation tasks. It introduces DiscoGAN and CycleGAN, two models that use GANs to translate between image domains without paired training data. The key differences between DiscoGAN and CycleGAN are discussed, including architectural choices like the number of layers, normalization methods, and losses used. CycleGAN adds reconstruction losses and uses residual blocks, while DiscoGAN has a simpler architecture with fewer layers.
Lecture 4: Frequent Itemests, Association Rules. Evaluation. Beyond Apriori (ppt, pdf)
Chapter 6 from the book “Introduction to Data Mining” by Tan, Steinbach, Kumar.
Chapter 6 from the book Mining Massive Datasets by Anand Rajaraman and Jeff Ullman.
Probabilistic Data Structures and Approximate SolutionsOleksandr Pryymak
Probabilistic and approximate data structures can provide scalable solutions when exact answers are not required. They trade accuracy for speed and efficiency. Approaches like sampling, hashing, cardinality estimation, and probabilistic databases allow analyzing large datasets while controlling error rates. Example techniques discussed include Bloom filters, locality-sensitive hashing, count-min sketches, HyperLogLog, and feature hashing for machine learning. The talk provided code examples and comparisons of these probabilistic methods.
The document discusses register allocation techniques in LLVM. It introduces the register allocation problem and describes LLVM's template method for register allocation. It then discusses LLVM's basic greedy register allocation approach, which prioritizes assigning registers to live intervals with higher spill costs. Finally, it describes LLVM's greedy register allocation in more detail, including techniques for live interval splitting such as local, instruction, region and block splitting using algorithms like Hopfield networks.
Steve Lorello presented on Redis, dispelling common myths and demonstrating how to store and query complex objects. The key points covered included:
1. Redis can provide ACID transactions through individual commands and scripting. Data is always consistent on a single instance and eventually consistent with replication.
2. Redis ensures durability through append-only files and snapshotting databases to disk.
3. Complex objects can be stored in Redis as hashes, structured blobs, or JSON for flexible querying.
4. Secondary indexes like sorted sets or RediSearch enable finding objects by values through queries.
5. Redis OM and LINQ simplify .NET applications by providing declarative indexes and query capabilities.
An Introduction to Redis for .NET Developers.pdfStephen Lorello
Redis is an in-memory data structure store that can be used as a database, cache, or message broker. It supports various data structures like strings, hashes, lists, sets, sorted sets, streams and geospatial indexes. Redis is single-threaded, runs on a single server, and is very fast. It provides different deployment modes like standalone, sentinel-based replication for high availability, and clustering for horizontal scaling. The StackExchange.Redis client is recommended for .NET developers to connect and interact with Redis.
Redis is a popular, powerful, and—most of all—FAST database. Developers worldwide use Redis as a Cache, a session store, a message bus, and even as their regular database. In this session, we'll introduce Redis, and learn some of the key design patterns that you can use with Redis. Then we'll go over how Redis fits into the ecosystem, and how to use it in our applications—including some of the major gotchas to avoid.
Indexing, searching, and aggregation with redi search and .netStephen Lorello
This document discusses indexing, searching, and aggregation in Redis using RediSearch and .NET. It provides an introduction to Redis data structures and building secondary indices. It then covers using RediSearch to define schemas, query data through full text search and filters, and perform aggregations through grouping, reductions, and applying functions. RediSearch provides an easier way to index and query Redis compared to building secondary indices in vanilla Redis.
ASP.NET and its cousin ASP.NET Core have long been dominant and intuitive tools for us to build our Web Applications. There's always friction when moving between the front and back ends, though, as we need to leave our comfortable .NET environment to add the logic to our web pages. We've had a hard dependency on Javascript, and all the flaws and warts that go along with it. With the advent of Blazor, we can start to move past this. In this talk, we'll be exploring Blazor Server and Blazor WebAssembly and discussing how we can begin building the Frontends of our web applications without Javascript!
This document provides an overview of computer vision and facial detection using .NET. It discusses digital images, convolution and edge detection using Sobel filters. It also covers convolutional neural networks and their limitations. Facial detection is demonstrated using Viola-Jones detection, integral images and cascading classifiers. Finally, it shows how to integrate facial detection with the Vonage Video API in a WPF application by intercepting video frames and running detection on each one before rendering.
The document provides an agenda and overview for an introduction to computer vision in .NET. The agenda includes discussing what digital images and computer vision are, demonstrating Hello OpenCV in .NET, covering convolution and edge detection, facial detection, facial detection with the Vonage Video API, and feature tracking and image projection. The document shares code examples and resources for working with computer vision libraries like OpenCV and Emgu CV in .NET.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
DDS Security Version 1.2 was adopted in 2024. This revision strengthens support for long runnings systems adding new cryptographic algorithms, certificate revocation, and hardness against DoS attacks.
E-commerce Development Services- Hornet DynamicsHornet Dynamics
For any business hoping to succeed in the digital age, having a strong online presence is crucial. We offer Ecommerce Development Services that are customized according to your business requirements and client preferences, enabling you to create a dynamic, safe, and user-friendly online store.
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
1. Count-min sketch to Infinity:
Using Probabilistic Data Structures to Solve Presence, Counting, and Distinct
Count Problems in .NET
presented by: Steve Lorello - Developer Advocate @Redis
2. Agenda
● What are Probabilistic Data Structures?
● Set Membership problems
● Bloom Filters
● Counting problems
● Count-Min-Sketch
● Distinct Count problems
● HyperLogLog
● Using Probabilistic Data Structures with Redis
3. What are Probabilistic Data Structures?
● Class of specialized data structures
● Tackle specific problems
● Use probability approximate
4. Probabilistic Data Structures Examples
Name Problem Solved Optimization
Bloom Filter Presence Space, Insertion Time, Lookup Time
Quotient Filter Presence Space, Insertion Time, Lookup Time
Skip List Ordering and Searching Insertion Time, Search time
HyperLogLog Set Cardinality Space, Insertion Time, Lookup Time
Count-min-sketch Counting occurrences on large sets Space, Insertion Time, Lookup Time
Cuckoo Filter Presence Space, Insertion Time, Lookup Time
Top-K Keep track of top records Space, Insertion Time, Lookup Time
17. Bloom Filter
● Specialized ‘Probabilistic’ Data Structure for presence checks
● Can say if element has definitely not been added
● Can say if element has probably been added
● Uses constant K-hashes scheme
● Represented as a 1D array of bits
● All operations O(1) complexity
● Space complexity O(n) - bits
26. Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
27. Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
28. Example Query username ‘fizzle’
Bloom Filter k = 3
bit 0 1 2 3 4 5 6 7 8 9
state 0 0 1 0 0 1 0 0 1 0
H1(fizzle) = 8 - bit 8 is set—maybe?
H3(fizzle) = 2 - bit 2 is set—maybe?
H2(fizzle) = 4 - bit 4 is not set—definitely not.
29. False Positives and Optimal K
● This algorithm will never give you false negatives, but it is
possible to report false positives
● You can optimize false positives by optimizing K
● Let c = hash-table-size/num-records
● Optimal K = c * ln(2)
● This will result in a false positivity rate of .6185^c, this will be
quite small
31. What’s a Counting Problem?
● How many times does an individual occur in a stream
● Easy to do on small-mid size streams of data
● e.g. Counting Views on YouTube
● Nearly impossible to scale to enormous data sets
32. Naive Approach: Hash Table
● Hash Table of Counters
● Lookup name in Hash table, instead of storing record,
store an integer
● On insert, increment the integer
● On query, check the integer
33. Pros
● Straight Forward
● Guaranteed accuracy (if
storing whole object)
● O(n) Space Complexity in the
best case
● O(n) worst case time
complexity
● Scales poorly (think billions
of unique records)
● If relying on only a single
hash - very vulnerable to
collisions and overcounts
Cons
34. Naive Approach Relational DB
● Issue a Query to a traditional Relational Database searching for a count of
record where some condition occurs
SELECT COUNT( * ) FROM views
WHERE name=”Gangnam Style”
Linear Time Complexity O(n)
Linear Space Complexity O(n)
35. What’s the problem with a Billion Unique Records?
● Each unique record needs its own space in a Hash Table or row in a RDBMS
(perhaps several rows across multiple tables)
● Taxing on memory for Hash Table
○ 8 bit integer? 1GB
○ 16 bit? 2GB
○ 32 bit? 4GB
○ 64 bit? 8GB
● Maintaining such large data structures in a typical program’s memory isn’t
feasible
● In a relational database, it’s stored on disk
37. Count-Min Sketch
● Specialized data structure for keeping count on very large streams of data
● Similar to Bloom filter in Concept - multi-hashed record
● 2D array of counters
● Sublinear Space Complexity
● Constant Time complexity
● Never undercounts, sometimes over counts
52. Type Worst Case
Space Sublinear
Increment O(1)
Query O(1)
Delete Not Available
Complexities
53. CMS Pros
● Extremely Fast - O(1)
● Super compact - sublinear
● Impossible to undercount
● incidence of overcounting - all
results are approximations
CMS Cons
54. When to Use a Count Min Sketch?
● Counting many unique instances
● When Approximation is fine
● When counts are likely to be skewed (think YouTube video views)
56. Set Cardinality
● Counting distinct elements inserted into set
● Easier on smaller data sets
● For exact counts - must preserve all unique elements
● Scales very poorly
58. HyperLogLog
● Probabilistic Data Structure to Count Distinct Elements
● Space Complexity is about O(1)
● Time Complexity O(1)
● Can handle billions of elements with less than 2 kB of memory
● Can scale up as needed - HLL is effectively constant even
though you may want to increase its size for enormous
cardinalities.
59. HyperLogLog Walkthrough
● Initialize an array of registers of size 2^P (where P is some constant, usually
around 16-18)
● When an Item is inserted
○ Hash the Item
○ Determine the register to update: i - from the left P bits of the item’s hash
○ Set registers[i] to the index of the rightmost 1 in the binary representation of the hash
● When Querying
○ Compute harmonic mean of the registers that have been set
○ Multiply by a constant determined by size of P
60. Example: Insert Username ‘bar’ P = 16
H(bar) = 3103595182
● 1011 1000 1111 1101 0001 1010 1010 1110
● Take first 16 bits -> 1011 1000 1111 1101 -> 47357 = register index
● Index of rightmost 1 = 1
● registers[47357] = 1
61. Get Cardinality
● Calculate harmonic mean of only set registers.
Only 1 set register: 47357 -> 1
ceiling(.673 * 1 * 1/(2^1)) = 1
Cardinality = 1
63. What Is Redis Stack?
● Grouping of modules for Redis, including Redis Bloom
● Adds additional functionality, e.g. Probabilistic Data structures to Redis
81. Does Tom Exist? H1(Tom)=1 H2(Tom)=4 H3(Tom) = 5
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
A hash of Tom = 0, so no Tom does not exist!
82. Does Liam Exist? H1(Liam)=0 H2(Liam) = 4 H3(Liam) = 6
0 1 2 3 4 5 6 7 8 9
1 0 0 0 1 1 1 0 0 0
All Hashes of Liam = 1, so we repot YES
83. 2-Choice Hashing
● Use two Hash Functions instead of one
● Store @ index with Lowest Load (smallest linked list)
● Time Complexity goes from log(n) in traditional chain hash table -> log(log(n))
with high probability, so nearly constant
● Benefit stops at 2 hashes, additional hashes don’t help
● Still O(n) space complexity
85. Hash Table
● Ubiquitous data structure for
storing associated data. E.g.
Map, Dictionary, Dict
● Set of Keys associated with
array of values
● Run hash function on key to find
position in array to store value
Source: wikipedia
86. Hash Collisions
● Hash Functions can
produce the same
output for different
keys - creates
collision
● Collision Resolution
either sequentially of
with linked-list
87. Hash Table Complexity - with chain hashing
Type Amortized Worst Case
Space O(n) O(n)
Insert O(1) O(n)
Lookup O(1) O(n)
Delete O(1) O(n)