...par Raphaël Javaux, le 25 mars 2015.
Copie pour permettre le téléchargement L'original se trouve ici : https://docs.google.com/presentation/d/1A87M-e3u1uXlsQddGsI60EZXHVqiehaNy_SCfj7hUeQ/pub?start=false&loop=false&delayms=3000&slide=id.p
Ce cours concerne la manipulation des chaines de caractères et les expressions régulières. La première partie présente la classe str Python ainsi que les opérations qu'il est possible de faire sur des objets str. La seconde partie concerne les expressions regulières qui permettent de valider des chaines de caractères ou d'en extraire des sous-chaines qui satisfont un motif donné. On y voit finalement comment utiliser le module re Python.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
The presentation introduces KillrChat, a scalable messaging app built using Cassandra to demonstrate denormalization. It discusses the technology stack including Cassandra, Spring Boot, and AngularJS. It then covers the data models and solutions for various entities like users, chat rooms, and messages to handle concurrent modifications using lightweight transactions. Real-time features are implemented with WebSockets. The presentation provides a hands-on exercise for attendees and highlights how to build a real application with the Cassandra ecosystem.
This document summarizes a presentation about the KillrChat messaging application. KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates de-normalization and provides an exercise for attendees to work with user and chat room management, as well as chat messages. The document outlines the architecture, data models, and solutions for handling concurrent requests to avoid inconsistencies through the use of lightweight transactions in Cassandra.
...par Raphaël Javaux, le 25 mars 2015.
Copie pour permettre le téléchargement L'original se trouve ici : https://docs.google.com/presentation/d/1A87M-e3u1uXlsQddGsI60EZXHVqiehaNy_SCfj7hUeQ/pub?start=false&loop=false&delayms=3000&slide=id.p
Ce cours concerne la manipulation des chaines de caractères et les expressions régulières. La première partie présente la classe str Python ainsi que les opérations qu'il est possible de faire sur des objets str. La seconde partie concerne les expressions regulières qui permettent de valider des chaines de caractères ou d'en extraire des sous-chaines qui satisfont un motif donné. On y voit finalement comment utiliser le module re Python.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
Fast track to getting started with DSE Max @ INGDuyhai Doan
This document provides an overview of Apache Spark and Apache Cassandra and how they can be used together. It begins with introductions to Spark, describing its core concepts like RDDs and transformations. It then introduces Cassandra and covers concepts like data distribution and token ranges. The remainder discusses the Spark Cassandra connector, covering how it allows reading and writing Cassandra data from Spark and maintaining data locality. It also discusses use cases, failure handling, and cross-datacenter/cluster operations.
This document discusses using Spark with Apache Cassandra for various use cases including loading data from various sources, performing analytics, and sanitizing, validating, and transforming data. It provides examples of using Spark jobs to import data, clean data, perform schema migrations, and run analytics queries. It also covers aspects of the connector architecture like data locality, failure handling, and cross data center operations. The document concludes with discussing a benchmark that used Spark and Cassandra to perform parallel data ingestion and top-K queries on 3.2 billion rows of data with SASI indices.
The presentation introduces KillrChat, a scalable messaging app built using Cassandra to demonstrate denormalization. It discusses the technology stack including Cassandra, Spring Boot, and AngularJS. It then covers the data models and solutions for various entities like users, chat rooms, and messages to handle concurrent modifications using lightweight transactions. Real-time features are implemented with WebSockets. The presentation provides a hands-on exercise for attendees and highlights how to build a real application with the Cassandra ecosystem.
This document summarizes a presentation about the KillrChat messaging application. KillrChat is a scalable messaging app built using AngularJS, Spring, and Cassandra. It demonstrates de-normalization and provides an exercise for attendees to work with user and chat room management, as well as chat messages. The document outlines the architecture, data models, and solutions for handling concurrent requests to avoid inconsistencies through the use of lightweight transactions in Cassandra.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
This document summarizes Cassandra drivers and tools. It discusses the Java driver architecture including connection pooling, load balancing policies, and automatic paging. It also demonstrates Cassandra Unit for testing, the Java driver object mapper module, and Achilles object mapper with features like dirty checking. Live coding examples are provided for these tools.
The document describes the KillrChat application, which is a scalable chat application built with AngularJS, Cassandra, and Spring Boot. It discusses the application architecture including using Cassandra for distributed data storage and scaling out via a message broker. It also summarizes the key components of the application including controllers, services, REST resources, directives, and how data is distributed in Cassandra.
This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Datastax day 2016 introduction to apache cassandraDuyhai Doan
This document provides an overview of Apache Cassandra and discusses its key features. It describes how Cassandra distributes and replicates data across multiple nodes for continuous availability and linear scalability. It also covers Cassandra's consistency model and how consistency levels can be tuned to balance availability and durability. The document lists Cassandra's features like collections, user-defined types, materialized views, and JSON support for flexible data modeling.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
This document discusses Libon's migration of contact data from an SQL database to Cassandra. It began with billions of contact records stored relationally in Oracle. Performance became unpredictable at scale. Tuning Oracle helped but new challenges like high availability and multi-datacenter support remained. The migration strategy involved writing to both databases, migrating old data, and switching fully to Cassandra while preserving no downtime and safe rollback. Business code refactoring maintained existing tests by modifying services and repositories to work with the new Cassandra data model.
This document discusses Cassandra and the Datastax Academy. It provides examples of companies using Cassandra as infrastructure including ING, Netflix, Sony, and Microsoft. It also discusses the increasing SQL support in Cassandra, such as user defined functions, materialized views, and secondary indexes. The document notes that skills in Cassandra are in high demand but difficult to find. It promotes the Datastax Academy as a free solution to this problem, offering self-paced courses, instructor-led training, and O'Reilly certification to boost careers.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
Cassandra 3 new features @ Geecon Krakow 2016Duyhai Doan
Duyhai Doan gave a presentation on new features in Cassandra 3.0, including materialized views, user defined functions, user defined aggregates, and the new SASI full text search index. Materialized views allow pre-computing common queries to improve performance. User defined functions and aggregates enable pushing computation to the server. The new SASI index provides improved full text search capabilities in Cassandra.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
Duy Hai Doan presented Apache Zeppelin, an open-source web-based notebook that allows users to interact with data. Zeppelin provides a front-end GUI and display system for data analysis tools and uses interpreters to connect to back-end systems like Spark, Cassandra, and Flink. Doan demonstrated Zeppelin's notebook interface, display options, and how users can write their own interpreters to connect new systems to Zeppelin. Future plans for Zeppelin include improving usability, adding authentication and authorization, and developing more interpreters and visualizations.
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
This document summarizes Cassandra drivers and tools. It discusses the Java driver architecture including connection pooling, load balancing policies, and automatic paging. It also demonstrates Cassandra Unit for testing, the Java driver object mapper module, and Achilles object mapper with features like dirty checking. Live coding examples are provided for these tools.
The document describes the KillrChat application, which is a scalable chat application built with AngularJS, Cassandra, and Spring Boot. It discusses the application architecture including using Cassandra for distributed data storage and scaling out via a message broker. It also summarizes the key components of the application including controllers, services, REST resources, directives, and how data is distributed in Cassandra.
This document provides an introduction to Cassandra including:
1) An overview of Cassandra's key architecture including its linear scalability, continuous availability across data centers, and operational simplicity.
2) A discussion of Cassandra's data model including its use of Last Write Wins for conflict resolution and examples of modeling one-to-many relationships using clustered tables.
3) Details on Cassandra's consistency levels and how they impact availability and durability of writes and reads.
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
This document discusses how Spark and Cassandra can be used together. It begins with an introduction to Spark and Cassandra individually, explaining their architectures and key features. It then details the Spark-Cassandra connector, describing how Cassandra tables can be exposed as Spark RDDs and DataFrames. Various use cases for Spark and Cassandra are presented, including data cleaning, schema migration, and analytics. The document emphasizes the importance of data locality when performing joins and writes between Spark and Cassandra. Code examples are provided for common tasks like data cleaning, migration, and analytics.
This document summarizes a presentation about using Spark with Apache Cassandra. It discusses using Spark jobs to load and transform data in Cassandra for purposes such as data import, cleaning, schema migration and analytics. It also covers aspects of the connector architecture like data locality, failure handling and cross-cluster operations. Examples are given of using Spark and Cassandra together for parallel data ingestion and top-K queries on a large dataset.
Datastax day 2016 introduction to apache cassandraDuyhai Doan
This document provides an overview of Apache Cassandra and discusses its key features. It describes how Cassandra distributes and replicates data across multiple nodes for continuous availability and linear scalability. It also covers Cassandra's consistency model and how consistency levels can be tuned to balance availability and durability. The document lists Cassandra's features like collections, user-defined types, materialized views, and JSON support for flexible data modeling.
The document provides an introduction to Cassandra presented by Duy Hai Doan. It discusses Cassandra's history, key features including linear scalability, availability, support for multiple data centers, operational simplicity, and analytics capabilities. It also covers Cassandra architecture including the cluster layer based on Dynamo and data-store layer based on BigTable, data distribution, replication, consistency levels, and the write path. The data model of using the last write to resolve conflicts is explained along with CQL basics and modeling one-to-many relationships with clustered tables.
Spark cassandra integration, theory and practiceDuyhai Doan
This document discusses Spark and Cassandra integration. It begins with an introduction to Spark, describing it as a general data processing framework that is faster than Hadoop. It then discusses the Cassandra database and its data distribution using token ranges. The document provides examples of using the Spark/Cassandra connector for reading and writing data between Spark and Cassandra, including techniques for ensuring data locality. It discusses best practices for cluster deployment and handling failures while maintaining data locality. Finally, it presents some use cases for using Spark/Cassandra including data cleaning, schema migration, and analytics.
This document discusses Libon's migration of contact data from an SQL database to Cassandra. It began with billions of contact records stored relationally in Oracle. Performance became unpredictable at scale. Tuning Oracle helped but new challenges like high availability and multi-datacenter support remained. The migration strategy involved writing to both databases, migrating old data, and switching fully to Cassandra while preserving no downtime and safe rollback. Business code refactoring maintained existing tests by modifying services and repositories to work with the new Cassandra data model.
This document discusses Cassandra and the Datastax Academy. It provides examples of companies using Cassandra as infrastructure including ING, Netflix, Sony, and Microsoft. It also discusses the increasing SQL support in Cassandra, such as user defined functions, materialized views, and secondary indexes. The document notes that skills in Cassandra are in high demand but difficult to find. It promotes the Datastax Academy as a free solution to this problem, offering self-paced courses, instructor-led training, and O'Reilly certification to boost careers.
Cassandra nice use cases and worst anti patterns no sql-matters barcelonaDuyhai Doan
This document summarizes a presentation on Cassandra use cases and anti-patterns. It discusses several anti-patterns to avoid such as queue-like designs, intensive updates on the same column, and designing around a dynamic schema. It also provides examples of good use cases such as rate limiting, anti-fraud detection, and account validation. The document contains an agenda, descriptions of each anti-pattern and their level of failure, as well as explanations and demonstrations of the example use cases.
There are a few options for performing more complex queries in Cassandra beyond the restrictions of the WHERE clause:
1. Denormalize/duplicate data across tables to allow querying on different columns. For example, have one table keyed on user ID and another keyed on message date to allow filtering by date.
2. Offload complex queries to an external search index like Solr or Elasticsearch that can handle full-text and complex queries, and keep Cassandra as the system of record.
3. Use Spark/Hive on Cassandra to run queries across the cluster and leverage their more powerful query engines.
4. Consider a different database if your queries require joins, complex where clauses, or don't map well to
Cassandra 3 new features @ Geecon Krakow 2016Duyhai Doan
Duyhai Doan gave a presentation on new features in Cassandra 3.0, including materialized views, user defined functions, user defined aggregates, and the new SASI full text search index. Materialized views allow pre-computing common queries to improve performance. User defined functions and aggregates enable pushing computation to the server. The new SASI index provides improved full text search capabilities in Cassandra.
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
This document provides an overview of Spark and its integration with Cassandra for real-time data processing. It begins with introductions of the speaker and Datastax. It then discusses what Spark and Cassandra are, including their architectures and key characteristics like Spark being fast, easy to use, and supporting multiple languages. The document demonstrates basic Spark code and how RDDs work. It covers the Spark and Cassandra connectors and how they provide locality-aware joins. It also discusses use cases and deployment options. Finally, it considers future improvements like leveraging Solr for local filtering to improve data locality during joins.
Apache zeppelin the missing component for the big data ecosystemDuyhai Doan
Duy Hai Doan presented Apache Zeppelin, an open-source web-based notebook that allows users to interact with data. Zeppelin provides a front-end GUI and display system for data analysis tools and uses interpreters to connect to back-end systems like Spark, Cassandra, and Flink. Doan demonstrated Zeppelin's notebook interface, display options, and how users can write their own interpreters to connect new systems to Zeppelin. Future plans for Zeppelin include improving usability, adding authentication and authorization, and developing more interpreters and visualizations.
This document provides an overview of DataStax Enterprise, a database platform for cloud applications. It discusses key features of DataStax Enterprise including that it is certified for production, offers automatic management services for configuration and administration through OpsCenter, and provides 24/7 expert support. The document also summarizes various DataStax Enterprise technologies and capabilities like advanced replication, tiered storage, security features, and integration with search, analytics, and graph databases.
Sasi, cassandra on full text search rideDuyhai Doan
This document discusses SASI (SSTable Attached Secondary Index), a new secondary index for Apache Cassandra that follows the SSTable lifecycle. It describes how SASI works, including its in-memory and on-disk structures. It also covers SASI's query planning optimizations and provides some benchmark results showing SASI's performance improvements over full scans. While SASI is not as full-featured as search engines, it can cover many search use cases within Cassandra.
This document provides an overview of big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure and consensus. It also covers data sharding, replication, and the CAP theorem. Key points include how latency is impacted by network delays, different failure modes, and that the CAP theorem states that a distributed system can only guarantee two of consistency, availability, and partition tolerance at once.
Big data 101 for beginners riga dev daysDuyhai Doan
This document provides an overview and introduction to big data concepts for a new project in 2017. It discusses distributed systems theories like time ordering, latency, failure modes, and consensus protocols. It also covers data sharding and replication techniques. The document explains the CAP theorem and how it relates to consistency and availability. Finally, it discusses different distributed systems architectures like master/slave versus masterless designs.
Datastax day 2016 : Cassandra data modeling basicsDuyhai Doan
This document discusses data modeling with Apache Cassandra. It covers:
1. The objectives of data modeling like reducing query latency and avoiding disasters
2. Choosing the right partition key which is the main entry point for queries and helps distribute data
3. Using clustering columns to simulate one-to-many relationships and enable sorting and range queries
4. Other critical details like avoiding huge partitions, sub-partitioning techniques, and how deletes create tombstones
This document discusses Apache Cassandra and its features and use cases. It provides an overview of Cassandra's key characteristics like massive scalability, extreme availability, and rich data modeling. Example use cases mentioned include messaging, collections/playlists, fraud detection, recommendations, and IoT sensor data. New features introduced in Cassandra in 2016 are also summarized, such as delete by range, materialized views, atomic UDT updates, a new SASI index, and support for GROUP BY queries.
Spark zeppelin-cassandra at synchrotronDuyhai Doan
This document discusses using Spark, Cassandra, and Zeppelin for storing and aggregating metrics data from a particle accelerator project called HDB++. It provides an overview of the HDB++ project, how it previously used MySQL but now stores data in Cassandra. It describes the Spark jobs that are run to load metrics data from Cassandra and generate statistics that are written back to Cassandra. It also demonstrates visualizing the data using Zeppelin and discusses some tricks and traps to be aware of when using this stack.
This document provides an introduction to Cassandra including:
- Datastax is a company that contributes to Apache Cassandra and sells Datastax Enterprise.
- Cassandra was created at Facebook and is now open source software with the current version being 3.2.
- Cassandra's key features include linear scalability, continuous availability, multi-datacenter support, operational simplicity, and Spark integration.
This document discusses user-defined functions and materialized views in Cassandra. It provides information on how to create user-defined functions and user-defined aggregates, including the syntax and best practices. It also covers how user-defined functions and aggregates are executed. The document then discusses materialized views, including why they are useful and how they work at a high level. It provides the syntax for creating materialized views and describes how updates are handled.
Apache zeppelin, the missing component for the big data ecosystemDuyhai Doan
Apache Zeppelin is a web-based notebook that allows users to interact with data via interpreters like Spark, SQL, and Cassandra. It provides a GUI for data scientists to write code and visualizations in notebooks. Zeppelin has a modular architecture that allows new interpreters to be easily added. It also includes features like scheduling, sharing, and exporting of notebooks.
Distributed algorithms for big data @ GeeConDuyhai Doan
This document discusses distributed algorithms for big data. It begins with an overview of HyperLogLog for estimating cardinality and counting distinct elements in a large data set. It then explains how HyperLogLog works by using a hash function to distribute the data across buckets and applying the LogLog algorithm to each bucket before taking the harmonic mean. The document also covers Paxos for distributed consensus, explaining the phases of prepare, promise, accept and learn to reach agreement in the presence of failures.
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
- The document discusses Spark/Cassandra connector API, best practices, and use cases.
- It describes the connector architecture including support for Spark Core, SQL, and Streaming APIs. Data is read from and written to Cassandra tables mapped as RDDs.
- Best practices around data locality, failure handling, and cross-region/cluster operations are covered. Locality is important for performance.
- Use cases include data cleaning, schema migration, and analytics like joins and aggregation. The connector allows processing and analytics on Cassandra data with Spark.
13. @doanduyhai#AlgosBigData
Algorithmesimplifié LogLog
1) Choisir une fonction de hachage H très distributive
2) Pour chaque élément observé (login, article_id, uuid …),
appliquer H
3) Convertir le hash en séquence binaire
4) Déduire des séquences binaires, la cardinalité
13
0111010010101…
0010010010001…
1010111001100…
…
14. @doanduyhai#AlgosBigData
Intuition LogLog
Équi-probabilité:
50% des séquences commencent par 0xxxxx
50% des séquences commencent par 1xxxxx
1/4 des séquences commencent par 00xxxxx
1/4 des séquences commencent par 01xxxxx
1/4 des séquences commencent par 10xxxxx
1/4 des séquences commencent par 11xxxxx
14
34. @doanduyhai#AlgosBigData
HyperLogLog, les maths
On remplace les xi dans la formule par Mi
34
H(Mi ) = b Mi
−1
i=1
b
∑( )
−1
On remplace les Mi dans la formule par 2max(ri)
H(Mi ) = b 2i
−max(ri )
i=1
b
∑
#
$
%
&
'
(
−1
35. @doanduyhai#AlgosBigData
HyperLogLog, les maths
On remplace dans la formule initiale n ≈ b・H(Mi)
35
n ≈ αbb2
2−max(ri )
i=1
b
∑
$
%
&
'
(
)
−1
n = cardinalité estimée
b = nb de buckets rang max dans
chaque bucket𝛼b = constante correctrice
50. @doanduyhai#AlgosBigData
Paxos phase 1: prepare
50
Le proposer choisit un nombre n (séquence toujours
croissante)
Il envoie prepare(n) à un quorumd’Acceptors
Proposer
/Leader
prepare(n)
Acceptor
51. @doanduyhai#AlgosBigData
Paxos phase 1: promise
51
Chaque acceptor, à la réception d’un prepare(n):
• s’il a déjà accepté un accept(m,valm) de la part d’un autre proposer
avec m ≤ n
☞ retourne promise(n,(m,valm))
Proposer
/Leader
Acceptor
promise(n,(m, valm))
52. @doanduyhai#AlgosBigData
Paxos phase 1: promise
52
Chaque acceptor, à la réception d’un prepare(n):
• s’il n’a accepté aucune proposition encore (accept(?,?)) ou s’il a
retourné une promesse avec m < n
☞ retourne promise(n, ∅) ET promet de ne plus accepter aucun
prepare(m) ou accept(m,?) avec m < n
Proposer
/Leader
Acceptor
promise(n,∅)
53. @doanduyhai#AlgosBigData
Paxos phase 1: promise
53
Chaque acceptor, à la réception d’un prepare(n):
• s’il a déjà fait une promesse/accepté une valeur avec m > n
☞ ignore la requête. Il peut renvoyer un Nack également
(optimisation)
Proposer
/Leader
Acceptor
Nack
54. @doanduyhai#AlgosBigData
Paxos phase 1: objectifs
54
Buts de la phase 1:
• découvrir toute proposition en cours pour la faire progresser
• bloquer toute ancienne proposition qui n’a pas abouti
55. @doanduyhai#AlgosBigData
Paxos phase 2: accept
55
Le leader reçoit plusieurs promise(n,(mi,vali)):
• si tous les couples (mi,vali) reçus sont vides (promise(∅, ∅)), le
leader peut envoyer accept(n,val) avec val de son choix
• extrait tous les couples (mi,vali) pour garder vali avec le mi le plus
grand ET envoie accept(n,valmax(mi)) au quorum d’acceptors
Proposer
/Leader
Acceptor
accept(n,valmax)
56. @doanduyhai#AlgosBigData
Paxos phase 2: accepted
56
Chaque acceptor à la réception d’un accept(n,val):
• s’il n’a fait aucune promesse avec m > n, retourne accepted()
• sinon, ignore la requête
Proposer
/Leader
Acceptor
accepted(n,val)
57. @doanduyhai#AlgosBigData
Paxos phase 2: learn
57
Chaque acceptor après avoir envoyé un accepted():
• envoie la valeur val choisie à une liste de learners (stockage
durable)
Le consensus est atteint et a pour valeur val !
Ceci définit un tour de Paxos
58. @doanduyhai#AlgosBigData
Limites du Paxos théorique
58
Une fois la valeur val choisie, on ne peut plus la modifier !
Besoin de faire un reset de val pour un autre tour de Paxos
Multi-Paxos
• plusieurs tours de Paxos en parallèle
• chaque serveur peut devenir tour à tour Proposer, Acceptor & Learner
Fast-Paxos, Egalitarian-Paxos etc …
59. @doanduyhai#AlgosBigData
Situations de conflits
59
Le dernier arrivé fait progresser une proposition "en cours"
a1
a2
a3
a4
a5
prepare(n1) promise(∅)
promise(∅)
promise(∅)
prepare(n1)
prepare(n1)
Légende
message reçu
message envoyé
accept(n1,a) prepare(n2)
prepare(n2)
prepare(n2)
promise(n2,(n1,a))
promise(∅)
promise(∅)
propose(n2,a)
propose(n2,a)
propose(n2,a)
accepted()
☠
☠
accept()
accept()
accept()