Aucune remarque pour cette diapositive
BigData != Hadoop
Mais souvent Hadoop…!
Cependant Hadoop = écosystème vaste
Faire rappel sur historique Hadoop : v1 puis v2
Données transmises en RPC entre Mapper et Reducer
Pas de logique dans le dataNode : c’est le NameNode qui connait via réception des bloc report
The fsimage file contains a serialized form of all the directory and file
inodes in the filesystem. Each inode is an internal representation of a
file or directory’s metadata and contains such information as the file’s
replication level, modification and access times, access permissions,
block size, and the blocks a file is made up of. For directories, the modification
time, permissions, and quota metadata is stored.
The fsimage file does not record the datanodes on which the blocks are
stored. Instead the namenode keeps this mapping in memory, which it
constructs by asking the datanodes for their block lists when they join
the cluster and periodically afterward to ensure the namenode’s block
mapping is up-to-date.
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.
In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.
When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.
L’écosystème de Logstash est constitué de 4 composants :
Shipper qui envoie des événements à Logstash.
Broker et Indexer qui reçoivent et indexent les événements.
Search et Stockage qui permettent de rechercher et de stocker les événements.
Web Interface qui est une interface web appelée Kibana.
Un RDD est une abstraction de collection sur laquelle les opérations sont effectuées de manière distribuée tout en étant tolérante aux pannes matérielles. Le traitement que l’on écrit semble ainsi s’exécuter au sein de notre JVM mais il sera découpé pour s’exécuter sur plusieurs noeuds. En cas de perte d’un noeud, le sous-traitement sera automatiquement relancé sur un autre noeud par le framework, sans que cela impacte le résultat.
Ne supporte pas les insert
Supporte les insert
Ne supporte pas les insert
Seulement insert en fournissant une table
Spouts –sources of streams in a computation (e.g. a Twitter API)
Bolts – process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases.
A Storm cluster is composed of a set of nodes running a Supervisor daemon. The supervisor daemons talk to a single master node running a daemon called Nimbus. The Nimbus daemon is responsible for assigning work and managing resources in the cluster
Storm uses ZeroMQ for non-durable communication between bolts, which enables extremely low latency transmission of tuples. Samza does not have an equivalent mechanism, and always writes task output to a stream.
Samza is made up of three layers:
A streaming layer.
An execution layer.
A processing layer.
Samza provides out of the box support for all three layers.
Processing: Samza API
Storm and Samza are fairly similar. Both systems provide many of the same high-level features: a partitioned stream model, a distributed execution environment, an API for stream processing, fault tolerance, Kafka integration, etc.
Storm and Samza use different words for similar concepts: spouts in Storm are similar to stream consumers in Samza, bolts are similar to tasks, and tuples are similar to messages in Samza. Storm also has some additional building blocks which don’t have direct equivalents in Samza.
currently only at-least-once delivery, but support for exactly-once semantics is planned
Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
The attachment type is provided as a plugin extension. It uses Apache Tika behind the scene.
Kudu is an open source storage engine for structured data
which supports low-latency random access together with effi-
cient analytical access patterns. Kudu distributes data using
horizontal partitioning and replicates each partition using
Raft consensus, providing low mean-time-to-recovery and
low tail latencies. Kudu is designed within the context of
the Hadoop ecosystem and supports many modes of access
via tools such as Cloudera Impala, Apache Spark,
Structured storage in the Hadoop ecosystem has typically
been achieved in two ways: for static data sets, data is
typically stored on HDFS using binary data formats such
as Apache Avro or Apache Parquet. However, neither
HDFS nor these formats has any provision for updating individual
records, or for efficient random access. Mutable data
sets are typically stored in semi-structured stores such as
Apache HBase or Apache Cassandra. These systems
allow for low-latency record-level reads and writes, but lag
far behind the static file formats in terms of sequential read
throughput for applications such as SQL-based analytics or
Ambari se positionne en alternative à Chef, Puppet pour les solutions génériques ou encore à Cloudera Manager pour le monde Hadoop.
Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Flink includes several APIs for creating applications that use the Flink engine:
DataSet API for static data embedded in Java, Scala, and Python,
DataStream API for unbounded streams embedded in Java and Scala, and
Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
Machine Learning library, and
Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input and output as well as deployment.
Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.