SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
BIG DATA TO SMART DATA
PROCESS SCENARIO
REAL-TIME
ANALYTICS
CHAKER ALLAOUI
TABLE OF CONTENTS
• Input Data • Big Data • JSON Format • Apache Zookeeper
01 02 03 04INTRO
• Apache Storm • Apache Kafka • Apache Cassandra • Apache Spark
05 06 07 08
• SparkSQL • SparkUI • D3 • SMART DATA
09 10 11 12
Process Scenario
PROCESS SCENARIO
Big Data to SMART Data
Creative
process
Analytical
process
Big data, indicates the
explosion of the volume of the
data scanned, collected by
private individuals, public
actors, IT applications which
include users communities on
the scale of the planet. It will
be enough to quote some
examples more or less known
by all: Google, its search
engine and its services of
departments; networks said
social: Facebook and his billion
users who put down images,
texts, exchanges; sites of
sharing and distribution and
broadcasting of images and
photos “Flickr”; the community
sites (blogs, forums, wikis);
administrative departments
and their
Process Scenario: Analytical & Creative
scanned exchanges. In the center of all these
vacuum cleaners of data, we find Internet,
Web, and its capacity to be federated in the
space scanned by the billions of users, but
also sensors' profusion of all kinds,
accumulating scientific data with an
unpublished rhythm (satellite pictures for
example). To remain in Web, all the
messages, all the documents there, all the
images and the videos are got by
applications which, in exchange for the
supplied services of departments,
accumulate of immense data banks.
We speak in million waiters for Google,
Facebook or Amazon, stored in immense
sheds which, besides, consume a not
insignificant part of the produced
electricity. And the movement seems to go
accelerating.
Where from the necessity of presenting a
system of combination of these multiple
sources. Thus the idea is to create a
complete scenario of a transformation
process of these data of exploitable mass
in data and presentational to facilitate
their management and make implement
the decision-making computing to
analyze and federate these data.
The solution contains open source
software's the majority of which are
stemming from projects of the Apache.
Message
MapReduceDashboard
JSON
Kafka Cassandra
SparkD3
Storm
WebSockets
Streaming
Topic ORM
Aggregation
Big Data
Tuples/Zookeeper
Queue/Zookeeper Cassandra Driver
RDD/SparkSQL
RAM/FILE SYSTEM
Analytical transformations
TCP/UDP
DataBase Storage
Business Intelligence
InputStream
BIG DATA TO SMART DATA PROCESS SCENARIO
Input Data
WebSockets
TCP/UDP
InputStream

Data
WebSockets
The protocol Web Socket aims at
developing a communication
channel full house-duplex on a
socket TCP for the browsers and
the Web servers. The increasing
Web interactivity of the
applications, consecutive to the
improvement of the performances
of the browsers, quickly made
necessary the development of
techniques of bidirectional
communications between the Web
application and the processes
server. Techniques based on the
call by the customer of the object
“XMLHttpRequest” and using HTTP
requests with a long TTL stored by
the server for a later answer to the
customer allowed to mitigate this
lack by Ajax.
TCP/UDP
UDP is a directed protocol "not
connection", when a machine
A sends packages aimed at a
machine B, this flow is
unidirectional. Indeed, the
data transmission is made
without warning the addressee
(machine B), and the addressee
receives the data without
making of acknowledgement
of receipt towards the
transmitter (machine A).
TCP is directed "connection".
When a machine A sends data
towards a machine B, the
machine B is prevented of the
arrival of the data, and testifies
of the good reception of these
data by an acknowledgement
of receipt.
InputStream
The InputSream allow to these
processes of sending and data
reception. Flows always process
the data in a sequential way.
Flows can be divided into several
categories: the flows of treatment
and processing of characters and
the flows of treatment of
bytes(octets).
The flows of streams can be web
services, flus of data coming
from social networks such:
Tweeter API....
Data: WebSockets-TCP/UDP-InputStream
Binary Data
Structured Data
Unstructured Data
API Interface
http://fr.wikipedia.org (Traduction)
Read Content: cURL
To read the contents of a Web source, we are going to
manipulate: " cURL " (Customer URL Request Library), it
is an on-line interface of command intended to get
back the contents of an accessible resource by a
network. The resource is indicated by means of an URL
and has to be of a type supported by the software, he
can be so used as customer REST API.
The library supports in particular protocols FTP, FTPS,
HTTP, HTTP, TFTP, SCP, SFTP, Telnet, DICT, FILE and
LDAP. The writing can be made in HTTP by using the
POST and PUT commands.
CURL is a generic interface allowing us to handle a
flow of data.
The attached example presents the reading of one
contained by a present file JSON on a server in
localhost.
The idea is to know how the software of streaming treat
an information received by remote server.
Example: curl readed Content by GET Method:
> curl –i –H « Accept: application/json » -H « Content-
Type:application/json » http://www.host.com/data.json
cURL command: JSON Stream
BIG DATA
Volume-Value
Velocity-Variety
The initial definition given by
the cabinet McKinsey and
Company in 2011 turned at
first to the technological
question, with the famous rule
of 3V:
A big Volume of data, an
important Variety of the same
data and a Processing speed
being sometimes similar to the
real time.
These technologies were
supposed to answer the
explosion of the data in the
digital landscape "data
deluge ". Then, these qualifiers
evolved, with a more economic
vision carried by the 4th V:
Big Data comes along with the
development of applications with
analytical aim, which process the
data to pull it of the sense. These
analyses are called Big Analytics or "
grinding of data ". They concern
complex quantitative data with
methods of distributed calculation.
In 2001, a research paper of the
META GROUP defines the stakes
inherent to the growth of the data as
being three-dimensional:
The complex analyses indeed
answer the said rule " 3V "
(volume, velocity and variety).
This model is still widely used
today to describe this
phenomenon.
The annual world average
growth rate of the market of
the technology and the
services of departments of Big
Data over the period 2011-
2016 should be 31,7 %. This
market should so affect 23,8
billion dollars in 2016
(according to IDC in March,
2013). Big Data should also
represent 8 % of the European
GDP(GROSS DOMESTIC
PRODUCT) in 2020.
Organize Data
Centralize Data
Combine Data
Support Decision
Big Data
http://recherche.cnam.fr/equipes/sciences-industrielles-et-technologies-de-l-information/big-data-decodage-et-analyse-des-enjeux-661198.kjsp
The Value
Volume: it is a relative dimension: Big Data as noted
it Lev Manovitch in 2011 defined formerly " the big
enough data sets to require super computers ", but
he quickly became possible to use standard software
on desktop computers to analyze or co-analyze vast
data sets. The volume of the stored data is rapidly
growing: the digital data created in the world
would have passed of 1,2 Z-octets a year in 2010 to
1,8 Z-octets in 2011.
Velocity: Represent at the same
time the frequency to which the
data are generated, captured and
shared and updated. Increasing
flows of data must be analyzed in
real-time to meet the needs of
processes stopwatch sensitive. For
example, the systems set up by the
stock exchange and the
companies must be capable of
processing these data before a
new cycle of generation began,
with the risk for the man to lose a
big part of the control of the
system when the main operators
become "robots" capable of throw
of the orders of purchase or of
sale of the order of nanosecond,
without arranging all the criteria
of analysis for the average and
long term.
Then 2,8 Z-octets in 2012 and
will amount to 40 Z-octets in
2039. For example Twitter
generated in January, 2013, 7 in
the daytime.
Variety: The volume of Big Data
puts dated-centers in front of a
real challenge: the variety of the
data. It is not about traditional
relational data, these data are raw,
semi-structured even not
structured. They are complex data
resulting from Web, in the
size(format) text and images
(Image Mining). They can be
public (Open Data, Web of the
data), geography demographic by
island (IP addresses), or be a
matter of the property of the
consumers. What returns them
with difficulty usable with the
traditional tools of managements
of data to take out the best of it.
http://fr.wikipedia.org/wiki/Big_data (Traduction)
JSON
JavaScript
Object
Notation
JSON: format
JSON (JavaScript Object Notation) is a lightweight data-
interchange format. It is easy for humans to read and
write. It is easy for machines to parse and generate. It is
based on a subset of the JavaScript Programming
Language, Standard ECMA-262 3rd Edition - December
1999. JSON is a text format that is completely language
independent but uses conventions that are familiar to
programmers of the C-family of languages, including C,
C++, C#, Java, JavaScript, Perl, Python, and many others.
These properties make JSON an ideal data-interchange.
JSON is built on two structures:
A collection of name/value pairs. In various
languages, this is realized as an object, record,
structure, dictionary, hash table, keyed list, or
associative array.
An ordered list of values. In most languages,
this is realized as an array, vector, list, or
sequence.
These are universal data structures. Virtually
all modern programming languages support
them in one form or another. It makes sense
that a data format that is interchangeable with
programming languages also be based on these
structures.
In JSON, they take on these forms:
An object is an unordered set of name/value
pairs. An object begins with { (left brace) and
ends with } (right brace). Each name is
followed by : (colon) and the name/value pairs
are separated by , (comma).
Structured Data
Universal
Speed Treatments
Riche Componentshttp://json.org/
JSON: structure of data
APACHE ZOOKEEPER
Working with distributed systems
APACHE ZOOKEEPER ZooKeeper
Apache Zookeeper is a framework
to federate the communications
between distributed systems, it
works by supplying a space report
shared by all the authorities of
same set servers. This space
memory is hierarchical, in the style
of a system of file compound of
directories and files. He is
distributed, consists of several
machines connected between them
and who solve together a problem,
by opposition with a centralized
system, thus a big unique machine
which sets quite at his
expense(under his responsibility).
The case of Google, for which no
unique machine could not handle
all the requests. The simple fact of
having several machines which
have to work together is a source
of problems, among which:
The resistance in the
breakdowns: if a machine
breaks down in the network,
that to make. If she is the only
one to carry important
information, these are lost. In
this case, we adjust the
question with the redundancy
of the information, duplicated
on several machines. The
consistency of the information,
in particular if she is
duplicated. The purpose is to
offer an independence of her
value of the datum with regard
compared with his source: we want
to avoid that every machine carries
a different version, of an
information which we need. The
distribution of the load
responsibility: how well manage his
resources to avoid that a single
machine is too much requested,
while the others are inactive.
How is a request emitted by a
customer handled, Which makes
him? And how guarantee that the
result is the voucher, independently
of the machine which handled the
question. It is the said problem of
the consensus: we say that there is
consensus if two machines having
the same initial data give the same
result for the same treatment.
Distributed
Local Implements
Cloud Implements
Orchestrator
http://blog.zenika.com/index.php?post/2014/10/13/Apache-Zookeeper-facilitez-vous-les-systemes-distribues2
Distributed systems
The communication between the
authorities or the machines takes
place in asynchronous mode. For
example, you want to send a
message in a set of authorities ( a
cluster) to launch a treatment and
processing. You also need to know
if an authority is operational. To
communicate well, an used method
consists of a tail. This one allows to
send messages directly or by
subject, of lira in a asynchronous
way. The communication in mode
to cluster takes place exactly as a
communication, in local mode with
several processes, and pass by the
coordination in the use custom of
the resources. Concretely, when a
knot writes on a file, it is preferable
that he puts a bolt, visible by all
other knots, on said file.
A distributed system of files
does not wish to know that
such file is on the authority
thing. He wants to be under the
illusion to treat and manipulate
a unique system of files, exactly
as on a single machine. The
management of the distribution
of the resources, with what to
do if an authority does not
answer, is not useful to know.
A service of department of
naming: to present him,
certainly in a rough way, we
would like a kind of
<MapString, Object > who is
distributed on the network,
whom all the authorities can
use. A typical example would
be JNDI. The purpose of the
service of department of
naming consists in
Have a system distributed by access to objects.
A strong system of treatment and processing of the
requests is distributed architecture which welcomes
real time data. In the case of a structure advanced, we
wish not to lose the slightest given such as: the stock-
exchange orders, the idea is to make a treatment and
processing which gets back the data of an authority
before it breaks down or takes out of networks. The
approach of Zookeeper is to choose a leader in every
putting on of system, and the latter who assures and
insures the information sharing and the frames in the
which an information has to be to persist in the system
to keep the track, by taking account in the opinion of
every Znodes or exactly the answer to make protect an
information, so an upper number in n/2 where n is
the number.
ZOOKEEPER: ZNodes
> .binzkServer.cmd
> .binzkCli.cmd 127.0.0.1:2181
Zookeeper nodes list
Zookeeper daemon
APACHE STORM
Real-Time Streaming
Apache Storm is a system of
real-time calculations
distributed and fault tolerant.
Developed originally by the
company BackType, the
project became open-source
after the acquisition of the
company by Twitter. It is
available on
License Eclipse Public 1.0.
Furthermore, Storm entered
since a few months the process
of incubation of the Apache
foundation. To handle
continuously several flows of
data, Storm bases on the
definition of a topology.
A topology takes the shape of a
directed graph in which one: Streams,
symbolized by arcs, is unlimited
sequences of Tuples. A tuple is a list of
appointed values which represents the
model of data used by Storm.
Spouts, knots root of the graph,
indicate the sources of streams. He can
involve for example a sequence of
tweets emitted via the API Twitter, a
flow of logs or still data read directly
since a database.
And finally, bolts is the knots which
consume these sequences of tuples
emitted by one or several knots. They
have for role to realize various
operations (filters, aggregations, joints,
reading / writing towards and since a
database, etc.) and if need to emit in
their turn a new sequence of tuples.
Storm: concepts
Analytics
Big Data
Distributed
Topology Design
http://blog.zenika.com/index.php?post/2014/01/31/Storm-Ajouter-du-temps-reel-a-votre-BigData (Traduction)
The grouping of flows answers
the following question: when a
tuple is emitted, towards
which bolts he must be
managed? In other words, it is
a question of specifying the
way flows are partitioned
between the various
authorities of the same
component spout or bolt. For
that purpose, Storm supplies a
set of predefined groupings,
among which here are the
main definitions:
* Shuffle grouping: tuples is
randomly distributed towards
the various bolts authorities in
a way that each receives an
equal number of tuples.
* Fields grouping: the flow is
partitioned according to one
or several fields.
Streams: grouping, worker, executor
* All grouping: the flow is
answered towards all the
authorities. This method is to be
used with precaution because it
generates so many flows as there
is of authorities.
* Global grouping: the whole
flow is redirected towards the
same authority. In the case,
where there is several for the
same bolt, the flow is then
redirected towards that having
the smallest identifier. When a
topology is submitted to Storm,
this one distributes all the
treatments and processing
implemented by your components through the
cluster. Every component is then executed in parallel
on one or several machines.
Storm executes a topology compound of spout and
two bolts. For every topology, Storm manages a set of
different entities:
One " worker process " is a JVM running on a
machine of the cluster. It has for role to coordinate
the execution of one or several components (spouts or
bolts) belonging to the same topology. (the number of
workers associated to a topology can change in time.)
A "executor" is a thread launched by one " worker
process ". It is in charge of executing one or several
"task" for a bolt or specific spout. (the number of
executive).
Storm cluster: properties
Storm: spouts, bolts
Storm: Topology analytics
Storm: JSON Receiver spouts
Parse JSON Stream with Storm Java API
APACHE KAFKA
Producer-Consumer of Content
Apache published Kafka 0.8, the first
major version of Kafka since the
project became a project of top level
of the Apache Software Foundation.
Apache Kafka is a directed system
message of type publication-
subscription implemented as
transactional system of tracks
distributed, adapted for the
consumption of on-line and
outstanding messages. It is about a
directed system message developed
originally to LinkedIn for the
collection and the distribution of high
volumes of events and data of track
with low latency. The last version
includes the replication intra-cluster
and the support of multiple
directories of data. The files of track
can be switched around by age, and
the levels of track can be
Valued dynamically by JMX. A
tool of performance test was
added, to help to handle the
concerns and to look for the
improvements concerning the
performances. Kafka is a service
of department of committed of
tracks distributed, partitioned and
answered.
The producers publish messages
in subjects Kafka,
The consumers subscribe to
these subjects and consume
messages. A waiter in a
cluster Kafka is called an
intermediary. For every
subject, the cluster Kafka
maintains a partition for the
load increase, the
parallelism and the
resistance in the
breakdowns.
Live Transmit
Distributed
Memory
Tracability
http://www.infoq.com/fr/news/2014/01/apache-afka-messaging-system (Traduction)
Apache Kafka
The system is checked and
controlled by the consumer. A
typical consumer will treat the
message following in the list,
although he can consume
messages in any order, because
the cluster Kafka keeps all the
messages published for a
period of configurable time. It
returns the very economic
consumers, because they go
and come without a lot of
impact on the cluster, and
authorize the consumers
disconnected as
Clusters Hadoop. The
producers are capable of
choosing the subject, and the
partition within the subject, in
which to publish the message.
The consumers auto-affect a
name of group of consumer,
and every message is
distributed to a consumer in
every group of subscribed
consumers. If all the
consumers have different
groups, then messages are
broadcast to every consumer.
Kafka can be used as a
middleware of traditional
message. He offers a high flow
and has capacities of partition
native, of replication and
resistance in the breakdowns,
what makes it a good solution
for them
applications of treatment and processing of large-scale
messages. Kafka can be also used for the follow-up of
Web sites on strong volume. The activity of the site can
be published and handled in real time, or loaded in
system of warehouse of data Hadoop or except line.
Kafka can be also used as solution of aggregation of
tracks. Instead of working with files, tracks can be
treated as flows of messages. Kafka is used to LinkedIn
and he manages more than 10 billion writings a day
with a steady load which borders 172 000 messages per
second. There is a massive use of support medium or
multi-subscribers
Kakfa Topic
APACHE KAFKA: producer - topic - consumer
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --
replication-factor 1 --partitions 1 --topic test
> bin/kafka-topics.sh --list --zookeeper localhost:2181
> bin/kafka-console-producer.sh --broker-list localhost:9092 --
topic test
toto
Hello World
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --
topic test --from-beginning
toto
Hello World
APACHE KAFKA: JSON format transmit
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
public class KafkaProducer {
private static kafka.javaapi.producer.Producer<Integer, String> producer = null;
private static String topic = "topic_message";
private static Properties props = new Properties();
public KafkaProducer() { }
public void sendMessage() {
try {
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("metadata.broker.list", « 127.0.0.1:9092");
producer = new kafka.javaapi.producer.Producer<Integer, String>(new ProducerConfig(props));
BufferedReader br = null;
String sCurrentLine;
br = new BufferedReader(new FileReader("data.json"));
while ((sCurrentLine = br.readLine()) != null) {
producer.send(new KeyedMessage<Integer, String>(topic, sCurrentLine));
}
producer.send(new KeyedMessage<Integer, String>(topic, new String(content)));
} catch (IOException e) {
e.printStackTrace();
}
finally {
try {
if (br != null)
br.close();
} catch (IOException ex) {
ex.printStackTrace();
}}
}}
APACHE CASSANDRA
NoSQL DataBase
Replication indicates the number
of knots where the datum is
answered. Besides, Cassandra's
architecture defines the term of
cluster as being a group of at
least two knots and one dated to
center as being relocated
clusters. Cassandra allows to
insure the replication through
different dated to center. The
knots which fell can be replaced
without unavailability of the
service of department.
· Decentralized: in a cluster all
the knots are equal. He not there
no notion of master, neither of
slave, nor process which would
have at her expense the
management, nor even of
bottleneck at the level of the
network part. The protocol
Apache Cassandra is a database of
the family very fashionable
NoSQL. She is classified among
directed bases columns just like
HBase, Apache Accumulo, Big
Table. This base was originally
developed by engineers of
Facebook for their in-house needs
before being put at the disposal of
the general public in open-
source.
CHARACTERISTICS
· Fault tolerance: the data of a knot
(a knot is Cassandra's authority)
are automatically answered
towards other knots ( various
machines). So, if a knot is out of
order the present data are
available through other knots. The
term of factor of
GOSSIP insures overdraft, the
location and the collection of all
the information on the state of the
knots of a cluster.
· rich Model of data: the model of
data proposed by Cassandra
based on the notion of key / value
allows to develop numerous cases
of use in the world of Web.
· Elastic: the scalability is linear.
The debit of writing and reading
increases in a linear way when a
new server is added in the cluster.
Besides, Cassandra assures that
there will be no unavailability of
the system nor the interruption at
the level of the applications.
Analytics
Storage
Distributed
Memory
APACHE Cassandra: NoSQL DataBase
http://www.infoq.com/fr/articles/modele-stockage-physique-cassandra-details (Traduction)
In the world of databases NoSQL, we
often hear about the Theorem CAP.
This theorem establishes 3
parameters on which we can play to
configure a distributed database
The coherence (C for Consistency)
The availability (A for Availability)
The fault tolerance and in the
network cuts (P for Partition-
Tolerance)
The theorem postulates that for any
distributed database, we can choose
only 2 of these 3 parameters, ever 3
at the same time. In theory, we can
thus choose the following couples:
a. Coherence and availability (CA)
thus not resistant in the
breakdowns.
b. Coherence and 100% thus
unavailable fault tolerance ( CP) this
equal ( A )
c. Availability and 100% thus
not coherent fault tolerance (
AP) this equal ( C )
This is the theory. In practice,
we realize that the parameter
P is more or less imposed.
Indeed, the network cuts it
happens, it is inevitable. As a
result, the choice amounts
after all to CP or AP. Cassandra
clearly chooses of AP for a
fault tolerance and an absolute
availability.
In return, Cassandra sacrifices the absolved
coherence (in the sense(direction) ACID of the term)
against a final coherence, that is a strong coherence
obtained after a convergence of the data or AP.
Cassandra clearly chooses of AP for a fault tolerance
and an absolute availability. In return, Cassandra
sacrifices the absolved coherence (in the
sense(direction) ACID of the term) against a final
coherence, that is a strong coherence obtained after a
convergence of the data or AP. Cassandra clearly
chooses of AP for a fault tolerance and an absolute
availability. In return, Cassandra sacrifices the
absolved coherence (in the sense ACID of the term)
against a final coherence
CAP Theorem
http://www.infoq.com/fr/articles/cap-twelve-years-later-how-the-rules-have-changed
CQLSH
CQL wants to say Cassandra
Query Language, and we are
in the version 4. The first
version was an experimental
attempt to introduce a
language of request for
Cassandra. The second version
of CQL was conceived or
designed to re-beg wide rows
but was not rather flexible to
adapt itself to all
The types of modelling who
exist in Apache Cassandra.
However, it is recommended to
use rather the second key of
indexation positioned on the
column containing the
deliberate information. Indeed,
to use the strategy Ordered-
Partitionners has the following
consequences: the sequential
writing can pull hotspots: if
the application tries to write or
To update sequential set lines, then the writing will
not be distributed in the cluster; an overhead
increased for the administration of the load hesitate
in the cluster: the administrators have to calculate
manually the beaches of tokens to distribute them in
the cluster; uneven distribution of load for families of
multiple columns.
The interface of CQLSH is written in python, thus
requires the installation of the utility python for a
version superior to 2.7 to be able to benefit from this
interface of direct communication with the database
Cassandra. The language of request in version 4 is
very similar to the SQL2. So several terms are the
same, but their utilities a primary key in Cassandra is
different, for example is not equivalent
Example:
CREATE TABLE developer(
developer_id bigint,
firstname text,
lastname text,
age int,
task varchar,
PRIMARY KEY(developer_id));
CQLSH 4.1 python interface
Apache Cassandra is a system
allowing to manage a big
quantity of data in a
distributed way. The latter can
be structured, semi-structured
or not structured by the whole.
Cassandra was conceived to be
highly scalable on servers
large number not while
presenting a Single Of Failure (
SPOF). Cassandra supplies a
dynamic plan of data to offer a
maximum of flexibility and
performance. But to include
well this tool, it is necessary to
assimilate first of all the
vocabulary
- Cluster: a cluster is a
grouping of the knots which
communicate themselves for
the management of data.
- Keyspace: it is the equivalent
of a database in the world of
the relational databases. To
note that it is possible to have
several "Keyspaces" on the
same server.
- Column ( Column): a column
consists of a name, a value and
a timestamp.
- Line ( Row): columns are
grouped(included) in Rows.
Row is represented by a key
and a value.
It is possible to configure the partition for a family of
columns by specifying that we want that it is managed
with type Ordered-Partitioners strategy. This mode can,
indeed, have an interest if we wish to get back a
beach(range) of lines between two values (thing which
is not possible if the hash MD5 of the keys of lines is
used). It is possible to configure the partition for a
family of columns by specifying that we want that it is
managed with type Ordered-Partitioners strategy. This
mode can, indeed, have an interest if we wish to get
back a range of lines between two values (thing which
is not possible if the hash MD5 of.
C* Cluster and system architecture
C* cluster: nodes configuration
Cassandra Cluster Configuration with OpsCenter
C* Storage Data: Model
import java.io.Serializable;
import java.util.Date;
import java.util.UUID;
import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.Id;
import org.apache.commons.lang3.builder.EqualsBuilder;
import org.apache.commons.lang3.builder.HashCodeBuilder;
import org.apache.commons.lang3.builder.ToStringBuilder;
import org.apache.commons.lang3.builder.ToStringStyle;
import org.easycassandra.Index;
@Entity(name = "tweet")
public class Tweet implements Serializable {
@Id
private UUID id;
@Index
@Column(name = "nickName")
private String nickName;
@Column(name = "message")
private String message;
@Column(name = "time")
private Date time;
public UUID getId() {return id; }
public void setId(UUID id) {this.id = id;}
public String getNickName() {return nickName; }
public void setNickName(String nickName) {
this.nickName = nickName;}
public String getMessage() {return message;}
public void setMessage(String message) {
this.message = message;}
public Date getTime() {return time;}
public void setTime(Date time) {this.time = time; }
@Override
public boolean equals(Object obj) {if(obj instanceof Tweet) {
Tweet other = Tweet.class.cast(obj);
return new EqualsBuilder().append(id, other.id).isEquals();}
return false;}
@Override
public int hashCode() {
return new HashCodeBuilder().append(id).toHashCode();}
@Override
public String toString() {
return ToStringBuilder.reflectionToString(this,
ToStringStyle.SHORT_PREFIX_STYLE);
}}
C* Storage Data: Repository-Request
import java.util.List;
import java.util.UUID;
import org.easycassandra.persistence.cassandra.Persistence;
public class TweetRepository {
private Persistence persistence;
public List<Tweet> findByIndex(String nickName) {
return persistence.findByIndex("nickName", nickName,
Tweet.class);
}
{
this.persistence =
CassandraManager.INSTANCE.getPersistence();
}
public void save(Tweet tweet) {
persistence.insert(tweet);
}
public Tweet findOne(UUID uuid) {
return persistence.findByKey(uuid, Tweet.class);
}
}
import java.util.Date;
import java.util.UUID;
public class App{
public static void main( String[] args ){
TweetRepository personService = new
TweetRepository();
Tweet tweet = new Tweet();
tweet.setNickName("allaoui chaker");
UUID uuid = UUID.randomUUID();
tweet.setId(uuid);
tweet.setNickName("allaoui chaker");
tweet.setMessage("test cassandra mapping");
tweet.setTime(new Date());
tweet.setId(UUID.randomUUID());
personService.save(tweet);
Tweet findTweet=personService.findOne(uuid);
System.out.println(findTweet);
}
}
C* Storage and find Data by ORM
EasyCassandra ORM Mapping Data into Cassandra
APACHE SPARK
Big Data Eco-System
ECO-SYSTEME APACHE SPARK
Apache Spark is a framework of
processing open source Big Data
built to make sophisticated
analyses and conceived for the
speed and the ease of use. This one
was originally developed by
AMPLab, of the University UC
Berkeley, in 2009 and open source
past in the form of project Apache
in 2010. Spark presents several
advantages with regard
to(compared with) the other
technologies big dated and
MapReduce as Hadoop and Storm.
At first, Spark proposes a
framework complete and unified to
meet the needs of processing Big
Data for diverse sets of data,
diverse by their nature (text,
graph) as well as by the type of
source.
Then, Spark allows applications
on clusters Hadoop to be executed
until 100 times as fast in memory,
10 times as fast on record. He
allows you to write quickly
applications in Java, Scala or
Python and includes a set of more
than 80 operators top-level.
Furthermore, it is possible to
use him in a interactive way to
rebeg the data since a Shell.
Finally, besides the operations
of Map and Reduce, Spark
supports the SQL requests and
the streaming of data and
proposes features of ML.
Analytics
Distributed
Local Implements
Cloud Implements
Apache Spark: SQL, Streaming, ML, GraphX
http://www.infoq.com/fr/articles/apache-spark-introduction (Traduction)
RDD
Resilient Distributed Datasets
(based on the publication of
research for Matei), or RDD, is
a concept at the heart of the
framework Spark. You can see
a RDD as a table in a database.
This one can carry(wear) quite
typical of data and is stored by
Spark in various partitions.
The RDD allows to re-
arrange(re-settle) the
calculations and to optimize
the treatment(processing).
They are also fault tolerant
because a RDD knows how to
recreate and to recalculate its
data set. The RDD is
immutable. To obtain a
modification of a RDD, It is
necessary to apply to it a
transformation, which will
return a new RDD,
the original will remain
unchanged. The RDD supports
two types of operations:
transformations and actions. It
is necessary to apply to it a
transformation, which will
return a new RDD,
the original will remain unchanged. The RDD
supports two types of operations: the
transformations and the actions it is necessary to
apply to it a transformation, which will return a
new RDD, the original will remain unchanged.
The RDD supports two types of operations: the
transformations and the actions it is necessary to
apply to it a transformation, which will return a
new RDD, the original will remain unchanged.
The RDD supports two types of operations: the
transformations and the actions it is necessary to
apply to it a transformation, which will return a
new RDD, the original will stay.
Actions/Transformations
The transformations: the
transformations do not turn
alone value, they turn a new
one RDD. Nothing is estimated
when we call on to a function
of transformation, this
function just takes a RDD and
turns a new one RDD. The
functions of transformation
are for example
The actions: the actions
estimate and return a new
value. As a function of action
is called on an object RDD, all
the requests of data processing
are calculated and the result
returned. The actions are for
example reduce, collect, count,
first, take, countByKey
Transformations: the most common both
transformations which you will use probably are the
map() and filter(). The map() the transformation takes
in a function and wall lamp in every element in the
RDD with the result of the function being the new
value of every element in resultant RDD. The filter()
the transformation takes in a function and returns a
RDD which has only the elements which cross the
filter() the function.
Actions: the most common action on RDDS of base
whom you will use probably is reduce(), which takes a
function which operates on two elements of the type in
your RDD and produces a new element of the same
type. A simple example of such a function is +, that we
can use to add our RDD. With reduce(), we can easily
add.
Spark: cluster manager
A cluster Spark consists of a Master's degree and one or several
workers. The cluster must be started and remain active to be able to
execute applications. The Master's degree has for the only one
responsibility the management of the cluster and it does not thus
execute code MapReduce. Workers, on the other hand, is the
executive. It is them who bring resources to the cluster, worth
knowing(namely) of the memory and the hearts of treatment.
To execute a treatment on a cluster Spark, it is necessary to submit
an application the treatment of which will be piloted by a driver.
Two modes of execution are possible:
- Customer mode: the driver is created on the machine which
submits the application
- Mode to cluster: the driver is created inside the cluster.
Communication within the cluster.
Workers establishes a bidirectional communication with the
Master's degree: the worker connects in the Master's degree to open
a channel in a sense, then the Master's degree connects in the
worker to open a channel in the inverse sense. It is thus necessary
that the various knots of the cluster can join correctly (resolution
DNS). The communication between knots is made with the
framework Akka. It is useful to know to identify the lines of logs
handling exchanges between knots.
The knots of the cluster (Master's degree as workers)
expose besides a Web interface allowing to watch the
state of the cluster as well as the progress(promotion)
of treatments. Every knot thus opens two ports:
- A port for the internal communication: port 7077 by
default for the Master's degree, the random port for
workers.
- A port for the Web interface: port 8080 by default
for the Master's degree, the port 8081 by default for
workers.
The creation of a cluster Spark allows to run the power
of treatment of several machines. His implementation
is relatively simple. It will simply be necessary to be
careful to return the Master's degree cancel by using
ZooKeeper. Besides, the execution of an application
does not require modification of the code of treatment.
The main constraint is to make sure that the data are
read and written since distributed systems.
http://www.infoq.com/fr/articles/apache-spark-introduction
SparkSQL: SchemaRDD
Spark work with structured and semi-structured data. The Structured data any are given which
have a plan that is a set known for fields for every report. When you have this type of data, the
SQL of Spark the fact so much easier than more effective to load and to question. Particularly the
SQL of Spark supplies three main capacities.
- He can load data of a variety of structured sources: JSON, Hive…
- He lets you question the data using the SQL, both inside a program Spark and the external tools
which connect to the SQL of Spark by connectors of standard database ( JDBC / ODBC), as tools
of the day before economic as Picture.
- When used in a program Spark, SparkSQL supplies the rich integration between the SQL and the
regular Python / Java / Scala, including the capacity to join RDDs and tables of SQL, expose the
custom works in the SQL and more. Many jobs are easier to write the use of this combination. To
implement these capacities, SparkSQL supplies a special type of RDD called SchemaRDD.
A stage of preparation of
the data is necessary to
allow the interpreter SQL
to know the data. The
concept of RDD is reused
and requires simply to be
enriched by a plan. The
treated class becomes
SchemaRDD. Two options
exist to build SchemaRDD:
- By using the generic
type Row and by
describing the plan
manually.
- By using personalized
types and by letting Spark
SQL discover the plan by
reflection.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sqlContext.jsonFile("./data/data.json")
people.printSchema()
people.registerTempTable("people")
people.count()
people.first()
val r=sqlContext.sql("select * from people where nom='allaoui'").collect()
SparkSQL/UI: SqlContext jobs statistics
SQLContext: Scala
SparkSQL/UI: SqlContext stages statistics
import org.apache.commons.lang.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import static com.datastax.spark.connector.japi.CassandraJavaUtil.*;
import com.datastax.spark.connector.japi.CassandraRow;
public class FirstRDD {
public static void main(String[] args) {
SparkConf conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
.setAppName("CassandraConnection").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext("local[*]", "test", conf);
JavaRDD<String> cassandraRowsRDD = javaFunctions(sc).cassandraTable("java_api","products")
.map(new Function<CassandraRow, String>() {
@Override
public String call(CassandraRow cassandraRow) throws Exception {
return cassandraRow.toString();
}});
System.out.println("--------------- SELECT CASSANDRA ROWS ---------------");
System.out.println("Data as CassandraRows: n" + StringUtils.join(cassandraRowsRDD.toArray(), "n"));
}}
SparkSQL with cassandraRDD
Aggregate data with cassandraRDD and sparkSQL queries
SparkConf conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").setAppName("SparkCassandra")
.setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext("local[*]", « CassandraActionsSpark", conf);
JavaRDD<String> cassandraRowsRDD = javaFunctions(sc).cassandraTable("network", "storage")
.map(new Function<CassandraRow, String>() {
@Override
public String call(CassandraRow cassandraRow) throws Exception {
return cassandraRow.toString();
}});
System.out.println("StringUtils.join(cassandraRowsRDD.toArray()));
System.out.println(cassandraRowsRDD.first());
System.out.println(cassandraRowsRDD.collect());
System.out.println(cassandraRowsRDD.id());
System.out.println(cassandraRowsRDD.countByValue());
System.out.println(cassandraRowsRDD.name());
System.out.println(cassandraRowsRDD.count());
System.out.println(cassandraRowsRDD.partitions());
Spark Actions with CassandraRowRDD
Actions functions with cassandraRowsRDD and sparkSQL queries
package scala.example
object Person {
var fisrstname:String="CHAKER"
var lastname:String="ALLAOUI"
def show(){
println(" Firstname : "+ firstname
+" Lastname : "+ lastname
)}}
def main(args: Array[String]){
Person p=new Person()
println("Show Infos")
p.show()
}
}
SparkUI: Monitoring
val textFile = sc.textFile("DATA.txt")
textFile.count()
textFile.first()
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
textFile.filter(line => line.contains("Spark")).count()
There are several manners to check(control) Spark applications: whose
pillar of monitoring and external instrumentation SparkUI is. Chaque
SparkContext launches Web UI, by default on the port 4040, which
shows useful information on the application. This includes:
A list of stages of planner and try.
A summary of sizes RDD and use of report memory.
Environmental information.
Information on executive common.
You can have access to this interface by opening simply
"http://hostname:4040" in an Internet browser. If multiple SparkContext
which run on the same host, they will be available listens to it on the
successive ports beginning with 4040, 4041, 4042
SparkContext: Scala
Note that this information is only available for
a period of life of the application by default.
To see the depiction of SparkUI and take
advantage of the interface of monitoring, put "
spark.eventLog.enabled " in " true ". This
configures Spark to register(record) the
events which contain the necessary
information to monitor the events of Spark in
SparkUI and visualize the persisted data.
SparkUI: SparkContext jobs statistics
SparkUI: SparkContext stages statistics
import java.lang.Math
val textFile = sc.textFile("DATA.txt")
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
MapReduce: map, key, value, reduce
Spark offers an alternative to MapReduce because he executes the jobs in micro-prizes(micro-lots) with intervals of five
seconds or less. Or a kind of fusion between the batch and the real time or almost. He supplies also more stability than other
tools of real-time treatment, as Twitter Storm, transplanted on Hadoop.
The software can be used for a big variety of uses, as an analysis perms real time data, and, thanks to a software library, more
numerous jobs for the in-depth calculations involving the machine learning and the graphic processing. With Spark,
developers can simplify the complexity of the code MapReduce and write requests of analysis of data in Java, Scala or Python,
by using a set of 80 high-level routines.
With this version 1.0 of Spark, Apache proposes from now on a stable API, which developers can use to interact with their own
applications. Other novelty of the version 1.0, a component Spark SQL to reach the structured data, so allowing the data to be
questioned beside data not structured during an analytical operation.
Spark Apache is of course compatible with the system of files HDFS (Of Hadoop Distributed File System), as well as other
components such as YARN (Yet Another Resource Negotiator) and the distributed database HBase. The University of California,
and more exactly the laboratory AMP (Algorithms, Machines and People) of Berkeley is at the origin of the development of
Spark which the Apache foundation adopted as project in June, 2013. IT companies as Cloudera, Pivotal, IBM, Intel and MapR
have already begun to integrate Spark into their distribution Hadoop.
SparkContext MapReduce: Scala
http://www.lemondeinformatique.fr/actualites/lire-la-fondation-apache-reveille-hadoop-avec-spark-57639.html
SparkUI: MapReduce statistics
Data-Driven Documents
Data Graphical Design
Data-Driven Documents
Presentational
Big Data
Design
Riche Components
The exponential production of data by the computer
systems is a well established fact. This reality feeds the
phenomenon Big dated. The statistical or predictive
analysis has to call on(appeal) to the art of the visual
representation of the data to give meaning to them and
include them better. The Display(Visualization) of the
data or Dated visualization is called to set a growing
place and it is true in proportion to the volume of data
produced by information systems.
As such, we are absolutely convinced that the D3
bookshop, the object of this article, will take all its place
and it is true not only because of its esthetic qualities
there. Created by Mike Bostock, D3 is often presented as
a graphic bookshop while its acronym - D3 for Data
Driven Documents - shows that it is at first, following
the example of jQuery, a Javascript bookshop facilitating
the manipulation of a tree DOM. D3 implements
routines allowing to load(charge) external data among
which the size(format) JSON, XML, CSV or text is native
supported. The developer writes the logic of
transformation(processing) of the data in HTML elements
or SVG to have a representation of it.
So, the representation can take as well the
shape of a picture(board) ( HTML elements) as
a curve (elements SVG). D3 thus allows to
produce Directed Documents Given. Whose
several models are available on the site
www.d3js.org.
JSON File presentation: DENDROGRAM
D3.js (or D3 for Data-Driven
Documents) is a Javascript
graphic library(bookcase)
which allows the display of
digital data under a graphic
and dynamic shape. It is about
a tool mattering for the
conformation for the
standards W3C which uses the
common SVG technologies,
Javascript and CSS for the
visualization of data. D3 is the
official successor of the
precedent framework
Protovis1. Contrary to the
other libraries, this one allows
a more ample control of the
visual result final. His
development popularized in
20113, at the release of the
version 2.00 in August, 2014.
In August, 2012, the library
had affected the version
2.10.05.
Integrated into a HTML,
the Javascript library D3.js
uses pre-constructed
functions of Javascript to
select elements, create
objects SVG, stylize them,
or add it transitions,
dynamic effects or tooltips.
These objects can be also
stylized on a large scale by
means of the famous
language CSS.
Furthermore, big databases
with associated values can
feed the Javascript
functions to generate
conditional and/or rich
graphic documents. These
documents are most of the
time graphs.
Databases can be under
numerous formats, most of
the time JSON, CSV,
GeoJSON
So, the data analysis is the process which consists in
examining and in interpreting data to develop answers to
questions. The main stages of the process of analysis
consist in encircling the subjects of analysis, in
determining the availability of appropriate data, in
deciding on methods that there is good reason to use to
answer the questions of interest, to apply the methods and
to estimate, to summarize and to communicate the results.
InputStream presentation: Pie Chart
Graphical presentation possible with D3
SMART DATA
Decisional Force

SMART Data
SMART DATA
Presentational
Design
Support Decision
Smart
Today, Big Data is for the marketing managers at the same time an incredible source of data on the
consumers and, at the same time, an incredible challenge to be raised. Marketing strategies "digital" take
into account from now on texts, conversations, behavior, etc. in an environment or the volume of this
information to be handled grow in a exponential way. He would thus be totally imaginary to imagine to
manage the entire these data. And The stake in the digital marketing is thus from now on the intelligent
management of Big Data to identify, classify and run the information consumer significant allowing the
professionals of the marketing to set up their strategies.
Smart Data is the process which allows to cross raw data in information ultra qualified on each of the
consumers. The objective is to have a vision in 360 ° customers, basing on information collected through
adapted marketing mechanisms, that they are classic or innovative (quizzes, social networks, purchases
during checking out, use of the mobile applications, the geo-localization, etc.). To reach there, companies
are equipped with marketing platforms crosses-channels capable of storing and of analyzing every
information "to push" the good message at the best moment for every consumer. The final goal is not only to
seduce new customers but especially to increase their satisfaction and their loyalty by anticipating their.
It means, among others, establishing a real dialogue with each of his customers and measuring effectively
the marketing and commercial performances of the mark.
To target in a fine way according to several criteria while respecting the preferences customers, to manage
the customization, the relevance and the coherence of messages cross channel freed by e-mail, mail, Web
and call center became imperatives which Smart Data allows finally to tackle in a effective way.
Let us forget "Big" and let us interest we "Smart" because the relevance of marketing strategies will always
depend on the data quality customers.
SMART DATA: Data Transformations
Data Sources: WebSockets, TCP/UDP, InputStream
Data Movement: Apache Storm, Apache kafka
Data Storage: Apache Cassandra, Apache Spark
Data Presentation: Data-Driven Documents
Integrated into a HTML Web page, the Javascript library D3.js uses
pre-constructed functions of Javascript to select elements, create
objects SVG, stylize them, or add to it transitions, dynamic effects or
tooltips. These objects can be also stylized on a large scale by means of
the famous language CSS. Furthermore, big databases with associated
values can feed the Javascript functions to generate conditional and
rich graphic documents.
SMART DATA: Process Scenario
Synthesis
BIBusiness Intelligence
Keywords
WebSockets – HTTP – TCP/UDP – InputStream – TwitterStream – WebServices – JSON – Data - Big Data – SMART DATA -
Process Big Data – Business Iintelligence – Data Storage – Data Sources – Data Presentation – Data Mining - Data
Exploration - Apache Storm – Apache Zookeeper – Apache Kafka – Apache Cassandra – Apache Spark – SparkSQL –
SparkUI – D3 - Data-Driven Documents – Storm cluster – StormUI – Storm Topology – Zookeeper cluster – Distrubuted
server – Topics – Message – Queue – Data Transmit – OpsCenter – EasyCassandra- Keyspace – Column – CQLSH -
CassandraColumn – CassandraRow- Cassandra cluster – Storage Data – Aggregation – RDD – SchemaRDD - Spark Actions
– Spark Transformations – Spark cluster - MapReduce – Jobs – Stages – Excutors – Data Transformations –SMART – Apache
Links
Apache Storm
https://storm.apache.org
Apache Zookeeper
https://zookeeper.apache.org
Apache Kafka
http://kafka.apache.org/
Apache Cassandra
https://spark.apache.org
Data-Driven Documents
http://d3js.org/
Apache Spark
http://cassandra.apache.org
Idea Create Refine
Contact
Visit my profile in LinkedIn
Visit my website
http://tn.linkedin.com/in/chakerallaoui
http://allaoui-chaker.github.io

Contenu connexe

Tendances

Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Robert Grossman
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xNPN Training
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Prof.Balakrishnan S
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in detailsMahmoud Yassin
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Scienceijtsrd
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSaciijournal
 

Tendances (20)

Big Data
Big DataBig Data
Big Data
 
Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)Managing Big Data (Chapter 2, SC 11 Tutorial)
Managing Big Data (Chapter 2, SC 11 Tutorial)
 
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.xModule 01 - Understanding Big Data and Hadoop 1.x,2.x
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
 
Thinking BIG
Thinking BIGThinking BIG
Thinking BIG
 
MongoDB
MongoDBMongoDB
MongoDB
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
Big data introduction, Hadoop in details
Big data introduction, Hadoop in detailsBig data introduction, Hadoop in details
Big data introduction, Hadoop in details
 
A Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data ScienceA Review Paper on Big Data and Hadoop for Data Science
A Review Paper on Big Data and Hadoop for Data Science
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONSBIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
BIG DATA SUMMARIZATION: FRAMEWORK, CHALLENGES AND POSSIBLE SOLUTIONS
 

Similaire à Big Data to SMART Data : Process Scenario

Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Mobility & Data Strategies
Mobility & Data StrategiesMobility & Data Strategies
Mobility & Data StrategiesSam Basu
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcscpconf
 
Web of Things (wiring web objects with Node-RED)
Web of Things (wiring web objects with Node-RED)Web of Things (wiring web objects with Node-RED)
Web of Things (wiring web objects with Node-RED)Francesco Collova'
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCEcsandit
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Azure + WP7 - CodePaLOUsa
Azure + WP7 - CodePaLOUsaAzure + WP7 - CodePaLOUsa
Azure + WP7 - CodePaLOUsaSam Basu
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeSumant Tambe
 

Similaire à Big Data to SMART Data : Process Scenario (20)

Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Ss eb29
Ss eb29Ss eb29
Ss eb29
 
Mobility & Data Strategies
Mobility & Data StrategiesMobility & Data Strategies
Mobility & Data Strategies
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Web of Things (wiring web objects with Node-RED)
Web of Things (wiring web objects with Node-RED)Web of Things (wiring web objects with Node-RED)
Web of Things (wiring web objects with Node-RED)
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Azure + WP7 - CodePaLOUsa
Azure + WP7 - CodePaLOUsaAzure + WP7 - CodePaLOUsa
Azure + WP7 - CodePaLOUsa
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Dernier (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Big Data to SMART Data : Process Scenario

  • 1. BIG DATA TO SMART DATA PROCESS SCENARIO REAL-TIME ANALYTICS CHAKER ALLAOUI
  • 2. TABLE OF CONTENTS • Input Data • Big Data • JSON Format • Apache Zookeeper 01 02 03 04INTRO • Apache Storm • Apache Kafka • Apache Cassandra • Apache Spark 05 06 07 08 • SparkSQL • SparkUI • D3 • SMART DATA 09 10 11 12 Process Scenario
  • 4. Creative process Analytical process Big data, indicates the explosion of the volume of the data scanned, collected by private individuals, public actors, IT applications which include users communities on the scale of the planet. It will be enough to quote some examples more or less known by all: Google, its search engine and its services of departments; networks said social: Facebook and his billion users who put down images, texts, exchanges; sites of sharing and distribution and broadcasting of images and photos “Flickr”; the community sites (blogs, forums, wikis); administrative departments and their Process Scenario: Analytical & Creative scanned exchanges. In the center of all these vacuum cleaners of data, we find Internet, Web, and its capacity to be federated in the space scanned by the billions of users, but also sensors' profusion of all kinds, accumulating scientific data with an unpublished rhythm (satellite pictures for example). To remain in Web, all the messages, all the documents there, all the images and the videos are got by applications which, in exchange for the supplied services of departments, accumulate of immense data banks. We speak in million waiters for Google, Facebook or Amazon, stored in immense sheds which, besides, consume a not insignificant part of the produced electricity. And the movement seems to go accelerating. Where from the necessity of presenting a system of combination of these multiple sources. Thus the idea is to create a complete scenario of a transformation process of these data of exploitable mass in data and presentational to facilitate their management and make implement the decision-making computing to analyze and federate these data. The solution contains open source software's the majority of which are stemming from projects of the Apache.
  • 5. Message MapReduceDashboard JSON Kafka Cassandra SparkD3 Storm WebSockets Streaming Topic ORM Aggregation Big Data Tuples/Zookeeper Queue/Zookeeper Cassandra Driver RDD/SparkSQL RAM/FILE SYSTEM Analytical transformations TCP/UDP DataBase Storage Business Intelligence InputStream BIG DATA TO SMART DATA PROCESS SCENARIO
  • 7.  Data WebSockets The protocol Web Socket aims at developing a communication channel full house-duplex on a socket TCP for the browsers and the Web servers. The increasing Web interactivity of the applications, consecutive to the improvement of the performances of the browsers, quickly made necessary the development of techniques of bidirectional communications between the Web application and the processes server. Techniques based on the call by the customer of the object “XMLHttpRequest” and using HTTP requests with a long TTL stored by the server for a later answer to the customer allowed to mitigate this lack by Ajax. TCP/UDP UDP is a directed protocol "not connection", when a machine A sends packages aimed at a machine B, this flow is unidirectional. Indeed, the data transmission is made without warning the addressee (machine B), and the addressee receives the data without making of acknowledgement of receipt towards the transmitter (machine A). TCP is directed "connection". When a machine A sends data towards a machine B, the machine B is prevented of the arrival of the data, and testifies of the good reception of these data by an acknowledgement of receipt. InputStream The InputSream allow to these processes of sending and data reception. Flows always process the data in a sequential way. Flows can be divided into several categories: the flows of treatment and processing of characters and the flows of treatment of bytes(octets). The flows of streams can be web services, flus of data coming from social networks such: Tweeter API.... Data: WebSockets-TCP/UDP-InputStream Binary Data Structured Data Unstructured Data API Interface http://fr.wikipedia.org (Traduction)
  • 8. Read Content: cURL To read the contents of a Web source, we are going to manipulate: " cURL " (Customer URL Request Library), it is an on-line interface of command intended to get back the contents of an accessible resource by a network. The resource is indicated by means of an URL and has to be of a type supported by the software, he can be so used as customer REST API. The library supports in particular protocols FTP, FTPS, HTTP, HTTP, TFTP, SCP, SFTP, Telnet, DICT, FILE and LDAP. The writing can be made in HTTP by using the POST and PUT commands. CURL is a generic interface allowing us to handle a flow of data. The attached example presents the reading of one contained by a present file JSON on a server in localhost. The idea is to know how the software of streaming treat an information received by remote server. Example: curl readed Content by GET Method: > curl –i –H « Accept: application/json » -H « Content- Type:application/json » http://www.host.com/data.json cURL command: JSON Stream
  • 10. The initial definition given by the cabinet McKinsey and Company in 2011 turned at first to the technological question, with the famous rule of 3V: A big Volume of data, an important Variety of the same data and a Processing speed being sometimes similar to the real time. These technologies were supposed to answer the explosion of the data in the digital landscape "data deluge ". Then, these qualifiers evolved, with a more economic vision carried by the 4th V: Big Data comes along with the development of applications with analytical aim, which process the data to pull it of the sense. These analyses are called Big Analytics or " grinding of data ". They concern complex quantitative data with methods of distributed calculation. In 2001, a research paper of the META GROUP defines the stakes inherent to the growth of the data as being three-dimensional: The complex analyses indeed answer the said rule " 3V " (volume, velocity and variety). This model is still widely used today to describe this phenomenon. The annual world average growth rate of the market of the technology and the services of departments of Big Data over the period 2011- 2016 should be 31,7 %. This market should so affect 23,8 billion dollars in 2016 (according to IDC in March, 2013). Big Data should also represent 8 % of the European GDP(GROSS DOMESTIC PRODUCT) in 2020. Organize Data Centralize Data Combine Data Support Decision Big Data http://recherche.cnam.fr/equipes/sciences-industrielles-et-technologies-de-l-information/big-data-decodage-et-analyse-des-enjeux-661198.kjsp The Value
  • 11. Volume: it is a relative dimension: Big Data as noted it Lev Manovitch in 2011 defined formerly " the big enough data sets to require super computers ", but he quickly became possible to use standard software on desktop computers to analyze or co-analyze vast data sets. The volume of the stored data is rapidly growing: the digital data created in the world would have passed of 1,2 Z-octets a year in 2010 to 1,8 Z-octets in 2011. Velocity: Represent at the same time the frequency to which the data are generated, captured and shared and updated. Increasing flows of data must be analyzed in real-time to meet the needs of processes stopwatch sensitive. For example, the systems set up by the stock exchange and the companies must be capable of processing these data before a new cycle of generation began, with the risk for the man to lose a big part of the control of the system when the main operators become "robots" capable of throw of the orders of purchase or of sale of the order of nanosecond, without arranging all the criteria of analysis for the average and long term. Then 2,8 Z-octets in 2012 and will amount to 40 Z-octets in 2039. For example Twitter generated in January, 2013, 7 in the daytime. Variety: The volume of Big Data puts dated-centers in front of a real challenge: the variety of the data. It is not about traditional relational data, these data are raw, semi-structured even not structured. They are complex data resulting from Web, in the size(format) text and images (Image Mining). They can be public (Open Data, Web of the data), geography demographic by island (IP addresses), or be a matter of the property of the consumers. What returns them with difficulty usable with the traditional tools of managements of data to take out the best of it. http://fr.wikipedia.org/wiki/Big_data (Traduction)
  • 13. JSON: format JSON (JavaScript Object Notation) is a lightweight data- interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange. JSON is built on two structures: A collection of name/value pairs. In various languages, this is realized as an object, record, structure, dictionary, hash table, keyed list, or associative array. An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence. These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures. In JSON, they take on these forms: An object is an unordered set of name/value pairs. An object begins with { (left brace) and ends with } (right brace). Each name is followed by : (colon) and the name/value pairs are separated by , (comma). Structured Data Universal Speed Treatments Riche Componentshttp://json.org/
  • 15. APACHE ZOOKEEPER Working with distributed systems
  • 16. APACHE ZOOKEEPER ZooKeeper Apache Zookeeper is a framework to federate the communications between distributed systems, it works by supplying a space report shared by all the authorities of same set servers. This space memory is hierarchical, in the style of a system of file compound of directories and files. He is distributed, consists of several machines connected between them and who solve together a problem, by opposition with a centralized system, thus a big unique machine which sets quite at his expense(under his responsibility). The case of Google, for which no unique machine could not handle all the requests. The simple fact of having several machines which have to work together is a source of problems, among which: The resistance in the breakdowns: if a machine breaks down in the network, that to make. If she is the only one to carry important information, these are lost. In this case, we adjust the question with the redundancy of the information, duplicated on several machines. The consistency of the information, in particular if she is duplicated. The purpose is to offer an independence of her value of the datum with regard compared with his source: we want to avoid that every machine carries a different version, of an information which we need. The distribution of the load responsibility: how well manage his resources to avoid that a single machine is too much requested, while the others are inactive. How is a request emitted by a customer handled, Which makes him? And how guarantee that the result is the voucher, independently of the machine which handled the question. It is the said problem of the consensus: we say that there is consensus if two machines having the same initial data give the same result for the same treatment. Distributed Local Implements Cloud Implements Orchestrator http://blog.zenika.com/index.php?post/2014/10/13/Apache-Zookeeper-facilitez-vous-les-systemes-distribues2
  • 17. Distributed systems The communication between the authorities or the machines takes place in asynchronous mode. For example, you want to send a message in a set of authorities ( a cluster) to launch a treatment and processing. You also need to know if an authority is operational. To communicate well, an used method consists of a tail. This one allows to send messages directly or by subject, of lira in a asynchronous way. The communication in mode to cluster takes place exactly as a communication, in local mode with several processes, and pass by the coordination in the use custom of the resources. Concretely, when a knot writes on a file, it is preferable that he puts a bolt, visible by all other knots, on said file. A distributed system of files does not wish to know that such file is on the authority thing. He wants to be under the illusion to treat and manipulate a unique system of files, exactly as on a single machine. The management of the distribution of the resources, with what to do if an authority does not answer, is not useful to know. A service of department of naming: to present him, certainly in a rough way, we would like a kind of <MapString, Object > who is distributed on the network, whom all the authorities can use. A typical example would be JNDI. The purpose of the service of department of naming consists in Have a system distributed by access to objects. A strong system of treatment and processing of the requests is distributed architecture which welcomes real time data. In the case of a structure advanced, we wish not to lose the slightest given such as: the stock- exchange orders, the idea is to make a treatment and processing which gets back the data of an authority before it breaks down or takes out of networks. The approach of Zookeeper is to choose a leader in every putting on of system, and the latter who assures and insures the information sharing and the frames in the which an information has to be to persist in the system to keep the track, by taking account in the opinion of every Znodes or exactly the answer to make protect an information, so an upper number in n/2 where n is the number.
  • 18. ZOOKEEPER: ZNodes > .binzkServer.cmd > .binzkCli.cmd 127.0.0.1:2181 Zookeeper nodes list Zookeeper daemon
  • 20. Apache Storm is a system of real-time calculations distributed and fault tolerant. Developed originally by the company BackType, the project became open-source after the acquisition of the company by Twitter. It is available on License Eclipse Public 1.0. Furthermore, Storm entered since a few months the process of incubation of the Apache foundation. To handle continuously several flows of data, Storm bases on the definition of a topology. A topology takes the shape of a directed graph in which one: Streams, symbolized by arcs, is unlimited sequences of Tuples. A tuple is a list of appointed values which represents the model of data used by Storm. Spouts, knots root of the graph, indicate the sources of streams. He can involve for example a sequence of tweets emitted via the API Twitter, a flow of logs or still data read directly since a database. And finally, bolts is the knots which consume these sequences of tuples emitted by one or several knots. They have for role to realize various operations (filters, aggregations, joints, reading / writing towards and since a database, etc.) and if need to emit in their turn a new sequence of tuples. Storm: concepts Analytics Big Data Distributed Topology Design http://blog.zenika.com/index.php?post/2014/01/31/Storm-Ajouter-du-temps-reel-a-votre-BigData (Traduction)
  • 21. The grouping of flows answers the following question: when a tuple is emitted, towards which bolts he must be managed? In other words, it is a question of specifying the way flows are partitioned between the various authorities of the same component spout or bolt. For that purpose, Storm supplies a set of predefined groupings, among which here are the main definitions: * Shuffle grouping: tuples is randomly distributed towards the various bolts authorities in a way that each receives an equal number of tuples. * Fields grouping: the flow is partitioned according to one or several fields. Streams: grouping, worker, executor * All grouping: the flow is answered towards all the authorities. This method is to be used with precaution because it generates so many flows as there is of authorities. * Global grouping: the whole flow is redirected towards the same authority. In the case, where there is several for the same bolt, the flow is then redirected towards that having the smallest identifier. When a topology is submitted to Storm, this one distributes all the treatments and processing implemented by your components through the cluster. Every component is then executed in parallel on one or several machines. Storm executes a topology compound of spout and two bolts. For every topology, Storm manages a set of different entities: One " worker process " is a JVM running on a machine of the cluster. It has for role to coordinate the execution of one or several components (spouts or bolts) belonging to the same topology. (the number of workers associated to a topology can change in time.) A "executor" is a thread launched by one " worker process ". It is in charge of executing one or several "task" for a bolt or specific spout. (the number of executive).
  • 25. Storm: JSON Receiver spouts Parse JSON Stream with Storm Java API
  • 27. Apache published Kafka 0.8, the first major version of Kafka since the project became a project of top level of the Apache Software Foundation. Apache Kafka is a directed system message of type publication- subscription implemented as transactional system of tracks distributed, adapted for the consumption of on-line and outstanding messages. It is about a directed system message developed originally to LinkedIn for the collection and the distribution of high volumes of events and data of track with low latency. The last version includes the replication intra-cluster and the support of multiple directories of data. The files of track can be switched around by age, and the levels of track can be Valued dynamically by JMX. A tool of performance test was added, to help to handle the concerns and to look for the improvements concerning the performances. Kafka is a service of department of committed of tracks distributed, partitioned and answered. The producers publish messages in subjects Kafka, The consumers subscribe to these subjects and consume messages. A waiter in a cluster Kafka is called an intermediary. For every subject, the cluster Kafka maintains a partition for the load increase, the parallelism and the resistance in the breakdowns. Live Transmit Distributed Memory Tracability http://www.infoq.com/fr/news/2014/01/apache-afka-messaging-system (Traduction) Apache Kafka
  • 28. The system is checked and controlled by the consumer. A typical consumer will treat the message following in the list, although he can consume messages in any order, because the cluster Kafka keeps all the messages published for a period of configurable time. It returns the very economic consumers, because they go and come without a lot of impact on the cluster, and authorize the consumers disconnected as Clusters Hadoop. The producers are capable of choosing the subject, and the partition within the subject, in which to publish the message. The consumers auto-affect a name of group of consumer, and every message is distributed to a consumer in every group of subscribed consumers. If all the consumers have different groups, then messages are broadcast to every consumer. Kafka can be used as a middleware of traditional message. He offers a high flow and has capacities of partition native, of replication and resistance in the breakdowns, what makes it a good solution for them applications of treatment and processing of large-scale messages. Kafka can be also used for the follow-up of Web sites on strong volume. The activity of the site can be published and handled in real time, or loaded in system of warehouse of data Hadoop or except line. Kafka can be also used as solution of aggregation of tracks. Instead of working with files, tracks can be treated as flows of messages. Kafka is used to LinkedIn and he manages more than 10 billion writings a day with a steady load which borders 172 000 messages per second. There is a massive use of support medium or multi-subscribers Kakfa Topic
  • 29. APACHE KAFKA: producer - topic - consumer > bin/zookeeper-server-start.sh config/zookeeper.properties > bin/kafka-server-start.sh config/server.properties > bin/kafka-topics.sh --create --zookeeper localhost:2181 -- replication-factor 1 --partitions 1 --topic test > bin/kafka-topics.sh --list --zookeeper localhost:2181 > bin/kafka-console-producer.sh --broker-list localhost:9092 -- topic test toto Hello World > bin/kafka-console-consumer.sh --zookeeper localhost:2181 -- topic test --from-beginning toto Hello World
  • 30. APACHE KAFKA: JSON format transmit import java.io.FileInputStream; import java.io.IOException; import java.util.Properties; import kafka.producer.KeyedMessage; import kafka.producer.ProducerConfig; public class KafkaProducer { private static kafka.javaapi.producer.Producer<Integer, String> producer = null; private static String topic = "topic_message"; private static Properties props = new Properties(); public KafkaProducer() { } public void sendMessage() { try { props.put("serializer.class", "kafka.serializer.StringEncoder"); props.put("metadata.broker.list", « 127.0.0.1:9092"); producer = new kafka.javaapi.producer.Producer<Integer, String>(new ProducerConfig(props)); BufferedReader br = null; String sCurrentLine; br = new BufferedReader(new FileReader("data.json")); while ((sCurrentLine = br.readLine()) != null) { producer.send(new KeyedMessage<Integer, String>(topic, sCurrentLine)); } producer.send(new KeyedMessage<Integer, String>(topic, new String(content))); } catch (IOException e) { e.printStackTrace(); } finally { try { if (br != null) br.close(); } catch (IOException ex) { ex.printStackTrace(); }} }}
  • 32. Replication indicates the number of knots where the datum is answered. Besides, Cassandra's architecture defines the term of cluster as being a group of at least two knots and one dated to center as being relocated clusters. Cassandra allows to insure the replication through different dated to center. The knots which fell can be replaced without unavailability of the service of department. · Decentralized: in a cluster all the knots are equal. He not there no notion of master, neither of slave, nor process which would have at her expense the management, nor even of bottleneck at the level of the network part. The protocol Apache Cassandra is a database of the family very fashionable NoSQL. She is classified among directed bases columns just like HBase, Apache Accumulo, Big Table. This base was originally developed by engineers of Facebook for their in-house needs before being put at the disposal of the general public in open- source. CHARACTERISTICS · Fault tolerance: the data of a knot (a knot is Cassandra's authority) are automatically answered towards other knots ( various machines). So, if a knot is out of order the present data are available through other knots. The term of factor of GOSSIP insures overdraft, the location and the collection of all the information on the state of the knots of a cluster. · rich Model of data: the model of data proposed by Cassandra based on the notion of key / value allows to develop numerous cases of use in the world of Web. · Elastic: the scalability is linear. The debit of writing and reading increases in a linear way when a new server is added in the cluster. Besides, Cassandra assures that there will be no unavailability of the system nor the interruption at the level of the applications. Analytics Storage Distributed Memory APACHE Cassandra: NoSQL DataBase http://www.infoq.com/fr/articles/modele-stockage-physique-cassandra-details (Traduction)
  • 33. In the world of databases NoSQL, we often hear about the Theorem CAP. This theorem establishes 3 parameters on which we can play to configure a distributed database The coherence (C for Consistency) The availability (A for Availability) The fault tolerance and in the network cuts (P for Partition- Tolerance) The theorem postulates that for any distributed database, we can choose only 2 of these 3 parameters, ever 3 at the same time. In theory, we can thus choose the following couples: a. Coherence and availability (CA) thus not resistant in the breakdowns. b. Coherence and 100% thus unavailable fault tolerance ( CP) this equal ( A ) c. Availability and 100% thus not coherent fault tolerance ( AP) this equal ( C ) This is the theory. In practice, we realize that the parameter P is more or less imposed. Indeed, the network cuts it happens, it is inevitable. As a result, the choice amounts after all to CP or AP. Cassandra clearly chooses of AP for a fault tolerance and an absolute availability. In return, Cassandra sacrifices the absolved coherence (in the sense(direction) ACID of the term) against a final coherence, that is a strong coherence obtained after a convergence of the data or AP. Cassandra clearly chooses of AP for a fault tolerance and an absolute availability. In return, Cassandra sacrifices the absolved coherence (in the sense(direction) ACID of the term) against a final coherence, that is a strong coherence obtained after a convergence of the data or AP. Cassandra clearly chooses of AP for a fault tolerance and an absolute availability. In return, Cassandra sacrifices the absolved coherence (in the sense ACID of the term) against a final coherence CAP Theorem http://www.infoq.com/fr/articles/cap-twelve-years-later-how-the-rules-have-changed
  • 34. CQLSH CQL wants to say Cassandra Query Language, and we are in the version 4. The first version was an experimental attempt to introduce a language of request for Cassandra. The second version of CQL was conceived or designed to re-beg wide rows but was not rather flexible to adapt itself to all The types of modelling who exist in Apache Cassandra. However, it is recommended to use rather the second key of indexation positioned on the column containing the deliberate information. Indeed, to use the strategy Ordered- Partitionners has the following consequences: the sequential writing can pull hotspots: if the application tries to write or To update sequential set lines, then the writing will not be distributed in the cluster; an overhead increased for the administration of the load hesitate in the cluster: the administrators have to calculate manually the beaches of tokens to distribute them in the cluster; uneven distribution of load for families of multiple columns. The interface of CQLSH is written in python, thus requires the installation of the utility python for a version superior to 2.7 to be able to benefit from this interface of direct communication with the database Cassandra. The language of request in version 4 is very similar to the SQL2. So several terms are the same, but their utilities a primary key in Cassandra is different, for example is not equivalent Example: CREATE TABLE developer( developer_id bigint, firstname text, lastname text, age int, task varchar, PRIMARY KEY(developer_id)); CQLSH 4.1 python interface
  • 35. Apache Cassandra is a system allowing to manage a big quantity of data in a distributed way. The latter can be structured, semi-structured or not structured by the whole. Cassandra was conceived to be highly scalable on servers large number not while presenting a Single Of Failure ( SPOF). Cassandra supplies a dynamic plan of data to offer a maximum of flexibility and performance. But to include well this tool, it is necessary to assimilate first of all the vocabulary - Cluster: a cluster is a grouping of the knots which communicate themselves for the management of data. - Keyspace: it is the equivalent of a database in the world of the relational databases. To note that it is possible to have several "Keyspaces" on the same server. - Column ( Column): a column consists of a name, a value and a timestamp. - Line ( Row): columns are grouped(included) in Rows. Row is represented by a key and a value. It is possible to configure the partition for a family of columns by specifying that we want that it is managed with type Ordered-Partitioners strategy. This mode can, indeed, have an interest if we wish to get back a beach(range) of lines between two values (thing which is not possible if the hash MD5 of the keys of lines is used). It is possible to configure the partition for a family of columns by specifying that we want that it is managed with type Ordered-Partitioners strategy. This mode can, indeed, have an interest if we wish to get back a range of lines between two values (thing which is not possible if the hash MD5 of. C* Cluster and system architecture
  • 36. C* cluster: nodes configuration Cassandra Cluster Configuration with OpsCenter
  • 37. C* Storage Data: Model import java.io.Serializable; import java.util.Date; import java.util.UUID; import javax.persistence.Column; import javax.persistence.Entity; import javax.persistence.Id; import org.apache.commons.lang3.builder.EqualsBuilder; import org.apache.commons.lang3.builder.HashCodeBuilder; import org.apache.commons.lang3.builder.ToStringBuilder; import org.apache.commons.lang3.builder.ToStringStyle; import org.easycassandra.Index; @Entity(name = "tweet") public class Tweet implements Serializable { @Id private UUID id; @Index @Column(name = "nickName") private String nickName; @Column(name = "message") private String message; @Column(name = "time") private Date time; public UUID getId() {return id; } public void setId(UUID id) {this.id = id;} public String getNickName() {return nickName; } public void setNickName(String nickName) { this.nickName = nickName;} public String getMessage() {return message;} public void setMessage(String message) { this.message = message;} public Date getTime() {return time;} public void setTime(Date time) {this.time = time; } @Override public boolean equals(Object obj) {if(obj instanceof Tweet) { Tweet other = Tweet.class.cast(obj); return new EqualsBuilder().append(id, other.id).isEquals();} return false;} @Override public int hashCode() { return new HashCodeBuilder().append(id).toHashCode();} @Override public String toString() { return ToStringBuilder.reflectionToString(this, ToStringStyle.SHORT_PREFIX_STYLE); }}
  • 38. C* Storage Data: Repository-Request import java.util.List; import java.util.UUID; import org.easycassandra.persistence.cassandra.Persistence; public class TweetRepository { private Persistence persistence; public List<Tweet> findByIndex(String nickName) { return persistence.findByIndex("nickName", nickName, Tweet.class); } { this.persistence = CassandraManager.INSTANCE.getPersistence(); } public void save(Tweet tweet) { persistence.insert(tweet); } public Tweet findOne(UUID uuid) { return persistence.findByKey(uuid, Tweet.class); } } import java.util.Date; import java.util.UUID; public class App{ public static void main( String[] args ){ TweetRepository personService = new TweetRepository(); Tweet tweet = new Tweet(); tweet.setNickName("allaoui chaker"); UUID uuid = UUID.randomUUID(); tweet.setId(uuid); tweet.setNickName("allaoui chaker"); tweet.setMessage("test cassandra mapping"); tweet.setTime(new Date()); tweet.setId(UUID.randomUUID()); personService.save(tweet); Tweet findTweet=personService.findOne(uuid); System.out.println(findTweet); } }
  • 39. C* Storage and find Data by ORM
  • 40. EasyCassandra ORM Mapping Data into Cassandra
  • 41. APACHE SPARK Big Data Eco-System
  • 43. Apache Spark is a framework of processing open source Big Data built to make sophisticated analyses and conceived for the speed and the ease of use. This one was originally developed by AMPLab, of the University UC Berkeley, in 2009 and open source past in the form of project Apache in 2010. Spark presents several advantages with regard to(compared with) the other technologies big dated and MapReduce as Hadoop and Storm. At first, Spark proposes a framework complete and unified to meet the needs of processing Big Data for diverse sets of data, diverse by their nature (text, graph) as well as by the type of source. Then, Spark allows applications on clusters Hadoop to be executed until 100 times as fast in memory, 10 times as fast on record. He allows you to write quickly applications in Java, Scala or Python and includes a set of more than 80 operators top-level. Furthermore, it is possible to use him in a interactive way to rebeg the data since a Shell. Finally, besides the operations of Map and Reduce, Spark supports the SQL requests and the streaming of data and proposes features of ML. Analytics Distributed Local Implements Cloud Implements Apache Spark: SQL, Streaming, ML, GraphX http://www.infoq.com/fr/articles/apache-spark-introduction (Traduction)
  • 44. RDD Resilient Distributed Datasets (based on the publication of research for Matei), or RDD, is a concept at the heart of the framework Spark. You can see a RDD as a table in a database. This one can carry(wear) quite typical of data and is stored by Spark in various partitions. The RDD allows to re- arrange(re-settle) the calculations and to optimize the treatment(processing). They are also fault tolerant because a RDD knows how to recreate and to recalculate its data set. The RDD is immutable. To obtain a modification of a RDD, It is necessary to apply to it a transformation, which will return a new RDD, the original will remain unchanged. The RDD supports two types of operations: transformations and actions. It is necessary to apply to it a transformation, which will return a new RDD, the original will remain unchanged. The RDD supports two types of operations: the transformations and the actions it is necessary to apply to it a transformation, which will return a new RDD, the original will remain unchanged. The RDD supports two types of operations: the transformations and the actions it is necessary to apply to it a transformation, which will return a new RDD, the original will remain unchanged. The RDD supports two types of operations: the transformations and the actions it is necessary to apply to it a transformation, which will return a new RDD, the original will stay.
  • 45. Actions/Transformations The transformations: the transformations do not turn alone value, they turn a new one RDD. Nothing is estimated when we call on to a function of transformation, this function just takes a RDD and turns a new one RDD. The functions of transformation are for example The actions: the actions estimate and return a new value. As a function of action is called on an object RDD, all the requests of data processing are calculated and the result returned. The actions are for example reduce, collect, count, first, take, countByKey Transformations: the most common both transformations which you will use probably are the map() and filter(). The map() the transformation takes in a function and wall lamp in every element in the RDD with the result of the function being the new value of every element in resultant RDD. The filter() the transformation takes in a function and returns a RDD which has only the elements which cross the filter() the function. Actions: the most common action on RDDS of base whom you will use probably is reduce(), which takes a function which operates on two elements of the type in your RDD and produces a new element of the same type. A simple example of such a function is +, that we can use to add our RDD. With reduce(), we can easily add.
  • 46. Spark: cluster manager A cluster Spark consists of a Master's degree and one or several workers. The cluster must be started and remain active to be able to execute applications. The Master's degree has for the only one responsibility the management of the cluster and it does not thus execute code MapReduce. Workers, on the other hand, is the executive. It is them who bring resources to the cluster, worth knowing(namely) of the memory and the hearts of treatment. To execute a treatment on a cluster Spark, it is necessary to submit an application the treatment of which will be piloted by a driver. Two modes of execution are possible: - Customer mode: the driver is created on the machine which submits the application - Mode to cluster: the driver is created inside the cluster. Communication within the cluster. Workers establishes a bidirectional communication with the Master's degree: the worker connects in the Master's degree to open a channel in a sense, then the Master's degree connects in the worker to open a channel in the inverse sense. It is thus necessary that the various knots of the cluster can join correctly (resolution DNS). The communication between knots is made with the framework Akka. It is useful to know to identify the lines of logs handling exchanges between knots. The knots of the cluster (Master's degree as workers) expose besides a Web interface allowing to watch the state of the cluster as well as the progress(promotion) of treatments. Every knot thus opens two ports: - A port for the internal communication: port 7077 by default for the Master's degree, the random port for workers. - A port for the Web interface: port 8080 by default for the Master's degree, the port 8081 by default for workers. The creation of a cluster Spark allows to run the power of treatment of several machines. His implementation is relatively simple. It will simply be necessary to be careful to return the Master's degree cancel by using ZooKeeper. Besides, the execution of an application does not require modification of the code of treatment. The main constraint is to make sure that the data are read and written since distributed systems.
  • 48. SparkSQL: SchemaRDD Spark work with structured and semi-structured data. The Structured data any are given which have a plan that is a set known for fields for every report. When you have this type of data, the SQL of Spark the fact so much easier than more effective to load and to question. Particularly the SQL of Spark supplies three main capacities. - He can load data of a variety of structured sources: JSON, Hive… - He lets you question the data using the SQL, both inside a program Spark and the external tools which connect to the SQL of Spark by connectors of standard database ( JDBC / ODBC), as tools of the day before economic as Picture. - When used in a program Spark, SparkSQL supplies the rich integration between the SQL and the regular Python / Java / Scala, including the capacity to join RDDs and tables of SQL, expose the custom works in the SQL and more. Many jobs are easier to write the use of this combination. To implement these capacities, SparkSQL supplies a special type of RDD called SchemaRDD. A stage of preparation of the data is necessary to allow the interpreter SQL to know the data. The concept of RDD is reused and requires simply to be enriched by a plan. The treated class becomes SchemaRDD. Two options exist to build SchemaRDD: - By using the generic type Row and by describing the plan manually. - By using personalized types and by letting Spark SQL discover the plan by reflection.
  • 49. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val people = sqlContext.jsonFile("./data/data.json") people.printSchema() people.registerTempTable("people") people.count() people.first() val r=sqlContext.sql("select * from people where nom='allaoui'").collect() SparkSQL/UI: SqlContext jobs statistics SQLContext: Scala
  • 51. import org.apache.commons.lang.StringUtils; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import static com.datastax.spark.connector.japi.CassandraJavaUtil.*; import com.datastax.spark.connector.japi.CassandraRow; public class FirstRDD { public static void main(String[] args) { SparkConf conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1") .setAppName("CassandraConnection").setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext("local[*]", "test", conf); JavaRDD<String> cassandraRowsRDD = javaFunctions(sc).cassandraTable("java_api","products") .map(new Function<CassandraRow, String>() { @Override public String call(CassandraRow cassandraRow) throws Exception { return cassandraRow.toString(); }}); System.out.println("--------------- SELECT CASSANDRA ROWS ---------------"); System.out.println("Data as CassandraRows: n" + StringUtils.join(cassandraRowsRDD.toArray(), "n")); }} SparkSQL with cassandraRDD
  • 52. Aggregate data with cassandraRDD and sparkSQL queries
  • 53. SparkConf conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1").setAppName("SparkCassandra") .setMaster("local[*]"); JavaSparkContext sc = new JavaSparkContext("local[*]", « CassandraActionsSpark", conf); JavaRDD<String> cassandraRowsRDD = javaFunctions(sc).cassandraTable("network", "storage") .map(new Function<CassandraRow, String>() { @Override public String call(CassandraRow cassandraRow) throws Exception { return cassandraRow.toString(); }}); System.out.println("StringUtils.join(cassandraRowsRDD.toArray())); System.out.println(cassandraRowsRDD.first()); System.out.println(cassandraRowsRDD.collect()); System.out.println(cassandraRowsRDD.id()); System.out.println(cassandraRowsRDD.countByValue()); System.out.println(cassandraRowsRDD.name()); System.out.println(cassandraRowsRDD.count()); System.out.println(cassandraRowsRDD.partitions()); Spark Actions with CassandraRowRDD
  • 54. Actions functions with cassandraRowsRDD and sparkSQL queries
  • 55. package scala.example object Person { var fisrstname:String="CHAKER" var lastname:String="ALLAOUI" def show(){ println(" Firstname : "+ firstname +" Lastname : "+ lastname )}} def main(args: Array[String]){ Person p=new Person() println("Show Infos") p.show() } } SparkUI: Monitoring val textFile = sc.textFile("DATA.txt") textFile.count() textFile.first() val linesWithSpark = textFile.filter(line => line.contains("Spark")) textFile.filter(line => line.contains("Spark")).count() There are several manners to check(control) Spark applications: whose pillar of monitoring and external instrumentation SparkUI is. Chaque SparkContext launches Web UI, by default on the port 4040, which shows useful information on the application. This includes: A list of stages of planner and try. A summary of sizes RDD and use of report memory. Environmental information. Information on executive common. You can have access to this interface by opening simply "http://hostname:4040" in an Internet browser. If multiple SparkContext which run on the same host, they will be available listens to it on the successive ports beginning with 4040, 4041, 4042 SparkContext: Scala Note that this information is only available for a period of life of the application by default. To see the depiction of SparkUI and take advantage of the interface of monitoring, put " spark.eventLog.enabled " in " true ". This configures Spark to register(record) the events which contain the necessary information to monitor the events of Spark in SparkUI and visualize the persisted data.
  • 58. import java.lang.Math val textFile = sc.textFile("DATA.txt") textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts.collect() MapReduce: map, key, value, reduce Spark offers an alternative to MapReduce because he executes the jobs in micro-prizes(micro-lots) with intervals of five seconds or less. Or a kind of fusion between the batch and the real time or almost. He supplies also more stability than other tools of real-time treatment, as Twitter Storm, transplanted on Hadoop. The software can be used for a big variety of uses, as an analysis perms real time data, and, thanks to a software library, more numerous jobs for the in-depth calculations involving the machine learning and the graphic processing. With Spark, developers can simplify the complexity of the code MapReduce and write requests of analysis of data in Java, Scala or Python, by using a set of 80 high-level routines. With this version 1.0 of Spark, Apache proposes from now on a stable API, which developers can use to interact with their own applications. Other novelty of the version 1.0, a component Spark SQL to reach the structured data, so allowing the data to be questioned beside data not structured during an analytical operation. Spark Apache is of course compatible with the system of files HDFS (Of Hadoop Distributed File System), as well as other components such as YARN (Yet Another Resource Negotiator) and the distributed database HBase. The University of California, and more exactly the laboratory AMP (Algorithms, Machines and People) of Berkeley is at the origin of the development of Spark which the Apache foundation adopted as project in June, 2013. IT companies as Cloudera, Pivotal, IBM, Intel and MapR have already begun to integrate Spark into their distribution Hadoop. SparkContext MapReduce: Scala http://www.lemondeinformatique.fr/actualites/lire-la-fondation-apache-reveille-hadoop-avec-spark-57639.html
  • 61. Data-Driven Documents Presentational Big Data Design Riche Components The exponential production of data by the computer systems is a well established fact. This reality feeds the phenomenon Big dated. The statistical or predictive analysis has to call on(appeal) to the art of the visual representation of the data to give meaning to them and include them better. The Display(Visualization) of the data or Dated visualization is called to set a growing place and it is true in proportion to the volume of data produced by information systems. As such, we are absolutely convinced that the D3 bookshop, the object of this article, will take all its place and it is true not only because of its esthetic qualities there. Created by Mike Bostock, D3 is often presented as a graphic bookshop while its acronym - D3 for Data Driven Documents - shows that it is at first, following the example of jQuery, a Javascript bookshop facilitating the manipulation of a tree DOM. D3 implements routines allowing to load(charge) external data among which the size(format) JSON, XML, CSV or text is native supported. The developer writes the logic of transformation(processing) of the data in HTML elements or SVG to have a representation of it. So, the representation can take as well the shape of a picture(board) ( HTML elements) as a curve (elements SVG). D3 thus allows to produce Directed Documents Given. Whose several models are available on the site www.d3js.org. JSON File presentation: DENDROGRAM
  • 62. D3.js (or D3 for Data-Driven Documents) is a Javascript graphic library(bookcase) which allows the display of digital data under a graphic and dynamic shape. It is about a tool mattering for the conformation for the standards W3C which uses the common SVG technologies, Javascript and CSS for the visualization of data. D3 is the official successor of the precedent framework Protovis1. Contrary to the other libraries, this one allows a more ample control of the visual result final. His development popularized in 20113, at the release of the version 2.00 in August, 2014. In August, 2012, the library had affected the version 2.10.05. Integrated into a HTML, the Javascript library D3.js uses pre-constructed functions of Javascript to select elements, create objects SVG, stylize them, or add it transitions, dynamic effects or tooltips. These objects can be also stylized on a large scale by means of the famous language CSS. Furthermore, big databases with associated values can feed the Javascript functions to generate conditional and/or rich graphic documents. These documents are most of the time graphs. Databases can be under numerous formats, most of the time JSON, CSV, GeoJSON So, the data analysis is the process which consists in examining and in interpreting data to develop answers to questions. The main stages of the process of analysis consist in encircling the subjects of analysis, in determining the availability of appropriate data, in deciding on methods that there is good reason to use to answer the questions of interest, to apply the methods and to estimate, to summarize and to communicate the results. InputStream presentation: Pie Chart
  • 65.  SMART Data SMART DATA Presentational Design Support Decision Smart Today, Big Data is for the marketing managers at the same time an incredible source of data on the consumers and, at the same time, an incredible challenge to be raised. Marketing strategies "digital" take into account from now on texts, conversations, behavior, etc. in an environment or the volume of this information to be handled grow in a exponential way. He would thus be totally imaginary to imagine to manage the entire these data. And The stake in the digital marketing is thus from now on the intelligent management of Big Data to identify, classify and run the information consumer significant allowing the professionals of the marketing to set up their strategies. Smart Data is the process which allows to cross raw data in information ultra qualified on each of the consumers. The objective is to have a vision in 360 ° customers, basing on information collected through adapted marketing mechanisms, that they are classic or innovative (quizzes, social networks, purchases during checking out, use of the mobile applications, the geo-localization, etc.). To reach there, companies are equipped with marketing platforms crosses-channels capable of storing and of analyzing every information "to push" the good message at the best moment for every consumer. The final goal is not only to seduce new customers but especially to increase their satisfaction and their loyalty by anticipating their. It means, among others, establishing a real dialogue with each of his customers and measuring effectively the marketing and commercial performances of the mark. To target in a fine way according to several criteria while respecting the preferences customers, to manage the customization, the relevance and the coherence of messages cross channel freed by e-mail, mail, Web and call center became imperatives which Smart Data allows finally to tackle in a effective way. Let us forget "Big" and let us interest we "Smart" because the relevance of marketing strategies will always depend on the data quality customers.
  • 66. SMART DATA: Data Transformations Data Sources: WebSockets, TCP/UDP, InputStream Data Movement: Apache Storm, Apache kafka Data Storage: Apache Cassandra, Apache Spark Data Presentation: Data-Driven Documents Integrated into a HTML Web page, the Javascript library D3.js uses pre-constructed functions of Javascript to select elements, create objects SVG, stylize them, or add to it transitions, dynamic effects or tooltips. These objects can be also stylized on a large scale by means of the famous language CSS. Furthermore, big databases with associated values can feed the Javascript functions to generate conditional and rich graphic documents. SMART DATA: Process Scenario Synthesis
  • 68. Keywords WebSockets – HTTP – TCP/UDP – InputStream – TwitterStream – WebServices – JSON – Data - Big Data – SMART DATA - Process Big Data – Business Iintelligence – Data Storage – Data Sources – Data Presentation – Data Mining - Data Exploration - Apache Storm – Apache Zookeeper – Apache Kafka – Apache Cassandra – Apache Spark – SparkSQL – SparkUI – D3 - Data-Driven Documents – Storm cluster – StormUI – Storm Topology – Zookeeper cluster – Distrubuted server – Topics – Message – Queue – Data Transmit – OpsCenter – EasyCassandra- Keyspace – Column – CQLSH - CassandraColumn – CassandraRow- Cassandra cluster – Storage Data – Aggregation – RDD – SchemaRDD - Spark Actions – Spark Transformations – Spark cluster - MapReduce – Jobs – Stages – Excutors – Data Transformations –SMART – Apache
  • 69. Links Apache Storm https://storm.apache.org Apache Zookeeper https://zookeeper.apache.org Apache Kafka http://kafka.apache.org/ Apache Cassandra https://spark.apache.org Data-Driven Documents http://d3js.org/ Apache Spark http://cassandra.apache.org
  • 70. Idea Create Refine Contact Visit my profile in LinkedIn Visit my website http://tn.linkedin.com/in/chakerallaoui http://allaoui-chaker.github.io