Big Data to SMART Data : Process scenario
Scenario of an implementation of a transformation process of the Data towards exploitable data and representative with treatments of the streaming, the distributed systems, the messages, the storage in an NoSQL environment, a management with an ecosystem Big Data graphic visualization of the data with the technologies:
Apache Storm, Apache Zookeeper, Apache Kafka, Apache Cassandra, Apache Spark and Data-Driven Document.
4. Creative
process
Analytical
process
Big data, indicates the
explosion of the volume of the
data scanned, collected by
private individuals, public
actors, IT applications which
include users communities on
the scale of the planet. It will
be enough to quote some
examples more or less known
by all: Google, its search
engine and its services of
departments; networks said
social: Facebook and his billion
users who put down images,
texts, exchanges; sites of
sharing and distribution and
broadcasting of images and
photos “Flickr”; the community
sites (blogs, forums, wikis);
administrative departments
and their
Process Scenario: Analytical & Creative
scanned exchanges. In the center of all these
vacuum cleaners of data, we find Internet,
Web, and its capacity to be federated in the
space scanned by the billions of users, but
also sensors' profusion of all kinds,
accumulating scientific data with an
unpublished rhythm (satellite pictures for
example). To remain in Web, all the
messages, all the documents there, all the
images and the videos are got by
applications which, in exchange for the
supplied services of departments,
accumulate of immense data banks.
We speak in million waiters for Google,
Facebook or Amazon, stored in immense
sheds which, besides, consume a not
insignificant part of the produced
electricity. And the movement seems to go
accelerating.
Where from the necessity of presenting a
system of combination of these multiple
sources. Thus the idea is to create a
complete scenario of a transformation
process of these data of exploitable mass
in data and presentational to facilitate
their management and make implement
the decision-making computing to
analyze and federate these data.
The solution contains open source
software's the majority of which are
stemming from projects of the Apache.
7.
Data
WebSockets
The protocol Web Socket aims at
developing a communication
channel full house-duplex on a
socket TCP for the browsers and
the Web servers. The increasing
Web interactivity of the
applications, consecutive to the
improvement of the performances
of the browsers, quickly made
necessary the development of
techniques of bidirectional
communications between the Web
application and the processes
server. Techniques based on the
call by the customer of the object
“XMLHttpRequest” and using HTTP
requests with a long TTL stored by
the server for a later answer to the
customer allowed to mitigate this
lack by Ajax.
TCP/UDP
UDP is a directed protocol "not
connection", when a machine
A sends packages aimed at a
machine B, this flow is
unidirectional. Indeed, the
data transmission is made
without warning the addressee
(machine B), and the addressee
receives the data without
making of acknowledgement
of receipt towards the
transmitter (machine A).
TCP is directed "connection".
When a machine A sends data
towards a machine B, the
machine B is prevented of the
arrival of the data, and testifies
of the good reception of these
data by an acknowledgement
of receipt.
InputStream
The InputSream allow to these
processes of sending and data
reception. Flows always process
the data in a sequential way.
Flows can be divided into several
categories: the flows of treatment
and processing of characters and
the flows of treatment of
bytes(octets).
The flows of streams can be web
services, flus of data coming
from social networks such:
Tweeter API....
Data: WebSockets-TCP/UDP-InputStream
Binary Data
Structured Data
Unstructured Data
API Interface
http://fr.wikipedia.org (Traduction)
8. Read Content: cURL
To read the contents of a Web source, we are going to
manipulate: " cURL " (Customer URL Request Library), it
is an on-line interface of command intended to get
back the contents of an accessible resource by a
network. The resource is indicated by means of an URL
and has to be of a type supported by the software, he
can be so used as customer REST API.
The library supports in particular protocols FTP, FTPS,
HTTP, HTTP, TFTP, SCP, SFTP, Telnet, DICT, FILE and
LDAP. The writing can be made in HTTP by using the
POST and PUT commands.
CURL is a generic interface allowing us to handle a
flow of data.
The attached example presents the reading of one
contained by a present file JSON on a server in
localhost.
The idea is to know how the software of streaming treat
an information received by remote server.
Example: curl readed Content by GET Method:
> curl –i –H « Accept: application/json » -H « Content-
Type:application/json » http://www.host.com/data.json
cURL command: JSON Stream
10. The initial definition given by
the cabinet McKinsey and
Company in 2011 turned at
first to the technological
question, with the famous rule
of 3V:
A big Volume of data, an
important Variety of the same
data and a Processing speed
being sometimes similar to the
real time.
These technologies were
supposed to answer the
explosion of the data in the
digital landscape "data
deluge ". Then, these qualifiers
evolved, with a more economic
vision carried by the 4th V:
Big Data comes along with the
development of applications with
analytical aim, which process the
data to pull it of the sense. These
analyses are called Big Analytics or "
grinding of data ". They concern
complex quantitative data with
methods of distributed calculation.
In 2001, a research paper of the
META GROUP defines the stakes
inherent to the growth of the data as
being three-dimensional:
The complex analyses indeed
answer the said rule " 3V "
(volume, velocity and variety).
This model is still widely used
today to describe this
phenomenon.
The annual world average
growth rate of the market of
the technology and the
services of departments of Big
Data over the period 2011-
2016 should be 31,7 %. This
market should so affect 23,8
billion dollars in 2016
(according to IDC in March,
2013). Big Data should also
represent 8 % of the European
GDP(GROSS DOMESTIC
PRODUCT) in 2020.
Organize Data
Centralize Data
Combine Data
Support Decision
Big Data
http://recherche.cnam.fr/equipes/sciences-industrielles-et-technologies-de-l-information/big-data-decodage-et-analyse-des-enjeux-661198.kjsp
The Value
11. Volume: it is a relative dimension: Big Data as noted
it Lev Manovitch in 2011 defined formerly " the big
enough data sets to require super computers ", but
he quickly became possible to use standard software
on desktop computers to analyze or co-analyze vast
data sets. The volume of the stored data is rapidly
growing: the digital data created in the world
would have passed of 1,2 Z-octets a year in 2010 to
1,8 Z-octets in 2011.
Velocity: Represent at the same
time the frequency to which the
data are generated, captured and
shared and updated. Increasing
flows of data must be analyzed in
real-time to meet the needs of
processes stopwatch sensitive. For
example, the systems set up by the
stock exchange and the
companies must be capable of
processing these data before a
new cycle of generation began,
with the risk for the man to lose a
big part of the control of the
system when the main operators
become "robots" capable of throw
of the orders of purchase or of
sale of the order of nanosecond,
without arranging all the criteria
of analysis for the average and
long term.
Then 2,8 Z-octets in 2012 and
will amount to 40 Z-octets in
2039. For example Twitter
generated in January, 2013, 7 in
the daytime.
Variety: The volume of Big Data
puts dated-centers in front of a
real challenge: the variety of the
data. It is not about traditional
relational data, these data are raw,
semi-structured even not
structured. They are complex data
resulting from Web, in the
size(format) text and images
(Image Mining). They can be
public (Open Data, Web of the
data), geography demographic by
island (IP addresses), or be a
matter of the property of the
consumers. What returns them
with difficulty usable with the
traditional tools of managements
of data to take out the best of it.
http://fr.wikipedia.org/wiki/Big_data (Traduction)
13. JSON: format
JSON (JavaScript Object Notation) is a lightweight data-
interchange format. It is easy for humans to read and
write. It is easy for machines to parse and generate. It is
based on a subset of the JavaScript Programming
Language, Standard ECMA-262 3rd Edition - December
1999. JSON is a text format that is completely language
independent but uses conventions that are familiar to
programmers of the C-family of languages, including C,
C++, C#, Java, JavaScript, Perl, Python, and many others.
These properties make JSON an ideal data-interchange.
JSON is built on two structures:
A collection of name/value pairs. In various
languages, this is realized as an object, record,
structure, dictionary, hash table, keyed list, or
associative array.
An ordered list of values. In most languages,
this is realized as an array, vector, list, or
sequence.
These are universal data structures. Virtually
all modern programming languages support
them in one form or another. It makes sense
that a data format that is interchangeable with
programming languages also be based on these
structures.
In JSON, they take on these forms:
An object is an unordered set of name/value
pairs. An object begins with { (left brace) and
ends with } (right brace). Each name is
followed by : (colon) and the name/value pairs
are separated by , (comma).
Structured Data
Universal
Speed Treatments
Riche Componentshttp://json.org/
16. APACHE ZOOKEEPER ZooKeeper
Apache Zookeeper is a framework
to federate the communications
between distributed systems, it
works by supplying a space report
shared by all the authorities of
same set servers. This space
memory is hierarchical, in the style
of a system of file compound of
directories and files. He is
distributed, consists of several
machines connected between them
and who solve together a problem,
by opposition with a centralized
system, thus a big unique machine
which sets quite at his
expense(under his responsibility).
The case of Google, for which no
unique machine could not handle
all the requests. The simple fact of
having several machines which
have to work together is a source
of problems, among which:
The resistance in the
breakdowns: if a machine
breaks down in the network,
that to make. If she is the only
one to carry important
information, these are lost. In
this case, we adjust the
question with the redundancy
of the information, duplicated
on several machines. The
consistency of the information,
in particular if she is
duplicated. The purpose is to
offer an independence of her
value of the datum with regard
compared with his source: we want
to avoid that every machine carries
a different version, of an
information which we need. The
distribution of the load
responsibility: how well manage his
resources to avoid that a single
machine is too much requested,
while the others are inactive.
How is a request emitted by a
customer handled, Which makes
him? And how guarantee that the
result is the voucher, independently
of the machine which handled the
question. It is the said problem of
the consensus: we say that there is
consensus if two machines having
the same initial data give the same
result for the same treatment.
Distributed
Local Implements
Cloud Implements
Orchestrator
http://blog.zenika.com/index.php?post/2014/10/13/Apache-Zookeeper-facilitez-vous-les-systemes-distribues2
17. Distributed systems
The communication between the
authorities or the machines takes
place in asynchronous mode. For
example, you want to send a
message in a set of authorities ( a
cluster) to launch a treatment and
processing. You also need to know
if an authority is operational. To
communicate well, an used method
consists of a tail. This one allows to
send messages directly or by
subject, of lira in a asynchronous
way. The communication in mode
to cluster takes place exactly as a
communication, in local mode with
several processes, and pass by the
coordination in the use custom of
the resources. Concretely, when a
knot writes on a file, it is preferable
that he puts a bolt, visible by all
other knots, on said file.
A distributed system of files
does not wish to know that
such file is on the authority
thing. He wants to be under the
illusion to treat and manipulate
a unique system of files, exactly
as on a single machine. The
management of the distribution
of the resources, with what to
do if an authority does not
answer, is not useful to know.
A service of department of
naming: to present him,
certainly in a rough way, we
would like a kind of
<MapString, Object > who is
distributed on the network,
whom all the authorities can
use. A typical example would
be JNDI. The purpose of the
service of department of
naming consists in
Have a system distributed by access to objects.
A strong system of treatment and processing of the
requests is distributed architecture which welcomes
real time data. In the case of a structure advanced, we
wish not to lose the slightest given such as: the stock-
exchange orders, the idea is to make a treatment and
processing which gets back the data of an authority
before it breaks down or takes out of networks. The
approach of Zookeeper is to choose a leader in every
putting on of system, and the latter who assures and
insures the information sharing and the frames in the
which an information has to be to persist in the system
to keep the track, by taking account in the opinion of
every Znodes or exactly the answer to make protect an
information, so an upper number in n/2 where n is
the number.
20. Apache Storm is a system of
real-time calculations
distributed and fault tolerant.
Developed originally by the
company BackType, the
project became open-source
after the acquisition of the
company by Twitter. It is
available on
License Eclipse Public 1.0.
Furthermore, Storm entered
since a few months the process
of incubation of the Apache
foundation. To handle
continuously several flows of
data, Storm bases on the
definition of a topology.
A topology takes the shape of a
directed graph in which one: Streams,
symbolized by arcs, is unlimited
sequences of Tuples. A tuple is a list of
appointed values which represents the
model of data used by Storm.
Spouts, knots root of the graph,
indicate the sources of streams. He can
involve for example a sequence of
tweets emitted via the API Twitter, a
flow of logs or still data read directly
since a database.
And finally, bolts is the knots which
consume these sequences of tuples
emitted by one or several knots. They
have for role to realize various
operations (filters, aggregations, joints,
reading / writing towards and since a
database, etc.) and if need to emit in
their turn a new sequence of tuples.
Storm: concepts
Analytics
Big Data
Distributed
Topology Design
http://blog.zenika.com/index.php?post/2014/01/31/Storm-Ajouter-du-temps-reel-a-votre-BigData (Traduction)
21. The grouping of flows answers
the following question: when a
tuple is emitted, towards
which bolts he must be
managed? In other words, it is
a question of specifying the
way flows are partitioned
between the various
authorities of the same
component spout or bolt. For
that purpose, Storm supplies a
set of predefined groupings,
among which here are the
main definitions:
* Shuffle grouping: tuples is
randomly distributed towards
the various bolts authorities in
a way that each receives an
equal number of tuples.
* Fields grouping: the flow is
partitioned according to one
or several fields.
Streams: grouping, worker, executor
* All grouping: the flow is
answered towards all the
authorities. This method is to be
used with precaution because it
generates so many flows as there
is of authorities.
* Global grouping: the whole
flow is redirected towards the
same authority. In the case,
where there is several for the
same bolt, the flow is then
redirected towards that having
the smallest identifier. When a
topology is submitted to Storm,
this one distributes all the
treatments and processing
implemented by your components through the
cluster. Every component is then executed in parallel
on one or several machines.
Storm executes a topology compound of spout and
two bolts. For every topology, Storm manages a set of
different entities:
One " worker process " is a JVM running on a
machine of the cluster. It has for role to coordinate
the execution of one or several components (spouts or
bolts) belonging to the same topology. (the number of
workers associated to a topology can change in time.)
A "executor" is a thread launched by one " worker
process ". It is in charge of executing one or several
"task" for a bolt or specific spout. (the number of
executive).
27. Apache published Kafka 0.8, the first
major version of Kafka since the
project became a project of top level
of the Apache Software Foundation.
Apache Kafka is a directed system
message of type publication-
subscription implemented as
transactional system of tracks
distributed, adapted for the
consumption of on-line and
outstanding messages. It is about a
directed system message developed
originally to LinkedIn for the
collection and the distribution of high
volumes of events and data of track
with low latency. The last version
includes the replication intra-cluster
and the support of multiple
directories of data. The files of track
can be switched around by age, and
the levels of track can be
Valued dynamically by JMX. A
tool of performance test was
added, to help to handle the
concerns and to look for the
improvements concerning the
performances. Kafka is a service
of department of committed of
tracks distributed, partitioned and
answered.
The producers publish messages
in subjects Kafka,
The consumers subscribe to
these subjects and consume
messages. A waiter in a
cluster Kafka is called an
intermediary. For every
subject, the cluster Kafka
maintains a partition for the
load increase, the
parallelism and the
resistance in the
breakdowns.
Live Transmit
Distributed
Memory
Tracability
http://www.infoq.com/fr/news/2014/01/apache-afka-messaging-system (Traduction)
Apache Kafka
28. The system is checked and
controlled by the consumer. A
typical consumer will treat the
message following in the list,
although he can consume
messages in any order, because
the cluster Kafka keeps all the
messages published for a
period of configurable time. It
returns the very economic
consumers, because they go
and come without a lot of
impact on the cluster, and
authorize the consumers
disconnected as
Clusters Hadoop. The
producers are capable of
choosing the subject, and the
partition within the subject, in
which to publish the message.
The consumers auto-affect a
name of group of consumer,
and every message is
distributed to a consumer in
every group of subscribed
consumers. If all the
consumers have different
groups, then messages are
broadcast to every consumer.
Kafka can be used as a
middleware of traditional
message. He offers a high flow
and has capacities of partition
native, of replication and
resistance in the breakdowns,
what makes it a good solution
for them
applications of treatment and processing of large-scale
messages. Kafka can be also used for the follow-up of
Web sites on strong volume. The activity of the site can
be published and handled in real time, or loaded in
system of warehouse of data Hadoop or except line.
Kafka can be also used as solution of aggregation of
tracks. Instead of working with files, tracks can be
treated as flows of messages. Kafka is used to LinkedIn
and he manages more than 10 billion writings a day
with a steady load which borders 172 000 messages per
second. There is a massive use of support medium or
multi-subscribers
Kakfa Topic
29. APACHE KAFKA: producer - topic - consumer
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --
replication-factor 1 --partitions 1 --topic test
> bin/kafka-topics.sh --list --zookeeper localhost:2181
> bin/kafka-console-producer.sh --broker-list localhost:9092 --
topic test
toto
Hello World
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --
topic test --from-beginning
toto
Hello World
32. Replication indicates the number
of knots where the datum is
answered. Besides, Cassandra's
architecture defines the term of
cluster as being a group of at
least two knots and one dated to
center as being relocated
clusters. Cassandra allows to
insure the replication through
different dated to center. The
knots which fell can be replaced
without unavailability of the
service of department.
· Decentralized: in a cluster all
the knots are equal. He not there
no notion of master, neither of
slave, nor process which would
have at her expense the
management, nor even of
bottleneck at the level of the
network part. The protocol
Apache Cassandra is a database of
the family very fashionable
NoSQL. She is classified among
directed bases columns just like
HBase, Apache Accumulo, Big
Table. This base was originally
developed by engineers of
Facebook for their in-house needs
before being put at the disposal of
the general public in open-
source.
CHARACTERISTICS
· Fault tolerance: the data of a knot
(a knot is Cassandra's authority)
are automatically answered
towards other knots ( various
machines). So, if a knot is out of
order the present data are
available through other knots. The
term of factor of
GOSSIP insures overdraft, the
location and the collection of all
the information on the state of the
knots of a cluster.
· rich Model of data: the model of
data proposed by Cassandra
based on the notion of key / value
allows to develop numerous cases
of use in the world of Web.
· Elastic: the scalability is linear.
The debit of writing and reading
increases in a linear way when a
new server is added in the cluster.
Besides, Cassandra assures that
there will be no unavailability of
the system nor the interruption at
the level of the applications.
Analytics
Storage
Distributed
Memory
APACHE Cassandra: NoSQL DataBase
http://www.infoq.com/fr/articles/modele-stockage-physique-cassandra-details (Traduction)
33. In the world of databases NoSQL, we
often hear about the Theorem CAP.
This theorem establishes 3
parameters on which we can play to
configure a distributed database
The coherence (C for Consistency)
The availability (A for Availability)
The fault tolerance and in the
network cuts (P for Partition-
Tolerance)
The theorem postulates that for any
distributed database, we can choose
only 2 of these 3 parameters, ever 3
at the same time. In theory, we can
thus choose the following couples:
a. Coherence and availability (CA)
thus not resistant in the
breakdowns.
b. Coherence and 100% thus
unavailable fault tolerance ( CP) this
equal ( A )
c. Availability and 100% thus
not coherent fault tolerance (
AP) this equal ( C )
This is the theory. In practice,
we realize that the parameter
P is more or less imposed.
Indeed, the network cuts it
happens, it is inevitable. As a
result, the choice amounts
after all to CP or AP. Cassandra
clearly chooses of AP for a
fault tolerance and an absolute
availability.
In return, Cassandra sacrifices the absolved
coherence (in the sense(direction) ACID of the term)
against a final coherence, that is a strong coherence
obtained after a convergence of the data or AP.
Cassandra clearly chooses of AP for a fault tolerance
and an absolute availability. In return, Cassandra
sacrifices the absolved coherence (in the
sense(direction) ACID of the term) against a final
coherence, that is a strong coherence obtained after a
convergence of the data or AP. Cassandra clearly
chooses of AP for a fault tolerance and an absolute
availability. In return, Cassandra sacrifices the
absolved coherence (in the sense ACID of the term)
against a final coherence
CAP Theorem
http://www.infoq.com/fr/articles/cap-twelve-years-later-how-the-rules-have-changed
34. CQLSH
CQL wants to say Cassandra
Query Language, and we are
in the version 4. The first
version was an experimental
attempt to introduce a
language of request for
Cassandra. The second version
of CQL was conceived or
designed to re-beg wide rows
but was not rather flexible to
adapt itself to all
The types of modelling who
exist in Apache Cassandra.
However, it is recommended to
use rather the second key of
indexation positioned on the
column containing the
deliberate information. Indeed,
to use the strategy Ordered-
Partitionners has the following
consequences: the sequential
writing can pull hotspots: if
the application tries to write or
To update sequential set lines, then the writing will
not be distributed in the cluster; an overhead
increased for the administration of the load hesitate
in the cluster: the administrators have to calculate
manually the beaches of tokens to distribute them in
the cluster; uneven distribution of load for families of
multiple columns.
The interface of CQLSH is written in python, thus
requires the installation of the utility python for a
version superior to 2.7 to be able to benefit from this
interface of direct communication with the database
Cassandra. The language of request in version 4 is
very similar to the SQL2. So several terms are the
same, but their utilities a primary key in Cassandra is
different, for example is not equivalent
Example:
CREATE TABLE developer(
developer_id bigint,
firstname text,
lastname text,
age int,
task varchar,
PRIMARY KEY(developer_id));
CQLSH 4.1 python interface
35. Apache Cassandra is a system
allowing to manage a big
quantity of data in a
distributed way. The latter can
be structured, semi-structured
or not structured by the whole.
Cassandra was conceived to be
highly scalable on servers
large number not while
presenting a Single Of Failure (
SPOF). Cassandra supplies a
dynamic plan of data to offer a
maximum of flexibility and
performance. But to include
well this tool, it is necessary to
assimilate first of all the
vocabulary
- Cluster: a cluster is a
grouping of the knots which
communicate themselves for
the management of data.
- Keyspace: it is the equivalent
of a database in the world of
the relational databases. To
note that it is possible to have
several "Keyspaces" on the
same server.
- Column ( Column): a column
consists of a name, a value and
a timestamp.
- Line ( Row): columns are
grouped(included) in Rows.
Row is represented by a key
and a value.
It is possible to configure the partition for a family of
columns by specifying that we want that it is managed
with type Ordered-Partitioners strategy. This mode can,
indeed, have an interest if we wish to get back a
beach(range) of lines between two values (thing which
is not possible if the hash MD5 of the keys of lines is
used). It is possible to configure the partition for a
family of columns by specifying that we want that it is
managed with type Ordered-Partitioners strategy. This
mode can, indeed, have an interest if we wish to get
back a range of lines between two values (thing which
is not possible if the hash MD5 of.
C* Cluster and system architecture
36. C* cluster: nodes configuration
Cassandra Cluster Configuration with OpsCenter
37. C* Storage Data: Model
import java.io.Serializable;
import java.util.Date;
import java.util.UUID;
import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.Id;
import org.apache.commons.lang3.builder.EqualsBuilder;
import org.apache.commons.lang3.builder.HashCodeBuilder;
import org.apache.commons.lang3.builder.ToStringBuilder;
import org.apache.commons.lang3.builder.ToStringStyle;
import org.easycassandra.Index;
@Entity(name = "tweet")
public class Tweet implements Serializable {
@Id
private UUID id;
@Index
@Column(name = "nickName")
private String nickName;
@Column(name = "message")
private String message;
@Column(name = "time")
private Date time;
public UUID getId() {return id; }
public void setId(UUID id) {this.id = id;}
public String getNickName() {return nickName; }
public void setNickName(String nickName) {
this.nickName = nickName;}
public String getMessage() {return message;}
public void setMessage(String message) {
this.message = message;}
public Date getTime() {return time;}
public void setTime(Date time) {this.time = time; }
@Override
public boolean equals(Object obj) {if(obj instanceof Tweet) {
Tweet other = Tweet.class.cast(obj);
return new EqualsBuilder().append(id, other.id).isEquals();}
return false;}
@Override
public int hashCode() {
return new HashCodeBuilder().append(id).toHashCode();}
@Override
public String toString() {
return ToStringBuilder.reflectionToString(this,
ToStringStyle.SHORT_PREFIX_STYLE);
}}
38. C* Storage Data: Repository-Request
import java.util.List;
import java.util.UUID;
import org.easycassandra.persistence.cassandra.Persistence;
public class TweetRepository {
private Persistence persistence;
public List<Tweet> findByIndex(String nickName) {
return persistence.findByIndex("nickName", nickName,
Tweet.class);
}
{
this.persistence =
CassandraManager.INSTANCE.getPersistence();
}
public void save(Tweet tweet) {
persistence.insert(tweet);
}
public Tweet findOne(UUID uuid) {
return persistence.findByKey(uuid, Tweet.class);
}
}
import java.util.Date;
import java.util.UUID;
public class App{
public static void main( String[] args ){
TweetRepository personService = new
TweetRepository();
Tweet tweet = new Tweet();
tweet.setNickName("allaoui chaker");
UUID uuid = UUID.randomUUID();
tweet.setId(uuid);
tweet.setNickName("allaoui chaker");
tweet.setMessage("test cassandra mapping");
tweet.setTime(new Date());
tweet.setId(UUID.randomUUID());
personService.save(tweet);
Tweet findTweet=personService.findOne(uuid);
System.out.println(findTweet);
}
}
43. Apache Spark is a framework of
processing open source Big Data
built to make sophisticated
analyses and conceived for the
speed and the ease of use. This one
was originally developed by
AMPLab, of the University UC
Berkeley, in 2009 and open source
past in the form of project Apache
in 2010. Spark presents several
advantages with regard
to(compared with) the other
technologies big dated and
MapReduce as Hadoop and Storm.
At first, Spark proposes a
framework complete and unified to
meet the needs of processing Big
Data for diverse sets of data,
diverse by their nature (text,
graph) as well as by the type of
source.
Then, Spark allows applications
on clusters Hadoop to be executed
until 100 times as fast in memory,
10 times as fast on record. He
allows you to write quickly
applications in Java, Scala or
Python and includes a set of more
than 80 operators top-level.
Furthermore, it is possible to
use him in a interactive way to
rebeg the data since a Shell.
Finally, besides the operations
of Map and Reduce, Spark
supports the SQL requests and
the streaming of data and
proposes features of ML.
Analytics
Distributed
Local Implements
Cloud Implements
Apache Spark: SQL, Streaming, ML, GraphX
http://www.infoq.com/fr/articles/apache-spark-introduction (Traduction)
44. RDD
Resilient Distributed Datasets
(based on the publication of
research for Matei), or RDD, is
a concept at the heart of the
framework Spark. You can see
a RDD as a table in a database.
This one can carry(wear) quite
typical of data and is stored by
Spark in various partitions.
The RDD allows to re-
arrange(re-settle) the
calculations and to optimize
the treatment(processing).
They are also fault tolerant
because a RDD knows how to
recreate and to recalculate its
data set. The RDD is
immutable. To obtain a
modification of a RDD, It is
necessary to apply to it a
transformation, which will
return a new RDD,
the original will remain
unchanged. The RDD supports
two types of operations:
transformations and actions. It
is necessary to apply to it a
transformation, which will
return a new RDD,
the original will remain unchanged. The RDD
supports two types of operations: the
transformations and the actions it is necessary to
apply to it a transformation, which will return a
new RDD, the original will remain unchanged.
The RDD supports two types of operations: the
transformations and the actions it is necessary to
apply to it a transformation, which will return a
new RDD, the original will remain unchanged.
The RDD supports two types of operations: the
transformations and the actions it is necessary to
apply to it a transformation, which will return a
new RDD, the original will stay.
45. Actions/Transformations
The transformations: the
transformations do not turn
alone value, they turn a new
one RDD. Nothing is estimated
when we call on to a function
of transformation, this
function just takes a RDD and
turns a new one RDD. The
functions of transformation
are for example
The actions: the actions
estimate and return a new
value. As a function of action
is called on an object RDD, all
the requests of data processing
are calculated and the result
returned. The actions are for
example reduce, collect, count,
first, take, countByKey
Transformations: the most common both
transformations which you will use probably are the
map() and filter(). The map() the transformation takes
in a function and wall lamp in every element in the
RDD with the result of the function being the new
value of every element in resultant RDD. The filter()
the transformation takes in a function and returns a
RDD which has only the elements which cross the
filter() the function.
Actions: the most common action on RDDS of base
whom you will use probably is reduce(), which takes a
function which operates on two elements of the type in
your RDD and produces a new element of the same
type. A simple example of such a function is +, that we
can use to add our RDD. With reduce(), we can easily
add.
46. Spark: cluster manager
A cluster Spark consists of a Master's degree and one or several
workers. The cluster must be started and remain active to be able to
execute applications. The Master's degree has for the only one
responsibility the management of the cluster and it does not thus
execute code MapReduce. Workers, on the other hand, is the
executive. It is them who bring resources to the cluster, worth
knowing(namely) of the memory and the hearts of treatment.
To execute a treatment on a cluster Spark, it is necessary to submit
an application the treatment of which will be piloted by a driver.
Two modes of execution are possible:
- Customer mode: the driver is created on the machine which
submits the application
- Mode to cluster: the driver is created inside the cluster.
Communication within the cluster.
Workers establishes a bidirectional communication with the
Master's degree: the worker connects in the Master's degree to open
a channel in a sense, then the Master's degree connects in the
worker to open a channel in the inverse sense. It is thus necessary
that the various knots of the cluster can join correctly (resolution
DNS). The communication between knots is made with the
framework Akka. It is useful to know to identify the lines of logs
handling exchanges between knots.
The knots of the cluster (Master's degree as workers)
expose besides a Web interface allowing to watch the
state of the cluster as well as the progress(promotion)
of treatments. Every knot thus opens two ports:
- A port for the internal communication: port 7077 by
default for the Master's degree, the random port for
workers.
- A port for the Web interface: port 8080 by default
for the Master's degree, the port 8081 by default for
workers.
The creation of a cluster Spark allows to run the power
of treatment of several machines. His implementation
is relatively simple. It will simply be necessary to be
careful to return the Master's degree cancel by using
ZooKeeper. Besides, the execution of an application
does not require modification of the code of treatment.
The main constraint is to make sure that the data are
read and written since distributed systems.
48. SparkSQL: SchemaRDD
Spark work with structured and semi-structured data. The Structured data any are given which
have a plan that is a set known for fields for every report. When you have this type of data, the
SQL of Spark the fact so much easier than more effective to load and to question. Particularly the
SQL of Spark supplies three main capacities.
- He can load data of a variety of structured sources: JSON, Hive…
- He lets you question the data using the SQL, both inside a program Spark and the external tools
which connect to the SQL of Spark by connectors of standard database ( JDBC / ODBC), as tools
of the day before economic as Picture.
- When used in a program Spark, SparkSQL supplies the rich integration between the SQL and the
regular Python / Java / Scala, including the capacity to join RDDs and tables of SQL, expose the
custom works in the SQL and more. Many jobs are easier to write the use of this combination. To
implement these capacities, SparkSQL supplies a special type of RDD called SchemaRDD.
A stage of preparation of
the data is necessary to
allow the interpreter SQL
to know the data. The
concept of RDD is reused
and requires simply to be
enriched by a plan. The
treated class becomes
SchemaRDD. Two options
exist to build SchemaRDD:
- By using the generic
type Row and by
describing the plan
manually.
- By using personalized
types and by letting Spark
SQL discover the plan by
reflection.
49. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val people = sqlContext.jsonFile("./data/data.json")
people.printSchema()
people.registerTempTable("people")
people.count()
people.first()
val r=sqlContext.sql("select * from people where nom='allaoui'").collect()
SparkSQL/UI: SqlContext jobs statistics
SQLContext: Scala
55. package scala.example
object Person {
var fisrstname:String="CHAKER"
var lastname:String="ALLAOUI"
def show(){
println(" Firstname : "+ firstname
+" Lastname : "+ lastname
)}}
def main(args: Array[String]){
Person p=new Person()
println("Show Infos")
p.show()
}
}
SparkUI: Monitoring
val textFile = sc.textFile("DATA.txt")
textFile.count()
textFile.first()
val linesWithSpark = textFile.filter(line => line.contains("Spark"))
textFile.filter(line => line.contains("Spark")).count()
There are several manners to check(control) Spark applications: whose
pillar of monitoring and external instrumentation SparkUI is. Chaque
SparkContext launches Web UI, by default on the port 4040, which
shows useful information on the application. This includes:
A list of stages of planner and try.
A summary of sizes RDD and use of report memory.
Environmental information.
Information on executive common.
You can have access to this interface by opening simply
"http://hostname:4040" in an Internet browser. If multiple SparkContext
which run on the same host, they will be available listens to it on the
successive ports beginning with 4040, 4041, 4042
SparkContext: Scala
Note that this information is only available for
a period of life of the application by default.
To see the depiction of SparkUI and take
advantage of the interface of monitoring, put "
spark.eventLog.enabled " in " true ". This
configures Spark to register(record) the
events which contain the necessary
information to monitor the events of Spark in
SparkUI and visualize the persisted data.
58. import java.lang.Math
val textFile = sc.textFile("DATA.txt")
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts.collect()
MapReduce: map, key, value, reduce
Spark offers an alternative to MapReduce because he executes the jobs in micro-prizes(micro-lots) with intervals of five
seconds or less. Or a kind of fusion between the batch and the real time or almost. He supplies also more stability than other
tools of real-time treatment, as Twitter Storm, transplanted on Hadoop.
The software can be used for a big variety of uses, as an analysis perms real time data, and, thanks to a software library, more
numerous jobs for the in-depth calculations involving the machine learning and the graphic processing. With Spark,
developers can simplify the complexity of the code MapReduce and write requests of analysis of data in Java, Scala or Python,
by using a set of 80 high-level routines.
With this version 1.0 of Spark, Apache proposes from now on a stable API, which developers can use to interact with their own
applications. Other novelty of the version 1.0, a component Spark SQL to reach the structured data, so allowing the data to be
questioned beside data not structured during an analytical operation.
Spark Apache is of course compatible with the system of files HDFS (Of Hadoop Distributed File System), as well as other
components such as YARN (Yet Another Resource Negotiator) and the distributed database HBase. The University of California,
and more exactly the laboratory AMP (Algorithms, Machines and People) of Berkeley is at the origin of the development of
Spark which the Apache foundation adopted as project in June, 2013. IT companies as Cloudera, Pivotal, IBM, Intel and MapR
have already begun to integrate Spark into their distribution Hadoop.
SparkContext MapReduce: Scala
http://www.lemondeinformatique.fr/actualites/lire-la-fondation-apache-reveille-hadoop-avec-spark-57639.html
61. Data-Driven Documents
Presentational
Big Data
Design
Riche Components
The exponential production of data by the computer
systems is a well established fact. This reality feeds the
phenomenon Big dated. The statistical or predictive
analysis has to call on(appeal) to the art of the visual
representation of the data to give meaning to them and
include them better. The Display(Visualization) of the
data or Dated visualization is called to set a growing
place and it is true in proportion to the volume of data
produced by information systems.
As such, we are absolutely convinced that the D3
bookshop, the object of this article, will take all its place
and it is true not only because of its esthetic qualities
there. Created by Mike Bostock, D3 is often presented as
a graphic bookshop while its acronym - D3 for Data
Driven Documents - shows that it is at first, following
the example of jQuery, a Javascript bookshop facilitating
the manipulation of a tree DOM. D3 implements
routines allowing to load(charge) external data among
which the size(format) JSON, XML, CSV or text is native
supported. The developer writes the logic of
transformation(processing) of the data in HTML elements
or SVG to have a representation of it.
So, the representation can take as well the
shape of a picture(board) ( HTML elements) as
a curve (elements SVG). D3 thus allows to
produce Directed Documents Given. Whose
several models are available on the site
www.d3js.org.
JSON File presentation: DENDROGRAM
62. D3.js (or D3 for Data-Driven
Documents) is a Javascript
graphic library(bookcase)
which allows the display of
digital data under a graphic
and dynamic shape. It is about
a tool mattering for the
conformation for the
standards W3C which uses the
common SVG technologies,
Javascript and CSS for the
visualization of data. D3 is the
official successor of the
precedent framework
Protovis1. Contrary to the
other libraries, this one allows
a more ample control of the
visual result final. His
development popularized in
20113, at the release of the
version 2.00 in August, 2014.
In August, 2012, the library
had affected the version
2.10.05.
Integrated into a HTML,
the Javascript library D3.js
uses pre-constructed
functions of Javascript to
select elements, create
objects SVG, stylize them,
or add it transitions,
dynamic effects or tooltips.
These objects can be also
stylized on a large scale by
means of the famous
language CSS.
Furthermore, big databases
with associated values can
feed the Javascript
functions to generate
conditional and/or rich
graphic documents. These
documents are most of the
time graphs.
Databases can be under
numerous formats, most of
the time JSON, CSV,
GeoJSON
So, the data analysis is the process which consists in
examining and in interpreting data to develop answers to
questions. The main stages of the process of analysis
consist in encircling the subjects of analysis, in
determining the availability of appropriate data, in
deciding on methods that there is good reason to use to
answer the questions of interest, to apply the methods and
to estimate, to summarize and to communicate the results.
InputStream presentation: Pie Chart
65.
SMART Data
SMART DATA
Presentational
Design
Support Decision
Smart
Today, Big Data is for the marketing managers at the same time an incredible source of data on the
consumers and, at the same time, an incredible challenge to be raised. Marketing strategies "digital" take
into account from now on texts, conversations, behavior, etc. in an environment or the volume of this
information to be handled grow in a exponential way. He would thus be totally imaginary to imagine to
manage the entire these data. And The stake in the digital marketing is thus from now on the intelligent
management of Big Data to identify, classify and run the information consumer significant allowing the
professionals of the marketing to set up their strategies.
Smart Data is the process which allows to cross raw data in information ultra qualified on each of the
consumers. The objective is to have a vision in 360 ° customers, basing on information collected through
adapted marketing mechanisms, that they are classic or innovative (quizzes, social networks, purchases
during checking out, use of the mobile applications, the geo-localization, etc.). To reach there, companies
are equipped with marketing platforms crosses-channels capable of storing and of analyzing every
information "to push" the good message at the best moment for every consumer. The final goal is not only to
seduce new customers but especially to increase their satisfaction and their loyalty by anticipating their.
It means, among others, establishing a real dialogue with each of his customers and measuring effectively
the marketing and commercial performances of the mark.
To target in a fine way according to several criteria while respecting the preferences customers, to manage
the customization, the relevance and the coherence of messages cross channel freed by e-mail, mail, Web
and call center became imperatives which Smart Data allows finally to tackle in a effective way.
Let us forget "Big" and let us interest we "Smart" because the relevance of marketing strategies will always
depend on the data quality customers.
66. SMART DATA: Data Transformations
Data Sources: WebSockets, TCP/UDP, InputStream
Data Movement: Apache Storm, Apache kafka
Data Storage: Apache Cassandra, Apache Spark
Data Presentation: Data-Driven Documents
Integrated into a HTML Web page, the Javascript library D3.js uses
pre-constructed functions of Javascript to select elements, create
objects SVG, stylize them, or add to it transitions, dynamic effects or
tooltips. These objects can be also stylized on a large scale by means of
the famous language CSS. Furthermore, big databases with associated
values can feed the Javascript functions to generate conditional and
rich graphic documents.
SMART DATA: Process Scenario
Synthesis