Adios hadoop, Hola Spark! T3chfest 2015

Adiós Hadoop
Hola Spark!
1

@dhiguero
dhiguero@stratio.com
Daniel Higuero

•  Introducción
•  Spark
§  Conceptos básicos
§  Ecosistema
Agenda
2

3

VIEWER DISCRETION IS ADVISED
All
elephants
are
innocent
un3l
proven
guilty
in
a

court
of
development

Opinions
expressed
are
solely
my
own
and
do
not
express
the
views
or
opinions
of
my
employer.

Timeline
#t3chfest2015 5

2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015

Google

MapReduce

paper

Google

GFS
paper

Timeline
#t3chfest2015 6

2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015

Google

MapReduce

paper

Google

GFS
paper
Hive

HBase

Hadoop
1TB,

910
nodes
<
4

min

Timeline
#t3chfest2015 7

2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015

Google

MapReduce

paper

Google

GFS
paper
Hive

HBase

Hadoop
1TB,

910
nodes
<
4

min

alpha-‐0.1

Spark
0.7

Timeline
#t3chfest2015 8

2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015

Google

MapReduce

paper

Google

GFS
paper
Hive

HBase

Hadoop
1TB,

910
nodes
<
4

min

Hadoop
103
TB,

2100
nodes,
72

min

alpha-‐0.1

Spark
0.7

Timeline
#t3chfest2015 9

2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015

Google

MapReduce

paper

Google

GFS
paper
Hive

HBase

Hadoop
1TB,

910
nodes
<
4

min

Spark
100
TB,

206
nodes,
23

min

Hadoop
103
TB,

2100
nodes,
72

min

alpha-‐0.1

Spark
0.7
Spark
1.2+

o  ¿Qué es Spark?
o  Framework de procesamiento paralelo
o  Historia
Introducción
10

https://spark.apache.org/
Apache
SoOware
Founda3on

#t3chfest2015

o  Concepto de programación funcional
o  Popularizado por Google
Map-reduce
11

(map
'list
(lambda
(x)
(+
x
10))
'(1
2
3
4))

=>
(11
12
13
14)

(reduce
#'+
'(1
2
3
4))
=>
10

Jeﬀ
Dean
and
Sanjay
Ghemawat.
"MapReduce:
Simpliﬁed
Data

Processing
on
Large
Clusters."
OSDI
(2004)

#t3chfest2015

Map-Reduce
12

Input
data

Map

Map

Map

Map

Reduce

Reduce

Reduce

result

#t3chfest2015

Map-Reduce
13
#t3chfest2015
val
wordCounts
=
textFile.flatMap(line
=>
line.split("
"))

.map(word
=>
(word,
1))

.reduceByKey(_
+
_)

Apache
Spark
is
an
open-‐source
cluster
compu3ng

framework
originally
developed
in
the
AMPLab
at
UC

Berkeley.
In
contrast
to
Hadoop's
two-‐stage
disk-‐
based
MapReduce
paradigm,
Spark's
in-‐memory

primi3ves
provide
performance
up
to
100
3mes

faster
for
certain
applica3ons.
By
allowing
user

programs
to
load
data
into
a
cluster's
memory
and

query
it
repeatedly,
Spark
is
well
suited
to
machine

learning
algorithms

Array[String]

Apache

Spark

is

an

open-‐source

cluster

…

Array[(String,
Int)]

(Apache,
1)

(Spark,
1)

(is,
1)

…

(Spark,
1)

(is,
1)

…

Array[(String,
Int)]

(Apache,
1)

(Spark,
2)

(is,
2)

…

(to,
4)

(the,
1)

…

Source:
Wikipedia

o  Mayor flexibilidad en la definición de
transformaciones
o  Menor uso de almacenamiento en disco
o  Aprovechamiento de la memoria
o  Tolerancia a fallos
o  Tracción de la comunidad
Ventajas de Spark
14
#t3chfest2015

o  Abstracción básica en Spark
o  Contiene las transformaciones que se van a
realizar sobre un conjunto de datos
•  Inmutable
•  Lazy evaluation
•  En caso de fallo se puede recuperar el estado
•  Control de persistencia y particionado
RDD
16
#t3chfest2015

Ecosistema Spark
18

©
databricks

#t3chfest2015

o  Proporciona las abstracciones básicas y se
encarga del scheduling
Spark core engine
19

RDD
DAG
Scheduling

Cluster

manager

Threads

Block

manager

Task

scheduling

Worker

#t3chfest2015

o  Permite transformar una fuente streaming en
un conjunto de mini-batch
•  Definición de una ventana
§  Temporal
Spark Streaming
20
#t3chfest2015

Spark Streaming
21

Window
=
5

batch0
batch1
batch2
batch3
batch4
batch5
batch6
batch7

3empo

3empo

#t3chfest2015

o  Librería para Machine Learning
o  Abstracciones útiles para cómputo
o  Vectores, Matrices dispersas
o  Implementación de algoritmos conocidos
o  Clasificación, regresión, collaborative
filtering y clustering
MLlib
22
#t3chfest2015

o  Capa de acceso SQL para ejecutar
operaciones sobre RDD
o  DataFrame (antes SchemaRDD)
SparkSQL
23

val
people
=
sqlContext.parquetFile("...")

val
department
=
sqlContext.parquetFile("...")

people.filter("age"
>
30)

.join(department,

people("deptId")
===
department("id"))

.groupBy(department("name"),
"gender”)

©
databricks

#t3chfest2015

Primeros pasos
24

$
wget
http://www.apache.org/.../spark-‐1.2.0-‐bin-‐hadoop2.4.tgz

$
tar
xvzf
spark-‐1.2.0-‐bin-‐hadoop2.4.tgz

$
cd
spark-‐1.2.0-‐bin-‐hadoop2.4

$
cp
conf/spark-‐env.sh.template
conf/spark-‐env.sh

$
./bin/spark-‐shell

$
./bin/spark-‐shell

…

15/02/09
15:47:50
INFO
HttpServer:
Starting
HTTP
Server

15/02/09
15:47:50
INFO
Utils:
Successfully
started
service
'HTTP
class
server'
on
port
60416.

Welcome
to

____

__

/
__/__

___
_____/
/__

_
/
_
/
_
`/
__/

'_/

/___/
.__/_,_/_/
/_/_

version
1.2.0

/_/

Using
Scala
version
2.10.4
(Java
HotSpot(TM)
64-‐Bit
Server
VM,
Java
1.7.0_71)

Type
in
expressions
to
have
them
evaluated.

scala>

hep://localhost:4040

#t3chfest2015

25

WE ARE HIRING!
Java
Scala
Ping
pong
Nerf
Big
Data
Spark
Hadoop
Cassandra
MongoDB
NoSQL
Passion

BIG DATA
CHILD`S PLAY
@dhiguero
dhiguero@stratio.com
Daniel Higuero
Acknowledgements: This work has been partially funded by
the Spanish Ministry of Economy and Competitiveness under
grant PTQ-13-05997

Adios hadoop, Hola Spark! T3chfest 2015

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Adios hadoop, Hola Spark! T3chfest 2015

Similaire à Adios hadoop, Hola Spark! T3chfest 2015 (20)

Dernier

Dernier (20)

Adios hadoop, Hola Spark! T3chfest 2015