2. HOW DO WE HANDLE
EVER GROWING DATA
THAT HAS BECOME BIG
DATA?
3. Basics of Spark
Core API
Cluster Managers
Spark Maintenance
Libraries
- SQL
- Streaming
- Mllib
GraphX
Troubleshooting /
Future of Spark
AGENDA
4.
5. Readability
Expressiveness
Fast
Testability
Interactive
Fault Tolerant
Unify Big Data
Spark officially sets a new record in large scale sorting, spark
does make computations on disk it makes use of cached data
in memory
WHY SPARK? TINIER CODE LEADS TO ..
6. Map reduce has very narrow scope especially in batch
processing
Each problem needed a new api to solve
EXPLOSION OF MAP REDUCE
10. The most basic abstraction of spark
Spark operations are two main categories:
Transformations [lazily evalutaed only storing the intent]
Actions
val textFile = sc.textFile("file:///spark/README.md")
textFile.first // action
RDD [RESILIETNT DISTRIBUTION DATASET]
15. collection of elements partitioned across the nodes of the
cluster that can be operated on in parallel…
A collection similar to a list or an array from a user level
processed in parallel to fasten computation time with no
failure tolerance
RDD is immutable
Transformations are lazy and stored in a DAG
Actions trigger DAGs
DAGS are like linear graph of tasks
Each action will trigger a fresh execution of the graph
RDD
24. Collect()
Count()
Take(num)
takeOrdered(num)(ordering)
Reduce(function)
Aggregate(zeroValue)(seqOp,compOp)
Foreach(function)
Actions return different types according to each action
saveAsObjectFile(path)
saveAsTextFile(path) // saves as text file
External connector
foreach(T => Unit) // one object at a time
- foreachPartition(Iterator[T] => Unit) // one partition at a time
ACTIONS IN SPARK
32. Spark SQL is Apache Spark's module for working with
structured or semi data.
It is meant to be used by non big data users
As Spark continues to grow, we want to enable wider
audiences beyond “Big Data” engineers to leverage the power
of distributed processing.
Databricks blog (http://bit.ly/17NM70s)
SPARK SQL
33. Seamlessly mix SQL queries with Spark programs
Spark SQL lets you query structured data inside Spark programs,
using either SQL or a familiar DataFrame API
Connect to any data source the same way.
It executes SQL queries.
We can read data from existing Hive installation using
SparkSQL.
When we run SQL within another programming language we
will get the result as Dataset/DataFrame.
SPARK SQL FEATURES
34.
35. DataFrames and SQL provide a common way to access a variety
of data sources, including Hive, Avro, Parquet, ORC, JSON, and
JDBC. You can even join data across these sources.
Run SQL or HiveQL queries on existing warehouses.[Hive
Integration]
Connect through JDBC or ODBC.[Standard Connectivity]
It is includes with spark
DATAFRAMES
36. Spark 1.3 release. It is a distributed collection of data
ordered into named columns. Concept wise it is equal to the
table in a relational database or a data frame in R/Python.
We can create DataFrame using:
Structured data files
Tables in Hive
External databases
Using existing RDD
SPARK DATAFRAME IS
Data frames = schem RDD
39. Hive
Parquet
Json
Avro
Amazon red shift
Csv
Others
It is recommended as a starting point for any spark application
As it adds
Predicate push down
Column pruning
Can use SQL & RDD
SPARK SQL DATA SOURCES
41. Big & fast data
Gigabytes per second
Real time fraud detection
Marketing
makes it easy to build scalable fault-tolerant streaming
applications.
SPARK STREAMING
42. SPARK STREAMING COMPETITORS
Streaming data
• Kafka
• Flume
• Twitter
• Hadoop hdfs
• Others
• live logs, system telemetry data, IoT device
data, etc.)
44. MLlib is a standard component of Spark providing machine
learning primitives on top of Spark.
SPARK MLIB
45. MATLAB
R
EASY TO USE BUT NOT SCALABLE
MAHOUT
GRAPHLAB
Scalable but at the cost ease
Org.apache.spark.mlib
Rdd based algoritms
Org.aoache.spark.ml
Pipleline api built on top of dataframes
SPARK MLIB COMPETITION
46. Loding the data
Extracting features
Training the data
Testing
the data
The new pipeline allows tuning testing and early failure
detection
MACHINE LEARNING FLOW
47. Algorithms
Classifcation ex: naïve bayes
Regression
Linear
Logistic
Filtering by als ,k squares
Clustering by k-means
Dimensional reduction by SVD singular value decomposition
Feature extraction and transformations
Tf-idf : term frequency- inverse document frequency
ALGRITHMS IN MLIB
49. Word to vector algorithm
This algorithm takes an input text and outputs a set of vectors
representing a dictionary of words [to see word similarity]
We cache the rdds because mlib will have multiple passes o
the same data so this memory cache can reduce processing
time alot
breeze numerical processing library used inside of spark
It has ability to perform mathematical operations on vectors
MLIB DEMO
51. GraphX is Apache Spark's API for graphs and graph-parallel
computation.
Page ranking
Producing evaluations
It can be used in genetic analysis
ALGORITHMS
PageRank
Connected components
Label propagation
SVD++
Strongly connected components
Triangle count
GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED
WORLD
53. Joints each had unique id
Each vertex can has properties of user defined type and store
metal data
ARCHITECTURE
54. Arrows are relations that can store metadata data known as
edges which is a long type
A graph is built of two RDDs one containing the collection of
edges and the collection of vertices
55. Another component is edge triplet is an object which exposes
the relation between each vertex and edge containing all the
information for each connection