This document provides an agenda and slides for a presentation on introducing big data concepts using open source tools. The presentation covers ingesting and analyzing sample data using Spark SQL, including joining datasets to count the number of books by author. It also demonstrates basic machine learning by loading sample revenue data, applying data quality rules to correct anomalies, and using linear regression to predict revenue for a party of 40 guests. The goal is to make big data concepts accessible to audiences of all experience levels.
3. News…
๏ Director of Engineering for WeExperience
๏ Hiring a team of talented engineers to work with us
๏ Front end
๏ Mobile
๏ Back end & data
๏ AI
๏ Shoot at @jgperrin
9. ๏ What is Big Data?
๏ What is. ?
๏ What can I do with. ?
๏ What is a app, anyway?
๏ Install a bunch of software
๏ A first example
๏ Understand what just happened
๏ Another example, slightly more complex, because you are now ready
๏ But now, sincerely what just happened?
๏ Let’s do AI!
๏ Going further
Agenda
10. 3
V4
5
Biiiiiiiig Data
๏ volume
๏ variety
๏ velocity
๏ variability
๏ value
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
11. Data is
considered big
when they need
more than one
computer to be
processed
Sources: https://en.wikipedia.org/wiki/Big_data, https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data
14. An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
15. An analytics operating system?
HardwareHardware
OS OS
Distributed OS
Analytics OS
Apps
{
16. Some use cases
๏ NCEatery.com
๏ Restaurant analytics
๏ 1.57×10^21 datapoints analyzed
๏ Lumeris
๏ General compute
๏ Distributed data transfer/pipeline
๏ CERN
๏ Analysis of the science experiments in the LHC - Large Hadron Collider
๏ IBM
๏ Watson Data Studio
๏ Event Store - http://jgp.net/2017/06/22/spark-boosts-ibm-event-store/
๏ And much more…
17. What a typical app looks like?
Connect to the
cluster
Load Data
Do something
with the data
Share the results
20. Get all the S T U F F
๏ Go to http://jgp.net/ato2018
๏ Install the software
๏ Access the source code
21. Download some tools
๏ Java JDK 1.8
๏ http://bit.ly/javadk8
๏ Eclipse Oxygen or later
๏ http://bit.ly/eclipseo2
๏ Other nice to have
๏ Maven
๏ SourceTree or git (command line)
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
http://www.eclipse.org/downloads/eclipse-packages/
24. Lab #1 - ingestion
๏ Goal
In a Big Data project, ingestion is the first operation.
You get the data “in.”
๏ Source code
https://github.com/jgperrin/
net.jgp.books.spark.ch01
25. Getting deeper
๏ Go to net.jgp.books.spark.ch01
๏ Open CsvToDataframeApp.java
๏ Right click, Run As, Java Application
26. +---+--------+--------------------+-----------+--------------------+
| id|authorId| title|releaseDate| link|
+---+--------+--------------------+-----------+--------------------+
| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|
| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|
| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|
| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|
| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|
+---+--------+--------------------+-----------+--------------------+
only showing top 5 rows
27. package net.jgp.books.sparkWithJava.ch01;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class CsvToDataframeApp {
public static void main(String[] args) {
CsvToDataframeApp app = new CsvToDataframeApp();
app.start();
}
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", "true")
.load("data/books.csv");
// Shows at most 5 rows from the dataframe
df.show(5);
}
}
/jgperrin/net.jgp.books.sparkWithJava.ch01
30. Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW
31. Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Your application
Dataframe
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Unified API
32. Title Text Spark SQL
Spark streaming
Machine learning
& deep learning
& artificial intelligence
GraphX
Dataframe
33. Lab #2 - a bit of analytics
But really just a bit
34. Lab #2 - a little bit of analytics
๏ Goal
From two datasets, one containing books, the other
authors, list the authors with most books, by
number of books
๏ Source code
https://github.com/jgperrin/net.jgp.labs.spark
35. If it was in a relational database
books.csv
authors.csv
id: integer
name: string
link: string
wikipedia: string
id: integer
authorId: integer
title: string
releaseDate: string
link: string
36. Basic analytics
๏ Go to net.jgp.labs.spark.l200_join.l030_count_books
๏ Open AuthorsAndBooksCountBooksApp.java
๏ Right click, Run As, Java Application
44. Popular beliefs
๏ Robot with human-like behavior
๏ HAL from 2001
๏ Isaac Asimov
๏ Potential ethic problems
General AI Narrow AI
๏ Lots of mathematics
๏ Heavy calculations
๏ Algorithms
๏ Self-driving cars
Current state-of-the-art
45. Title Text
I am an expert in
general AI
ARTIFICIAL INTELLIGENCE
is Machine Learning
46. ๏ Common algorithms
๏Linear and logistic regressions
๏Classification and regression trees
๏K-nearest neighbors (KNN)
๏Deep learning
๏Subset of ML
๏Artificial neural networks (ANNs)
๏Super CPU intensive, use of GPU
Machine learning
47. There are two kinds of data scientists:
1) Those who can extrapolate from incomplete data.
48. Title TextDATA
Engineer
DATA
Scientist
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
Develop, build, test, and operationalize
datastores and large-scale processing
systems.
DataOps is the new DevOps.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for
innovative correlations.
Match architecture
with business needs.
Develop processes
for data modeling,
mining, and
pipelines.
Improve data
reliability and quality.
Prepare data for
predictive models.
Explore data to find
hidden gems and
patterns.
Tells stories to key
stakeholders.
49. Title Text
Adapted from: https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
52. Lab #3 - projecting data
๏ Goal
As a restaurant manager, I want to predict how
much revenue will bring a party of 40
๏ Source code
https://github.com/jgperrin/net.jgp.labs.sparkdq4ml
57. Using existing data quality rules
package net.jgp.labs.sparkdq4ml.dq.udf;
import org.apache.spark.sql.api.java.UDF1;
import net.jgp.labs.sparkdq4ml.dq.service.*;
public class MinimumPriceDataQualityUdf
implements UDF1< Double, Double > {
public Double call(Double price) throws Exception {
return MinimumPriceDataQualityService.checkMinimumPrice(price);
}
}
/jgperrin/net.jgp.labs.sparkdq4ml
If price is ok, returns price,
if price is ko, returns -1
58. Telling Spark to use my DQ rules
SparkSession spark = SparkSession.builder()
.appName("DQ4ML").master("local").getOrCreate();
spark.udf().register(
"minimumPriceRule",
new MinimumPriceDataQualityUdf(),
DataTypes.DoubleType);
spark.udf().register(
"priceCorrelationRule",
new PriceCorrelationDataQualityUdf(),
DataTypes.DoubleType);
/jgperrin/net.jgp.labs.sparkdq4ml
59. Loading my dataset
String filename = "data/dataset.csv";
Dataset<Row> df = spark.read().format("csv")
.option("inferSchema", "true").option("header", "false")
.load(filename);
df = df.withColumn("guest", df.col("_c0")).drop("_c0");
df = df.withColumn("price", df.col("_c1")).drop("_c1");
df = df.withColumn(
"price_no_min",
callUDF("minimumPriceRule", df.col("price")));
df.createOrReplaceTempView("price");
df = spark.sql("SELECT guest, price_no_min AS price FROM price WHERE
price_no_min > 0");
Using CSV,
but could be
Hive, JDBC,
name it…
/jgperrin/net.jgp.labs.sparkdq4ml
65. Format the data for ML
๏ Convert/Adapt dataset to Features and Label
๏ Required for Linear Regression in MLlib
๏Needs a column called label of type double
๏Needs a column called features of type VectorUDT
66. Format the data for ML
spark.udf().register(
"vectorBuilder",
new VectorBuilder(),
new VectorUDT());
df = df.withColumn("label", df.col("price"));
df = df.withColumn("features", callUDF("vectorBuilder", df.col("guest")));
// ... Lots of complex ML code goes here ...
double p = model.predict(features);
System.out.println("Prediction for " + feature + " guests is " + p);
/jgperrin/net.jgp.labs.sparkdq4ml
68. (the complex ML code)
LinearRegression lr = new LinearRegression()
.setMaxIter(40)
.setRegParam(1)
.setElasticNetParam(1);
LinearRegressionModel model = lr.fit(df);
Double feature = 40.0;
Vector features = Vectors.dense(40.0);
double p = model.predict(features);
/jgperrin/net.jgp.labs.sparkdq4ml
Define algorithms and its (hyper)parameters
Created a model from our data
Apply the model to a new dataset: predict
69. It’s all about the base model
Same model
Trainer ModelDataset #1
ModelDataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
71. A (Big) Data Scenario
Data
Raw
Data
Ingestion
DataQuality
Pure
Data
Transformation
Rich
Data
Load/Publish
Data
72. Key takeaways
๏ Big Data is easier than one could think
๏ Java is the way to go (or Python)
๏ New vocabulary for using Spark
๏ You have a friend to help (ok, me)
๏ Spark is fun
๏ Spark is easily extensible
73. Going further
๏ Contact me @jgperrin
๏ Join the Spark User mailing list
๏ Get help from Stack Overflow
๏ fb.com/TriangleSpark
๏ Start a Spark meetup in Columbia, SC?
74. Going further
Spark in action (Second edition, MEAP)
by Jean Georges Perrin
published by Manning
http://jgp.net/sia
sprkans-681D
sprkans-7538
ctwopen10119
40% off
One
two free books