SlideShare a Scribd company logo
1 of 53
Jean-Georges Perrin • @jgperrin
It's painful
how much
data rules the world
All Things Open Meetup
Raleigh Convention Center • Raleigh, NC
September 15th 2021
The opinions expressed in this presentation and on the
following slides are solely those of the presenter and not
necessarily those of The NPD Group. The NPD Group does
not guarantee the accuracy or reliability of the information
provided herein.
Jean-Georges “jgp" Perrin
Software since 1983 >$0 1995
Big Data since 1984 >$0 2006
AI since 1994 >$0 2010
x13
It’s a story
about
data
4 avril 1980
Air & Space
Source:
NASA
Find & process the
data, not in Excel
Display the data in
a palatable form
Source:
Pexels
Sources:
Bureau of Transportation Statistics: https://www.transtats.bts.gov/TRAFFIC/
+----------+----------------+-----------+-----+
|month |internationalPax|domesticPax|pax |
+----------+----------------+-----------+-----+
|2000-01-01|5394 |41552 |46946|
|2000-02-01|5249 |43724 |48973|
|2000-03-01|6447 |52984 |59431|
|2000-04-01|6062 |50349 |56411|
|2000-05-01|6342 |52320 |58662|
+----------+----------------+-----------+-----+
only showing top 5 rows
root
|-- month: date (nullable = true)
|-- internationalPax: integer (nullable = true)
|-- domesticPax: integer (nullable = true)
|-- pax: integer (nullable = true)
I !
दृिष्ट • dṛṣṭi
Open source, React & IBM Carbon-based
data visualization framework
Download at https://jgp.ai/drsti
Apply light data quality
Create a session
Create a schema
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("month").isNotNull())
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #300
Ingest international passengers
Ingest domestic passengers
Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col(“month").equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
df = DrstiUtils.setHeader(df, "month", "Month of");
df = DrstiUtils.setHeader(df, "pax", "Passengers");
df = DrstiUtils.setHeader(df, "internationalPax", "International Passengers");
df = DrstiUtils.setHeader(df, "domesticPax", "Domestic Passengers");
DrstiChart d = new DrstiLineChart(df);
d.setTitle("Air passenger traffic per month");
d.setXScale(DrstiK.SCALE_TIME);
d.setXTitle("Period from " + DataframeUtils.min(df, "month") + " to “ + DataframeUtils.max(df, "month"));
d.setYTitle("Passengers (000s)");
d.render();
/jgperrin/ai.jgp.drsti-spark
Lab #300
All my data processing
Add meta data directly to the dataframe
Configure dṛṣṭi directly on the server
Aren’t you glad we
are using Java?
Apps
Analytics
Distrib.
Hardware
OS
Apps
Hardware
Hardware
OS OS
Distributed OS
Analytics OS
Apps
Hardware
Hardware
OS OS
An analytics operating system?
Hardware
Hardware
OS OS
Distributed OS
Analytics OS
Apps
{
An analytics operating system?
Domestic
passengers
(CSV)
International
passengers
(CSV)
Domestic
passengers
(dataframe)
International
passengers
(dataframe) Combining
through an
outer join
Passengers
(dataframe)
Enhanced
data
(dataframe)
Enhanced
data
(CSV)
Visualization
metadata
(JSON)
dṛṣṭi
visualization
Server processing through . Transfer Visualization
Applying to our air traffic app
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
Apache Spark
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 1 -
HW
Node 2 -
HW
Node 3 -
HW
Node 4 -
HW
Spark SQL Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Node 5 -
OS
Node 5 -
HW
Your application
…
…
Unified API
Node 6 -
OS
Node 6 -
HW
Node 7 -
OS
Node 7 -
HW
Node 8 -
OS
Node 8 -
HW
Spark SQL
Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Your application
Dataframe
Unified API
Node 1 -
OS
Node 2 -
OS
Node 3 -
OS
Node 4 -
OS
Node 5 -
OS
…
Node 6 -
OS
Node 7 -
OS
Node 8 -
OS
Spark SQL
Spark Streaming
Spark MLlib
Machine learning
& artificial intelligence
Spark GraphX
Dataframe
Source:
Pexels
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("month").isNotNull())
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #310
Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col("month")
.equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(
col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
Dataset<Row> dfQuarter = df
.withColumn("year", year(col("month")))
.withColumn("q", ceil(month(col("month")).$div(3)))
.withColumn("period", concat(col("year"), lit("-Q"), col("q")))
.groupBy(col("period"))
.agg(sum(“pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"))
.drop("year")
.drop("q")
.orderBy(col("period"));
/jgperrin/ai.jgp.drsti-spark
Lab #310
New code for quarter
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local[*]")
.getOrCreate();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(“month", DataTypes.DateType, false),
DataTypes.createStructField(“pax", DataTypes.IntegerType, true) });
Dataset<Row> internationalPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/International USCarrier_Traffic_20210902163435.csv");
internationalPaxDf = internationalPaxDf
.withColumnRenamed("pax", "internationalPax")
.filter(col("month").isNotNull())
.filter(col("internationalPax").isNotNull());
Dataset<Row> domesticPaxDf = spark.read().format("csv")
.option("header", true)
.option("dateFormat", "MMMM yyyy")
.schema(schema)
.load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv");
domesticPaxDf = domesticPaxDf
.withColumnRenamed("pax", "domesticPax")
.filter(col("month").isNotNull())
.filter(col("domesticPax").isNotNull());
/jgperrin/ai.jgp.drsti-spark
Lab #320
Dataset<Row> df = internationalPaxDf
.join(domesticPaxDf,
internationalPaxDf.col("month")
.equalTo(domesticPaxDf.col("month")),
"outer")
.withColumn("pax", expr("internationalPax + domesticPax"))
.drop(domesticPaxDf.col("month"))
.filter(
col("month").$less(lit("2020-01-01").cast(DataTypes.DateType)))
.orderBy(col("month"))
.cache();
Dataset<Row> dfYear = df
.withColumn("year", year(col("month")))
.groupBy(col("year"))
.agg(sum("pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"))
.orderBy(col("year"));
/jgperrin/ai.jgp.drsti-spark
Lab #320
New code for year
Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline
Data
Bronze
Raw Data
Silver
Pure Data
Gold
Rich Data
Actionable
Data
Application
of
Data Quality
rules
Ingestion
Transfor-
mation
Publication
“Cache”
A (Big) Data Scenario
Building a pipeline
// Combining datasets
// ...
df.write()
.format("delta")
.mode("overwrite")
.save("./data/tmp/airtrafficmonth");
/jgperrin/ai.jgp.drsti-spark
Lab #400
Saving to Delta Lake
Dataset<Row> df = spark.read().format("delta")
.load("./data/tmp/airtrafficmonth")
.orderBy(col("month"));
Dataset<Row> dfYear = df
.withColumn("year", year(col("month")))
.groupBy(col("year"))
.agg(sum(“pax").as("pax"),
...
/jgperrin/ai.jgp.drsti-spark
Lab #430
Reading from Delta Lake
Can we project future traffic?
Source:
Comedy Central
Do you remember January 2020?
And March?
Source:
Pexels
• Make a model for 2000-2019
• See the projection
• Use 2020 data & imputation for
the rest of 2021
• See the projection
What now?
Source:
Pexels
Label
Feature
Use my model
Split training & test data
String[] inputCols = { "year" };
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
df = assembler.transform(df);
LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.5)
.setElasticNetParam(0.8)
.setLabelCol("pax");
int threshold = 2019;
Dataset<Row> trainingData = df.filter(col("year").$less$eq(threshold));
Dataset<Row> testData = df.filter(col("year").$greater(threshold));
LinearRegressionModel model = lr.fit(trainingData);
Integer[] l = new Integer[] { 2020, 2021, 2022, 2023, 2024, 2025, 2026 };
List<Integer> data = Arrays.asList(l);
Dataset<Row> futuresDf = spark.createDataset(data, Encoders.INT()).toDF().withColumnRenamed("value", "year");
assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
futuresDf = assembler.transform(futuresDf);
df = df.unionByName(futuresDf, true);
df = model.transform(df);
Features are a vector - let’s build one
Build a linear regression
Building my model
/jgperrin/ai.jgp.drsti-spark
Lab #500
Something happened in 2020…
Source:
Pexels
Something happened in 2020…
Source:
Pexels
Label
Feature
Imputation
Real data for 2021
Model 2000-2019
Model 2000-2021
Connect now to dṛṣṭi
http:/
/172.25.177.2:3000
Dataset<Row> df2021 = df.filter(expr(
"month >= TO_DATE('2021-01-01') and month <= TO_DATE('2021-12-31')"));
int monthCount = (int) df2021.count();
df2021 = df2021
.agg(sum("pax").as("pax"),
sum("internationalPax").as("internationalPax"),
sum("domesticPax").as("domesticPax"));
int pax = DataframeUtils.maxAsInt(df2021, "pax") / (12 - monthCount);
int intPax = DataframeUtils.maxAsInt(df2021, “internationalPax") / (12 - monthCount);
int domPax = DataframeUtils.maxAsInt(df2021, "domesticPax") / (12 - monthCount);
List<String> data = new ArrayList();
for (int i = monthCount + 1; i <= 12; i++) {
data.add("2021-" + i + "-01");
}
Dataset<Row> dfImputation2021 = spark
.createDataset(data, Encoders.STRING()).toDF()
.withColumn("month", col("value").cast(DataTypes.DateType))
.withColumn("pax", lit(pax))
.withColumn("internationalPax", lit(intPax))
.withColumn("domesticPax", lit(domPax))
.drop("value");
Extract 2021 data
/jgperrin/ai.jgp.drsti-spark
Lab #600
Calculate imputation data
Create a new dataframe, from scratch with the
additional data
LinearRegression lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8).setLabelCol("pax");
LinearRegressionModel model2019 = lr.fit(df.filter(col(“year").$less$eq(2019)));
df = model2019
.transform(df)
.withColumnRenamed("prediction", "prediction2019");
LinearRegressionModel model2021 = lr.fit(df.filter(col("year").$less$eq(2021)));
df = model2021
.transform(df)
.withColumnRenamed("prediction", "prediction2021");
Pretty much the same code as lab #500,
except: for renaming columns
/jgperrin/ai.jgp.drsti-spark
Lab #610
Reusing the same linear regression for both
model,
but the model is different!
Same model
Trainer Model
Dataset #1
Model
Dataset #2
Predicted
Data
Step 1:
Learning
phase
Step 2..n:
Predictive
phase
It’s all about the base model
Scientist & Engineer
There are two kinds of
data scientists:
1) Those who can
extrapolate from
incomplete data.
DATA
Engineer
DATA
Scientist
Develop, build, test, and operationalize
datastores and large-scale processing systems.
DataOps is the new DevOps.
Match architecture
with business needs.
Develop processes for
data modeling,
mining, and pipelines.
Improve data
reliability and quality.
Clean, massage, and organize data.
Perform statistics and analysis to develop
insights, build models, and search for innovative
correlations. Prepare data for
predictive models.
Explore data to find hidden
gems and patterns.
Tells stories to key
stakeholders.
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
DATA
Engineer
DATA
Scientist
SQL
Source:
Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
IBM Watson
Studio
Conclusion
Call for action
• We always need more data
• Air Traffic @ https://github.com/jgperrin/ai.jgp.drsti-spark
• COVID-19 @ https://github.com/jgperrin/net.jgp.books.spark.ch99
• Go try & contribute to dṛṣṭi at http://jgp.ai/drsti
• Follow me on Twitter @jgperrin & YouTube /jgperrin
Key takeaways
• Spark is very fun & powerful for any data application:
• Data engineering
• Data science
• New vocabulary & concept regarding Apache Spark: dataframe, analytics
operating system
• Machine learning & AI work better with Big Data
• Data is fluid (and it’s really painful)
Thank you! http://jgp.ai/sia
See you next month
for All Things Open!

More Related Content

What's hot

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 

What's hot (17)

Report Statistical Analysis
Report Statistical AnalysisReport Statistical Analysis
Report Statistical Analysis
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
Agg framework selectgroup feb2015 v2
Agg framework selectgroup feb2015 v2Agg framework selectgroup feb2015 v2
Agg framework selectgroup feb2015 v2
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Mongodb Aggregation Pipeline
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipeline
 
SAS codes and tricks Comprehensive all codess
SAS codes and tricks Comprehensive all codessSAS codes and tricks Comprehensive all codess
SAS codes and tricks Comprehensive all codess
 
SAS codes and tricks Comprehensive all codes
SAS codes and tricks Comprehensive all codesSAS codes and tricks Comprehensive all codes
SAS codes and tricks Comprehensive all codes
 
My All Codes of SAS
My All Codes of SASMy All Codes of SAS
My All Codes of SAS
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
MongoDB - Back to Basics - La tua prima Applicazione
MongoDB - Back to Basics - La tua prima ApplicazioneMongoDB - Back to Basics - La tua prima Applicazione
MongoDB - Back to Basics - La tua prima Applicazione
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
MySQL 8.0 Preview: What Is Coming?
MySQL 8.0 Preview: What Is Coming?MySQL 8.0 Preview: What Is Coming?
MySQL 8.0 Preview: What Is Coming?
 
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDBMongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
MongoDB .local Paris 2020: La puissance du Pipeline d'Agrégation de MongoDB
 

Similar to It's painful how much data rules the world

All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
Moriyoshi Koizumi
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB
 
bbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docx
bbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docxbbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docx
bbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docx
ikirkton
 
Bitcoin Price Analysis (2014 - 2023).pdf
Bitcoin Price Analysis (2014 - 2023).pdfBitcoin Price Analysis (2014 - 2023).pdf
Bitcoin Price Analysis (2014 - 2023).pdf
Mukeshkanna24
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Steve Watt
 

Similar to It's painful how much data rules the world (20)

How to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in RHow to generate a 100+ page website using parameterisation in R
How to generate a 100+ page website using parameterisation in R
 
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Youth Tobacco Survey Analysis
Youth Tobacco Survey AnalysisYouth Tobacco Survey Analysis
Youth Tobacco Survey Analysis
 
All I know about rsc.io/c2go
All I know about rsc.io/c2goAll I know about rsc.io/c2go
All I know about rsc.io/c2go
 
R getting spatial
R getting spatialR getting spatial
R getting spatial
 
Historical Finance Data
Historical Finance DataHistorical Finance Data
Historical Finance Data
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code Reviews
 
Here is my current code- I need help visualizing my data using Python-.docx
Here is my current code- I need help visualizing my data using Python-.docxHere is my current code- I need help visualizing my data using Python-.docx
Here is my current code- I need help visualizing my data using Python-.docx
 
A Shiny Example-- R
A Shiny Example-- RA Shiny Example-- R
A Shiny Example-- R
 
10. R getting spatial
10.  R getting spatial10.  R getting spatial
10. R getting spatial
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Books
BooksBooks
Books
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
MongoDB for Time Series Data: Analyzing Time Series Data Using the Aggregatio...
 
Serverless Functions and Vue.js
Serverless Functions and Vue.jsServerless Functions and Vue.js
Serverless Functions and Vue.js
 
bbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docx
bbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docxbbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docx
bbyopenApp_Code.DS_StorebbyopenApp_CodeVBCodeGoogleMaps.docx
 
Bitcoin Price Analysis (2014 - 2023).pdf
Bitcoin Price Analysis (2014 - 2023).pdfBitcoin Price Analysis (2014 - 2023).pdf
Bitcoin Price Analysis (2014 - 2023).pdf
 
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and VerticaBridging Structured and Unstructred Data with Apache Hadoop and Vertica
Bridging Structured and Unstructred Data with Apache Hadoop and Vertica
 

More from Jean-Georges Perrin

More from Jean-Georges Perrin (20)

Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Why i love Apache Spark?
Why i love Apache Spark?Why i love Apache Spark?
Why i love Apache Spark?
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
The road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentionsThe road to AI is paved with pragmatic intentions
The road to AI is paved with pragmatic intentions
 
Spark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the CommunitySpark Summit Europe Wrap Up and TASM State of the Community
Spark Summit Europe Wrap Up and TASM State of the Community
 
Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)Spark hands-on tutorial (rev. 002)
Spark hands-on tutorial (rev. 002)
 
Spark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASMSpark Summit 2017 - A feedback for TASM
Spark Summit 2017 - A feedback for TASM
 
HTML (or how the web got started)
HTML (or how the web got started)HTML (or how the web got started)
HTML (or how the web got started)
 
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
2CRSI presentation for ISC-HPC: When High-Performance Computing meets High-Pe...
 
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entrepriseVision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
Vision stratégique de l'utilisation de l'(Open)Data dans l'entreprise
 
Informix is not for legacy applications
Informix is not for legacy applicationsInformix is not for legacy applications
Informix is not for legacy applications
 
Vendre des produits techniques
Vendre des produits techniquesVendre des produits techniques
Vendre des produits techniques
 
Vendre plus sur le web
Vendre plus sur le webVendre plus sur le web
Vendre plus sur le web
 
Vendre plus sur le Web
Vendre plus sur le WebVendre plus sur le Web
Vendre plus sur le Web
 
GreenIvory : products and services
GreenIvory : products and servicesGreenIvory : products and services
GreenIvory : products and services
 
GreenIvory : produits & services
GreenIvory : produits & servicesGreenIvory : produits & services
GreenIvory : produits & services
 
A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)A la découverte des nouvelles tendances du web (Mulhouse Edition)
A la découverte des nouvelles tendances du web (Mulhouse Edition)
 
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvoryMashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
MashupXFeed et la stratégie éditoriale - Workshop Activis - GreenIvory
 
MashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - GreenivoryMashupXFeed et le référencement - Workshop Activis - Greenivory
MashupXFeed et le référencement - Workshop Activis - Greenivory
 

Recently uploaded

Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

It's painful how much data rules the world

  • 1. Jean-Georges Perrin • @jgperrin It's painful how much data rules the world All Things Open Meetup Raleigh Convention Center • Raleigh, NC September 15th 2021
  • 2. The opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of The NPD Group. The NPD Group does not guarantee the accuracy or reliability of the information provided herein.
  • 3. Jean-Georges “jgp" Perrin Software since 1983 >$0 1995 Big Data since 1984 >$0 2006 AI since 1994 >$0 2010 x13
  • 4.
  • 7.
  • 8. Find & process the data, not in Excel Display the data in a palatable form Source: Pexels
  • 9. Sources: Bureau of Transportation Statistics: https://www.transtats.bts.gov/TRAFFIC/
  • 10. +----------+----------------+-----------+-----+ |month |internationalPax|domesticPax|pax | +----------+----------------+-----------+-----+ |2000-01-01|5394 |41552 |46946| |2000-02-01|5249 |43724 |48973| |2000-03-01|6447 |52984 |59431| |2000-04-01|6062 |50349 |56411| |2000-05-01|6342 |52320 |58662| +----------+----------------+-----------+-----+ only showing top 5 rows root |-- month: date (nullable = true) |-- internationalPax: integer (nullable = true) |-- domesticPax: integer (nullable = true) |-- pax: integer (nullable = true) I !
  • 11. दृिष्ट • dṛṣṭi Open source, React & IBM Carbon-based data visualization framework Download at https://jgp.ai/drsti
  • 12.
  • 13. Apply light data quality Create a session Create a schema SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local[*]") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField(“month", DataTypes.DateType, false), DataTypes.createStructField(“pax", DataTypes.IntegerType, true) }); Dataset<Row> internationalPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/International USCarrier_Traffic_20210902163435.csv"); internationalPaxDf = internationalPaxDf .withColumnRenamed("pax", "internationalPax") .filter(col("month").isNotNull()) .filter(col("internationalPax").isNotNull()); Dataset<Row> domesticPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv"); domesticPaxDf = domesticPaxDf .withColumnRenamed("pax", "domesticPax") .filter(col("month").isNotNull()) .filter(col("domesticPax").isNotNull()); /jgperrin/ai.jgp.drsti-spark Lab #300 Ingest international passengers Ingest domestic passengers
  • 14. Dataset<Row> df = internationalPaxDf .join(domesticPaxDf, internationalPaxDf.col(“month").equalTo(domesticPaxDf.col("month")), "outer") .withColumn("pax", expr("internationalPax + domesticPax")) .drop(domesticPaxDf.col("month")) .filter(col("month").$less(lit("2020-01-01").cast(DataTypes.DateType))) .orderBy(col("month")) .cache(); df = DrstiUtils.setHeader(df, "month", "Month of"); df = DrstiUtils.setHeader(df, "pax", "Passengers"); df = DrstiUtils.setHeader(df, "internationalPax", "International Passengers"); df = DrstiUtils.setHeader(df, "domesticPax", "Domestic Passengers"); DrstiChart d = new DrstiLineChart(df); d.setTitle("Air passenger traffic per month"); d.setXScale(DrstiK.SCALE_TIME); d.setXTitle("Period from " + DataframeUtils.min(df, "month") + " to “ + DataframeUtils.max(df, "month")); d.setYTitle("Passengers (000s)"); d.render(); /jgperrin/ai.jgp.drsti-spark Lab #300 All my data processing Add meta data directly to the dataframe Configure dṛṣṭi directly on the server
  • 15. Aren’t you glad we are using Java?
  • 16.
  • 17. Apps Analytics Distrib. Hardware OS Apps Hardware Hardware OS OS Distributed OS Analytics OS Apps Hardware Hardware OS OS An analytics operating system?
  • 18. Hardware Hardware OS OS Distributed OS Analytics OS Apps { An analytics operating system?
  • 19. Domestic passengers (CSV) International passengers (CSV) Domestic passengers (dataframe) International passengers (dataframe) Combining through an outer join Passengers (dataframe) Enhanced data (dataframe) Enhanced data (CSV) Visualization metadata (JSON) dṛṣṭi visualization Server processing through . Transfer Visualization Applying to our air traffic app
  • 21. Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 1 - HW Node 2 - HW Node 3 - HW Node 4 - HW Spark SQL Spark Streaming Spark MLlib Machine learning & artificial intelligence Spark GraphX Node 5 - OS Node 5 - HW Your application … … Unified API Node 6 - OS Node 6 - HW Node 7 - OS Node 7 - HW Node 8 - OS Node 8 - HW
  • 22. Spark SQL Spark Streaming Spark MLlib Machine learning & artificial intelligence Spark GraphX Your application Dataframe Unified API Node 1 - OS Node 2 - OS Node 3 - OS Node 4 - OS Node 5 - OS … Node 6 - OS Node 7 - OS Node 8 - OS
  • 23. Spark SQL Spark Streaming Spark MLlib Machine learning & artificial intelligence Spark GraphX Dataframe Source: Pexels
  • 24.
  • 25. SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local[*]") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField(“month", DataTypes.DateType, false), DataTypes.createStructField(“pax", DataTypes.IntegerType, true) }); Dataset<Row> internationalPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/International USCarrier_Traffic_20210902163435.csv"); internationalPaxDf = internationalPaxDf .withColumnRenamed("pax", "internationalPax") .filter(col("month").isNotNull()) .filter(col("internationalPax").isNotNull()); Dataset<Row> domesticPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv"); domesticPaxDf = domesticPaxDf .withColumnRenamed("pax", "domesticPax") .filter(col("month").isNotNull()) .filter(col("domesticPax").isNotNull()); /jgperrin/ai.jgp.drsti-spark Lab #310
  • 26. Dataset<Row> df = internationalPaxDf .join(domesticPaxDf, internationalPaxDf.col("month") .equalTo(domesticPaxDf.col("month")), "outer") .withColumn("pax", expr("internationalPax + domesticPax")) .drop(domesticPaxDf.col("month")) .filter( col("month").$less(lit("2020-01-01").cast(DataTypes.DateType))) .orderBy(col("month")) .cache(); Dataset<Row> dfQuarter = df .withColumn("year", year(col("month"))) .withColumn("q", ceil(month(col("month")).$div(3))) .withColumn("period", concat(col("year"), lit("-Q"), col("q"))) .groupBy(col("period")) .agg(sum(“pax").as("pax"), sum("internationalPax").as("internationalPax"), sum("domesticPax").as("domesticPax")) .drop("year") .drop("q") .orderBy(col("period")); /jgperrin/ai.jgp.drsti-spark Lab #310 New code for quarter
  • 27.
  • 28. SparkSession spark = SparkSession.builder() .appName("CSV to Dataset") .master("local[*]") .getOrCreate(); StructType schema = DataTypes.createStructType(new StructField[] { DataTypes.createStructField(“month", DataTypes.DateType, false), DataTypes.createStructField(“pax", DataTypes.IntegerType, true) }); Dataset<Row> internationalPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/International USCarrier_Traffic_20210902163435.csv"); internationalPaxDf = internationalPaxDf .withColumnRenamed("pax", "internationalPax") .filter(col("month").isNotNull()) .filter(col("internationalPax").isNotNull()); Dataset<Row> domesticPaxDf = spark.read().format("csv") .option("header", true) .option("dateFormat", "MMMM yyyy") .schema(schema) .load("data/bts/Domestic USCarrier_Traffic_20210902163435.csv"); domesticPaxDf = domesticPaxDf .withColumnRenamed("pax", "domesticPax") .filter(col("month").isNotNull()) .filter(col("domesticPax").isNotNull()); /jgperrin/ai.jgp.drsti-spark Lab #320
  • 29. Dataset<Row> df = internationalPaxDf .join(domesticPaxDf, internationalPaxDf.col("month") .equalTo(domesticPaxDf.col("month")), "outer") .withColumn("pax", expr("internationalPax + domesticPax")) .drop(domesticPaxDf.col("month")) .filter( col("month").$less(lit("2020-01-01").cast(DataTypes.DateType))) .orderBy(col("month")) .cache(); Dataset<Row> dfYear = df .withColumn("year", year(col("month"))) .groupBy(col("year")) .agg(sum("pax").as("pax"), sum("internationalPax").as("internationalPax"), sum("domesticPax").as("domesticPax")) .orderBy(col("year")); /jgperrin/ai.jgp.drsti-spark Lab #320 New code for year
  • 30. Data Bronze Raw Data Silver Pure Data Gold Rich Data Actionable Data Application of Data Quality rules Ingestion Transfor- mation Publication “Cache” A (Big) Data Scenario Building a pipeline
  • 31. Data Bronze Raw Data Silver Pure Data Gold Rich Data Actionable Data Application of Data Quality rules Ingestion Transfor- mation Publication “Cache” A (Big) Data Scenario Building a pipeline
  • 32. // Combining datasets // ... df.write() .format("delta") .mode("overwrite") .save("./data/tmp/airtrafficmonth"); /jgperrin/ai.jgp.drsti-spark Lab #400 Saving to Delta Lake
  • 33. Dataset<Row> df = spark.read().format("delta") .load("./data/tmp/airtrafficmonth") .orderBy(col("month")); Dataset<Row> dfYear = df .withColumn("year", year(col("month"))) .groupBy(col("year")) .agg(sum(“pax").as("pax"), ... /jgperrin/ai.jgp.drsti-spark Lab #430 Reading from Delta Lake
  • 34. Can we project future traffic? Source: Comedy Central
  • 35. Do you remember January 2020? And March? Source: Pexels
  • 36. • Make a model for 2000-2019 • See the projection • Use 2020 data & imputation for the rest of 2021 • See the projection What now? Source: Pexels
  • 38. Use my model Split training & test data String[] inputCols = { "year" }; VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features"); df = assembler.transform(df); LinearRegression lr = new LinearRegression() .setMaxIter(10) .setRegParam(0.5) .setElasticNetParam(0.8) .setLabelCol("pax"); int threshold = 2019; Dataset<Row> trainingData = df.filter(col("year").$less$eq(threshold)); Dataset<Row> testData = df.filter(col("year").$greater(threshold)); LinearRegressionModel model = lr.fit(trainingData); Integer[] l = new Integer[] { 2020, 2021, 2022, 2023, 2024, 2025, 2026 }; List<Integer> data = Arrays.asList(l); Dataset<Row> futuresDf = spark.createDataset(data, Encoders.INT()).toDF().withColumnRenamed("value", "year"); assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features"); futuresDf = assembler.transform(futuresDf); df = df.unionByName(futuresDf, true); df = model.transform(df); Features are a vector - let’s build one Build a linear regression Building my model /jgperrin/ai.jgp.drsti-spark Lab #500
  • 39. Something happened in 2020… Source: Pexels
  • 40. Something happened in 2020… Source: Pexels
  • 41. Label Feature Imputation Real data for 2021 Model 2000-2019 Model 2000-2021
  • 42. Connect now to dṛṣṭi http:/ /172.25.177.2:3000
  • 43. Dataset<Row> df2021 = df.filter(expr( "month >= TO_DATE('2021-01-01') and month <= TO_DATE('2021-12-31')")); int monthCount = (int) df2021.count(); df2021 = df2021 .agg(sum("pax").as("pax"), sum("internationalPax").as("internationalPax"), sum("domesticPax").as("domesticPax")); int pax = DataframeUtils.maxAsInt(df2021, "pax") / (12 - monthCount); int intPax = DataframeUtils.maxAsInt(df2021, “internationalPax") / (12 - monthCount); int domPax = DataframeUtils.maxAsInt(df2021, "domesticPax") / (12 - monthCount); List<String> data = new ArrayList(); for (int i = monthCount + 1; i <= 12; i++) { data.add("2021-" + i + "-01"); } Dataset<Row> dfImputation2021 = spark .createDataset(data, Encoders.STRING()).toDF() .withColumn("month", col("value").cast(DataTypes.DateType)) .withColumn("pax", lit(pax)) .withColumn("internationalPax", lit(intPax)) .withColumn("domesticPax", lit(domPax)) .drop("value"); Extract 2021 data /jgperrin/ai.jgp.drsti-spark Lab #600 Calculate imputation data Create a new dataframe, from scratch with the additional data
  • 44. LinearRegression lr = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8).setLabelCol("pax"); LinearRegressionModel model2019 = lr.fit(df.filter(col(“year").$less$eq(2019))); df = model2019 .transform(df) .withColumnRenamed("prediction", "prediction2019"); LinearRegressionModel model2021 = lr.fit(df.filter(col("year").$less$eq(2021))); df = model2021 .transform(df) .withColumnRenamed("prediction", "prediction2021"); Pretty much the same code as lab #500, except: for renaming columns /jgperrin/ai.jgp.drsti-spark Lab #610 Reusing the same linear regression for both model, but the model is different!
  • 45. Same model Trainer Model Dataset #1 Model Dataset #2 Predicted Data Step 1: Learning phase Step 2..n: Predictive phase It’s all about the base model
  • 47. There are two kinds of data scientists: 1) Those who can extrapolate from incomplete data.
  • 48. DATA Engineer DATA Scientist Develop, build, test, and operationalize datastores and large-scale processing systems. DataOps is the new DevOps. Match architecture with business needs. Develop processes for data modeling, mining, and pipelines. Improve data reliability and quality. Clean, massage, and organize data. Perform statistics and analysis to develop insights, build models, and search for innovative correlations. Prepare data for predictive models. Explore data to find hidden gems and patterns. Tells stories to key stakeholders. Source: Adapted from https://www.datacamp.com/community/blog/data-scientist-vs-data-engineer
  • 51. Call for action • We always need more data • Air Traffic @ https://github.com/jgperrin/ai.jgp.drsti-spark • COVID-19 @ https://github.com/jgperrin/net.jgp.books.spark.ch99 • Go try & contribute to dṛṣṭi at http://jgp.ai/drsti • Follow me on Twitter @jgperrin & YouTube /jgperrin
  • 52. Key takeaways • Spark is very fun & powerful for any data application: • Data engineering • Data science • New vocabulary & concept regarding Apache Spark: dataframe, analytics operating system • Machine learning & AI work better with Big Data • Data is fluid (and it’s really painful)
  • 53. Thank you! http://jgp.ai/sia See you next month for All Things Open!