Scalable Predictive Analysis and The Trend with Big Data & AI

Jongwook Woo
BigDAI
CalStateLA
School of Business
Yonsei University
Oct 5 2021, Korea
Jongwook Woo, PhD, jwoo5@calstatela.edu
Big Data AI Center (BigDAI)
California State University Los Angeles
Scalable Predictive Analysis
and The Trend with Big Data & AI

Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Summary

Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at Dept. of IS, California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC

Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits

Jongwook Woo
CalStateLA
Myself: Partners for Services

Jongwook Woo
CalStateLA
Myself: Collaborations
SOFTZEN

Jongwook Woo
CalStateLA
Collaboration with NVidia, Databricks, Oracle,
Amazon, CDH using Big Data AI
https://www.cloudera.com/more/customers/csula.html

Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– IoT (Streaming data, Sensor Data) in SmartX
– Social Computing, smart phone, online game
– Bioinformatics, …
Legacy approach
 Can handle the massive data set
– Increase the storage size
– Improve the speed of CPU
 Only Problem
– Too expensive

Jongwook Woo
CalStateLA
Data Handling: Traditional Way

Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive

Jongwook Woo
CalStateLA
Big Data Definition
What is Big Data? Data or Systems?
Data view: Large Scale Data?
–3 Vs, 5Vs
• Velocity, Volume, Variety
–Many people only see the data point of view
• Nothing new
Systems View:
– YES, new systems for large scale data
• Non-expensive

Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004

Jongwook Woo
CalStateLA
Data Handling: Another Way
But Works Well with the crazy massive data set
Battle of Nagashino,
1575, Japan

Jongwook Woo
CalStateLA
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)

Jongwook Woo
CalStateLA
Not Expensive
http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677
AD 1409 (Year 9 of King Tae-Jong, Chosun Dynasty, Korea) By Choi family:
최해산(崔海山), 아버지 최무선(崔茂宣)
[Ref] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심

Jongwook Woo
CalStateLA
Big Data Solution: Large Scale Data
Big Data:
Non-expensive platform, which is distributed parallel computing
systems and that can store a large-scale data and process it in
parallel
 Apache Hadoop and Spark since 2006
– Non-expensive Super Computer
– Any small companies or university labs can own it

Jongwook Woo
CalStateLA
Big Data
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
…

Jongwook Woo
CalStateLA
Big Data: Linearly Scalable
 Some people questions that the system to handle 1 ~ 3GB of
data set is not Big Data
Well…. add more servers as more data in the future in Big Data platform
– it is linearly scalable once built
– n time more computing power ideally
Data Size: < 3 GB Data Size: 200 TB >
Add n
servers

Jongwook Woo
CalStateLA
Big Data is great for everyone: University labs and
Small Business, etc
 Big Data Analysis
Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
Big Data for Data Analysis
– How to store, compute, analyze massive dataset?
You have your specific data
Big Company does not have a specific data as you have
Your Business data is the value
– Customer data
– Operational data

Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
Big Data Engineering
Big Data Analysis
Big Data Science

Jongwook Woo
CalStateLA
Big Data Data Analysis & Visualization
Sentiment Map of Alphago
Positive
Negative

Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)

Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA

Jongwook Woo
CalStateLA
Jams and other traffic incidents reported
by users in Dec 2017 – Jan 2018:
(Dalyapraz Dauletbak)

Jongwook Woo
CalStateLA
COVID 19 Dashboard
https://www.calstatela.edu/centers/hipic/covid-19-us-ca-confirmed-prediction

Jongwook Woo
CalStateLA
Big Data Prediction
Big Data Science
How to predict the future trend and pattern with the massive
dataset?
=> Machine Learning
Deep
Learning
Machine
Learning
AI

Jongwook Woo
CalStateLA
Spark
Limitation in MapReduce computing
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue for Machine Learning

Jongwook Woo
CalStateLA
Spark (Cont’d)
Spark by UC Berkley AMP Lab
Started by Matei Zaharia in 2009,
– and open sourced in 2010
In-Memory storage for intermediate data
20 ~ 100 times faster than
– MapReduce
Good in Machine Learning => Big Data Science
– Iterative algorithms
Spark ML
Supports Machine Learning libraries
Process massive data set to build prediction models

Jongwook Woo
CalStateLA
Deep Learning
Machine Learning
Has been popular since Google Tensorflow, Nov 9 2015
Multiple Cores in GPU
– Even with multiple GPUs and CPUs
Parallel Computing in a chip
GPU (Nvidia GTX 1660 Ti)
1280 CUDA cores
Other Deep Learning Libraries
Tensor Flow with Keras
PyTorch by Facebook
Apache Mxnet
Caffe, Caffe2
Microsoft Cognitive Toolkit (Previously CNTK)
DeepLearning4j
…

Jongwook Woo
CalStateLA
From Neural Networks to Deep Learning
Deep learning – Different types of architectures
Generative Adversarial Networks (GAN)
Convolutional Neural Networks (CNN)
Neural Networks (NN)
7 © 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC
Recurrent Neural Networks (RNN) &
Long-Short Term Memory (LSTM)
Ref: SAP Enterprise Deep Learning with TensorFlow

Jongwook Woo
CalStateLA
Deep Learning
CNN
Image Recognition
Video Analysis
 NLP for classification, Prediction
RNN
Time Series Prediction
Text Analysis
– Conversation Q&A
Image/Video Captioning
Speech Recognition/Synthesis
GAN
 Media Generation
– Photo Realistic Images
Human Image Synthesis: Fake faces

Jongwook Woo
CalStateLA
Data Scale Driving: Deep Learning Process
Deep Learning and Massive Data [3]
“Machine Learning Yearning” Andrew Ng 2016

Jongwook Woo
CalStateLA
Deep learning experts
The
Chasm
Big Data Engineers, Scientists, Analysts, etc.
Another Gap between Deep Learning and Big Data
Communities [6]

Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
 Existing Big Data cluster with massive data set without using
Big Data
Too slow in data
migration and
single server fails
Single GPU
server for Deep
Learning?
Single server for
Python and R
Traditional
Machine Learning?
Big Data Cluster

Jongwook Woo
CalStateLA
Deep Learning with Spark
What if we combine Deep Learning and Spark?

Jongwook Woo
CalStateLA
Leveraging Big Data Cluster with Deep Learning
 Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
Distributed Deep Learning
– Integrate Deep Learning to the cluster
Not needs data migration and can leverage the
parallel computing and existing large scale data
Big Data Cluster

Jongwook Woo
CalStateLA
Deep Learning with Spark
Deep Learning Pipelines for Apache Spark
Databricks
BigDL (Distributed Deep Learning Library for Apache Spark)
Intel
TensorFlowOnSpark
Yahoo! Inc
DL4J (Deeplearning4j On Spark)
Skymind
Distributed Deep Learning with Keras & Spark
Elephas

Jongwook Woo
CalStateLA
Spark ML and DDL [2-5]
Deep Learning in Spark cluster
Distributed Deep Learning
DDL
DDL lib
DDL lib
Deep Learning in Spark

Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Big Data using GPU
 Summary

Jongwook Woo
CalStateLA
Leveraging Big Data Cluster with GPU
What if we use GPU for Big Data Cluster?

Jongwook Woo
CalStateLA
Again: Big Data Cluster with GPU
 Existing Big Data cluster with massive data set
Too slow in data
migration and
single server fails
Single GPU
server for
Machine (Deep)
Learning?
Big Data Cluster

Jongwook Woo
CalStateLA
Big Data Cluster: Unified Analytics Platform
Already built in the site
– and matured for Data Engineering, Data Analysis, Data Science
Can we use the existing Big Data cluster with GPU?
– Can we integrate GPU to this Big Data Cluster?
NVidia
RAPIDS and Spark

Jongwook Woo
CalStateLA
Distributed Parallel Computing using RAPIDS
 RAPIDS:
Parallel Machine Learning (ML) on GPU
 RAPIDS + Spark:
 Distributed Parallel ML in Big Data
– XGBoost:
(+) machine learning not deep learning
(+) Leveraging Big Data
No bottleneck for large scale data

Jongwook Woo
CalStateLA
Parallel Computing with GPU
 Apache Spark 3.0 in GPU

Jongwook Woo
CalStateLA
 Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
– Integrate GPU chips to the cluster
– Big Data x GPU
• Improved Parallelism
– Distributed Parallel x Parallel Chip Computing
Not needs data migration and can leverage the parallel
computing and existing large scale data with GPU
Big Data Cluster with GPU

Jongwook Woo
CalStateLA
Case I: Traffic Data Analysis
 Dalyapraz Dauletbak, Junghoon Heo, Sooyoung Kim, Yeon Pyo Kim and,
Jongwook Woo, "Scalable Traffic Predictive Analysis for Smart City using
GPU in Big Data", KSII The 16th APIC-IST, June 20-22 2021, pp144-148, ISSN
2093-0542
 Columns to consider :
 Location/Time
– X and Y coordinates (Longitude & Latitude)
 Level of traffic intensity (1 - 5)
 Counts of jams/alerts
 Traffic Jam Analysis with Classification:
Found the Time for Traffic Jam
– Rush hours from 7 am to 9 am produce a lot of traffic,
– the heaviest traffic time
• start from 3pm and gets better after 6pm.

Jongwook Woo
CalStateLA
Features/columns in a dataset
Label to Predict:
Level of traffic (0, 1)
Features:
location x, location y X and Y -coordinate of location
date_pst Pacific Time of the publication of traffic report
level Label: jam level
1: almost no jam, 5: standstill jam
speed driver’s captured speed in mph
length length of the traffic ahead in the route of user in
meters
*date_pst *date splits into month, day, hour, min, sec,
weekday

Jongwook Woo
CalStateLA
Experiment: H/W Specification
Dataproc Cluster of GCP: Hadoop Spark
 Spark 3.1.1 on Hadoop 3.2.2
Spark Cluster 2 worker nodes
(CPU)
2 GPUs
n1-highmem-32 nvidia-tesla-t4
Cores 32 48
Memory 208 GB 32 GB

Jongwook Woo
CalStateLA
Accuracy of Models
3 Algorithms
 XGBoost, Gradient Boost Tree (GBT), Random Forrest (RF)
XGBoost has 100% Recall, Precision, and AUC
High Recall: low FN
RF (CPU) GBT
(CPU)
XGBoost
(GPU)
AUC 86.3% 89.6% 100%
Precision 0.890 0.922 1.0
Recall 0.956 0.947 1.0
Computin
g Time
1 hrs 8
min 53 sec
3 hrs 55
min 23
sec
21 sec

Jongwook Woo
CalStateLA
Computing Time to train Models
RF GBT XGBoost
Computing Time 1 hrs 8 min
53 sec
3 hrs 55 min
23 sec
21 sec
Computing Time:
Log(Sec)
3.62 4.15 1.32
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 hrs 8 min 53 sec
3 hrs 55 min 23 sec
21 sec
RF
GBT
XGBoost
Computing Time: Log(Sec)

Jongwook Woo
CalStateLA
GCP Cluster Price
Price of a Cluster
 Number of nodes to be computationally equivalent
– Assuming the cluster is linearly scalable
RF GBT XGBoost
Equivalent No of Nodes 199 673 2
Total Prices $753.35 $2547.77 $5.99
GCP Price/hours
n1-highmem-32 (CPU) $1.892848
nvidia-tesla-t4 (GPU) $1.1
Total $2.99

Jongwook Woo
CalStateLA
Case II: Fraud Detection in Financial Data
Priyanka Purushu, Jongwook Woo, "Financial Fraud
Detection adopting Distributed Deep Learning in Big Data",
KSII The 15th APIC-IST 2020, July 5 -7 2020, Seoul, Korea,
pp271-273, ISSN 2093-0542
Distributed Deep Learning without GPU
No public available datasets on financial services
 private nature of financial transactions
– specially in the mobile money transactions domain
 PaySim
URL: https://www.kaggle.com/ntnu-testimon/paysim1

Jongwook Woo
CalStateLA
Data Understanding
Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest
Categorical attributes:
step, type, isFraud, isFlaggedFraud
String attributes:
 nameOrig, nameDest

Jongwook Woo
CalStateLA
Label: isFraud
Data is biased as others
isFraud has only a few positives
– not helpful in detecting a fraud transaction
Traditional Approach
Need to generate sample data to balance the data
to build a model
– SMOTE (Synthetic Minority Over Sampling
Technique) algorithm adopted
• Minority Data: 11% from 0.2 %
Large Scale data does not need to generate
it as it has good enough data set
 Just sample and balanced the data to build a model

Jongwook Woo
CalStateLA
Experimental System Specification
Cluster in Google Cloud Platform
Hadoop Spark of Dataproc cluster
– Python 2.7.14, Spark 2.3.4
– Intel BigDL
Google Cloud Platform (GCP):
 Instances: n1-standard-64 (64 vCPUs, 240 GB memory, 257 TB storage)
 Number of Nodes: 6
– Memory size:
• 1.44 TB = 1440 GB (= 240 GB x 6)
– CPU:
• 384 vCPU (= 64 vCPUs x 6), 2.0 GHz
– Storage:
• 1.542 PB = 1,542 TB (= 257TB x 6)

Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB
6,362,620 records
Not that large scale data comparing to data set > GB
But the Big Data architecture can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
Attributes: 11
Predictive Analysis
The target column to predict fraud :
– ‘isFraud’

Jongwook Woo
CalStateLA
Comparing Spark ML and DDL for fraud detection
Spark ML algorithms
DT (Decision Tree)
RF (Random Forest)
LR (Linear Regression)
DDL: Distributed Deep Learning in Spark
MFF (Multilayer Perceptron FF)
– Feed Forward (FF)
• a neural network system
– Cross Validation (CV)
– Train Split Validation (TSV)
BigDL FF (BFF)
Achieve High Recall: low FN

Jongwook Woo
CalStateLA
Summary: Accuracy and Performance
Model Precision Recall AUC Time
(mins)
DT 0.976 0.975 0.976 3
RF 0.977 0.980 0.979 13
LR 0.946 0.860 0.905 3
MFF TSV 0.694 1 0.782 2
MFF CV 0.695 1 0.783 4
BFF 0.593 0.516 1 4

Jongwook Woo
CalStateLA
Summary: Confusion matrix of RF
RF should be the optimal model
has the high
– Recall: 0.980, and Precision: 0.977
Good AUC: 0.979
MFF: Recall 1
AUC is low:
– 0.782
BFF: AUC 1
Recall is low:
– 0.516
RF Actual
Negative
Actual Positive
Predicted
Negative
124,034 2,847
Predicted
Positive
2,534 122,936

Jongwook Woo
CalStateLA
Summary: Performance
MFF TSV has the fast computing time of about 2
minutes
Others:
– 3 – 4 mins
RF: 13 mins
0 5 10 15
DT, 3
RF, 13
LR, 3
MFF TSV, 2
MFF CV, 4
BFF, 4
Computing Time (min)

Jongwook Woo
CalStateLA
Successful Enterprise: Business + Engineering
Low Tech (Cost?) but
High Biz
High {Biz + Tech
(Cost?)}
Low {Biz + Tech} High Tech but Low Biz
Engineering / Technology (Cost?)
Business

Jongwook Woo
CalStateLA
Collaboration
How do you adopt technology for your business
High Tech
– Not to focus on technology itself
• Good enough technology
Business
– Good Business model
• Good enough or Latest technology?
Needs Convergence and Collaboration
Communication between biz and eng needed
 Find the proper solution
– Leveraging the optimal Tech
– Gain the highest Business Profit

Jongwook Woo
CalStateLA
Summary
 Big Data platform for Large Scale Data
 High Performance solution for massive data set
– Data Storage, Analysis, Prediction
 Unified Analytics Platform
 Big Data and AI
 Big Data
– without GPU but with Deep Learning
 GPU
– Leveraging Big Data with GPU
 Big Data Predictive Analysis Performance with GPU
Faster
More Accurate
Much cheaper

Jongwook Woo
CalStateLA
Questions?

Jongwook Woo
CalStateLA
References
1. J. Barbaresso, G. Cordahi, D. Garcia et al., “USDOT’s Intelligent Transportation Systems (ITS) ITS Strategic Plan
2015- 2019,” 2014.
2. “Integrated Corridor Management,” Intelligent Transportation Systems - Integrated Corridor Management,
www.its.dot.gov/research_archives/icms/. Accessed April 14, 2019.
3. J. Kestelyn, “Real-Time Data Visualization and Machine Learning for London Traffic Analysis,” Google Cloud,
2016, cloud.google.com/blog/products/gcp/real-time-data-visualization-and-machine-learning-for-london-
traffic-analysis. Accessed April 14, 2019.
4. “Connected Citizens by Waze,” Waze, www.waze.com/ccp. Accessed April 14, 2019.
5. M. Schnuerle, “Louisville and Waze: Applying Mobility Data in Cities,” Harvard Civic Analytics Network
Summit on Data-Smart Government, 2017.
6. Louisville Metro. “Thunder Jams, 2017 Traffic Delays.” CARTO, louisvillemetro-
ms.carto.com/builder/d98732d0-1f6a-4db2-9f8a-e58026bf0d39/embed. Accessed April 14, 2019.
7. Louisville Metro. “Pothole Animation.” CARTO, cdolabs-admin.carto.com/builder/a80f62bf-98e1-4591-8354-
acfa8e51a8de/embed. Accessed April 14, 2019.
8. E. Necula, “Analyzing Traffic Patterns on Street Segments Based on GPS Data Using R,” Transportation
Research Procedia, Vol. 10, pp. 276–285, 2015.

Jongwook Woo
CalStateLA
References
9. J. Woo and Y. Xu, “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing,” in Proc. of
International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las
Vegas. 2011.
10. “Pandas.io.json.json_normalize.” Pandas.io.json.json_normalize - Pandas 0.24.2 Documentation,
pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html. Accessed April
14, 2019.
11. United States, Chief Executive Office County of Los Angeles. “Cities within the County of Los Angeles.”
lacounty.gov. Accessed April 14, 2019.
12. Garyericson. “What Is - Azure Machine Learning Studio.” Microsoft Docs, docs.microsoft.com/en-
us/azure/machine-learning/studio/what-is-ml-studio. Accessed April 14, 2019.
13. A. Tharwat, “Classification Assessment Methods.” Applied Computing and Informatics, 2018.
14. M. Sokolova and L. Guy, “A Systematic Analysis of Performance Measures for Classification
Tasks,” Information Processing & Management, Vol. 45. No. 4, pp. 427–437, 2009.
15. Performance of Dataframe in Spark and PySpark, https://databricks.com/blog/2015/02/17/introducing-
dataframes-in-spark-for-large-scale-data-science.html
16. https://cities-today.com/smart-traffic-management-could-save-cities-us277-billion-by-2025/
17. https://www.greenbiz.com/article/advanced-traffic-management-next-big-thing-smart-cities

Scalable Predictive Analysis and The Trend with Big Data & AI

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scalable Predictive Analysis and The Trend with Big Data & AI

Similaire à Scalable Predictive Analysis and The Trend with Big Data & AI (20)

Plus de Jongwook Woo

Plus de Jongwook Woo (19)

Dernier

Dernier (20)

Scalable Predictive Analysis and The Trend with Big Data & AI