SlideShare une entreprise Scribd logo
1  sur  70
Jongwook Woo
BigDAI
CalStateLA
School of Business
Yonsei University
Oct 5 2021, Korea
Jongwook Woo, PhD, jwoo5@calstatela.edu
Big Data AI Center (BigDAI)
California State University Los Angeles
Scalable Predictive Analysis
and The Trend with Big Data & AI
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself
Experience:
Since 2002, Professor at Dept. of IS, California State University Los Angeles
– PhD in 2001: Computer Science and Engineering at USC
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: S/W Development Lead
http://www.mobygames.com/game/windows/matrix-online/credits
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Partners for Services
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Myself: Collaborations
SOFTZEN
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Collaboration with NVidia, Databricks, Oracle,
Amazon, CDH using Big Data AI
https://www.cloudera.com/more/customers/csula.html
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– IoT (Streaming data, Sensor Data) in SmartX
– Social Computing, smart phone, online game
– Bioinformatics, …
Legacy approach
 Can handle the massive data set
– Increase the storage size
– Improve the speed of CPU
 Only Problem
– Too expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Traditional Way
Becomes too Expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Definition
What is Big Data? Data or Systems?
Data view: Large Scale Data?
–3 Vs, 5Vs
• Velocity, Volume, Variety
–Many people only see the data point of view
• Nothing new
Systems View:
– YES, new systems for large scale data
• Non-expensive
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– Distributed Systems on non-expensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with non-expensive computers
Own super computers
Published papers in 2003, 2004
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
But Works Well with the crazy massive data set
Battle of Nagashino,
1575, Japan
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
From 2017 Korean
Blockbuster Movie,
“The Fortress”
(남한산성)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Handling: Another Way
Not Expensive
http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677
AD 1409 (Year 9 of King Tae-Jong, Chosun Dynasty, Korea) By Choi family:
최해산(崔海山), 아버지 최무선(崔茂宣)
[Ref] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Solution: Large Scale Data
Big Data:
Non-expensive platform, which is distributed parallel computing
systems and that can store a large-scale data and process it in
parallel
 Apache Hadoop and Spark since 2006
– Non-expensive Super Computer
– Any small companies or university labs can own it
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data
Big Data (Hadoop, Spark, Distributed Deep Learning)
Cluster for Compute and Store
(Distributed File Systems: HDFS, GFS)
…
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data: Linearly Scalable
 Some people questions that the system to handle 1 ~ 3GB of
data set is not Big Data
Well…. add more servers as more data in the future in Big Data platform
– it is linearly scalable once built
– n time more computing power ideally
Data Size: < 3 GB Data Size: 200 TB >
Add n
servers
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data is great for everyone: University labs and
Small Business, etc
 Big Data Analysis
Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,..
Big Data for Data Analysis
– How to store, compute, analyze massive dataset?
You have your specific data
Big Company does not have a specific data as you have
Your Business data is the value
– Customer data
– Operational data
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Analysis and Prediction Flow
Data Collection
Batch API: Yelp,
Google
Streaming: Twitter,
Apache NiFi, Kafka,
Storm
Open Data:
Government
Data Storage
HDFS, S3, Object Storage,
NoSQL DB (Couchbase)…
Data Filtering
Hive, Pig
Data Analysis and Science
Hive, Pig, Spark, BI Tools
(Datameer, Qlik, …)
Data Visualization
Qlik, Datameer, Excel
PowerView
Big Data Engineering
Big Data Analysis
Big Data Science
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Data Analysis & Visualization
Sentiment Map of Alphago
Positive
Negative
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
K-Election 2017
(April 29 – May 9)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Businesses popular in 5 miles of CalStateLA,
USC , UCLA
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Jams and other traffic incidents reported
by users in Dec 2017 – Jan 2018:
(Dalyapraz Dauletbak)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
COVID 19 Dashboard
https://www.calstatela.edu/centers/hipic/covid-19-us-ca-confirmed-prediction
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Big Data Prediction
Big Data Science
How to predict the future trend and pattern with the massive
dataset?
=> Machine Learning
Deep
Learning
Machine
Learning
AI
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark
Limitation in MapReduce computing
Hard to program in Java
Batch Processing
– Not interactive
Disk storage for intermediate data
– Performance issue for Machine Learning
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark (Cont’d)
Spark by UC Berkley AMP Lab
Started by Matei Zaharia in 2009,
– and open sourced in 2010
In-Memory storage for intermediate data
20 ~ 100 times faster than
– MapReduce
Good in Machine Learning => Big Data Science
– Iterative algorithms
Spark ML
Supports Machine Learning libraries
Process massive data set to build prediction models
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
Machine Learning
Has been popular since Google Tensorflow, Nov 9 2015
Multiple Cores in GPU
– Even with multiple GPUs and CPUs
Parallel Computing in a chip
GPU (Nvidia GTX 1660 Ti)
1280 CUDA cores
Other Deep Learning Libraries
Tensor Flow with Keras
PyTorch by Facebook
Apache Mxnet
Caffe, Caffe2
Microsoft Cognitive Toolkit (Previously CNTK)
DeepLearning4j
…
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
From Neural Networks to Deep Learning
Deep learning – Different types of architectures
Generative Adversarial Networks (GAN)
Convolutional Neural Networks (CNN)
Neural Networks (NN)
7 © 2017 SAP SE or an SAP affiliate company. All rights
reserved. ǀ PUBLIC
Recurrent Neural Networks (RNN) &
Long-Short Term Memory (LSTM)
Ref: SAP Enterprise Deep Learning with TensorFlow
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning
CNN
Image Recognition
Video Analysis
 NLP for classification, Prediction
RNN
Time Series Prediction
Text Analysis
– Conversation Q&A
Image/Video Captioning
Speech Recognition/Synthesis
GAN
 Media Generation
– Photo Realistic Images
Human Image Synthesis: Fake faces
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Scale Driving: Deep Learning Process
Deep Learning and Massive Data [3]
“Machine Learning Yearning” Andrew Ng 2016
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep learning experts
The
Chasm
Big Data Engineers, Scientists, Analysts, etc.
Another Gap between Deep Learning and Big Data
Communities [6]
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster
 Existing Big Data cluster with massive data set without using
Big Data
Too slow in data
migration and
single server fails
Single GPU
server for Deep
Learning?
Single server for
Python and R
Traditional
Machine Learning?
Big Data Cluster
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
What if we combine Deep Learning and Spark?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster with Deep Learning
 Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
Distributed Deep Learning
– Integrate Deep Learning to the cluster
Not needs data migration and can leverage the
parallel computing and existing large scale data
Big Data Cluster
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Deep Learning with Spark
Deep Learning Pipelines for Apache Spark
Databricks
BigDL (Distributed Deep Learning Library for Apache Spark)
Intel
TensorFlowOnSpark
Yahoo! Inc
DL4J (Deeplearning4j On Spark)
Skymind
Distributed Deep Learning with Keras & Spark
Elephas
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Spark ML and DDL [2-5]
Deep Learning in Spark cluster
Distributed Deep Learning
DDL
DDL lib
DDL lib
Deep Learning in Spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Big Data using GPU
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster with GPU
What if we use GPU for Big Data Cluster?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Again: Big Data Cluster with GPU
 Existing Big Data cluster with massive data set
Too slow in data
migration and
single server fails
Single GPU
server for
Machine (Deep)
Learning?
Big Data Cluster
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster with GPU
Big Data Cluster: Unified Analytics Platform
Already built in the site
– and matured for Data Engineering, Data Analysis, Data Science
Can we use the existing Big Data cluster with GPU?
– Can we integrate GPU to this Big Data Cluster?
NVidia
RAPIDS and Spark
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Distributed Parallel Computing using RAPIDS
 RAPIDS:
Parallel Machine Learning (ML) on GPU
 RAPIDS + Spark:
 Distributed Parallel ML in Big Data
– XGBoost:
(+) machine learning not deep learning
(+) Leveraging Big Data
No bottleneck for large scale data
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Parallel Computing with GPU
 Apache Spark 3.0 in GPU
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Leveraging Big Data Cluster with GPU
 Existing Big Data cluster
Big Data Engineering
Big Data Analysis
Big Data Science
– Integrate GPU chips to the cluster
– Big Data x GPU
• Improved Parallelism
– Distributed Parallel x Parallel Chip Computing
Not needs data migration and can leverage the parallel
computing and existing large scale data with GPU
Big Data Cluster with GPU
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Case I: Traffic Data Analysis
 Dalyapraz Dauletbak, Junghoon Heo, Sooyoung Kim, Yeon Pyo Kim and,
Jongwook Woo, "Scalable Traffic Predictive Analysis for Smart City using
GPU in Big Data", KSII The 16th APIC-IST, June 20-22 2021, pp144-148, ISSN
2093-0542
 Columns to consider :
 Location/Time
– X and Y coordinates (Longitude & Latitude)
 Level of traffic intensity (1 - 5)
 Counts of jams/alerts
 Traffic Jam Analysis with Classification:
Found the Time for Traffic Jam
– Rush hours from 7 am to 9 am produce a lot of traffic,
– the heaviest traffic time
• start from 3pm and gets better after 6pm.
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Features/columns in a dataset
Label to Predict:
Level of traffic (0, 1)
Features:
location x, location y X and Y -coordinate of location
date_pst Pacific Time of the publication of traffic report
level Label: jam level
1: almost no jam, 5: standstill jam
speed driver’s captured speed in mph
length length of the traffic ahead in the route of user in
meters
*date_pst *date splits into month, day, hour, min, sec,
weekday
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experiment: H/W Specification
Dataproc Cluster of GCP: Hadoop Spark
 Spark 3.1.1 on Hadoop 3.2.2
Spark Cluster 2 worker nodes
(CPU)
2 GPUs
n1-highmem-32 nvidia-tesla-t4
Cores 32 48
Memory 208 GB 32 GB
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Accuracy of Models
3 Algorithms
 XGBoost, Gradient Boost Tree (GBT), Random Forrest (RF)
XGBoost has 100% Recall, Precision, and AUC
High Recall: low FN
RF (CPU) GBT
(CPU)
XGBoost
(GPU)
AUC 86.3% 89.6% 100%
Precision 0.890 0.922 1.0
Recall 0.956 0.947 1.0
Computin
g Time
1 hrs 8
min 53 sec
3 hrs 55
min 23
sec
21 sec
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Computing Time to train Models
RF GBT XGBoost
Computing Time 1 hrs 8 min
53 sec
3 hrs 55 min
23 sec
21 sec
Computing Time:
Log(Sec)
3.62 4.15 1.32
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 hrs 8 min 53 sec
3 hrs 55 min 23 sec
21 sec
RF
GBT
XGBoost
Computing Time: Log(Sec)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
GCP Cluster Price
Price of a Cluster
 Number of nodes to be computationally equivalent
– Assuming the cluster is linearly scalable
RF GBT XGBoost
Equivalent No of Nodes 199 673 2
Total Prices $753.35 $2547.77 $5.99
GCP Price/hours
n1-highmem-32 (CPU) $1.892848
nvidia-tesla-t4 (GPU) $1.1
Total $2.99
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Case II: Fraud Detection in Financial Data
Priyanka Purushu, Jongwook Woo, "Financial Fraud
Detection adopting Distributed Deep Learning in Big Data",
KSII The 15th APIC-IST 2020, July 5 -7 2020, Seoul, Korea,
pp271-273, ISSN 2093-0542
Distributed Deep Learning without GPU
No public available datasets on financial services
 private nature of financial transactions
– specially in the mobile money transactions domain
 PaySim
URL: https://www.kaggle.com/ntnu-testimon/paysim1
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Data Understanding
Numeric attributes:
amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest
Categorical attributes:
step, type, isFraud, isFlaggedFraud
String attributes:
 nameOrig, nameDest
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Label: isFraud
Data is biased as others
isFraud has only a few positives
– not helpful in detecting a fraud transaction
Traditional Approach
Need to generate sample data to balance the data
to build a model
– SMOTE (Synthetic Minority Over Sampling
Technique) algorithm adopted
• Minority Data: 11% from 0.2 %
Large Scale data does not need to generate
it as it has good enough data set
 Just sample and balanced the data to build a model
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Experimental System Specification
Cluster in Google Cloud Platform
Hadoop Spark of Dataproc cluster
– Python 2.7.14, Spark 2.3.4
– Intel BigDL
Google Cloud Platform (GCP):
 Instances: n1-standard-64 (64 vCPUs, 240 GB memory, 257 TB storage)
 Number of Nodes: 6
– Memory size:
• 1.44 TB = 1440 GB (= 240 GB x 6)
– CPU:
• 384 vCPU (= 64 vCPUs x 6), 2.0 GHz
– Storage:
• 1.542 PB = 1,542 TB (= 257TB x 6)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Financial Data Set (Cont‘d)
Size: 470 MB
6,362,620 records
Not that large scale data comparing to data set > GB
But the Big Data architecture can be applicable to much bigger data set
– As it still adopt Spark Computing Engine in Big Data
Attributes: 11
Predictive Analysis
The target column to predict fraud :
– ‘isFraud’
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Comparing Spark ML and DDL for fraud detection
Spark ML algorithms
DT (Decision Tree)
RF (Random Forest)
LR (Linear Regression)
DDL: Distributed Deep Learning in Spark
MFF (Multilayer Perceptron FF)
– Feed Forward (FF)
• a neural network system
– Cross Validation (CV)
– Train Split Validation (TSV)
BigDL FF (BFF)
Achieve High Recall: low FN
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Accuracy and Performance
Model Precision Recall AUC Time
(mins)
DT 0.976 0.975 0.976 3
RF 0.977 0.980 0.979 13
LR 0.946 0.860 0.905 3
MFF TSV 0.694 1 0.782 2
MFF CV 0.695 1 0.783 4
BFF 0.593 0.516 1 4
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Confusion matrix of RF
RF should be the optimal model
has the high
– Recall: 0.980, and Precision: 0.977
Good AUC: 0.979
MFF: Recall 1
AUC is low:
– 0.782
BFF: AUC 1
Recall is low:
– 0.516
RF Actual
Negative
Actual Positive
Predicted
Negative
124,034 2,847
Predicted
Positive
2,534 122,936
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary: Performance
MFF TSV has the fast computing time of about 2
minutes
Others:
– 3 – 4 mins
RF: 13 mins
0 5 10 15
DT, 3
RF, 13
LR, 3
MFF TSV, 2
MFF CV, 4
BFF, 4
Computing Time (min)
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Successful Enterprise: Business + Engineering
Low Tech (Cost?) but
High Biz
High {Biz + Tech
(Cost?)}
Low {Biz + Tech} High Tech but Low Biz
Engineering / Technology (Cost?)
Business
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Collaboration
How do you adopt technology for your business
High Tech
– Not to focus on technology itself
• Good enough technology
Business
– Good Business model
• Good enough or Latest technology?
Needs Convergence and Collaboration
Communication between biz and eng needed
 Find the proper solution
– Leveraging the optimal Tech
– Gain the highest Business Profit
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Contents
 Myself
 Introduction To Big Data
 Scalable Business Intelligence
 Predictive Analysis with Big Data AI
 Summary
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Summary
 Big Data platform for Large Scale Data
 High Performance solution for massive data set
– Data Storage, Analysis, Prediction
 Unified Analytics Platform
 Big Data and AI
 Big Data
– without GPU but with Deep Learning
 GPU
– Leveraging Big Data with GPU
 Big Data Predictive Analysis Performance with GPU
Faster
More Accurate
Much cheaper
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
Questions?
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
1. J. Barbaresso, G. Cordahi, D. Garcia et al., “USDOT’s Intelligent Transportation Systems (ITS) ITS Strategic Plan
2015- 2019,” 2014.
2. “Integrated Corridor Management,” Intelligent Transportation Systems - Integrated Corridor Management,
www.its.dot.gov/research_archives/icms/. Accessed April 14, 2019.
3. J. Kestelyn, “Real-Time Data Visualization and Machine Learning for London Traffic Analysis,” Google Cloud,
2016, cloud.google.com/blog/products/gcp/real-time-data-visualization-and-machine-learning-for-london-
traffic-analysis. Accessed April 14, 2019.
4. “Connected Citizens by Waze,” Waze, www.waze.com/ccp. Accessed April 14, 2019.
5. M. Schnuerle, “Louisville and Waze: Applying Mobility Data in Cities,” Harvard Civic Analytics Network
Summit on Data-Smart Government, 2017.
6. Louisville Metro. “Thunder Jams, 2017 Traffic Delays.” CARTO, louisvillemetro-
ms.carto.com/builder/d98732d0-1f6a-4db2-9f8a-e58026bf0d39/embed. Accessed April 14, 2019.
7. Louisville Metro. “Pothole Animation.” CARTO, cdolabs-admin.carto.com/builder/a80f62bf-98e1-4591-8354-
acfa8e51a8de/embed. Accessed April 14, 2019.
8. E. Necula, “Analyzing Traffic Patterns on Street Segments Based on GPS Data Using R,” Transportation
Research Procedia, Vol. 10, pp. 276–285, 2015.
Big Data Artificial Intelligence Center (BigDAI)
Jongwook Woo
CalStateLA
References
9. J. Woo and Y. Xu, “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing,” in Proc. of
International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las
Vegas. 2011.
10. “Pandas.io.json.json_normalize.” Pandas.io.json.json_normalize - Pandas 0.24.2 Documentation,
pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html. Accessed April
14, 2019.
11. United States, Chief Executive Office County of Los Angeles. “Cities within the County of Los Angeles.”
lacounty.gov. Accessed April 14, 2019.
12. Garyericson. “What Is - Azure Machine Learning Studio.” Microsoft Docs, docs.microsoft.com/en-
us/azure/machine-learning/studio/what-is-ml-studio. Accessed April 14, 2019.
13. A. Tharwat, “Classification Assessment Methods.” Applied Computing and Informatics, 2018.
14. M. Sokolova and L. Guy, “A Systematic Analysis of Performance Measures for Classification
Tasks,” Information Processing & Management, Vol. 45. No. 4, pp. 427–437, 2009.
15. Performance of Dataframe in Spark and PySpark, https://databricks.com/blog/2015/02/17/introducing-
dataframes-in-spark-for-large-scale-data-science.html
16. https://cities-today.com/smart-traffic-management-could-save-cities-us277-billion-by-2025/
17. https://www.greenbiz.com/article/advanced-traffic-management-next-big-thing-smart-cities

Contenu connexe

Tendances

Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataJongwook Woo
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its TrendsJongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligenceManish Jain
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceANOOP V S
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsChandan Rajah
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Benjamin Taylor
 

Tendances (20)

Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Predictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial IntelligencePredictive Analytics - Big Data & Artificial Intelligence
Predictive Analytics - Big Data & Artificial Intelligence
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data science
Data scienceData science
Data science
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11Intro to Data Science by DatalentTeam at Data Science Clinic#11
Intro to Data Science by DatalentTeam at Data Science Clinic#11
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Predictive analytics and big data tutorial
Predictive analytics and big data tutorial Predictive analytics and big data tutorial
Predictive analytics and big data tutorial
 

Similaire à Scalable Predictive Analysis and The Trend with Big Data & AI

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsJongwook Woo
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksJongwook Woo
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfArmyTrilidiaDevegaSK
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
2014 Big Data Research by IDG Enterprise
2014 Big Data Research by IDG Enterprise2014 Big Data Research by IDG Enterprise
2014 Big Data Research by IDG EnterpriseIDG
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalIIIT Allahabad
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
 
What_BigData_means_to_your_organization
What_BigData_means_to_your_organizationWhat_BigData_means_to_your_organization
What_BigData_means_to_your_organizationAttila Barta
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementationSandip Tipayle Patil
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieSunil Ranka
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research ReportIla Group
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Gregg Barrett
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitecturePalani Kumar
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
C21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptxC21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptxAdityaDeshpande674450
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesBen Siscovick
 

Similaire à Scalable Predictive Analysis and The Trend with Big Data & AI (20)

Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdfA New Paradigm on Analytic-Driven Information and Automation V2.pdf
A New Paradigm on Analytic-Driven Information and Automation V2.pdf
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
2014 Big Data Research by IDG Enterprise
2014 Big Data Research by IDG Enterprise2014 Big Data Research by IDG Enterprise
2014 Big Data Research by IDG Enterprise
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactData Science - An emerging Stream of Science with its Spreading Reach & Impact
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
 
What_BigData_means_to_your_organization
What_BigData_means_to_your_organizationWhat_BigData_means_to_your_organization
What_BigData_means_to_your_organization
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Why Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A LieWhy Everything You Know About bigdata Is A Lie
Why Everything You Know About bigdata Is A Lie
 
Big Data Analytics Research Report
Big Data Analytics Research ReportBig Data Analytics Research Report
Big Data Analytics Research Report
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Bigdata
Bigdata Bigdata
Bigdata
 
Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...Overview of mit sloan case study on ge data and analytics initiative titled g...
Overview of mit sloan case study on ge data and analytics initiative titled g...
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
C21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptxC21027_Aditya_Big Data Analytics In Baking Sector.pptx
C21027_Aditya_Big Data Analytics In Baking Sector.pptx
 
The Business of Big Data - IA Ventures
The Business of Big Data - IA VenturesThe Business of Big Data - IA Ventures
The Business of Big Data - IA Ventures
 

Plus de Jongwook Woo

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum ComputingJongwook Woo
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeJongwook Woo
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open DataJongwook Woo
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and SparkJongwook Woo
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Jongwook Woo
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkJongwook Woo
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Jongwook Woo
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Jongwook Woo
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsJongwook Woo
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesJongwook Woo
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopJongwook Woo
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopJongwook Woo
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingJongwook Woo
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in SeoulJongwook Woo
 

Plus de Jongwook Woo (19)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Big Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive ComputingBig Data and Advanced Data Intensive Computing
Big Data and Advanced Data Intensive Computing
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Dernier

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Scalable Predictive Analysis and The Trend with Big Data & AI

  • 1. Jongwook Woo BigDAI CalStateLA School of Business Yonsei University Oct 5 2021, Korea Jongwook Woo, PhD, jwoo5@calstatela.edu Big Data AI Center (BigDAI) California State University Los Angeles Scalable Predictive Analysis and The Trend with Big Data & AI
  • 2. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Scalable Business Intelligence  Predictive Analysis with Big Data AI  Summary
  • 3. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself Experience: Since 2002, Professor at Dept. of IS, California State University Los Angeles – PhD in 2001: Computer Science and Engineering at USC
  • 4. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: S/W Development Lead http://www.mobygames.com/game/windows/matrix-online/credits
  • 5. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: Partners for Services
  • 6. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Myself: Collaborations SOFTZEN
  • 7. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Collaboration with NVidia, Databricks, Oracle, Amazon, CDH using Big Data AI https://www.cloudera.com/more/customers/csula.html
  • 8. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Scalable Business Intelligence  Predictive Analysis with Big Data AI  Summary
  • 9. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – IoT (Streaming data, Sensor Data) in SmartX – Social Computing, smart phone, online game – Bioinformatics, … Legacy approach  Can handle the massive data set – Increase the storage size – Improve the speed of CPU  Only Problem – Too expensive
  • 10. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way
  • 11. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Traditional Way Becomes too Expensive
  • 12. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Definition What is Big Data? Data or Systems? Data view: Large Scale Data? –3 Vs, 5Vs • Velocity, Volume, Variety –Many people only see the data point of view • Nothing new Systems View: – YES, new systems for large scale data • Non-expensive
  • 13. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – Distributed Systems on non-expensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with non-expensive computers Own super computers Published papers in 2003, 2004
  • 14. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way But Works Well with the crazy massive data set Battle of Nagashino, 1575, Japan
  • 15. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive From 2017 Korean Blockbuster Movie, “The Fortress” (남한산성)
  • 16. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Handling: Another Way Not Expensive http://blog.naver.com/PostView.nhn?blogId=dosims&logNo=221127053677 AD 1409 (Year 9 of King Tae-Jong, Chosun Dynasty, Korea) By Choi family: 최해산(崔海山), 아버지 최무선(崔茂宣) [Ref] 조선의 비밀 병기 : 총통기 화차(銃筒機火車)|작성자 도심
  • 17. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Solution: Large Scale Data Big Data: Non-expensive platform, which is distributed parallel computing systems and that can store a large-scale data and process it in parallel  Apache Hadoop and Spark since 2006 – Non-expensive Super Computer – Any small companies or university labs can own it
  • 18. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Big Data (Hadoop, Spark, Distributed Deep Learning) Cluster for Compute and Store (Distributed File Systems: HDFS, GFS) …
  • 19. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data: Linearly Scalable  Some people questions that the system to handle 1 ~ 3GB of data set is not Big Data Well…. add more servers as more data in the future in Big Data platform – it is linearly scalable once built – n time more computing power ideally Data Size: < 3 GB Data Size: 200 TB > Add n servers
  • 20. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Scalable Business Intelligence  Predictive Analysis with Big Data AI  Summary
  • 21. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data is great for everyone: University labs and Small Business, etc  Big Data Analysis Hadoop, Spark, NoSQL DB, SAP HANA, ElasticSearch,.. Big Data for Data Analysis – How to store, compute, analyze massive dataset? You have your specific data Big Company does not have a specific data as you have Your Business data is the value – Customer data – Operational data
  • 22. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Analysis and Prediction Flow Data Collection Batch API: Yelp, Google Streaming: Twitter, Apache NiFi, Kafka, Storm Open Data: Government Data Storage HDFS, S3, Object Storage, NoSQL DB (Couchbase)… Data Filtering Hive, Pig Data Analysis and Science Hive, Pig, Spark, BI Tools (Datameer, Qlik, …) Data Visualization Qlik, Datameer, Excel PowerView Big Data Engineering Big Data Analysis Big Data Science
  • 23. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Data Analysis & Visualization Sentiment Map of Alphago Positive Negative
  • 24. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA K-Election 2017 (April 29 – May 9)
  • 25. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Businesses popular in 5 miles of CalStateLA, USC , UCLA
  • 26. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Jams and other traffic incidents reported by users in Dec 2017 – Jan 2018: (Dalyapraz Dauletbak)
  • 27. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA COVID 19 Dashboard https://www.calstatela.edu/centers/hipic/covid-19-us-ca-confirmed-prediction
  • 28. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Scalable Business Intelligence  Predictive Analysis with Big Data AI  Summary
  • 29. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Big Data Prediction Big Data Science How to predict the future trend and pattern with the massive dataset? => Machine Learning Deep Learning Machine Learning AI
  • 30. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark Limitation in MapReduce computing Hard to program in Java Batch Processing – Not interactive Disk storage for intermediate data – Performance issue for Machine Learning
  • 31. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark (Cont’d) Spark by UC Berkley AMP Lab Started by Matei Zaharia in 2009, – and open sourced in 2010 In-Memory storage for intermediate data 20 ~ 100 times faster than – MapReduce Good in Machine Learning => Big Data Science – Iterative algorithms Spark ML Supports Machine Learning libraries Process massive data set to build prediction models
  • 32. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning Machine Learning Has been popular since Google Tensorflow, Nov 9 2015 Multiple Cores in GPU – Even with multiple GPUs and CPUs Parallel Computing in a chip GPU (Nvidia GTX 1660 Ti) 1280 CUDA cores Other Deep Learning Libraries Tensor Flow with Keras PyTorch by Facebook Apache Mxnet Caffe, Caffe2 Microsoft Cognitive Toolkit (Previously CNTK) DeepLearning4j …
  • 33. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA From Neural Networks to Deep Learning Deep learning – Different types of architectures Generative Adversarial Networks (GAN) Convolutional Neural Networks (CNN) Neural Networks (NN) 7 © 2017 SAP SE or an SAP affiliate company. All rights reserved. ǀ PUBLIC Recurrent Neural Networks (RNN) & Long-Short Term Memory (LSTM) Ref: SAP Enterprise Deep Learning with TensorFlow
  • 34. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning CNN Image Recognition Video Analysis  NLP for classification, Prediction RNN Time Series Prediction Text Analysis – Conversation Q&A Image/Video Captioning Speech Recognition/Synthesis GAN  Media Generation – Photo Realistic Images Human Image Synthesis: Fake faces
  • 35. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Scale Driving: Deep Learning Process Deep Learning and Massive Data [3] “Machine Learning Yearning” Andrew Ng 2016
  • 36. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep learning experts The Chasm Big Data Engineers, Scientists, Analysts, etc. Another Gap between Deep Learning and Big Data Communities [6]
  • 37. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster  Existing Big Data cluster with massive data set without using Big Data Too slow in data migration and single server fails Single GPU server for Deep Learning? Single server for Python and R Traditional Machine Learning? Big Data Cluster
  • 38. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning with Spark What if we combine Deep Learning and Spark?
  • 39. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster with Deep Learning  Existing Big Data cluster Big Data Engineering Big Data Analysis Big Data Science Distributed Deep Learning – Integrate Deep Learning to the cluster Not needs data migration and can leverage the parallel computing and existing large scale data Big Data Cluster
  • 40. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Deep Learning with Spark Deep Learning Pipelines for Apache Spark Databricks BigDL (Distributed Deep Learning Library for Apache Spark) Intel TensorFlowOnSpark Yahoo! Inc DL4J (Deeplearning4j On Spark) Skymind Distributed Deep Learning with Keras & Spark Elephas
  • 41. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Spark ML and DDL [2-5] Deep Learning in Spark cluster Distributed Deep Learning DDL DDL lib DDL lib Deep Learning in Spark
  • 42. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Scalable Business Intelligence  Predictive Analysis with Big Data AI  Big Data using GPU  Summary
  • 43. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster with GPU What if we use GPU for Big Data Cluster?
  • 44. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Again: Big Data Cluster with GPU  Existing Big Data cluster with massive data set Too slow in data migration and single server fails Single GPU server for Machine (Deep) Learning? Big Data Cluster
  • 45. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster with GPU Big Data Cluster: Unified Analytics Platform Already built in the site – and matured for Data Engineering, Data Analysis, Data Science Can we use the existing Big Data cluster with GPU? – Can we integrate GPU to this Big Data Cluster? NVidia RAPIDS and Spark
  • 46. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Distributed Parallel Computing using RAPIDS  RAPIDS: Parallel Machine Learning (ML) on GPU  RAPIDS + Spark:  Distributed Parallel ML in Big Data – XGBoost: (+) machine learning not deep learning (+) Leveraging Big Data No bottleneck for large scale data
  • 47. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Parallel Computing with GPU  Apache Spark 3.0 in GPU
  • 48. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Leveraging Big Data Cluster with GPU  Existing Big Data cluster Big Data Engineering Big Data Analysis Big Data Science – Integrate GPU chips to the cluster – Big Data x GPU • Improved Parallelism – Distributed Parallel x Parallel Chip Computing Not needs data migration and can leverage the parallel computing and existing large scale data with GPU Big Data Cluster with GPU
  • 49. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Case I: Traffic Data Analysis  Dalyapraz Dauletbak, Junghoon Heo, Sooyoung Kim, Yeon Pyo Kim and, Jongwook Woo, "Scalable Traffic Predictive Analysis for Smart City using GPU in Big Data", KSII The 16th APIC-IST, June 20-22 2021, pp144-148, ISSN 2093-0542  Columns to consider :  Location/Time – X and Y coordinates (Longitude & Latitude)  Level of traffic intensity (1 - 5)  Counts of jams/alerts  Traffic Jam Analysis with Classification: Found the Time for Traffic Jam – Rush hours from 7 am to 9 am produce a lot of traffic, – the heaviest traffic time • start from 3pm and gets better after 6pm.
  • 50. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Features/columns in a dataset Label to Predict: Level of traffic (0, 1) Features: location x, location y X and Y -coordinate of location date_pst Pacific Time of the publication of traffic report level Label: jam level 1: almost no jam, 5: standstill jam speed driver’s captured speed in mph length length of the traffic ahead in the route of user in meters *date_pst *date splits into month, day, hour, min, sec, weekday
  • 51. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experiment: H/W Specification Dataproc Cluster of GCP: Hadoop Spark  Spark 3.1.1 on Hadoop 3.2.2 Spark Cluster 2 worker nodes (CPU) 2 GPUs n1-highmem-32 nvidia-tesla-t4 Cores 32 48 Memory 208 GB 32 GB
  • 52. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Accuracy of Models 3 Algorithms  XGBoost, Gradient Boost Tree (GBT), Random Forrest (RF) XGBoost has 100% Recall, Precision, and AUC High Recall: low FN RF (CPU) GBT (CPU) XGBoost (GPU) AUC 86.3% 89.6% 100% Precision 0.890 0.922 1.0 Recall 0.956 0.947 1.0 Computin g Time 1 hrs 8 min 53 sec 3 hrs 55 min 23 sec 21 sec
  • 53. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Computing Time to train Models RF GBT XGBoost Computing Time 1 hrs 8 min 53 sec 3 hrs 55 min 23 sec 21 sec Computing Time: Log(Sec) 3.62 4.15 1.32 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 hrs 8 min 53 sec 3 hrs 55 min 23 sec 21 sec RF GBT XGBoost Computing Time: Log(Sec)
  • 54. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA GCP Cluster Price Price of a Cluster  Number of nodes to be computationally equivalent – Assuming the cluster is linearly scalable RF GBT XGBoost Equivalent No of Nodes 199 673 2 Total Prices $753.35 $2547.77 $5.99 GCP Price/hours n1-highmem-32 (CPU) $1.892848 nvidia-tesla-t4 (GPU) $1.1 Total $2.99
  • 55. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Case II: Fraud Detection in Financial Data Priyanka Purushu, Jongwook Woo, "Financial Fraud Detection adopting Distributed Deep Learning in Big Data", KSII The 15th APIC-IST 2020, July 5 -7 2020, Seoul, Korea, pp271-273, ISSN 2093-0542 Distributed Deep Learning without GPU No public available datasets on financial services  private nature of financial transactions – specially in the mobile money transactions domain  PaySim URL: https://www.kaggle.com/ntnu-testimon/paysim1
  • 56. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Data Understanding Numeric attributes: amount, oldbalanceOrg, newbalanceOrg, oldbalanceDest, newbalanceDest Categorical attributes: step, type, isFraud, isFlaggedFraud String attributes:  nameOrig, nameDest
  • 57. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Label: isFraud Data is biased as others isFraud has only a few positives – not helpful in detecting a fraud transaction Traditional Approach Need to generate sample data to balance the data to build a model – SMOTE (Synthetic Minority Over Sampling Technique) algorithm adopted • Minority Data: 11% from 0.2 % Large Scale data does not need to generate it as it has good enough data set  Just sample and balanced the data to build a model
  • 58. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Experimental System Specification Cluster in Google Cloud Platform Hadoop Spark of Dataproc cluster – Python 2.7.14, Spark 2.3.4 – Intel BigDL Google Cloud Platform (GCP):  Instances: n1-standard-64 (64 vCPUs, 240 GB memory, 257 TB storage)  Number of Nodes: 6 – Memory size: • 1.44 TB = 1440 GB (= 240 GB x 6) – CPU: • 384 vCPU (= 64 vCPUs x 6), 2.0 GHz – Storage: • 1.542 PB = 1,542 TB (= 257TB x 6)
  • 59. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Financial Data Set (Cont‘d) Size: 470 MB 6,362,620 records Not that large scale data comparing to data set > GB But the Big Data architecture can be applicable to much bigger data set – As it still adopt Spark Computing Engine in Big Data Attributes: 11 Predictive Analysis The target column to predict fraud : – ‘isFraud’
  • 60. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Comparing Spark ML and DDL for fraud detection Spark ML algorithms DT (Decision Tree) RF (Random Forest) LR (Linear Regression) DDL: Distributed Deep Learning in Spark MFF (Multilayer Perceptron FF) – Feed Forward (FF) • a neural network system – Cross Validation (CV) – Train Split Validation (TSV) BigDL FF (BFF) Achieve High Recall: low FN
  • 61. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Accuracy and Performance Model Precision Recall AUC Time (mins) DT 0.976 0.975 0.976 3 RF 0.977 0.980 0.979 13 LR 0.946 0.860 0.905 3 MFF TSV 0.694 1 0.782 2 MFF CV 0.695 1 0.783 4 BFF 0.593 0.516 1 4
  • 62. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Confusion matrix of RF RF should be the optimal model has the high – Recall: 0.980, and Precision: 0.977 Good AUC: 0.979 MFF: Recall 1 AUC is low: – 0.782 BFF: AUC 1 Recall is low: – 0.516 RF Actual Negative Actual Positive Predicted Negative 124,034 2,847 Predicted Positive 2,534 122,936
  • 63. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary: Performance MFF TSV has the fast computing time of about 2 minutes Others: – 3 – 4 mins RF: 13 mins 0 5 10 15 DT, 3 RF, 13 LR, 3 MFF TSV, 2 MFF CV, 4 BFF, 4 Computing Time (min)
  • 64. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Successful Enterprise: Business + Engineering Low Tech (Cost?) but High Biz High {Biz + Tech (Cost?)} Low {Biz + Tech} High Tech but Low Biz Engineering / Technology (Cost?) Business
  • 65. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Collaboration How do you adopt technology for your business High Tech – Not to focus on technology itself • Good enough technology Business – Good Business model • Good enough or Latest technology? Needs Convergence and Collaboration Communication between biz and eng needed  Find the proper solution – Leveraging the optimal Tech – Gain the highest Business Profit
  • 66. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Contents  Myself  Introduction To Big Data  Scalable Business Intelligence  Predictive Analysis with Big Data AI  Summary
  • 67. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Summary  Big Data platform for Large Scale Data  High Performance solution for massive data set – Data Storage, Analysis, Prediction  Unified Analytics Platform  Big Data and AI  Big Data – without GPU but with Deep Learning  GPU – Leveraging Big Data with GPU  Big Data Predictive Analysis Performance with GPU Faster More Accurate Much cheaper
  • 68. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA Questions?
  • 69. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 1. J. Barbaresso, G. Cordahi, D. Garcia et al., “USDOT’s Intelligent Transportation Systems (ITS) ITS Strategic Plan 2015- 2019,” 2014. 2. “Integrated Corridor Management,” Intelligent Transportation Systems - Integrated Corridor Management, www.its.dot.gov/research_archives/icms/. Accessed April 14, 2019. 3. J. Kestelyn, “Real-Time Data Visualization and Machine Learning for London Traffic Analysis,” Google Cloud, 2016, cloud.google.com/blog/products/gcp/real-time-data-visualization-and-machine-learning-for-london- traffic-analysis. Accessed April 14, 2019. 4. “Connected Citizens by Waze,” Waze, www.waze.com/ccp. Accessed April 14, 2019. 5. M. Schnuerle, “Louisville and Waze: Applying Mobility Data in Cities,” Harvard Civic Analytics Network Summit on Data-Smart Government, 2017. 6. Louisville Metro. “Thunder Jams, 2017 Traffic Delays.” CARTO, louisvillemetro- ms.carto.com/builder/d98732d0-1f6a-4db2-9f8a-e58026bf0d39/embed. Accessed April 14, 2019. 7. Louisville Metro. “Pothole Animation.” CARTO, cdolabs-admin.carto.com/builder/a80f62bf-98e1-4591-8354- acfa8e51a8de/embed. Accessed April 14, 2019. 8. E. Necula, “Analyzing Traffic Patterns on Street Segments Based on GPS Data Using R,” Transportation Research Procedia, Vol. 10, pp. 276–285, 2015.
  • 70. Big Data Artificial Intelligence Center (BigDAI) Jongwook Woo CalStateLA References 9. J. Woo and Y. Xu, “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing,” in Proc. of International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas. 2011. 10. “Pandas.io.json.json_normalize.” Pandas.io.json.json_normalize - Pandas 0.24.2 Documentation, pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html. Accessed April 14, 2019. 11. United States, Chief Executive Office County of Los Angeles. “Cities within the County of Los Angeles.” lacounty.gov. Accessed April 14, 2019. 12. Garyericson. “What Is - Azure Machine Learning Studio.” Microsoft Docs, docs.microsoft.com/en- us/azure/machine-learning/studio/what-is-ml-studio. Accessed April 14, 2019. 13. A. Tharwat, “Classification Assessment Methods.” Applied Computing and Informatics, 2018. 14. M. Sokolova and L. Guy, “A Systematic Analysis of Performance Measures for Classification Tasks,” Information Processing & Management, Vol. 45. No. 4, pp. 427–437, 2009. 15. Performance of Dataframe in Spark and PySpark, https://databricks.com/blog/2015/02/17/introducing- dataframes-in-spark-for-large-scale-data-science.html 16. https://cities-today.com/smart-traffic-management-could-save-cities-us277-billion-by-2025/ 17. https://www.greenbiz.com/article/advanced-traffic-management-next-big-thing-smart-cities