SlideShare une entreprise Scribd logo
1  sur  53
jwoo Woo
HiPIC
CSULA
Big Data and Advanced Data Intensive
Computing
Yonsei University
Shin-Chon, Korea
June 18th 2014
Jongwook Woo (PhD)
High-Performance Information Computing Center (HiPIC)
Cloudera Academic Partner and Grants Awardee of Amazon AWS
Computer Information Systems Department
California State University, Los Angeles
High Performance Information Computing Center
Jongwook Woo
CSULA
Contents
소개
 Emerging Big Data Technology
 Big Data Use Cases
 Hadoop 2.0
 Training in Big Data
High Performance Information Computing Center
Jongwook Woo
CSULA
Me
 이름: 우종욱
 직업:
 교수 (직책: 부교수), California State University Los Angeles
– Capital City of Entertainment
 경력:
 2002년 부터 교수: Computer Information Systems Dept, College of
Business and Economics
– www.calstatela.edu/faculty/jwoo5
 1998년부터 헐리우드등지의 많은 회사 컨설팅
– 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축
– FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합
– Warner Bros (Matrix online game), E!, citysearch.com, ARM 등
 2009여년 부터 하둡 빅데이타에 관심
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Grants
 Received MicroSoft Windows Azure Educator Grant (Oct 2013
- July 2014)
 Received Amazon AWS in Education Research Grant (July
2012 - July 2014)
 Received Amazon AWS in Education Coursework Grants (July
2012 - July 2013, Jan 2011 - Dec 2011
 Partnership
 Received Academic Education Partnership with Cloudera since
June 2012
 Linked with Hortonworks since May 2013
– Positive to provide partnership
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Certificate
 Certified Cloudera 강사
 Certified Cloudera Hadoop Developer / Administrator
 Certificate of Achievement in the Big Data University Training
Course, “Hadoop Fundamentals I”, July 8 2012
 Certificate of 10gen Training Course, “M101: MongoDB
Development”, (Dec 24 2012)
 Blog and Github for Hadoop and its ecosystems
 http://dal-cloudcomputing.blogspot.com/
– Hadoop, AWS, Cloudera
 https://github.com/hipic
– Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming,
RHadoop
 https://github.com/dalgual
High Performance Information Computing Center
Jongwook Woo
CSULA
Experience in Big Data
 Several publications regarding Hadoop and NoSQL
 Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of
MovieLens Data Set using Hive”, in Journal of Science and
Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN
 “Scalable, Incremental Learning with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung,
Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck
Choe. in Proceedings of the International Joint Conference on Neural
Networks, 2013
 “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las
Vegas (July 16-19, 2012)
 “Market Basket Analysis Algorithm with no-SQL DB HBase and
Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho
Kim, EDB 2012, Incheon, Aug. 25-27, 2011
 “Market Basket Analysis Algorithm with Map/Reduce of Cloud
Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las
Vegas (July 18-21, 2011)
 Collaboration with Universities and companies
 USC, Texas A&M, Cloudera, Amazon, MicroSoft
High Performance Information Computing Center
Jongwook Woo
CSULA
What is Big Data, Map/Reduce, Hadoop, NoSQL DB on
Cloud Computing
High Performance Information Computing Center
Jongwook Woo
CSULA
Data
Google
“We don’t have a better algorithm
than others but we have more data
than others”
High Performance Information Computing Center
Jongwook Woo
CSULA
New Data Trend
Sparsity
Unstructured
Schema free data with sparse attributes
– Semantic or social relations
No relational property
– nor complex join queries
• Log data
Immutable
No need to update and delete data
High Performance Information Computing Center
Jongwook Woo
CSULA
Data Issues
Large-Scale data
Tera-Byte (1012), Peta-byte (1015)
– Because of web
– Sensor Data, Bioinformatics, Social Computing,
smart phone, online game…
Cannot handle with the legacy approach
Too big
Un-/Semi-structured data
Too expensive
Need new systems
Non-expensive
High Performance Information Computing Center
Jongwook Woo
CSULA
Two Cores in Big Data
How to store Big Data
How to compute Big Data
Google
How to store Big Data
– GFS
– On inexpensive commodity computers
How to compute Big Data
– MapReduce
– Parallel Computing with multiple non-expensive
computers
• Own super computers
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 1.0
Hadoop
Doug Cutting
– 하둡 창시자
– 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의
창시자
– 아파치 소프트웨어 파운데이션의 보드 멤버
– Chief Architect at Cloudera
MapReduce
HDFS
Restricted Parallel Programming
– Not for iterative algorithms
– Not for graph
High Performance Information Computing Center
Jongwook Woo
CSULA
Emerging Big Data Technology
Giraph
Spark and Shark
Use Cases
Use Cases experienced
High Performance Information Computing Center
Jongwook Woo
CSULA
Spark and Shark
High Speed In-Memory Analytics over
Hadoop and Hive data
http://www.slideshare.net/Hadoop_Summit/s
park-and-shark
 Fast Data Sharing
–Iterative Graph Algorithms
• Data Mining (Classification/Clustering)
–Interactive Query
High Performance Information Computing Center
Jongwook Woo
CSULA
Giraph
BSP
Facebook
http://www.slideshare.net/aladagemre/a-talk-
on-apache-giraph
High Performance Information Computing Center
Jongwook Woo
CSULA
Josh Wills (Cloudera)
 “I have found that many kinds of
scientists– such as astronomers,
geneticists, and geophysicists– are
working with very large data sets in order
to build models that do not involve
statistics or machine learning, and that
these scientists encounter data
challenges that would be familiar to data
scientists at Facebook, Twitter, and
LinkedIn.”
 “Data science is a set of techniques used
by many scientists to solve problems
across a wide array of scientific fields.”
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases experienced
Log Analysis
 Log files from IPS and IDS
– 1.5GB per day for each systems
 Extracting unusual cases using Hadoop, Solr,
Flume on Cloudera
Customer Behavior Analysis
Market Basket Analysis Algorithm
 Machine Learning for Image Processing
with Texas A&M
Hadoop Streaming API
 Movie Data Analysis
 Hive, Impala
jwoo Woo
HiPIC
CSULA
Scalable, Incremental Learning
with MapReduce Parallelization for
Cell Detection in High-Resolution 3D Microscopy
Data (IJCNN 2013)
Chul Sung, Yoonsuck
Choe
BrainNetworksLaboratory
Computer Scienceand
Engineering
TAMU
Jongwook Woo
Computer Information
Systems
CALSTATE-LA
Matthew Goodman,
Todd Huffman
3SCAN
High Performance Information Computing Center
Jongwook Woo
CSULA
Motivation
Analysis of neuronal distribution in the brain
plays an important role in the diagnosis of
disorders of the brain.
E.g., Purkinje cell reduction in autism [3]
A. Normal cerebellum
B. Reduction of neurons in the Purkinje cell layer
Normal
human
brain
Autistic
human
brain
High Performance Information Computing Center
Jongwook Woo
CSULA
Approach
Use a machine learning approach to
detect neurons.
Learn a binary classifier:
𝑓: 𝑅3
→ {0,1}
Input: local volume data
Output: cell center (1) or off-center (0)
High Performance Information Computing Center
Jongwook Woo
CSULA
Requirement: Effective
Incremental Learning
Several properties are desired:
Low computational cost
Non-iterative
No accumulation of data points
No retraining
Yet, sufficient accuracy
High Performance Information Computing Center
Jongwook Woo
CSULA
Proposed Algorithm
Principal Components Analysis
(PCA)-based supervised learning
No need of retraining
Highly scalable due to only the
eigenvector matrices being stored
Highly parallelizable due to its
incremental nature
–We keep the eigenvectors as new training
samples are made available and
additionally use them in the testing
process.
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization
 A highly effective and popular framework for big
data analytics
 Parallel data processing tasks
Map phase - tasks are divided and results are emitted
Reduce phase - the emitted results are sorted and
consolidated
 Apache Hadoop
 Open source project of the Apache Foundation
 Storage: Hadoop Distributed File System (HDFS)
 Processing: Map/Reduce (Fault Tolerant Distributed
Processing)
Slide from Dr. Jongwook Woo’s SWRC 2013 Presentation
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 Hadoop MapReduce for Non-Java codes: Python,
Ruby
 Requirement
 Running Hadoop
 Needs Hadoop Streaming API
– hadoop-streaming.jar
 Needs to build Mapper and Reducer codes
– Simple conversion from sequential codes
 STDIN > mapper > reducer > STDOUT
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop Streaming
 MapReduce Python execution
 http://wiki.apache.org/hadoop/HadoopStreaming
 Sysntax
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar
[options] Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
 Example
$ bin/hadoop jar contrib/streaming/hadoop-streaming.jar 
-file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py 
-file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py 
-input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare-
output
High Performance Information Computing Center
Jongwook Woo
CSULA
Training
PCA is run separately on these class-
specific subsets, resulting in class-
specific eigenvector matrices.
Class 1: 𝑿+
Training Set 𝐗
Class 2: 𝑿−
Eigenvectors 1: 𝑽+
PCA
Eigenvectors 2: 𝑽−
PCA
XY XZYZ
An input
vector
High Performance Information Computing Center
Jongwook Woo
CSULA
Eigenvectors 1: 𝑽+
Novel input x
Eigenvectors 2: 𝑽−
Eigenvectors 1: 𝑽+𝑻
Eigenvectors 2: 𝑽−𝑻
*
Reconst. 𝒙+
Class 1
Yes
║x− 𝒙+║<║x- 𝒙−║?
No
Class 2
Reconst. 𝒙−
*
*
*
Projection 𝒚+
Projection 𝒚−
Testing
 Each data vector x is projected using the two class-specific
PCA eigenvector matrices
 The class associated with the more accurate reconstruction
determines the label for the new data vector
xz
xy
yz
xz
xy
yz
?
High Performance Information Computing Center
Jongwook Woo
CSULA
Reconstruction Examples
 Reconstruction of cell center and off-center data using
matching vs. non-matching eigenvector matrices
 Reconstruction accurate only with matching eigenvector matrix
 Proximity: Cell center proximity value (e.g., 1.0 is cell center
and 0.1 off-center)
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization
Our algorithm is highly
parallelizable.
To exploit this property, we
developed a MapReduce-based
implementation of the algorithm.
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization (Training)
 Parallel PCA computations of the class-specific subsets
from the training sets, generating two eigenvector matrices
per training set
Set 𝐤 𝑽−
Set 𝐤 𝑽+
Set 𝟏 𝑽−
Eigenvectors
Input files
Eigen
Decomposition
Map phase Output files
Read
worker
worker
worker
Class
1
Class
2
Training Set
1
Class
1
Class
2
Training Set
k
Set 𝟏 𝑽+
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce Parallelization (Testing)
- Map
1. We need to prepare all data vectors from
all voxels in the data volume whether a data
vector is in the cell-center class.
Eigenvectors
Input
files
Read
Projection
& Reconst.
Map
phase
Reconst.
Errors
Intermediate
files
Averaging
Reconst. Errors
Reduce
phase
Output
files
Novel
input
split 1
split
m
Read worker
worker
worker
Averages of
Reconst. Errors
𝒙𝟏 err
avg.
𝒙𝒏 err
avg.
Set 𝐤
𝑽−
Set 𝐤
𝑽+
Set 𝟏
𝑽−
Set 𝟏
𝑽+
𝒙𝟏 − 𝒙𝟏+𝟏
𝒙𝟏 − 𝒙𝟏−𝟏
𝒙𝟏 − 𝒙𝟏+𝒌
𝒙𝟏 − 𝒙𝟏−𝒌
𝒙𝟏 − 𝒙𝟏+𝟏
𝒙𝟏 − 𝒙𝟏−𝟏
𝒙𝟏 − 𝒙𝟏+𝒌
𝒙𝟏 − 𝒙𝟏−𝒌
worker
worker
worker
High Performance Information Computing Center
Jongwook Woo
CSULA
300
250
200
150
100
50
0
A B C D
Cluster Configuration
A: Single Node
B: One Master, One Slave
C: One Master, Five Slaves
D: One Master, Ten Slaves
Results: MapReduce Performance
 Performance
comparison during
testing
 35 map tasks and
10 reduce tasks
per job (except for
A case)
 Performance was
greatly improved
(nearly 10 times)
 Not much gain
during training
Average
Each node computing is quad-core 2xIntel
Xeon X5570 CPU and 23.00 GB memory.
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Developed a novel scalable incremental
learning algorithm for fast quantitative
analysis of massive, growing, sparsely
labeled data.
Our algorithm showed high accuracy
(AUC of 0.9614).
10 times speed up using MapReduce.
Expected to be broadly applicable to
the analysis of high-throughput medical
imaging data.
High Performance Information Computing Center
Jongwook Woo
CSULA
Use Cases in Science
Seismology
HEP
High Performance Information Computing Center
Jongwook Woo
CSULA
하둡 과학 분야 이용 사례
Reflection Seismology (반사지진학)
Marine Seismic Survey (해양 탄성파탐사)
Sears (Retail)
Gravity (Online Publishing,
Personalized Content)
High Performance Information Computing Center
Jongwook Woo
CSULA
Reflection Seismology (반사지진학)
 반사지진학
 A set of techniques for solving a classic inverse problem:
– given a collection of seismograms (진동 기록) and associated
metadata,
– generate an image of the subsurface of the Earth that generated
those seismograms.
 Big Data
– A Modern seismic survey
• tens of thousands of shots and multiple terabytes of trace data.
 반사지진학의 목적
 To locate oil and natural gas deposits.
 To identify the location of the Chicxulub Crater
– that has been linked to the extinction of the dinosaur.
High Performance Information Computing Center
Jongwook Woo
CSULA
Marine Seismic Survey
(해양 탄성파탐사)
High Performance Information Computing Center
Jongwook Woo
CSULA
Common Depth Point (CDP) Gather
(공통 심도점)
Common Depth Point (CDP)
 CDP의 목적
 By comparing the time it
took for the seismic
waves to trace from the
different source and
receiver locations and
experimenting with
different velocity models
for the waves moving
thorough the rock,
– we can estimate the depth
of the common surface
point that the waves
reflected off of.
High Performance Information Computing Center
Jongwook Woo
CSULA
Reflection Seismology and Hadoop
 By aggregating a large
number of these estimates,
 construct a complete image of
the surface.
 As we increase the density (밀도)
and the number of traces,
– create higher quality images
• that improve our understanding
of the subsurface geology
(지하지질)
A 3D seismic image of
Japan’s southeastern
margin
High Performance Information Computing Center
Jongwook Woo
CSULA
Reflection Seismology and Hadoop
(Legacy Seismic Data Processing)
Geophysicists (지구 물리학자)
Use the first Cray supercomputers
–as well as the massively parallel
Connection Machine.
Parallel Computing
–must file a request to move the data into
active storage
• then consume precious cluster resources
in order to process the data.
High Performance Information Computing Center
Jongwook Woo
CSULA
Reflection Seismology and Hadoop
(Legacy Seismic Data Processing)
open-source software tools in
Seismic data processing
The Seismic Unix project
–from the Colorado School of Mines
SEPlib
–from Stanfrod University
SeisSpace
–commercial toolkit for seismic data
processing.
• Built on top of an open source foundation,
the JavaSeis project.
High Performance Information Computing Center
Jongwook Woo
CSULA
Emerge of Apache Hadoop for Seismology
 Seismic Hadoop by Cloudera
 Data Intensive Computing
– store and process seismic data in a Hadoop cluster.
• Enabled to export many of the most I/O intensive steps in the seismic data processing into the
Hadoop cluster
 Combines Seismic Unix with Crunch,
– the Java library for creating MapReduce Pipelines.
 Seismic Unix
– extensive use of Unix pipes in order to construct complex data processing tasks from
a set of simple procedures
sufilter f=
10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort
cdp gx
 A pipeline in Seismic Unix
– first applies a filter to the trace data is built,
– then some meta data associated with each trace are edited,
– and the traces by the metadata just edited are finally sorted
High Performance Information Computing Center
Jongwook Woo
CSULA
What is HEP?
High Energy Physics
 Definition:
 Involves colliding highly energetic, common particles together
– in order to create small, exotic, and incredibly short-lived
particles.
High Performance Information Computing Center
Jongwook Woo
CSULA
Large Hadron Collider
Collides protons together at an energy of 7 TeV per
particle.
 protons travel around the rings and are collided inside particle
detectors.
 Collisions occur every 25 nanoseconds.
High Performance Information Computing Center
Jongwook Woo
CSULA
Compact Muon Solenoid
 Big Data
 Collisions at a rate of 40MHz
– Each collision has about 1MB worth of data.
 40MHz x 1MB = 320 Tera bps
– (unmanageable amount)
 Complex custom compute system (called trigger)
will cut down the entire collision rate to about
300Hz, which means that significant data are
statistically determined.
High Performance Information Computing Center
Jongwook Woo
CSULA
From Raw Data to Significant
Raw Sensor Data
Reconstructed Data
Analysis-oriented Data
Physicist-specific
N-tuples
1MB
110KB
1KB
Tier1-AtCERN
Tier2-BigData
High Performance Information Computing Center
Jongwook Woo
CSULA
Characteristics of Tier 2
 Need Hadoop
 Large amount of data (400 TB)
– Large data rate (in the range of 10Gbps) to analyze
 Need for reliability, but not archival storage
 Proper use of resources
 Need for interoperability
High Performance Information Computing Center
Jongwook Woo
CSULA
HDFS Structure
HDFS
Mounted with FUSE
Worker Nodes
SRM
Generic Web-
Services Interface
Globus GridFTP
Standard Grid Protocol
for WAN Transfers
• FUSE
• allows physicists’ C++ applications to access HDFS
without modification.
• Two grid components
• allow interoperation with non-Hadoop sites.
High Performance Information Computing Center
Jongwook Woo
CSULA
MapReduce 1.0 Cons and Future
 Bad for
 Fast response time
 Large amount of shared data
 Fine-grained synch needed
 CPU-intensive not data-intensive
 Continuous input stream
 Hadoop 2.0: YARN
product
High Performance Information Computing Center
Jongwook Woo
CSULA
Hadoop 2.0: YARN
 Data processing applications and services
 Online Serving – HOYA (HBase on YARN)
 Real-time event processing – Storm, S4, other commercial
platforms
 Tez – Generic framework to run a complex DAG
 MPI: OpenMPI, MPICH2
 Master-Worker
 Machine Learning: Spark
 Graph processing: Giraph
 Enabled by allowing the use of paradigm-specific application
master
 [http://www.slideshare.net/hortonworks/apache-hadoop-yarn-
enabling-nex]
High Performance Information Computing Center
Jongwook Woo
CSULA
Training in Big Data
 Learn by yourself?
Miss many important topics
Cloudera
With hands-on exercises
 Cloudera 강의
하둡 개발자
하둡 시스템관리자
하둡 데이터 분석가/과학자
High Performance Information Computing Center
Jongwook Woo
CSULA
Conclusion
Era of Big Data
Need to store and compute Big Data
Many solutions but Hadoop
Hadoop is supercomputer that you
can own
Hadoop 2.0
Training is important
High Performance Information Computing Center
Jongwook Woo
CSULA
Question?

Contenu connexe

Tendances

IDs书友会 - 主题1 - Swinburne Next Generation Research
IDs书友会 - 主题1 - Swinburne Next Generation Research IDs书友会 - 主题1 - Swinburne Next Generation Research
IDs书友会 - 主题1 - Swinburne Next Generation Research
IDs Club 澳洲互联网俱乐部
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
Steve Watt
 

Tendances (14)

IDs书友会 - 主题1 - Swinburne Next Generation Research
IDs书友会 - 主题1 - Swinburne Next Generation Research IDs书友会 - 主题1 - Swinburne Next Generation Research
IDs书友会 - 主题1 - Swinburne Next Generation Research
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Data storage in Cloud computing
Data storage in Cloud computingData storage in Cloud computing
Data storage in Cloud computing
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Advanced Cyberinfrastructure Enabled Services and Applications in 2021
Advanced Cyberinfrastructure Enabled Services and Applications in 2021Advanced Cyberinfrastructure Enabled Services and Applications in 2021
Advanced Cyberinfrastructure Enabled Services and Applications in 2021
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Tech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big DataTech4Africa - Opportunities around Big Data
Tech4Africa - Opportunities around Big Data
 
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanMIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
 
Steve Watt Presentation
Steve Watt PresentationSteve Watt Presentation
Steve Watt Presentation
 
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | EdurekaTop 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
Top 8 Data Science Tools | Open Source Tools for Data Scientists | Edureka
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paper
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutables
 
Creating a Big Data Machine Learning Platform in California
Creating a Big Data Machine Learning Platform in CaliforniaCreating a Big Data Machine Learning Platform in California
Creating a Big Data Machine Learning Platform in California
 

Similaire à Big Data and Advanced Data Intensive Computing

Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
Jongwook Woo
 

Similaire à Big Data and Advanced Data Intensive Computing (20)

Introduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using HadoopIntroduction To Big Data and Use Cases using Hadoop
Introduction To Big Data and Use Cases using Hadoop
 
Big Data Platform adopting Spark and Use Cases with Open Data
Big Data  Platform adopting Spark and Use Cases with Open DataBig Data  Platform adopting Spark and Use Cases with Open Data
Big Data Platform adopting Spark and Use Cases with Open Data
 
Big Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use CasesBig Data and Data Intensive Computing: Use Cases
Big Data and Data Intensive Computing: Use Cases
 
Introduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on HadoopIntroduction To Big Data and Use Cases on Hadoop
Introduction To Big Data and Use Cases on Hadoop
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
Big Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on NetworksBig Data and Data Intensive Computing on Networks
Big Data and Data Intensive Computing on Networks
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Chek mate geolocation analyzer
Chek mate geolocation analyzerChek mate geolocation analyzer
Chek mate geolocation analyzer
 
Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data Introduction to Spark: Data Analysis and Use Cases in Big Data
Introduction to Spark: Data Analysis and Use Cases in Big Data
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the EcosystemsIntroduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
Introduction to Big Data, MapReduce, its Use Cases, and the Ecosystems
 
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
Special talk: Introduction to Big Data and FinTech at Financial Supervisory S...
 
Big Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and TrainingBig Data and Data Intensive Computing: Education and Training
Big Data and Data Intensive Computing: Education and Training
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
On Big Data
On Big DataOn Big Data
On Big Data
 
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark MLPredictive Analysis of Financial Fraud Detection using Azure and Spark ML
Predictive Analysis of Financial Fraud Detection using Azure and Spark ML
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Big Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure MLBig Data Analysis in Hydrogen Station using Spark and Azure ML
Big Data Analysis in Hydrogen Station using Spark and Azure ML
 

Plus de Jongwook Woo

Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
Jongwook Woo
 

Plus de Jongwook Woo (17)

Machine Learning in Quantum Computing
Machine Learning in Quantum ComputingMachine Learning in Quantum Computing
Machine Learning in Quantum Computing
 
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost PlatformsComparing Scalable Predictive Analysis using Spark XGBoost Platforms
Comparing Scalable Predictive Analysis using Spark XGBoost Platforms
 
Scalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AIScalable Predictive Analysis and The Trend with Big Data & AI
Scalable Predictive Analysis and The Trend with Big Data & AI
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 
Introduction to Big Data and its Trends
Introduction to Big Data and its TrendsIntroduction to Big Data and its Trends
Introduction to Big Data and its Trends
 
Rating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and SparkRating Prediction using Deep Learning and Spark
Rating Prediction using Deep Learning and Spark
 
History and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep LearningHistory and Trend of Big Data and Deep Learning
History and Trend of Big Data and Deep Learning
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Traffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big DataTraffic Data Analysis and Prediction using Big Data
Traffic Data Analysis and Prediction using Big Data
 
Big Data and Predictive Analysis
Big Data and Predictive AnalysisBig Data and Predictive Analysis
Big Data and Predictive Analysis
 
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon SungjaeWhose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
Whose tombs are so called Nakrang tombs in Pyungyang? By Moon Sungjae
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and SparkAlphago vs Lee Se-Dol: Tweeter Analysis using Hadoop and Spark
Alphago vs Lee Se-Dol : Tweeter Analysis using Hadoop and Spark
 
Big Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using SparkBig Data Analysis and Industrial Approach using Spark
Big Data Analysis and Industrial Approach using Spark
 
Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015Spark tutorial @ KCC 2015
Spark tutorial @ KCC 2015
 
Introduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use CasesIntroduction to Hadoop, Big Data, Training, Use Cases
Introduction to Hadoop, Big Data, Training, Use Cases
 
2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul2014 International Software Testing Conference in Seoul
2014 International Software Testing Conference in Seoul
 

Dernier

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Dernier (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Big Data and Advanced Data Intensive Computing

  • 1. jwoo Woo HiPIC CSULA Big Data and Advanced Data Intensive Computing Yonsei University Shin-Chon, Korea June 18th 2014 Jongwook Woo (PhD) High-Performance Information Computing Center (HiPIC) Cloudera Academic Partner and Grants Awardee of Amazon AWS Computer Information Systems Department California State University, Los Angeles
  • 2. High Performance Information Computing Center Jongwook Woo CSULA Contents 소개  Emerging Big Data Technology  Big Data Use Cases  Hadoop 2.0  Training in Big Data
  • 3. High Performance Information Computing Center Jongwook Woo CSULA Me  이름: 우종욱  직업:  교수 (직책: 부교수), California State University Los Angeles – Capital City of Entertainment  경력:  2002년 부터 교수: Computer Information Systems Dept, College of Business and Economics – www.calstatela.edu/faculty/jwoo5  1998년부터 헐리우드등지의 많은 회사 컨설팅 – 주로 J2EE 미들웨어를 이용한 eBusiness applications 구축 – FAST, Lucene/Solr, Sphinx 검색엔진을 이용한 정보추출, 정보통합 – Warner Bros (Matrix online game), E!, citysearch.com, ARM 등  2009여년 부터 하둡 빅데이타에 관심
  • 4. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Grants  Received MicroSoft Windows Azure Educator Grant (Oct 2013 - July 2014)  Received Amazon AWS in Education Research Grant (July 2012 - July 2014)  Received Amazon AWS in Education Coursework Grants (July 2012 - July 2013, Jan 2011 - Dec 2011  Partnership  Received Academic Education Partnership with Cloudera since June 2012  Linked with Hortonworks since May 2013 – Positive to provide partnership
  • 5. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Certificate  Certified Cloudera 강사  Certified Cloudera Hadoop Developer / Administrator  Certificate of Achievement in the Big Data University Training Course, “Hadoop Fundamentals I”, July 8 2012  Certificate of 10gen Training Course, “M101: MongoDB Development”, (Dec 24 2012)  Blog and Github for Hadoop and its ecosystems  http://dal-cloudcomputing.blogspot.com/ – Hadoop, AWS, Cloudera  https://github.com/hipic – Hadoop, Cloudera, Solr on Cloudera, Hadoop Streaming, RHadoop  https://github.com/dalgual
  • 6. High Performance Information Computing Center Jongwook Woo CSULA Experience in Big Data  Several publications regarding Hadoop and NoSQL  Deeksha Lakshmi, Iksuk Kim, Jongwook Woo, “Analysis of MovieLens Data Set using Hive”, in Journal of Science and Technology, Dec 2013, Vol3 no12, pp1194-1198, ARPN  “Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data”. Chul Sung, Jongwook Woo, Matthew Goodman, Todd Huffman, and Yoonsuck Choe. in Proceedings of the International Joint Conference on Neural Networks, 2013  “Apriori-Map/Reduce Algorithm”, Jongwook Woo, PDPTA 2012, Las Vegas (July 16-19, 2012)  “Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop”,Jongwook Woo, Siddharth Basopia, Yuhang Xu, Seon Ho Kim, EDB 2012, Incheon, Aug. 25-27, 2011  “Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing”, Jongwook Woo and Yuhang Xu, PDPTA 2011, Las Vegas (July 18-21, 2011)  Collaboration with Universities and companies  USC, Texas A&M, Cloudera, Amazon, MicroSoft
  • 7. High Performance Information Computing Center Jongwook Woo CSULA What is Big Data, Map/Reduce, Hadoop, NoSQL DB on Cloud Computing
  • 8. High Performance Information Computing Center Jongwook Woo CSULA Data Google “We don’t have a better algorithm than others but we have more data than others”
  • 9. High Performance Information Computing Center Jongwook Woo CSULA New Data Trend Sparsity Unstructured Schema free data with sparse attributes – Semantic or social relations No relational property – nor complex join queries • Log data Immutable No need to update and delete data
  • 10. High Performance Information Computing Center Jongwook Woo CSULA Data Issues Large-Scale data Tera-Byte (1012), Peta-byte (1015) – Because of web – Sensor Data, Bioinformatics, Social Computing, smart phone, online game… Cannot handle with the legacy approach Too big Un-/Semi-structured data Too expensive Need new systems Non-expensive
  • 11. High Performance Information Computing Center Jongwook Woo CSULA Two Cores in Big Data How to store Big Data How to compute Big Data Google How to store Big Data – GFS – On inexpensive commodity computers How to compute Big Data – MapReduce – Parallel Computing with multiple non-expensive computers • Own super computers
  • 12. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 1.0 Hadoop Doug Cutting – 하둡 창시자 – 아파치 Lucene, Nutch, Avro, 하둡 프로젝트의 창시자 – 아파치 소프트웨어 파운데이션의 보드 멤버 – Chief Architect at Cloudera MapReduce HDFS Restricted Parallel Programming – Not for iterative algorithms – Not for graph
  • 13. High Performance Information Computing Center Jongwook Woo CSULA Emerging Big Data Technology Giraph Spark and Shark Use Cases Use Cases experienced
  • 14. High Performance Information Computing Center Jongwook Woo CSULA Spark and Shark High Speed In-Memory Analytics over Hadoop and Hive data http://www.slideshare.net/Hadoop_Summit/s park-and-shark  Fast Data Sharing –Iterative Graph Algorithms • Data Mining (Classification/Clustering) –Interactive Query
  • 15. High Performance Information Computing Center Jongwook Woo CSULA Giraph BSP Facebook http://www.slideshare.net/aladagemre/a-talk- on-apache-giraph
  • 16. High Performance Information Computing Center Jongwook Woo CSULA Josh Wills (Cloudera)  “I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.”  “Data science is a set of techniques used by many scientists to solve problems across a wide array of scientific fields.”
  • 17. High Performance Information Computing Center Jongwook Woo CSULA Use Cases experienced Log Analysis  Log files from IPS and IDS – 1.5GB per day for each systems  Extracting unusual cases using Hadoop, Solr, Flume on Cloudera Customer Behavior Analysis Market Basket Analysis Algorithm  Machine Learning for Image Processing with Texas A&M Hadoop Streaming API  Movie Data Analysis  Hive, Impala
  • 18. jwoo Woo HiPIC CSULA Scalable, Incremental Learning with MapReduce Parallelization for Cell Detection in High-Resolution 3D Microscopy Data (IJCNN 2013) Chul Sung, Yoonsuck Choe BrainNetworksLaboratory Computer Scienceand Engineering TAMU Jongwook Woo Computer Information Systems CALSTATE-LA Matthew Goodman, Todd Huffman 3SCAN
  • 19. High Performance Information Computing Center Jongwook Woo CSULA Motivation Analysis of neuronal distribution in the brain plays an important role in the diagnosis of disorders of the brain. E.g., Purkinje cell reduction in autism [3] A. Normal cerebellum B. Reduction of neurons in the Purkinje cell layer Normal human brain Autistic human brain
  • 20. High Performance Information Computing Center Jongwook Woo CSULA Approach Use a machine learning approach to detect neurons. Learn a binary classifier: 𝑓: 𝑅3 → {0,1} Input: local volume data Output: cell center (1) or off-center (0)
  • 21. High Performance Information Computing Center Jongwook Woo CSULA Requirement: Effective Incremental Learning Several properties are desired: Low computational cost Non-iterative No accumulation of data points No retraining Yet, sufficient accuracy
  • 22. High Performance Information Computing Center Jongwook Woo CSULA Proposed Algorithm Principal Components Analysis (PCA)-based supervised learning No need of retraining Highly scalable due to only the eigenvector matrices being stored Highly parallelizable due to its incremental nature –We keep the eigenvectors as new training samples are made available and additionally use them in the testing process.
  • 23. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Parallelization  A highly effective and popular framework for big data analytics  Parallel data processing tasks Map phase - tasks are divided and results are emitted Reduce phase - the emitted results are sorted and consolidated  Apache Hadoop  Open source project of the Apache Foundation  Storage: Hadoop Distributed File System (HDFS)  Processing: Map/Reduce (Fault Tolerant Distributed Processing) Slide from Dr. Jongwook Woo’s SWRC 2013 Presentation
  • 24. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  Hadoop MapReduce for Non-Java codes: Python, Ruby  Requirement  Running Hadoop  Needs Hadoop Streaming API – hadoop-streaming.jar  Needs to build Mapper and Reducer codes – Simple conversion from sequential codes  STDIN > mapper > reducer > STDOUT
  • 25. High Performance Information Computing Center Jongwook Woo CSULA Hadoop Streaming  MapReduce Python execution  http://wiki.apache.org/hadoop/HadoopStreaming  Sysntax $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file  Example $ bin/hadoop jar contrib/streaming/hadoop-streaming.jar -file /home/jwoo/mapper.py -mapper /home/jwoo/mapper.py -file /home/jwoo/reducer.py -reducer /home/jwoo/reducer.py -input /user/jwoo/shakespeare/* -output /user/jwoo/shakespeare- output
  • 26. High Performance Information Computing Center Jongwook Woo CSULA Training PCA is run separately on these class- specific subsets, resulting in class- specific eigenvector matrices. Class 1: 𝑿+ Training Set 𝐗 Class 2: 𝑿− Eigenvectors 1: 𝑽+ PCA Eigenvectors 2: 𝑽− PCA XY XZYZ An input vector
  • 27. High Performance Information Computing Center Jongwook Woo CSULA Eigenvectors 1: 𝑽+ Novel input x Eigenvectors 2: 𝑽− Eigenvectors 1: 𝑽+𝑻 Eigenvectors 2: 𝑽−𝑻 * Reconst. 𝒙+ Class 1 Yes ║x− 𝒙+║<║x- 𝒙−║? No Class 2 Reconst. 𝒙− * * * Projection 𝒚+ Projection 𝒚− Testing  Each data vector x is projected using the two class-specific PCA eigenvector matrices  The class associated with the more accurate reconstruction determines the label for the new data vector xz xy yz xz xy yz ?
  • 28. High Performance Information Computing Center Jongwook Woo CSULA Reconstruction Examples  Reconstruction of cell center and off-center data using matching vs. non-matching eigenvector matrices  Reconstruction accurate only with matching eigenvector matrix  Proximity: Cell center proximity value (e.g., 1.0 is cell center and 0.1 off-center)
  • 29. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Parallelization Our algorithm is highly parallelizable. To exploit this property, we developed a MapReduce-based implementation of the algorithm.
  • 30. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Parallelization (Training)  Parallel PCA computations of the class-specific subsets from the training sets, generating two eigenvector matrices per training set Set 𝐤 𝑽− Set 𝐤 𝑽+ Set 𝟏 𝑽− Eigenvectors Input files Eigen Decomposition Map phase Output files Read worker worker worker Class 1 Class 2 Training Set 1 Class 1 Class 2 Training Set k Set 𝟏 𝑽+
  • 31. High Performance Information Computing Center Jongwook Woo CSULA MapReduce Parallelization (Testing) - Map 1. We need to prepare all data vectors from all voxels in the data volume whether a data vector is in the cell-center class. Eigenvectors Input files Read Projection & Reconst. Map phase Reconst. Errors Intermediate files Averaging Reconst. Errors Reduce phase Output files Novel input split 1 split m Read worker worker worker Averages of Reconst. Errors 𝒙𝟏 err avg. 𝒙𝒏 err avg. Set 𝐤 𝑽− Set 𝐤 𝑽+ Set 𝟏 𝑽− Set 𝟏 𝑽+ 𝒙𝟏 − 𝒙𝟏+𝟏 𝒙𝟏 − 𝒙𝟏−𝟏 𝒙𝟏 − 𝒙𝟏+𝒌 𝒙𝟏 − 𝒙𝟏−𝒌 𝒙𝟏 − 𝒙𝟏+𝟏 𝒙𝟏 − 𝒙𝟏−𝟏 𝒙𝟏 − 𝒙𝟏+𝒌 𝒙𝟏 − 𝒙𝟏−𝒌 worker worker worker
  • 32. High Performance Information Computing Center Jongwook Woo CSULA 300 250 200 150 100 50 0 A B C D Cluster Configuration A: Single Node B: One Master, One Slave C: One Master, Five Slaves D: One Master, Ten Slaves Results: MapReduce Performance  Performance comparison during testing  35 map tasks and 10 reduce tasks per job (except for A case)  Performance was greatly improved (nearly 10 times)  Not much gain during training Average Each node computing is quad-core 2xIntel Xeon X5570 CPU and 23.00 GB memory.
  • 33. High Performance Information Computing Center Jongwook Woo CSULA Conclusion Developed a novel scalable incremental learning algorithm for fast quantitative analysis of massive, growing, sparsely labeled data. Our algorithm showed high accuracy (AUC of 0.9614). 10 times speed up using MapReduce. Expected to be broadly applicable to the analysis of high-throughput medical imaging data.
  • 34. High Performance Information Computing Center Jongwook Woo CSULA Use Cases in Science Seismology HEP
  • 35. High Performance Information Computing Center Jongwook Woo CSULA 하둡 과학 분야 이용 사례 Reflection Seismology (반사지진학) Marine Seismic Survey (해양 탄성파탐사) Sears (Retail) Gravity (Online Publishing, Personalized Content)
  • 36. High Performance Information Computing Center Jongwook Woo CSULA Reflection Seismology (반사지진학)  반사지진학  A set of techniques for solving a classic inverse problem: – given a collection of seismograms (진동 기록) and associated metadata, – generate an image of the subsurface of the Earth that generated those seismograms.  Big Data – A Modern seismic survey • tens of thousands of shots and multiple terabytes of trace data.  반사지진학의 목적  To locate oil and natural gas deposits.  To identify the location of the Chicxulub Crater – that has been linked to the extinction of the dinosaur.
  • 37. High Performance Information Computing Center Jongwook Woo CSULA Marine Seismic Survey (해양 탄성파탐사)
  • 38. High Performance Information Computing Center Jongwook Woo CSULA Common Depth Point (CDP) Gather (공통 심도점) Common Depth Point (CDP)  CDP의 목적  By comparing the time it took for the seismic waves to trace from the different source and receiver locations and experimenting with different velocity models for the waves moving thorough the rock, – we can estimate the depth of the common surface point that the waves reflected off of.
  • 39. High Performance Information Computing Center Jongwook Woo CSULA Reflection Seismology and Hadoop  By aggregating a large number of these estimates,  construct a complete image of the surface.  As we increase the density (밀도) and the number of traces, – create higher quality images • that improve our understanding of the subsurface geology (지하지질) A 3D seismic image of Japan’s southeastern margin
  • 40. High Performance Information Computing Center Jongwook Woo CSULA Reflection Seismology and Hadoop (Legacy Seismic Data Processing) Geophysicists (지구 물리학자) Use the first Cray supercomputers –as well as the massively parallel Connection Machine. Parallel Computing –must file a request to move the data into active storage • then consume precious cluster resources in order to process the data.
  • 41. High Performance Information Computing Center Jongwook Woo CSULA Reflection Seismology and Hadoop (Legacy Seismic Data Processing) open-source software tools in Seismic data processing The Seismic Unix project –from the Colorado School of Mines SEPlib –from Stanfrod University SeisSpace –commercial toolkit for seismic data processing. • Built on top of an open source foundation, the JavaSeis project.
  • 42. High Performance Information Computing Center Jongwook Woo CSULA Emerge of Apache Hadoop for Seismology  Seismic Hadoop by Cloudera  Data Intensive Computing – store and process seismic data in a Hadoop cluster. • Enabled to export many of the most I/O intensive steps in the seismic data processing into the Hadoop cluster  Combines Seismic Unix with Crunch, – the Java library for creating MapReduce Pipelines.  Seismic Unix – extensive use of Unix pipes in order to construct complex data processing tasks from a set of simple procedures sufilter f= 10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort cdp gx  A pipeline in Seismic Unix – first applies a filter to the trace data is built, – then some meta data associated with each trace are edited, – and the traces by the metadata just edited are finally sorted
  • 43. High Performance Information Computing Center Jongwook Woo CSULA What is HEP? High Energy Physics  Definition:  Involves colliding highly energetic, common particles together – in order to create small, exotic, and incredibly short-lived particles.
  • 44. High Performance Information Computing Center Jongwook Woo CSULA Large Hadron Collider Collides protons together at an energy of 7 TeV per particle.  protons travel around the rings and are collided inside particle detectors.  Collisions occur every 25 nanoseconds.
  • 45. High Performance Information Computing Center Jongwook Woo CSULA Compact Muon Solenoid  Big Data  Collisions at a rate of 40MHz – Each collision has about 1MB worth of data.  40MHz x 1MB = 320 Tera bps – (unmanageable amount)  Complex custom compute system (called trigger) will cut down the entire collision rate to about 300Hz, which means that significant data are statistically determined.
  • 46. High Performance Information Computing Center Jongwook Woo CSULA From Raw Data to Significant Raw Sensor Data Reconstructed Data Analysis-oriented Data Physicist-specific N-tuples 1MB 110KB 1KB Tier1-AtCERN Tier2-BigData
  • 47. High Performance Information Computing Center Jongwook Woo CSULA Characteristics of Tier 2  Need Hadoop  Large amount of data (400 TB) – Large data rate (in the range of 10Gbps) to analyze  Need for reliability, but not archival storage  Proper use of resources  Need for interoperability
  • 48. High Performance Information Computing Center Jongwook Woo CSULA HDFS Structure HDFS Mounted with FUSE Worker Nodes SRM Generic Web- Services Interface Globus GridFTP Standard Grid Protocol for WAN Transfers • FUSE • allows physicists’ C++ applications to access HDFS without modification. • Two grid components • allow interoperation with non-Hadoop sites.
  • 49. High Performance Information Computing Center Jongwook Woo CSULA MapReduce 1.0 Cons and Future  Bad for  Fast response time  Large amount of shared data  Fine-grained synch needed  CPU-intensive not data-intensive  Continuous input stream  Hadoop 2.0: YARN product
  • 50. High Performance Information Computing Center Jongwook Woo CSULA Hadoop 2.0: YARN  Data processing applications and services  Online Serving – HOYA (HBase on YARN)  Real-time event processing – Storm, S4, other commercial platforms  Tez – Generic framework to run a complex DAG  MPI: OpenMPI, MPICH2  Master-Worker  Machine Learning: Spark  Graph processing: Giraph  Enabled by allowing the use of paradigm-specific application master  [http://www.slideshare.net/hortonworks/apache-hadoop-yarn- enabling-nex]
  • 51. High Performance Information Computing Center Jongwook Woo CSULA Training in Big Data  Learn by yourself? Miss many important topics Cloudera With hands-on exercises  Cloudera 강의 하둡 개발자 하둡 시스템관리자 하둡 데이터 분석가/과학자
  • 52. High Performance Information Computing Center Jongwook Woo CSULA Conclusion Era of Big Data Need to store and compute Big Data Many solutions but Hadoop Hadoop is supercomputer that you can own Hadoop 2.0 Training is important
  • 53. High Performance Information Computing Center Jongwook Woo CSULA Question?