The document discusses accelerating analytics on Hadoop using OpenCL. It begins by outlining three key macro analytics trends: computational enablement, usability and visualization, and intelligent systems. It then focuses on computational enablement, describing how big data and Hadoop are enabling scalable analytics. The document discusses using OpenCL and GPUs to accelerate Hadoop by speeding up computationally intensive tasks. It provides an example of using logistic regression via map-reduce on Hadoop to better understand customer purchase data and attribute sales to marketing creatives. In under 3 sentences.
IS-4011, Accelerating Analytics on HADOOP using OpenCL, by Zubin Dowlaty and Krishnaraj Gharpure
1. ACCELERATING ANALYTICS ON HADOOP USING OPEN CL
November 2013
ZUBIN DOWLATY
KRISHNARAJ GHARPURE
ROHIT SAREWAR
PRASANNA LALINGKAR
AUSTIN CV
MU SIGMA INC.
2. AGENDA
High Level Trends in the Applied Analytics Arena
‒Computational Enablement
‒Usability & Visualization
‒Intelligent Systems
Computational Enablement
‒Big Data & Hadoop
‒Reference Diagram
‒Open CL and Hadoop Research
2 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
3. THE INFORMATION EXPLOSION IS LEADING TO AN UNPRECEDENTED ABILITY
TO STORE, MANAGE AND ANALYZE DATA; WE NEED TO RUN AHEAD..
Data Source Explosion
Click-stream
data
Purchase intent
data
Social Media, Blogs
Intra / Inter Firewall
RFID & Geospatial Information
Events
Email
Archives
I
N
F
O
R
M
A
T
I
O
N
Expanding Applications
Offer Optimization
Opinion Mining
Social Graph Targeting
Influencer
Marketing
Clinical Data Analysis
Fraud
Detection
Digital Data in 2011: ~ 1750 Facebook: 100+ million users
per day
Exabytes
SKU Rationalization
Mobile Phone Subscription:15 Million Bing recommendation
records
4.6 billion
CART
Radial
Basis Functions
Adaptive Regression
Splines
Text
Machine Learning
►
Support Vector
Machines
Neural
Networks
Complex Event
Processing
RFID Tags: 40 billion
Geospatial Predictive Modeling
Advanced Analytical Techniques
3 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Software as a service
C
L
O
U
D
In Memory
Analytics
Search Technologies
Cloud Computing
Interactive
Visualization
Massively large
databases
Technology Evolution
4. Macro Trends: Agenda in All Three
THREE KEY MACRO ANALYTICS TRENDS TARGETING CONSUMPTION
Computational Enablement
Analytics Trends
– Scalable Analytics, Big Data & NoSQL
technologies, GP-GPU, In – Memory &
High Performance Computing
– Master Data Management, Cloud, and
Federated Data – the end of the
central EDW?
– Analytics as a Service and the Cloud
– Open Source Maturity
– Rise of Agile Self Service
– Data Model Disruption: ETL vs. ELT
Usability and Visualization
– Dashboards will Evolve into Cockpits
– Improved Interactive Data Visualization & Aesthetics
– Augmented / Virtual Reality / Natural User Interfaces
– Graph Analytics sparked by Social Data
– Discovery with Computational Geometry
– Mobile BI and the Integration of Consumer & Enterprise Devices
– Gamification: Do you want to play a game?
4 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Intelligent Systems (“Anticipation
Denotes Intelligence”)
‒ Operational Smart Systems are Back
propagating – Pervasive Analytics
‒ Real Time Analytics and Event
Streams – Beyond Batch..
‒ Artificial Intelligence (AI) is not what
we expected
‒ Don’t forget the 4th tier it’s Juicy
‒ Metaheuristics and the Ants
‒ Operational Analytics, You are the
Exception
‒ Experiments in the Enterprise and
the Quasi
‒ Just Say No to OLS
‒ Energy Informatics if Radiating
‒ Intelligent Search
5. What is Big Data?
“BIG DATA”, THE TERM, IS ABOUT APPLYING NEW TECHNOLOGIES TO
MEET UNFULFILLED NEEDS: ANALYTICS AT SCALE…
BIG Computation & Big Data Scaling Economically
Big Computation
Map Reduce
Intensive Computational Analysis not Supported by the Traditional Data Warehouse Stack
and Architecture
5 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
6. Big Data Technology Evolution
BIG DATA’S RECENT ENTERPRISE POPULARITY CAN BE ATTRIBUTED TO
TWO KEY FACTORS
Technology Trigger
‒ Robust Open Source Technology for Parallel Computation on Commodity Hardware (Hadoop /
NoSQL)
Frustration Trigger..
‒ Tyranny of the Data Model, Cost & Complexity for data integration, Agility, Impact, Bureaucracy,
Difficulty & Costly to scale of existing EDW technology
Teradata buys Aster
Apache: Hadoop project
Google Trigger: Releases “Map Reduce” & “GFS” paper
(map-reduce DB)
Cloudera named most promising
startup funded in 2009
IBM
Acquires
Lucene subproject
Yahoo: 10K core
cluster
Google: Bigtable
paper
HP Acquires Vertica
Oracle
MSOFT
SAS
EMC Acquires
Greenplum
elastic map-reduce
2003
2004
2005
Early Research
2006
2007
6 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
2008
Open source dev
momentum
2009
Initial success
stories
2010
2011
Commercialization &
Adoption
2012
7. Big Data – What is it?
A MINDSET FOR ENABLING DECISION SCIENCES
AT SCALE
DATASET -> TOOLSET -> SKILLSET -> MINDSET
Skillset
‒ Computer Scientist
‒ Map-Reduce, HDFS, HIVE, R, PIG, NoSQL, Unstructured Data
‒ Parallel computation & Stream Processing
‒ Data & Visual Explorer
‒ Math / Stats / Data Mining / Machine Learning
‒ Math & Technology & Business
‒ Data “Artisans” / Design Thinking
‒ Systems Theory, Generalizations
Mindset
‒ Automation, Scale, Agility
‒ Higher utilization of machines for decisions
‒ Learning vs Knowing
‒ Consumption Focused
‒ Algorithmic and Heuristic (Man & Machine)
‒ Intrapreneur, Curious, Persuasive, Soft Skills
7 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
8. The Big Data Ecosystem
THE BIG DATA ECO-SYSTEM IS A CHALLENGE TO KEEP UP WITH..
MINDSET: KNOWING VS. LEARNING..
”Formulating a right question is always hard, but with big data, it is an order of magnitude
harder, because you are blazing the trail (not grazing on the green field).” source: Gartner
8 | Accelerating Analytics on HADOOP using Open CL Source: The Big Data Group llc | CONFIDENTIAL
| NOVEMBER 19, 2013
9. Usability and Visualization: Evolution
TREND: DASHBOARDS WILL EVOLVE TO BE MORE ACTIONABLE AND FORWARD
LOOKING
Dashboredom?
http://www.informationmanagement.com/issues/20_7/dashboards_data_management_qua
lity_business_intelligence-10019101-1.html
“As we move from the information age into the intelligence age”..
9 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
10. Usability and Visualization
OPEN SOURCE: D3 (DATA-DRIVEN DOCUMENTS) HTTP://D3JS.ORG/
1. Helps to bring data to life using
HTML5, SVG and CSS.
2. Its an extension of “Protovis”, a
very popular JavaScript library.
3. With Minimal overhead, D3 is
extremely fast, supporting large
datasets and dynamic behavior for
interaction and animation.
4. It provides many built-in reusable
functions and function factories,
such as graphical primitives for
area, line and pie charts.
http://d3js.org/
10 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
11. Usability and Visualization
OPEN SOURCE:
R is the premier language for data scientists
‒ Strength in data visualizations..
‒ http://www.r-project.org/
11 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
12. Usability and Visualization
GRAPH NETWORK VISUALIZATION APPLIED TO METRIC DATA – CORRELATION
NETWORKS
12 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
13. THE CURRENT BUSINESS SCENARIO DEMANDS AUTOMATING AND
INJECTING INTELLIGENCE INTO VARIOUS ENTERPRISE TOUCH POINTS
Taxonomy, Associations,
Nearest Neighbor
Yearly Bonus
Let me buy a new
laptop
Search
Buy laptop
Wow ! Great offers
.. Since I have some
more money, let me
Customer Segmentation
have a look
Propensity Score
Webpage
Email
Behind the
Scenes
Checkout
Revenue
Management
Marketing
Attribution Methods to Optimize
Media Spend
Hey ! I have some
good deals being
provided here Collaborative
Filtering
Recommender
Systems
Add to Cart
CRM
Pricing
Elasticity
Popup
Optimization,
Need State
Personalization
Mix
Yippee ! 5 %
discount ..
Yeah.. I think I need
these headphones
too
Offer Optimization
Add Intelligent Analytics Everywhere You Can to Unlock The Value In Information
13 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
14. WE ARE IN THE ‘INFORMATION AGE’ AND ITS TIME THAT DATA AND
HUMAN ANALYTICS CAPABILITIES ARE AUGMENTED WITH MACHINE
INTELLIGENCE
Intelligent Systems: Next Wave of Innovation and Change (Iron Man Suit)
Convergence of the global industrial system with
the power of advanced computing, analytics,
low-cost sensing and new levels of connectivity
permitted by the Internet
Intelligent
Machines
Advanced
Analytics
People
At Work
Data Mining Finally Meets the Operation.. Powered by Big Data Scale & Value
* Source: GE (Industrial Internet – Pushing the Boundaries of Minds and Machines)
14 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
15. CURRENT NEWS ON INTELLIGENT SYSTEMS
CEP Vendors Acquired June 2013: Streambase (Tibco) and Progress Apama (Software
AG)
USA Today, September 16, 2012
– http://www.usatoday.com/money/business/story/2012/09/16/review-automate-this-probes-math-inglobal-trading/57782600/1
Two Second Advantage: Published 2011
– Wayne Gretzky: Edge: He predicted a little better than anyone else, where the hockey puck would be.
– Enterprise 2.0: Every transaction became a bit of digital data
– Enterprise 3.0: Every EVENT can become a bit of digital data
– There will always be value in making projections, days, months, years, predictive analytics, data mining,
and analytics systems on historical data – batch systems
– But in today’s world, enterprises need something more, instantaneous, proactive, predictive capability –
real time systems
GE Mind an Machines: November 2012
» Industrial Internet, Intelligent Systems, and Intelligent Machines
» http://www.ge.com/docs/chapters/Industrial_Internet.pdf
» http://www.youtube.com/watch?v=SvI3Pmv-DhE
15 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
16. AGENDA
High Level Trends in the Applied Analytics Arena
‒Computational Enablement
‒Usability & Visualization
‒Intelligent Systems
Computational Enablement
‒Big Data & Hadoop
‒Open CL and Hadoop Research
16 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
17. Computational Enablement
THE DIVERSE WORLD OF HADOOP
Manage complex data – relational and non relational
– in a single repository
“Data Reservoir/Lake”
Fault tolerant distributed
processing
Self healing, distributed
storage
Commodity hardware –
inexpensive storage
Zookeeper, Puppet
Maintenance
+
Store massive data sets, scale horizontally with
inexpensive commodity nodes
Process at source – eliminate data movement
Abstraction for parallel
computing
Nagios, Ganglia, Splunk
Monitoring
HBASE, Pig, Hive, Rhipe, Rhadoop, Mahout
Data management and analytics
17 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Complements your existing relational database
technology that excel at structured data
manipulation
**NEWS FLASH***
Hadoop as the data “OS”
Hadoop 2.0 and Yarn Supporting Other
Parallel Frameworks such as MPI (SAS HPC
Announcement)
High Speed SQL (STRATA 2013)
Impala, Dremel, SHARK
Hadapt
ETL & EDW Disruption
GigaOM: Think Bigger..
Schema on Read, ELT
18. THE HPC LANDSCAPE IS VAST AND CONTINUOUS RESEARCH TO UNDERSTAND THE
STRENGTHS AND WEAKNESSES OF DIFFERENT COMPONENTS IS THE KEY
Distributed Memory
SPARK
Map-Reduce
HPCC
STORM
Data
Driven
MPI
HAMA
Computation
Driven
Legend
Already in use by Mu Sigma
OpenMP
Should be explored
Not necessary in current context
Intel
TBBs
Techniques for writing parallel
programs
Techniques supporting
parallelization on the infrastructure
•
Size indicates our comfort level with
the technique/tool
18 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
JumboMem
Shared Memory
Agent,
Actor
Model
GP
GPU
19. HADOOP 2.0
“THE OS FOR DATA”
Source: HortonWorks
19 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
20. Case study: pro CRM client
Data sourcing
Data management
Data processing
Data Consumption
MU SIGMA IMPLEMENTED A BIG DATA SOLUTION THAT REDUCED SCORING
TIME FROM 18 HOURS TO 2 HOURS..
INGEST
Marketer: Process to score a list was taking 18
hours…
Data Scientists adding propensity scores,
segmentation codes, LTV, etc..
Big Data
Infrastructure
Load scored
customer data in
the specified
format back to
EDW
20 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Score all customers
based on
Prioritization logic &
produce output in
specified format
21. Application: Machine Logs
GLM on Map-Reduce, Attribution modeling on clickstream data
IP
Creative 1
Creative 2
Creative 3
Purchase (1/0)
w.x.y.z
a.b.c.d
e.f.g.h
3:15 AM
5:49 AM
9:54 PM
6:16 AM
3:11 AM
5:12 PM
-
1
0
1
Background
E-Commerce sites are constantly collecting clickstream data, of the
order of terabytes, on various attributes such as
users,
the creative's they’ve viewed,
their transactions, etc.
Frequently, when an online customer makes a purchase, the sale is
attributed to the most recent creative
However, this inferred causality is not necessarily always correct
A previous creative or sequence of creatives might actually have
triggered the customer’s decision to make a purchase
21 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Solution
Statistical techniques such as logistic regression can help
better understand the contributions of different creatives
towards purchases and
identify key purchase-driving creatives
Hadoop can store and process large volumes of data and R is a
powerful statistical programming language
Leveraging R to perform statistical analysis on large clickstream data
stored in Hadoop using map-reduce is the natural next step
Least-squares regression, a crucial component of iterative methods
used to solve logistic regression problems, has been implemented
here in R, using map-reduce
22. SPEEDING UP HADOOP WITH THE GPU
DOES THE GPU HELP?
Hadoop jobs are designed to run in batch, with data being read off disks and streamed into ‘mappers’ and
‘reducers’
‒ Execution times of Hadoop jobs can be viewed in terms of a fixed overhead, a variable overhead (dependent mainly
on data size) and the processing time, which depends on the complexity of computations and the size of data
Fixed
overhead
Variable
overhead
Processing
complexity
×
Data
size
‒ For a given size of data, the most computationally trivial job in Hadoop (an identity job) is usually measurable in
tens of seconds
Fixed
overhead
Variable
overhead
The testing objective was to measure the speedups achievable by using GPUs within MapReduce for
relatively complex data processing
Fixed
overhead
Variable
overhead
22 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Fixed
overhead
Variable
overhead
Processing
Processing
Processing
23. SETTING IT UP AND MAKING IT WORK
CONFIG AND TEETHING TROUBLES
Servers, with
Details
CPU 4-core Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz
RAM 16 GB
GPU AMD/ATI Radeon HD 7970 with 3GB GDDR5 Memory
Base software Ubuntu 12.04.3 LTS , Java 1.6.0_31
Hadoop + ecosystem Hadoop 2.0.0-cdh4.4.0 , MapReduce 0.20
OpenCL + ecosystem OpenCl 1.2 , JOCL
Since mappers/reducers run in parallel, an ideal Hadoop cluster with a GPU per node was simulated by
running mappers/reducers on one machine with one GPU
It was found that
‒ Hadoop doesn’t recognize the GPU card
‒ Needs to access GUI services provided by the X-server to be able to do so*
‒ A few modifications to the user’s shell configurations were required to get around this
* AMD is currently working on getting OpenCL working without the X-server
23 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
24. EXPERIMENTAL DETAILS
THE USE-CASE
Linear regression was written in Java MapReduce with and without a GPU
‒ Uses 𝑁 rows and (𝑀 + 2)* columns of data to fit the expected value of the dependent variable as a linear
combination of a set 𝑀 of predictor variables, such that for the 𝑖 th row
‒ 𝐸 𝑦 𝑖 = 𝛽0 +
𝑀
𝑗=1
𝛽 𝑗 𝑥 𝑖,𝑗
‒ Regression coefficients are computed by solving the Normal Equations
‒ 𝛽 = (𝑋 𝑇 𝑋)−1 (𝑋 𝑇 𝑦) where 𝑋 is a matrix of dimension (𝑁) × (𝑀 + 1) and 𝑦 is a column vector of length 𝑁
Data in Hadoop is distributed across multiple nodes into 𝐵 blocks that are mutually exclusive and
𝑁
collectively exhaustive (notwithstanding Hadoop’s replication factor)
𝑇
𝑇
‒ 𝑋 𝑏𝑘 𝑋 𝑏𝑘 and 𝑋 𝑏𝑘 𝑦 𝑏𝑘 (**) terms are computed and appended in mappers for each input block
‒ Augmented terms output from each of the mappers are summed and inversion and multiplication of terms is
performed in the reducer
Each 2D input*** 2Dmappers is passed as by row and 1D array to aarray tofunction
Using GPUs, each to input** is flattened a flattened sent as a 1D kernel GPU kernel functions
‒ Mappers compute dot products of all sub-arrays formed by elements 𝑒ℎ such that ℎ mod 𝑀 + 2 = 𝑙 for 𝑙 from 0 to
𝑀+1
* The dependent variable, 𝑀 predictor variables and a column of 1s used to compute the intercept coefficient, 𝛽0
** 𝑏𝑘 for 𝑘 from 1 to 𝐵 indicating the blocks of data that constitute the input data
24 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
*** 1D in case data is processed 1 row at a time
25. USING THE GPU FROM MAPPERS
DATA MOVEMENT AND COMPUTATIONS
Computing the XTX term
1 1
1 3 × 1
1
2 4
XT
1 1
1 3
2 4
4 10
6 14
2
4
6
14
20
25 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
2
1 2
= 4
3 4
6
X
4
10
14
XTX
6
14
20
26. MAPREDUCE ON CPU
CPU OLS CODE LOGIC
Mapper class uses methods
‒ setup() : Called once at the beginning of the
map task
‒ map() : Called for each (K,V) pair, Create a list
of all the values in the input split
‒ cleanup(): Called once at the end of the map
task, OLS code implemented in this method
Reducer class uses methods
Calculate XTX,
XTY for input
blocks
‒ setup() : Called once at the beginning of the
reduce task
‒ reduce() : Called for each key, sum of
intermediate matrices computed in this
method
‒ cleanup(): Called once at the end of the
reduce task, Matrix transpose and β values
computed here
Mapper
26 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Reducer
27. MAPREDUCE ON GPU + CPU
GPU OLS CODE LOGIC
Mapper class uses methods
‒ setup() : Called once at the beginning of the
map task
‒ map() : Called for each (K,V) pair, Create a list
of all the values in the input split
‒ cleanup(): Called once at the end of the map
task, OLS code implemented in this method
using JOCL library
Reducer class uses methods
‒ setup() : Called once at the beginning of the
reduce task
‒ reduce() : Called for each key, sum of
intermediate matrices computed in this
method
‒ cleanup(): Called once at the end of the
reduce task, Matrix transpose and β values
computed here
27 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
Mapper
Reducer
28. EXPERIMENTAL OUTCOMES
KEY RESULTS
Processing data one row at a time is not efficient – runtimes were measured for ‘chunks’ of data of
increasing size
100
50
6000
5000
4000
3000
GPU
2000
CPU
1000
0
0
1000
1 row at a time
GPU
2000
1000 rows at a time
* Processing times for 1000 rows and 256 columns, 1 row at a time and
all together
** Processing times for 100,000 rows and 256 columns, processed in
chunks of ‘Number of rows’
5000
10000
25000
50000
100000
Number of rows
CPU
Job execution time
(s)
Job execution time (s)
Multiple data vs row-by-row processing*
Job execution time (s)
Performing linear regression using GPUs within MapReduce constantly outperformed the same exercise
using ‘vanilla’ MapReduce code
Increasing the size of data chunks**
Increasing the size of data chunks** (GPU only)
60
50
40
30
20
1000
28 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
2000
5000
10000
25000
Number of rows
50000
100000
29. EXPERIMENTAL OUTCOMES
CPU AND GPU EXECUTION TIME
CPU execution times show a sharp rise on data beyond 16 columns and 100,000 rows
GPU execution time does not show significant rise for rows from 1000 to 1,000,000 and columns from 4 to 256
Beyond 1,000,000 rows GPU execution times grow than CPU execution times
CPU execution time * #
GPU execution time *
600
500-600
500
400-500
400
300-400
300
200-300
200
10000K
100
1000K
0
100-200
0-100
100K
4
8
16
10K
32
64
128
Number of columns
#1 CPU execution times of a few data
sets which are more than 600 seconds
which are not seen clearly in above
graph
Rows
Columns
1K
Number
of rows
256
Time (s)
Job execution time (s)
Job execution time (s)
Time (s)
600
500-600
500
400-500
400
300-400
300
200-300
200
10000K
100
1000K
0
1M
256
4
8
16
10K
32
Number of columns
64
128
1K
256
Number
of rows
804
128
1863
10M
256
7701
#2 The 2 bounds of the sharp increase in CPU execution times are 256 columns * 100 K rows (= 51.2 MB) and 16 columns * 10 M rows (= 320 MB)
* Processing time for data sets where number of rows range from 1K to 10 M and columns range from 4 to 256
29 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
0-100
100K
Time (s)
10M
100-200
30. EXPERIMENTAL OUTCOMES
COMPARING CPU AND GPU
Ratio of CPU to GPU execution time *
For very small data size, CPU and GPU have
comparable runtimes
As number of rows increase the CPU
execution time increases drastically as
compared to GPU
17-19
Ratio of execution time of (CPU/GPU)
With increase in the number of columns
the CPU execution time shows gradual
increase as compared to GPU
Ratio
15-17
13-15
19
11-13
17
9-11
15
7-9
13
5-7
11
3-5
9
1-3
-1-1
7
5
3
10000K
1
1000K
-1
For larger data size the ratio of CPU to GPU
execution time shows a steep increase
100K
4
8
16
10K
32
64
128
1K
256
Number of columns
30 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL
* For data sets where number of rows range from 1K to 10 M and
columns range from 4 to 256
Number of Rows
31. TAKEAWAYS
RECOMMENDATIONS AND NEXT STEPS
GPUs can be effectively used within MapReduce to speed up relatively computationally heavy tasks like
matrix operations
‒ Since copying data to/from a GPU is expensive, a large amount of data should be copied and processed ‘in one go’
‒ Computation time asymtotically decreases with increasing data chunk size
Multiple mappers running on a nodeon a node can share the GPU
mappers/reducers running can share the GPU
The Size of input data, Hadoop blocks, memory allocation are key factors that determine successfully and
Size of data chunks and memory allocation must be chosen carefully to ensure all processes run
faster execution of Hadoop codesotherwise would, when one GPU is shared across mappers/reducers
successfully and faster than they using a GPU, when one GPU is shared across mappers
‒ Using a GPU is not necessarily more ‘efficient’ for data with a small number of columns
‒ Smaller blocks increases the degree of MapReduce parallelism but also leads to more mappers using the GPU
simultaneously
‒ Improper memory allocation in Hadoop and on the GPU can lead to either unused memory, or more memory
required than available
Future exploration includes
‒ Measure changes in job execution time with the GPU being used in the reducer as well, for OLS
‒ Empirically determining optimal values for data size and memory allocation with one/more job running multiple
mappers/reducers on a node that share a GPU
31 | Accelerating Analytics on HADOOP using Open CL | NOVEMBER 19, 2013 | CONFIDENTIAL