2. Agenda
• Quick Review of In-Memory Data Grids
• The Need for Real-Time Analytics: Two Use Cases
• Data-Parallel Computation on an IMDG Using Parallel Method
Invocation (PMI)
• Implementing MapReduce Using PMI: ScaleOut hServer™
• Sample Use Cases
• Video Demo
• Comparison to Spark
2
ScaleOut Software, Inc.
3. About ScaleOut Software
• Develops and markets In-Memory Data Grids:
software middleware for:
• Scaling application performance and
• Performing real-time analytics using
• In-memory data storage and computing
• Dr. William Bain, Founder & CEO
• Career focused on parallel computing – Bell Labs, Intel, Microsoft
• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server
• Eight years in the market; 400 customers, 9,000 servers
• Sample customers:
3
ScaleOut Software, Inc.
4. What is an In-Memory Data Grid?
In-memory storage for fast updates and retrieval of live data
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.
4
ScaleOut Software, Inc.
5. Our Focus: Real-Time Analytics
Real-time
Batch
Live data sets
Gigabytes to terabytes
In-memory storage
Minutes to seconds
Best uses:
Static data sets
Petabytes
Disk storage
Hours to minutes
Best uses:
“Business Intelligence”
“Operational Intelligence”
• Tracking live data
• Immediately
identifying trends
and capturing
opportunities
5
Big Data Analytics
Real-Time
Batch
Analytics
Server
Hadoop
IBM
Teradata
SAS
SAP
hServer
ScaleOut Software, Inc.
• Analyzing
warehoused data
• Mining for longterm trends
6. Online Systems Need Real-Time Analysis
A
•
•
•
•
•
6
few examples:
Equity trading: to minimize risk during a trading day
Ecommerce: to optimize real-time shopping activity
Reservations systems: to identify issues, reroute, etc.
Credit cards: to detect fraud in real time
Smart grids: to optimize power distribution & detect issues
ScaleOut Software, Inc.
7. Integrate MapReduce
into IMDG for Real-Time Analytics
Benefits:
• Enables use of widely used Hadoop MapReduce APIs:
• Accelerates data access by staging data in memory.
• Eliminates batch scheduling
and data shuffling overheads
of standard Hadoop distributions.
• Analyzes and updates live data.
• Enables Hadoop
deployment in live
systems.
• Hadoop MapReduce
programs run without change.
• ScaleOut’s implementation is called
ScaleOut hServer™.
7
ScaleOut Software, Inc.
8. Data-Parallel Analysis Is Not New
• 1980’s: Special Purpose Hardware: “SIMD”
Thinking Machines
Connection Machine 5
• 1990’s: General Purpose Parallel Supercomputers:
“Domain Decomposition”, “SPMD”
Intel
IPSC-2
8
ScaleOut Software, Inc.
IBM
SP1
9. Data-Parallel Analysis Is Not New
• 1990’s – early 2000’s: HPC on Clusters: “MPI”
HP
Blade
Servers
• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”
Amazon EC2,
Windows Azure
9
ScaleOut Software, Inc.
10. Parallel Method Invocation
• Basic, well understood model of data-parallel computation
• Implemented for use on objects hosted in IMDGs:
• Executes user’s code in parallel across the grid.
• Uses parallel query to select objects for analysis.
Analyze Data (Eval)
In-Memory Data Grid Runs
Data-Parallel Analysis.
Combine Results
(Merge)
10
ScaleOut Software, Inc.
11. Running the Analysis
The parallel analysis executes in three steps:
• Step 1: The application first selects all relevant objects in the
collection with a parallel query run on all grid servers.
• Note: Query spec matches data’s object-oriented properties.
11
ScaleOut Software, Inc.
12. Running the Analysis: Step 2
• Step 2: The IMDG automatically schedules analysis operations
across all grid servers and cores.
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.
12
ScaleOut Software, Inc.
13. Running the Analysis: Step 3
• Step 3: The IMDG automatically merges all analysis results.
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
trader’s display as one
object.
13
ScaleOut Software, Inc.
14. Sample Performance Results for PMI
Optimizing a stock trading platform with real-time analysis:
• IMDG hosted in Amazon
cloud using 75 servers.
• IMDG holds 1 TB of stock
history data in memory.
• IMDG handles continuous
stream of updates (1.1 GB/s).
• IMDG performs real-time
analysis on live data.
• Entire data set analyzed in
4.1 seconds (250 GB/s).
• IMDG scales linearly as
workload grows.
14
ScaleOut Software, Inc.
15. Implementing Real-Time MapReduce
• Goal: Run MapReduce applications from a remote workstation.
• The IMDG automatically builds an “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• The invocation grid can be reused to shorten startup time.
• Use PMI to implement MapReduce.
15
ScaleOut Software, Inc.
16. Accelerating MapReduce Execution
PMI is the foundation of fast
execution time:
• Data can be input from either the
IMDG or an external data source.
•
Works with any input/output format
compatible with the Apache
distribution.
• ScaleOut IMDG uses its dataparallel execution engine (PMI) to
invoke the mappers and the
reducers.
•
Eliminates batch scheduling
overhead.
• Intermediate results are stored
within the IMDG.
•
•
16
Minimizes data motion between the
mappers and reducers.
Allows optional sorting.
ScaleOut Software, Inc.
17. Only One-Line Code Change
ScaleOut hServer subclasses the Hadoop Job class:
// This job will run using the Hadoop
// job tracker:
public static void main(String[] args)
throws
Exception {
// This job will run using ScaleOut hServer:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
Configuration conf = new Configuration();
Job job = new HServerJob(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(
job, new Path(args[1]));
job.waitForCompletion(true);
}
job.waitForCompletion(true);
}
17
public static void main(String[] args)
throws Exception {
ScaleOut Software, Inc.
18. Accessing IMDG Data for M/R
• IMDG adds grid input format for
accessing key/value pairs held in
the IMDG.
• MapReduce programs optionally
can output results to IMDG with
grid output format.
• Grid Record Reader optimizes
access to key/value pairs to
eliminate network overhead.
• Applications can access and
update key/value pairs as
operational data during analysis.
18
ScaleOut Software, Inc.
19. Optimized In-Memory Storage
Multiple in-memory storage
models:
• Named cache, optimized
for rich semantics:
• Property-based query
• Distributed locking
• Access from remote grids
• Named map, optimized for
efficient storage and bulk
analysis:
• Highly efficient object
storage
• Pipelined, bulk-access
mechanisms
19
ScaleOut Software, Inc.
20. Example: Ecommerce: Inventory Management
Fast map/reduce reconciles inventory and order systems
for an online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.
• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use PMI to reconcile in two minutes.
• Results: Real-time reconciliation ensures accurate orders.
20
ScaleOut Software, Inc.
21. Example in Financial Services
Integrate analysis into a stock trading platform:
• The IMDG holds market data and hedging strategies.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated map/reduce
analysis on hedging
strategies and alerts
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.
21
ScaleOut Software, Inc.
23. Comparison to Spark
• Spark is intended to accelerate data analysis using in-memory
computing.
• ScaleOut’s IMDG provides standard MapReduce for “live” systems.
Spark
ScaleOut IMDG
New MapReduce engine
Yes
Yes
In-memory data storage
Resilient Distr. Datasets
Distributed Objects
Load/store from HDFS
Yes
Yes
Avoid disk access
Yes
Yes
CRUD on live data
No
Yes
Query on properties
No
Yes
High availability
Rebuild on failure
Replication and failover
Extensibility
Additional operators
PMI methods
Open source
Yes
Hybrid
23
ScaleOut Software, Inc.
24. Summary
• Online systems need to analyze “live” data in real-time.
• MapReduce has traditionally focused on analyzing
large, static (offline) datasets held in file systems.
• An in-memory data grid (IMDG) can accelerate
MapReduce applications, enabling real-time analytics:
• Enables the application to analyze and update live data.
• Leverages the IMDG’s load-balanced placement of data.
• Avoids batch-scheduled startup delays.
• Avoids data motion from secondary storage.
• MapReduce can be implemented using standard dataparallel computing techniques (“parallel method
invocation”):
• Tightly integrates Map/Reduce engine with the IMDG.
• Accelerates Map/Reduce execution by >20X in benchmark
tests.
24
ScaleOut Software, Inc.
25. Accelerating Start-Up Times
• The invocation grid can be re-used across MapReduce jobs:
public static void main(String argv[]) throws Exception {
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies
addClass(MyMapper.class). addClass(MyReducer.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed as the parameter to the job
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
job.waitForCompletion(true);
}
//Unload the invocation grid when we are done
grid.unload();
}
25
ScaleOut Software, Inc.
26. Targeted Use Cases
Run continuous Hadoop
on live data, while it’s
being updated.
Accelerate Hadoop on
static data with a one
line code change.
Quickly prototype
Hadoop code.
26
“Capture perishable business
opportunities and identify issues.”
Real-time risk
analysis
Credit card fraud
detection
...
“Speed-up Hadoop execution by >10X for
faster business insights.”
Financial
modeling
Process
simulations
...
“Validate your Hadoop code before it
goes into batch processing.”
No need to install
Hadoop stack
ScaleOut Software, Inc.
Fast-turn debug
and tuning
...
27. The Need for Real-Time Analytics
Many Use Cases:
•
Across Key Industries:
Authorizations / Payment
Processing / Mobile Payments
•
•
•
•
•
•
•
•
•
•
27
ScaleOut Software, Inc.
Health Care
•
Operational Risk Compliance
Government
•
Financial: Risk, P&L, Pricing
Life Sciences
•
Execution Rules
IC / DoD
•
Market Feed / Event Handlers
Logistics
•
Churn Management
Manufacturing
•
Situational Awareness
Utilities
•
Fraud Detection
Retail
•
Real Time Tracking
Telco
•
Sensor Data / SCADA
Financial
•
Inventory Management
CPG
•
Service Activation
•
•
Law enforcement
28. Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
• Typically used for very large, static, offline datasets
• Data must be copied from disk-based storage (e.g., HDFS) into
memory for analysis.
• Hadoop Map/Reduce adds lengthy batch scheduling and data
shuffling overhead.
28
ScaleOut Software, Inc.
29. Hadoop Users Need
Real-Time Analytics
• ScaleOut Software conducted informal survey at Strata 2013
Conference (Santa Clara).
• Based on 150 responses:
• 78% of organizations generate fast-changing data.
• 60% use Hadoop and 78% plan to expand usage of Hadoop within
12 months.
• Only 42% consider Hadoop to be an effective platform for realtime analysis, but…
• 93% would benefit from real-time data analytics.
• 71% consider a 10X improvement in performance meaningful.
• Take-away: Hadoop users need real-time analytics.
29
ScaleOut Software, Inc.
30. Optional Caching of HDFS Data
• ScaleOut hServer adds Dataset Record Reader (wrapper) to
cache HDFS data during program execution.
• Hadoop automatically retrieves data from ScaleOut IMDG on
subsequent runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.
30
ScaleOut Software, Inc.
31. Java Example: Parallel Method Invocation
• Create method to analyze each queried stock object and another
method to pair-wise merge the results:
public class StockAnalysis implements
Invokable<Stock, StockCalcParams, Double>
{
public Double eval(Stock stock, StockCalcParams param)
throws InvokeException {
return stock.getPrice() * stock.getTotalShares();
}
public Double merge(Double first, Double second)
throws InvokeException {
return first + second;
}
}
31
ScaleOut Software, Inc.
32. Java Example: Parallel Method Invocation
•
Run a parallel method invocation on the query results:
NamedCache cache = CacheFactory.getCache("Stocks");
InvokeResult valueOfSelectedStocks =
cache.invoke(
StockAnalysis.class,
Stock.class,
or(equal("ticker", "GOOG"), equal("ticker", "ORCL")),
new StockCalcParams());
System.out.println("The value of selected stocks is" +
valueOfSelectedStocks.getResult());
32
ScaleOut Software, Inc.