Ceph on Intel: Intel Storage Components, Benchmarks, and Contributions
Presentation Ispass 2012 Session6 Presentation1
1. Evaluating FPGA-acceleration for Real-time
Unstructured Search
Sai Rahul Chalamalasetti†, Martin Margala†, Wim Vanderbauwhede*,
Mitch Wright‡, Parthasarathy Ranganathan‡‡
†University of Massachusetts Lowell, Lowell, MA
*University of Glasgow, Scotland, UK
‡Hewlett Packard, Houston, TX
‡‡Hewlett Packard Labs, Palo Alto, CA
4/9/2012 1
2. Outline
Motivation
Workload and Algorithm Description
Hardware Systems
Synthetic Datasets
Performance Results
Alternatives and Future Work
Conclusion
4/9/2012 2
3. Motivation
The era of “big data”
Explosion in data – particularly unstructured data
Information doubling every 18 months or faster
Enterprise server systems processed, delivered over 9 zettabytes in
2008 (UCSD report)
Walmart:1M transactions/hour; LHC: 1PB/second; YouTube: 48
hours/minute; Facebook: 100TB logs/day
Explosion in data-centric workloads
Collect, store, access, share, visualize, analyze, interpret, …
Consumer, Enterprise, Scientific, …
New applications emerging recently: search, live business
analytics, social correlation, collaborative filtering, …
Need better performance for deeper analytics across diverse data
4/9/2012 3
4. Motivation
The era of “green computing”
Power and cooling important constraint for
servers/datacenters
Only 4 countries consume more electricity than worldwide
datacenters; millions of dollars for cloud datacenters
Thermal density and costs of power delivery and cooling
infrastructure
Sustainability a growing concern
Lifecycle minimization of environmental effects and carbon
emissions
Corporate initiatives from HP, Cisco, Dell, Google, IBM, Intel;
Government initiatives from EPA, DOE, etc.
4/9/2012 4
5. This Work
High-performance energy-efficient data-centric architectures
FPGAs/accelerators a good way to improve energy efficiency
Accelerated Unstructured Search, mainly data analysis (document filtering &
profile match)
GiDEL ProcStar IV board (Four Altera Stratix IV 530 FPGAs)
Recent developments offer promise
Better toolkits, and IPs for Host Computer Interfaces to FPGAs, e.g. GiDEL
Future platforms, e.g. ARM+FPGA on a single die by Altera and Xilinx
Recent commercial successes , e.g. Fusion-io, Netezza, etc.
What we achieved
Performance speed up of 23X to 38X
Energy efficiency improvements of 31X to 40X
Performance-per-cost improvement of 10X
4/9/2012 5
6. Choice of Workload
Wide variety of emerging data-centric workloads
Operations: collect and store, maintain & manage; retrieve, interpret &
analyze
Focus on important emerging class: real-time unstructured
search
Searching patent repositories for related work comparison
Searching emails and share points for enterprise information management
Detecting spam in incoming emails
Monitoring communications for e.g. terrorist activity
News story topic detection and tracking
Searching through books, images, and videos for matching profiles
4/9/2012 6
7. Algorithm Description
Document model
Each document modeled as bag of words “D” of pairs (t,f)
t is a term; f is number of occurrences of t in document D
Profile “M” is a set of pairs p = (t,w)
t is term; w is weight function
Bayesian algorithm used offline to precompute profile based on user requirements
4/9/2012 7
8. Hardware Platform
FPGA Board Application Implementation
GiDEL PROCStar-IV development board GiDEL External Memory IPs
Internal FPGA Memory of 20Mb Algorithm in VHDL in Altera Quartus
External Memory for single FPGA GiDEL ProcWizard to integrate Algorithm
Bank A 512 MB (profile and scores)
Bank B/C 2 GB each (document stream)
with its IPs
4/9/2012 8
9. Baseline Systems
An optimized multi-threaded reference implementation
Written in C++, compiled in g++ with optimization –O3
Different platforms
System 1 – Intel Core 2 Duo Mobile E8435, 3.06 GHz and 8GB RAM
System 2 – 8-core Intel Core i7-2600, 3.4 GHz and 16GB RAM
The high memory baselines are required to enable sufficient
memory for the data collection
Reading the data from disk would dominate the performance
Collection is preloaded in memory
4/9/2012 9
10. Hardware Algorithm Description
Profile Ext. Mem Bloom
Storage (Bank A) Filter
Latency/ 20 cc 1 cc
Term
Probability of Input Terms to be a
profile hit is extremely small.
Bloom Filter is used to discard
misses.
Extract Parallelism out of FPGA ?
Parallel term look up of Bloom Filter
4/9/2012 10
11. Hardware Algorithm Description
Multi Bank Bloom Filter: To decrease
congestion for multi look up
Lookup Eight Terms in Parallel in Bloom
Filter
Individual banks are implemented on
Altera M9K (hard memory blocks) on
the FPGA
The current implementation uses only
half of 1280 M9K blocks to map 4Mb of
Profile
To decrease/eliminate false positives
future Bloom Filter designs include
8 Mb on all the M9K blocks (130 MHz)
16 Mb profile size on both M9k and M114k
together (100MHz)
4/9/2012 11
13. Synthetic Data Sets
Creating Synthetic Data Sets
The real world data is hard to access, e.g. patent collections are governed by
licenses that restrict their use.
Synthetic document collections statistically match real-world collections.
Real-World Document Collections
Newspaper Collection (TREC Aquaint)
Patent collection from US Patent Office (USPTO), and European Patent
Office (EPO)
Lemur2 Information Retrieval toolkit is used to determine the rank
frequency for all the terms in the collection
Average Average
Collection # Docs Doc. Len. Uniq. Terms
Aquaint 1,033,461 437 169
USPTO 1,406,200 1718 353
EPO 989,507 3863 705
4/9/2012 2www.lemurproject.org 13
14. Synthetic Data Sets
Modeling Distributions of terms
Most natural language documents follow Zipfian for rank-frequency distribution
We use Montemurro’s extension to Zipf's law
Modeling Document Lengths
Sampled from a truncated Gaussian
Verified using a χ2 test with 95 % confidence
Synthetic documents of varying lengths
Each document terms follow fitted rank-frequency distribution
Convert documents into the standard bag-of-words representation
4/9/2012 14
15. Experimental Parameters
The Performance of Algorithm on the system depends on
the size of the collection
256K document of 4096 terms (Patent collection)
1M documents of 1024terms (Aquaint collection)
the size of the profile
4K, 16K and 64K terms, which are similar to that of TREC Aquaint and EPO
Profile Types
“Random”: Selecting number of random documents from the
collection until the desired profile size is reached, hit probability 10-5
“Selected”: Selecting terms that occur in very few documents (Most
representative of real world usage), hit probability 5.10-4
4/9/2012 15
17. Performance versus Cost
We used cost model from Shah and Patel’s work
Cost Breakdown CPU CPU+FPGA
Space 21M$/y
Power & Cooling 52M$/y 29M$/y
IT Infrastructure 59M$/y 248M$/y
Total 132M$/y 299M$/y
Performance 136Mops/s 3090Mops/s
(single system)
Performance/Cost 32Mops/$ 330Mops/$
Considering the Device Demand
Economics, Performance/Cost for
various FPGA costs, such as $2000,
$4000, and $8000 are calculated
Performance/Cost versus FPGA system cost and Various speedup factors effect on
performance gains
Gain Factor
4/9/2012 17
18. Alternatives and Future Work
ASIC Bloom Filter
Frequency of operation, and its effect on Power Consumption
GPGPU
Frequency of operation, size of internal memory, and I/O bottleneck
Decrease congestion probability for multi term access of
Bloom Filter by increasing number of banks
In-depth characterization of other diverse workloads
Explore low power host systems, such as ARM, Atom etc
Implement the Hardware algorithm using high-level languages
such as Impulse-C, Catapult-C, MORA framework and OpenCL
4/9/2012 18
19. Conclusion
Growing demand on Data-Center Computing, and “Green
Computing” motivated designers for high performance system
with improved energy efficiency
A new FPGA- accelerated system design for information retrieval
or unstructured search
Algorithm is implemented on GiDEL ProcStar IV (Altera Stratix IV
530 FPGA), achieving 800Mterms/sec of throughput with power
consumption of 6W
Comparisons of FPGA system with Baseline system
Speed up of 23x to 38X
Energy Efficiency of 31x to 40x
Performance-per-cost improvement of 10X
4/9/2012 19