Presentation Ispass 2012 Session6 Presentation1

Evaluating FPGA-acceleration for Real-time
Unstructured Search
Sai Rahul Chalamalasetti†, Martin Margala†, Wim Vanderbauwhede*,
Mitch Wright‡, Parthasarathy Ranganathan‡‡
†University of Massachusetts Lowell, Lowell, MA
*University of Glasgow, Scotland, UK
‡Hewlett Packard, Houston, TX
‡‡Hewlett Packard Labs, Palo Alto, CA

4/9/2012 1

Outline
 Motivation
 Workload and Algorithm Description
 Hardware Systems
 Synthetic Datasets
 Performance Results
 Alternatives and Future Work
 Conclusion

4/9/2012 2

Motivation
The era of “big data”
 Explosion in data – particularly unstructured data
 Information doubling every 18 months or faster
 Enterprise server systems processed, delivered over 9 zettabytes in
2008 (UCSD report)
 Walmart:1M transactions/hour; LHC: 1PB/second; YouTube: 48
hours/minute; Facebook: 100TB logs/day
 Explosion in data-centric workloads
 Collect, store, access, share, visualize, analyze, interpret, …
 Consumer, Enterprise, Scientific, …
 New applications emerging recently: search, live business
analytics, social correlation, collaborative filtering, …
 Need better performance for deeper analytics across diverse data

4/9/2012 3

Motivation
 The era of “green computing”
 Power and cooling important constraint for
servers/datacenters
 Only 4 countries consume more electricity than worldwide
datacenters; millions of dollars for cloud datacenters
 Thermal density and costs of power delivery and cooling
infrastructure
 Sustainability a growing concern
 Lifecycle minimization of environmental effects and carbon
emissions
 Corporate initiatives from HP, Cisco, Dell, Google, IBM, Intel;
Government initiatives from EPA, DOE, etc.

4/9/2012 4

This Work
 High-performance energy-efficient data-centric architectures
 FPGAs/accelerators a good way to improve energy efficiency
 Accelerated Unstructured Search, mainly data analysis (document filtering &
profile match)
 GiDEL ProcStar IV board (Four Altera Stratix IV 530 FPGAs)
 Recent developments offer promise
 Better toolkits, and IPs for Host Computer Interfaces to FPGAs, e.g. GiDEL
 Future platforms, e.g. ARM+FPGA on a single die by Altera and Xilinx
 Recent commercial successes , e.g. Fusion-io, Netezza, etc.
 What we achieved
 Performance speed up of 23X to 38X
 Energy efficiency improvements of 31X to 40X
 Performance-per-cost improvement of 10X

4/9/2012 5

Choice of Workload
 Wide variety of emerging data-centric workloads
 Operations: collect and store, maintain & manage; retrieve, interpret &
analyze
 Focus on important emerging class: real-time unstructured
search
 Searching patent repositories for related work comparison
 Searching emails and share points for enterprise information management
 Detecting spam in incoming emails
 Monitoring communications for e.g. terrorist activity
 News story topic detection and tracking
 Searching through books, images, and videos for matching profiles

4/9/2012 6

Algorithm Description
 Document model
 Each document modeled as bag of words “D” of pairs (t,f)
 t is a term; f is number of occurrences of t in document D
 Profile “M” is a set of pairs p = (t,w)
 t is term; w is weight function
 Bayesian algorithm used offline to precompute profile based on user requirements

4/9/2012 7

Hardware Platform

 FPGA Board  Application Implementation
 GiDEL PROCStar-IV development board  GiDEL External Memory IPs
 Internal FPGA Memory of 20Mb  Algorithm in VHDL in Altera Quartus
 External Memory for single FPGA  GiDEL ProcWizard to integrate Algorithm
 Bank A  512 MB (profile and scores)
 Bank B/C  2 GB each (document stream)
with its IPs

4/9/2012 8

Baseline Systems
 An optimized multi-threaded reference implementation
 Written in C++, compiled in g++ with optimization –O3
 Different platforms
 System 1 – Intel Core 2 Duo Mobile E8435, 3.06 GHz and 8GB RAM
 System 2 – 8-core Intel Core i7-2600, 3.4 GHz and 16GB RAM
 The high memory baselines are required to enable sufficient
memory for the data collection
 Reading the data from disk would dominate the performance
 Collection is preloaded in memory

4/9/2012 9

Hardware Algorithm Description
Profile Ext. Mem Bloom
Storage (Bank A) Filter
Latency/ 20 cc 1 cc
Term

 Probability of Input Terms to be a
profile hit is extremely small.
 Bloom Filter is used to discard
misses.
 Extract Parallelism out of FPGA ?
 Parallel term look up of Bloom Filter

4/9/2012 10

Hardware Algorithm Description
 Multi Bank Bloom Filter: To decrease
congestion for multi look up
 Lookup Eight Terms in Parallel in Bloom
Filter
 Individual banks are implemented on
Altera M9K (hard memory blocks) on
the FPGA
 The current implementation uses only
half of 1280 M9K blocks to map 4Mb of
Profile
 To decrease/eliminate false positives
future Bloom Filter designs include
 8 Mb on all the M9K blocks (130 MHz)
 16 Mb profile size on both M9k and M114k
together (100MHz)

4/9/2012 11

Implemented Algorithm

Utilization Logic Memory
Elements MRAMs
424 KLEs 20 Mbits
Total 17,562 4
Algorithm 4,561 (1%) 4 (22%)

4/9/2012 12

Synthetic Data Sets
 Creating Synthetic Data Sets
 The real world data is hard to access, e.g. patent collections are governed by
licenses that restrict their use.
 Synthetic document collections statistically match real-world collections.
 Real-World Document Collections
 Newspaper Collection (TREC Aquaint)
 Patent collection from US Patent Office (USPTO), and European Patent
Office (EPO)
 Lemur2 Information Retrieval toolkit is used to determine the rank
frequency for all the terms in the collection
Average Average
Collection # Docs Doc. Len. Uniq. Terms
Aquaint 1,033,461 437 169
USPTO 1,406,200 1718 353
EPO 989,507 3863 705

4/9/2012 2www.lemurproject.org 13

Synthetic Data Sets
 Modeling Distributions of terms
 Most natural language documents follow Zipfian for rank-frequency distribution

 We use Montemurro’s extension to Zipf's law
 Modeling Document Lengths
 Sampled from a truncated Gaussian
 Verified using a χ2 test with 95 % confidence
 Synthetic documents of varying lengths
 Each document terms follow fitted rank-frequency distribution
 Convert documents into the standard bag-of-words representation

4/9/2012 14

Experimental Parameters
 The Performance of Algorithm on the system depends on
 the size of the collection
 256K document of 4096 terms (Patent collection)
 1M documents of 1024terms (Aquaint collection)

 the size of the profile
 4K, 16K and 64K terms, which are similar to that of TREC Aquaint and EPO

 Profile Types
 “Random”: Selecting number of random documents from the
collection until the desired profile size is reached, hit probability 10-5
 “Selected”: Selecting terms that occur in very few documents (Most
representative of real world usage), hit probability 5.10-4

4/9/2012 15

Performance Results
Profile System1 System2 FPGA board System1 System2 FPGA board
Random, 4K 269 416 3090 292 1118 3090
Random, 16K 245 324 3090 288 1014 3090
Random, 64K 223 379 3090 253 945 3090
Selected, 4K 118 232 3088 120 309 3088
Selected, 16K 107 164 3088 94 350 3088
Selected, 64K 82 136 3088 72 183 3088
Empty, 4K 710 1564 3090 911 2005 3090
Empty, 16K 711 1664 3090 844 1976 3090
Empty, 64K 710 1338 3090 877 1952 3090
Full, 4K 8 11 36 7 10 36
Full, 16K 8 12 36 8 12 36
Full, 64K 9 10 36 8 11 36
256K documents of 4096 terms(M Terms/Sec) 1M documents of 1024 terms(M Terms/Sec)

#Threads System1 System2 FPGA System
0 (Idle) 40 67 35 FPGA / System * System 1 System2
1 67 93 61.5
2 67 107 68
Speed up 38X 23X
4 67 135 74.5 Perf. / Watt 31X 40X
8 67* 141 81
Power consumption of document filtering application (W)

4/9/2012 16

Performance versus Cost
 We used cost model from Shah and Patel’s work
Cost Breakdown CPU CPU+FPGA
Space 21M$/y
Power & Cooling 52M$/y 29M$/y
IT Infrastructure 59M$/y 248M$/y
Total 132M$/y 299M$/y
Performance 136Mops/s 3090Mops/s
(single system)
Performance/Cost 32Mops/$ 330Mops/$

 Considering the Device Demand
Economics, Performance/Cost for
various FPGA costs, such as $2000,
$4000, and $8000 are calculated
Performance/Cost versus FPGA system cost and  Various speedup factors effect on
performance gains
Gain Factor

4/9/2012 17

Alternatives and Future Work
 ASIC Bloom Filter
 Frequency of operation, and its effect on Power Consumption
 GPGPU
 Frequency of operation, size of internal memory, and I/O bottleneck
 Decrease congestion probability for multi term access of
Bloom Filter by increasing number of banks
 In-depth characterization of other diverse workloads
 Explore low power host systems, such as ARM, Atom etc
 Implement the Hardware algorithm using high-level languages
such as Impulse-C, Catapult-C, MORA framework and OpenCL

4/9/2012 18

Conclusion
 Growing demand on Data-Center Computing, and “Green
Computing” motivated designers for high performance system
with improved energy efficiency
 A new FPGA- accelerated system design for information retrieval
or unstructured search
 Algorithm is implemented on GiDEL ProcStar IV (Altera Stratix IV
530 FPGA), achieving 800Mterms/sec of throughput with power
consumption of 6W
 Comparisons of FPGA system with Baseline system
 Speed up of 23x to 38X
 Energy Efficiency of 31x to 40x
 Performance-per-cost improvement of 10X

4/9/2012 19

Thank you

Questions ?

4/9/2012 20

Presentation Ispass 2012 Session6 Presentation1

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à Presentation Ispass 2012 Session6 Presentation1

Similaire à Presentation Ispass 2012 Session6 Presentation1 (20)

Presentation Ispass 2012 Session6 Presentation1