Talk at Information on Demand Conference 2011. As part of the Informix Ultimate Warehouse Edition, Informix
Warehouse Accelerator (IWA) transparently provides up to several
orders of a magnitude speed up in query performance for
Informix Dynamic Server (IDS), as well as enormous administrative
cost savings. Combined with the Intel Xeon E7 processor series,
Informix and the Accelerator brings the performance and
scalability of IDS solutions to new levels. This presentation will
give best practices and benefits of IWA and the Intel Xeon E7
processors, and highlight the implications and performance
benefits of running IDS and IWA on these processors, compared
to previous releases of IDS and prior Intel server platforms.
Ensuring Technical Readiness For Copilot in Microsoft 365
Performance and scalability of Informix ultimate warehouse edtion on Intel Xeon 7500 and E7 processors
1. Performance and Scalability of
Informix® Ultimate Warehouse Edition
on Intel Xeon® 7500 and E7 processors
Session Number 2864
Keshava Murthy, IBM®
Jantz Tran, Intel®
2. Agenda
• Intel Inside
• IWA Overview
• Key performance features in Intel
• How IWA is exploiting the Intel features.
• Performance results
1
3. Tick-Tock Development Model
Sustained Xeon® Microprocessor Leadership
Tick Tock Tick Tock Tick Tock Tick Tock
65nm 45nm 32nm 22nm
00 y ridge
® 53 7400 7500 ® E7 Sand /EN Ivy B N
Xeon 5100
X eon
® X eon
®
X eon
® Xeon e-EP EP/E
Bridg
Intel® Core™ Nehalem/Westmere Sandy Bridge/Ivy Bridge
Microarchitecture Microarchitecture
Microarchitecture
First high-volume server Quad-
Up to 10 cores Up to 8 cores
Core CPUs and 20MB Cache
and 30MB Cache
Integrated memory controller
Dedicated high- Integrated PCI Express
with DDR3 support
speed bus per CPU Turbo Boost 2.0
Turbo Boost, Intel HT, AES-
1
HW-assisted NI Intel Advanced Vector
virtualization (VT-x) Extensions (AVX)
End-to-end HW-assisted
virtualization (VT-x, -d, -c)
2
4. Intel Xeon Processor
® ®
Family for Business
Scalable
Intel® Xeon® processor E7 platforms
Enterprise
Scalable (up to 256-way), reliable, powerful 64-bit multi-core servers offering industry-
leading performance, expanded memory & I/O capacity, and advanced reliability ideal for
Mainstream Top-of-the-line performance,
the most demanding enterprise and mission critical workloads, large scale virtualization and
Enterprise
large-node HPC applications. scalability, and reliability
Best combination of
performance, power efficiency,
Intel® Xeon® processor 5000 sequence platforms (E5 in 2012)
and cost
Small Mission Critical
Versatile (up to 2-way) servers for all your infrastructure, high-density, workstationthe most and HPC
Business Enterprise Server optimal performancePerformance and reliability forfor the
applications with features that enable business critical efficiency outstanding
and power workloads with
data center. Versatility for infrastructure apps (up to 4S) economics
Economical and more Cloud Computing Cloud Computing
dependable vs. desktop Efficient, secure, and open platforms for Highest virtualization density and advanced
Intel® Xeon® processor 3000 and IAAS
Internet datacenters sequence platforms (E3 in 2012)
reliability for private cloud
Entry Servers andEconomical (1-way) dependable general purpose 64-bit servers well-suited for small
High Performance Computing & High Performance Computing
Workstations businesses and education with features that optimize performance, uptime, and security
Workstations
More features and performance than Bandwidth-optimized for high Greater scaling and memory capacity
traditional desktop systems performance analytics & visualization
Increasing capability
5. Intel® Xeon® Processor
E7-8800/4800/2800 Product Families
Building on Xeon® 7500 Leadership Capabilities
More Performance More Expandable
• 10 cores / 20 threads • Supports 32GB DDR3 DIMMs (2TB per
4-socket system)1
• 30MB of last level cache
More Security & RAS E7-4800 E7-4800
More Efficient
SECURITY • More performance within
same max CPU TDP as Xeon
• Intel® Advanced Encryption 7500
Standard-New Instructions E7-4800 E7-4800
• Lower partial active & idle
• Intel® Trusted Execution power via Intel Intelligent
Technology (TXT) Power Technology2
• Support for Low Voltage-
RELIABILITY, AVAILABILITY, SERVICEABILITY DIMMs3
• Enhanced DRAM Double Device Data Correction • Reduced power memory
• Fine Grained Memory Mirroring buffers4
Delivers more Performance, Expandability and RAS
while improving Energy Efficiency
1. Up to 64 slots per standard 4 socket system x 32GB/DIMM = 2TB
2. Uses similar core and package C6 power states enabled on Intel Xeon 5500/5600 series processors. Requires OS support.
3. Savings dependent on workload and configuration.
4. Memory buffer power savings of up to 1.3W active and 3W idle per buffer per Intel estimates. Slightly more savings when used with LV DIMMs
6. Advantages of the Xeon® E7 Platform
4-socket systems can…
…process the biggest workloads…maximize consolidation
…increase system uptime…handle highly variable workloads
Intel ® Xeon® Processor E7-4800 Product Family vs. Xeon® Processor 5600 Series
Large Workloads Mission Critical Class System
Highly Variable Workloads
& Max. Consolidation Availability
Over 2X the compute performance Protects your data by preventing
across a range of benchmarks1 More performance headroom to handle peak, errors
unexpected, or underestimated workloads
Up to 7X memory capacity for greater Increased availability via healing,
performance, headroom and memory Compute, memory and I/O scalability extends redundancy and failover
DIMM savings2 useful server life in high-growth workloads technologies
Up to 2X higher consolidation3 Denser compute resources per server Minimized downtime via failure
maximizes performance in constrained sites prediction and proactive
replacement of failing components
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that
product when combined with other products.
1. Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual
performance. For more information on performance tests and on the performance of Intel products, visit http://www.intel.com/performance/resources/limits.htm
2. 64 DIMM slots vs. 18 slots for the Xeon 5600 processor series platform
3. 2X higher consolidation refresh ratio based on ROI tool comparing Xeon 7500 and Xeon 5600 vs.. older generations.
7. Advanced Reliability Starts With Silicon
Intel® Xeon® processor E7 family RAS Capabilities
Memory I/O Hub CPU/Socket
• Inter-socket Memory Mirroring • Physical IOH Hot Add • Machine Check Architecture
Machine Check Architecture (MCA)
®
• Intel® Scalable Memory • OS IOH On-lining* recovery (MCA-R)
(MCA) recovery (MCA-R)
Interconnect (Intel® SMI) Lane • PCI-E Hot Plug • Corrected Machine Check Interrupt
Failover (CMCI)
®
• Intel® SMI Clock Fail Over • Corrupt Data Containment Mode
• Intel® SMI Packet Retry
® • Viral Mode
• Memory Address Parity • OS Assisted Processor Socket
• Failed DIMM Isolation Migration*
• Memory Board Hot Add/Remove • OS CPU on-lining *
• Dynamic Memory Migration* • CPU Board Hot Add at QPI
• OS Memory On-lining * • Electronically Isolated (Static)
• Recovery from Single DRAM Partitioning
Device Failure (SDDC) plus • Single Core Disable for Fault
random bit error Resilient Boot
• Memory Thermal Throttling
• Demand and Patrol scrubbing
• Fail Over from Single DRAM Intel® QuickPath Interconnect
Device Failure (SDDC)
• Enhanced DRAM Double Device • Intel QPI Packet Retry
Data Correction • Intel QPI Protocol Protection via
• Fine Grained Memory Mirroring CRC (8bit or 16bit rolling)
• Memory DIMM and Rank Sparing • QPI Clock Fail Over
• Intra-socket Memory Mirroring • QPI Self-Healing
• Mirrored Memory Board Hot
Add/Remove
Advanced reliability features work to maintain data integrity
6
8. ® ®
Intel Xeon processor E5-2600 product family (Sandy Bridge-EP)
New micro-architecture on the 32nm process technology
Higher performance
Platform Features
More Efficient
Lower platform power1 Up to 8 cores, 20 MB cache
New Intel® Advanced Vector Extensions
Optimized Turbo Boost Technology
Optimized Turbo Boost
More Intelligent Intel Node Manager Sandy Bridge-EP
enhancements QPI
Up to
Intel AES-NI improvements 2 QPI Up to
More Secure More robust Intel TXT solutions links 4 channels
between DDR3 1600
Up to 8 Cores
CPUs memory
Optimized platforms for: Integrated PCI Express* 3.0
Up to 40 lanes per socket
More Options Performance
Smaller Form Factors
Best value
1 Lower platform power claim based on a Xeon® 5600 CPU and Sandy Bridge-EP CPU with the same TDP specification and comparable platform configurations.
Platform power reduction is primarily attributed to TDP reduction from a two-chip solution based on the Intel 5520 chip set and ICH-10R, down to a one-chip south
bridge solution(Patsburg chip) on the Sandy Bridge platform.
9. INTEL: Breakthrough technologies for performance
7. Multi-core, multi-node environment 1. Large memory support
Nehalem has 8 cores and Westmere 10 cores. This 64-bit computing; System X with MAX5 supports up
trend is expected to continue. to 6TB on a single SMP box; Up to 640GB on each
node of blade center.
6. Single Instruction Multiple Data 2. Large on-chip Cache
Specialized instructions for manipulating L1 cache 64KB per core, L2 cache is 256KB per
128-bit data simultaneously. 7
7 1
1 core and L3 cache is about 24-30 MB.
Additional Translation lookaside buffer (TLB).
6
6 2
2
5
5 3
3
5. Hyperthreading 4
4 3. Frequency Partitioning
2x logical processors; increases Enabler for the effective parallel access of
processor throughput and overall the compressed data for scanning.
performance of threaded software. Horizontal and Vertical Partition Elimination.
4. Virtualization Performance
Lower overhead: Core micro-architecture
enhancements, EPT, VPID, and End-to-End HW
assist
8
11. Intel QuickPath Architecture
•Connectivity
– Fully-connected by 4 Intel® QuickPath
– interconnects per socket
MB
MB
– 6.4, 5.86, or 4.8 GT/s on all links
MB
MB
7500/E7 CPU 7500/E7 CPU
MB
MB
MB
MB
– With 2 IOHs: 82 PCIe lanes (72 Gen2
Boxboro lanes + 4 Gen1 lanes on unused
ESI port + 6 Gen1 ICH10 lanes)
MB
MB
MB
MB
7500/E7 CPU 7500/E7 CPU
MB
MB
– PCE-E Gen 2.0
MB
MB
Intel® QuickPath
interconnects
•Memory
Boxboro Boxboro
– Registered DDR3 800/1066 MHz via on-
board memory buffer
– 64 DIMM support (4:1 DIMM to buffer
ratio)
12. Intel® Xeon® 7500/E7 8 Socket Configuration
4+4 (8S) IBM® System
x3850 X5
Up to 10 cores and 2.4 Ghz
per CPU
Support 8 socket mode by
combining 2 systems via
external QPI links
Memory Configuration
4TB in 8 socket server
6TB in 8 socket + MAX5
Continued 1066MHz
support
11
13. Intel®: SIMD – Single Instruction Multiple Data
technology
• The Intel Xeon® E7 processor supports up to SSE 4.2
• SIMD capabilities will be expanded to 256-bit registers with the new AVX
instruction set in the upcoming Intel® Xeon® E5 series processors
• Informix leverages SSE in the Warehouse Accelerator
14. Intel® Xeon® Processors: Virtualization Performance
Greater Virtualization
Virtualization Performance2
Efficiency: VMmark* Performance
Intel QPI
DDR3 Memory
bandwidth and
capacity
Intel® VT
VT-x
VT-d
VT-c
1 Best published VMmark results as of 20 October 2010.
See legal information slide, speaker notes and backup foils (if needed) for notes and disclaimers.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
15. Third Generation of Database Technology
According to IDC’s Article (Carl Olofson) – Feb. 2010
1st Generation:
- Vendor proprietary databases of IMS, IDMS, Datacom
2nd Generation:
- RDBMS for Open Systems, dependent on disk layout, limitations in scalability and
disk I/O
- Database tuning by adding updating stats, creating/dropping indexes, data
partitioning, summary tables & cubes, force query plans, resource governing
3rd Generation: IDC Predicts that within 5 years:
• Most data warehouses will be stored in a columnar fashion
• Most OLTP database will either be augmented by an in-memory database (IMDB) or
reside entirely in memory
• Most large-scale database servers will achieve horizontal scalability through
clustering
14
16. Informix Warehouse Accelerator
IBM Smart Analytics
Step 1. Install, configure,
Studio
start Informix
Step 2. Install, configure, Step 3
start Accelerator
Step 1
Step 3. Connect Studio to
Informix & add accelerator
Step 4
Informix Database Server
Step 4. Design, validate,
Deploy Data mart
Step 5
Step 5. Load data to
accelerator
Ready for Queries
BI Applications
Step 2
Ready
Informix warehouse Accelerator
15
17. Informix Warehouse Accelerator
3rd Generation Database Technology is Here
How is it different?
What is it?
• Performance: Unprecedented response
The Informix Warehouse Accelerator (IWA) is a times to enable 'train of thought' analysis
workload optimized, appliance-like, add-on, that enables frequently blocked by poor query
the integration of business insights into operational performance.
processes to drive winning strategies. It accelerates
• Integration: Connects to IDS through deep
select queries, with unprecedented response times.
integration providing transparency to all
applications.
• Self-managed workloads: queries are
executed in the most efficient way
• Transparency: applications connected to
IDS, are entirely unaware of IWA
• Simplified administration: appliance-like
hands-free operations, eliminating many
database tuning tasks
Breakthrough Technology Enabling New Opportunities
16
19. IWA Software Components
• Linux on Intel x86_64 (RHEL 5 or SUSE SLES 11)
• IDS 11.70 + IWA code modules including IDS Stored Procedures
– Linux on Intel (64 bit)
– AIX on Power (64 bit)
– HPUX on Itanium (64 bit)
– Solaris on Sparc (64bit)
• ISAO Studio Plug-in – GUI for Mart definition
• OnIWA – On Utilities for Monitoring IWA
18
20. INTEL/IWA: Breakthrough technologies for performance
7. Multi-core, multi-node environment 1. Large memory support
Nehalem has 8 cores and Westmere 10 cores. This trend is 64-bit computing; System X with MAX5 supports up
expected to continue. IWA: Parallelize the scan, join, group to 6TB on a single SMP box; Up to 640GB on each
operations. Keep copies of dimensions to avoid cross-node node of blade center. IWA: Compress large dataset
synchronization. and keep it in memory; totally avoid IO.
6. Single Instruction Multiple Data
Specialized instructions for manipulating 2. Large on-chip Cache
128-bit data simultaneously. IWA: L1 cache 64KB per core, L2 cache is 256KB per
Compresses the data into deep columnar 7
7 1
1 core and L3 cache is about 4-12 MB.
fashion optimized to exploit SIMD. Used in Additional Translation lookaside buffer (TLB).
parallel predicate evaluation in scans. 6
6 2
2 IWA: New algorithms to avoid pipeline
flushing and cache hash tables in L2/L3 cache
5
5 3
3
5. Hyperthreading 4
4 3. Frequency Partitioning
2x logical processors; increases processor IWA: Enabler for the effective parallel access
throughput and overall performance of threaded of the compressed data for scanning.
software. IWA: Does not exploit this since the Horizontal and Vertical Partition Elimination.
software is written to avoid pipeline flushing.
4. Virtualization Performance
Lower overhead: Core micro-architecture
enhancements, EPT, VPID, and End-to-End
HW assist IWA: Helps informix and IWA to
seemlessly run and perform in virtualized
environment.
19
21. IWA: Multi-core and Multi-node environment
Step 1. Submit SQL
DB protocol: SQLI or DRDA Informix
Network : TCP/IP,SHM
Applications
2. Query matching and
BI Tools
redirection technology
Local
Step 5. Return results/describe/error Execution
Database protocol: SQLI or DRDA
Network : TCP/IP, SHM
Step 3
Step 4
offload SQL.
Results: DRDA over TCP/IP
DRDA over TCP/IP
Coordinator
Worker Worker Worker Worker
Compressed Compressed Compressed Compressed
data data data data
In memory In memory In memory In memory
Memory Memory
Memory image image on disk Memory image on disk
on disk image on disk
20
22. IWA: Multi-core and Multi-node environment
Step1
SQL from Informix Step5: Send the results
back to Infomrix server
Step2
Send the queries to all the
Step4: merge intermediate
workers Coordinator results, ORDER BY, FIRSTN
Worker Worker Worker Worker
Compressed data Compressed data Compressed data Compressed data
In memory In memory In memory In memory
Step3: Scan, Filter, Step3: Scan, Filter, Step3: Scan, Filter, Step3: Scan, Filter,
join, group join, group join, group join, group
21
23. IWA: Multi-core and Multi-node environment
Dictionaries
Dictionaries Query
Executor
Cell
3
core + $ (HT)
core + $ (HT)
Compressed and Cell
1
core + $ (HT)
core + $ (HT)
Partitioned Data
Cell core + $ (HT)
core + $ (HT)
2
• Cell is also the unit of processing, each cell…
– Assigned to one core
– Has its own hash table in cache (so no shared object that needs latching!)
• Main operator: SCAN over compressed, main-memory table
– Do selections, GROUP BY, and aggregation as part of this SCAN
– Only need de-compress for aggregation
• Response time ∝ (database size) / (# cores x # nodes)
– Embarrassing Parallelism – little data exchange across nodes
24. Expoloiting Larger Memory: Row Oriented Data Store
Each row stored sequentially
• Optimized for record I/O
• Fetch and decompress entire
row, every time
• Result –
• Very efficient for
transactional workloads
• Not always efficient for
analytical workloads
If only few columns are required the complete row is still
fetched and uncompressed
23
25. Expoloiting Larger Memory: Data is Processed in Compressed Format
• Within a Register – Store, several columns
are grouped together.
• The sum of the width of the compressed
columns doesn‘t exceed a register
compatible width. This utilizes the full
capabilities of a 64 bit system. It doesn‘t
matter how many columns are placed within
the register – wide data element.
• It is beneficial to place commonly used
columns within the same register – wide
data element. But this requires dynamic
knowledge about the executed workload
(runtime statistics).
• Having multiple columns within the same
register – wide data element prevents
ANDing of different results.
Predicate evaluation is done against compressed data!
The Register – Store is an optimization of the Column – Store approach where we try to make the best use
of existing hardware. Reshuffeling small data elements at runtime into a register is time consuming and can
be avoided. The Register – Store also delivers good vectorization capabilities.
24
26. Exploiting Large memory: Compression: Frequency Partitioning
Trade Info (volume, product, Column Partitions
origin country) Histogram
Occurrences
Number of
Vol Prod Origin on Origin
China GER,
USA FRA,
… Rest
Common Rare
Values values
Origin
Top 64
traded goods
Cell Cell 3 Cell 4
– 6 bit code 1
Product
Cell 2 Cell 5 Cell 6
Rest
Histogram
on Product Table partitioned
into Cells
• Field lengths vary between cells
• Higher Frequencies Shorter Codes (Approximate Huffman)
• Field lengths fixed within cells
25
27. IWA: SIMD: Register Stores Facilitate SIMD Parallelism
• Access only the banks referenced in the query (like a column store):
–SELECT SUM (T.G)
–FROM T
–WHERE T.A > 5
–GROUP BY T.D
• Pack multiple rows from the same bank into the 128-bit register
• Enables yet another layer of parallelism: SIMD (Single-Instruction, Multiple-Data)!
A1 D1 G1 B1 E1 F1 C1 H1
Cell Block
A2 D2 G2 B2 E2 F2 C2 H2
Operand
32 bits Operand
32 bits Operand
32 bits Operand
32 bits
A3 D3 G3 B3 E3 F3 C3 H3
Vector Operation
A4 D4 G4 B4 E4 F4 C4 H4
Result1128 bits Result2 Result3 Result
Bank β3 4
Bank β1 (32 bits) Bank β2 (32 bits)
(16 bits)
26
28. IWA:SIMD: Simultaneous Evaluation of Equality Predicates
• CPU operates on 128-bit units State==‘CA’ && Quarter == ‘Q4’
• Lots of fields fit in 128 bits Translate value query
• These fields are at fixed offsets to Code query
• Apply predicates to all columns State==01001 && Quarter==1110
simultaneously!
State Quarter
… … … … Row
&
11111 0 1111 0 Mask
==
Selection
01001 0 1110 0 result
27
29. Exploiting Large on-chip Cache
•Encoding makes grouping simple!
–Coded values assigned densely (by construction)
–Hence, in principle, grouping is simple: aggTable[group] += aggValue
•Challenges:
–Fitting hash table in L2 cache
–Avoiding all branches in hash table lookup
•IWA adaptively uses one of 2 techniques, depending on # of distinct groups
1.Use dictionary code as a perfect hash (i.e. collision-free), OR
•aggTable[groupCode] += aggValue
•No branches, no hash function computation
•Works great if groupCode is dense
– i.e., single column, or multiple column with little correlation
2.Use usual linear probing
•Involves branches, random access, …
31. Case Study #2: Datamart at a Government Agency
• Microstrategy report was run, which generates
• 667 SQL statements of which 537 were Select statements
• Datamart for this report has 250 Tables and 30 GB Data size
• Original report on XPS and Sun Sparc M9000 took 90 mins
• With IDS 11.7 on Linux Intel box, it took 40 mins
• With IWA, it took 67 seconds.
30
32. Case Study #3: Skechers, USA. Shoe Retailer
• Top 7 time-consuming queries in Retail BI and Warehouse:
(Against 1 Billion rows Fact Tables)
Query IDS 11.5 IDS 11.7 IWA
1 22 mins 4 secs
2 1 min 3 secs 2 secs
3 3 mins 40 secs 2 secs
4 30 mins & up 4 secs
5 2 mins 2 secs
6 30 mins 2 secs
7 45 mins & up 2 secs
Query acceleration 30x to 1400x – average acceleration 450x
31
42. Thank You!
Your Feedback is Important to Us
• Access your personal session survey list and complete via SmartSite
– Your smart phone or web browser at: iodsmartsite.com
– Any SmartSite kiosk onsite
– Each completed session survey increases your chance to win
an Apple iPod Touch with daily drawing sponsored by Alliance
Tech
Session Number 2864
41