3. “Pie DB” Project Requirements
• Yahoo Product Intelligence Engineering – Pie DB
– Several billions page views per day
– A unified data warehouse that could support click
streams, page views, and link views data
• Main requirements:
– Support > 1PB of data
– Linear scalability when adding storage or CPU
– Store data in a compressed format
– Standard SQL access
– Integrate with 3rd-party BI tools
– Support ~60 concurrent queries
– Resource management
– Reasonable and affordable cost 3
5. Goals
• High data compression rate
– Hadoop pre-processing improves compression rate
to 4-5x!
• ~4GB/s of reads (sustained)
– ~20GB/s effective read rate, based on 5x
compression rate
• Load 10TB in 3 hours
– 3.5TB/hr load rate; that is ~1GB/s writes
• No Indexes for queries
– Avoid additional space needed for indexes
– Avoid indexes building/rebuilding time after data
loading
5
6. Goals
• No SQL Hints!
• Standard Hardware / Software stack
– Avoid proprietary solutions as much as possible
– Easily repurpose if necessary
• Delete / Expire / Rolloff old data
– Truncate / drop old partitions
– No vacuum process
• Leverage hardware investment before deciding on
ETL tools
– Use database as transformation engine in the initial
phase (ELT instead of ETL)
6
7. Tests
• Load 3 months of clicks, page views, and link views
historical data
– Almost 100TB of raw data
– 21TB in database (due to compression)
• Load and transform data
– Load raw data
– Create dimension tables and merge with existing
dimensions
• 20 base queries to test system
– Typical queries we will see in the production
– Run queries serially and concurrently
– Concurrent test has to finish faster than serial
test
7
8. Tests
• Scalability
– Performance increases close to linearly as we add RAC
nodes
• Deep analytical queries
• Ad hoc queries
– Allow users to submit random queries to system and see if
it breaks!
-----------------------------------------------------------------------------------
| Id | Operation | Rows | Bytes | TempSpc | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | 16M| 7980M| | 610K (16)| 02:02:03|
|* 1 | VIEW | 16M| 7980M| | 610K (16)| 02:02:03|
. .
. .
| 10 | PX PARTITION HASH ALL | 16M| 3959M| | 610K (16)| 02:02:03|
|* 11 | HASH JOIN RIGHT OUTER| 16M| 3959M| 932M| 610K (16)| 02:02:03|
|* 12 | TABLE ACCESS FULL | 11G| 804G| | 25036 (7)| 00:05:01|
|* 13 | HASH JOIN | 16M| 2794M| | 543K (17)| 01:48:43|
|* 14 | TABLE ACCESS FULL | 16M| 1894M| | 69951 (1)| 00:14:00|
|* 15 | TABLE ACCESS FULL | 597G| 31T| | 471K (19)| 01:34:13|
-----------------------------------------------------------------------------------
8
9. System Requirements
• Network/Cluster Interconnects
– GigE does not meet the bandwidth requirement
– 10GigE is still too expensive
– InfiniBand is chosen (up to 20Gb/s)
• Storage
– Block based storage / SAN solution
– Price performance justified for warehouse workloads
• Oracle 10.2.0.3 x86_64 (RAC)
– Native IB support
– Many improvements and fixes on “warehousing” features
– Latest 10.2 patch set at that time
• Oracle Automatic Storage Management
– Provides LVM style striping of data
– Supports clustered access (required for RAC)
9
10. Overall System Topology
Applications
NAS Storage
Private LAN Public LAN
for RAW data
2 GigE NICs per
server
Node 1 Node 2 Node 3 Node 4 Node 5 Node 16
16 x3850 M2’s
InfiniBand Network
(redundant)
Storage Area
Network (SAN)
4x4Gb FCP 4x4Gb FCP
(SP-A) (SP-B)
6 x EMC CX3-40’s
Legend
1000TX Public (primary)
20Gb Full Duplex IB
4Gb FCP (Switch 1)
4Gb FCP (Switch 2)
11. Database Server Configuration
• IBM x3850 M2
– 64GB RAM (DDR2 SDRAM)
– 4 x Intel Xeon E7330 @ 2.40GHz (quad core)
• 4 x 4 = 16 cores per node
– One of the fastest servers in the same class; power efficiency
• 3 x QLogic QLE 2462 HBA (dual port)
• 4Gb FCP per port (for EMC SAN)
• 2 x QLogic 7104-HCA-128LPX-DDR
• 20Gb (for InfiniBand)
• RHEL4 Update 6
– Large SMP Kernel for x86_64 (2.6.9-67.ELlargesmp x86_64)
• Oracle 10.2.0.3 X86_64 Clusterware/ASM/RDBMS (with patches)
11
12. Database Server
Hardware Configuration (Simplified)
IBM x3850 M2
GigE HBA HCA
Public/ Fibre RDS over IB or
Oracle VIP Channel IP over IB
Cisco 4948 Brocade QLogic/SilverStorm
Ethernet Switch 4900 SAN Switch 9024 IB Switch
EMC CX3-40
12
13. Database Server
Software Architecture (Simplified)
Oracle Oracle Oracle
ASM Oracle RDBMS
Clusterware
Operating SCSI IP RDS/IB
Systems Multipath IP/IB
Hardware HBA GigE NIC HCA
13
15. Network/Cluster Interconnects
InfiniBand Architecture
IP over IB RDS over IB
RAC RAC
Database Database
IPC IPC
Library Library
User User
Kernel Kernel
UDP IB/RDS
IP
IPoIB
Hardware Hardware
NIC HCA NIC HCA
15
16. Network/Cluster Interconnects
InfiniBand Architecture
• InfiniBand Switch is required
• HCA is required
– Run INSTALL script to provide IP and netmask
• Relink Oracle
– cd $ORACLE_HOME/rdbms/lib
– make -f ins_rdbms.mk ipc_rds ioracle
• Oracle patch 6643259 – Intermittent hang for inter-instance parallel
query using RDS over IB
– Patch available for 10.2.0.3 and 11.1.0.6
• Kernel panic on an idle system/IB hang at reboot
– Fixed by upgrading the HCA driver
$ cat /proc/iba/mt25218/config
SilverStorm Technologies Inc. MT25218/MT25204 Verbs Provider Driver, version 4.2.0.5.2
for SilverStorm Technologies Inc. InfiniBand(tm) Transport Driver, version 4.2.0.5.2
Built for Linux Kernel 2.6.9-67.ELlargesmp
16
19. Storage
EMC SAN Details
• 6 x CX3-40F arrays
– 900 x 400GB 10K drives (150 drives @ RAID5 4+1 =
40TB usable per array)
– 96GB cache (16GB per array)
– 48 x 4Gb ports (8 per array)
– Capable of ~7.5GB/s read throughput (1.25GB/s
per array)
• 240TB usable storage capacity
– 200TB for Oracle data (1PB logical with 5:1 Oracle
compression)
– 40TB additional storage required for Oracle TEMP
space
19
20. Storage
EMC SAN Details
• 2 x EMC Brocade 4900 Departmental Switches
– 128 x 4Gb Ports (64 per Switch)
– Simple Dual-Fabric Design
• Ability to expand by adding drives and/or
arrays
• Linear scaling with 6 arrays
• Oracle ASM to rebalance data when adding stroage
• Best price/performance at the time
20
21. Storage
Oracle Automatic Storage Management
• Only stores metadata about where data lives – an LVM for Oracle data
• Stripe size is 1MB (_asm_stripesize=1048576)
• Stripe a Datafile evenly across all storage arrays to use all spindles
• Vendor agnostic; Can add / remove storage as needed
ASM Software
Layer Storage
1MB 1MB 1MB 1MB 1MB
SAN Based Storage
1MB 1MB 1MB 1MB 1MB
(iSCSI / FCP)
21
22. Critical Success Factors (Oracle)
• gzip support for external tables
– Feature added by Oracle to make POC succeed
– Patch 6522622: External tables need to read
compressed files
• Compression
– Reduce required disk space
– More effective throughput (5x)
• Automatic Storage Management
– Distribute IO evenly; scale IO linearly
• Features and Enhancements for Data warehouse
– Partitioning and composite partitioning
– Patch 6402957: Adaptive aggregation push-down
22
23. Critical Success Factors
• InfiniBand Interconnect
– Provide bandwidth needed
– Reduce latency/cluster wait
– Highest utilization is 7Gb/s but only for a brief
period (when using RDS over IB)
– 1~2Gb/s is more typical under load
• EMC SAN solution
– IO throughput to support the full table scan
– Max 1.25GB/s per array
23
26. PQ and RAC scaling issue
• All architectures, including parallel shared
nothing systems, eventually need a funnel
point (query coordinator)
– Lots of “select * from petabyte_table order
by 1” will kill everyone
• During POC, we had to ensure that Oracle
could parallelize ALL operations, otherwise
parallel query becomes useless
– This is a common source of PQ scaling
problems as it requires too much data to
traverse the interconnect
26
27. Scaling PQ on RAC
• Large number of sub-partitions required to achieve
high degree of parallelism and performance
• Reduce interconnect traffic
• Need an interconnect that can support
throughput requirements of QC
• Avoid “broadcast” redistribution of PQ results
27
28. Oracle Parallel Query
(More Realistic)
select … from table pageviews, linkviews where pageviews.pvid = ... group
by date;
QC
Group by Px Px
Px Px
Hash Join
Table Scan Px Px Px Px
Link Views Page Views
PVID Partitions
P1 P2 P1 P2
28
29. Need to Avoid
Node 1 Node 2
QC
Group by Px Px
Px Px
Hash Join
Table Scan Px Px Px Px
Link Views Page Views
PVID Partitions
P1 P2 P1 P2
29
31. How PQ Survives in RAC
Environment
• Node Affinity to avoid interconnect traffic
– The consumer / producer pair always lives
on the same node
• Joining tables that have the same partition
key and the same number of partitions result
in partition-wise join
– This is the key to scaling!
– Queries that join large tables that are not
partitioned on the same key will require
“brute force” interconnects to survive
31
32. Lessons Learned and Challenges
• Parallel Shared Nothing does not always scale linearly
• Although most Data Warehouse technology did very well
within 25TB, things started to change quickly at 100TB
• At this data volume, do not expect any commercial solution to
work without some growing pains
– Expect to see bugs!
• Avoiding proprietary solutions and staying open means
possibly multiple vendors are involved
– Working with multiple vendors/teams might be challenging
– Select vendors with quality support and knowledge
transfer
– Dedication from Oracle support and development team to
make the POC successful
32
33. Backup and Restore Challenges
• Web logs/events (the fact tables) can be
reloaded; no need to back up
• Aggregation/summary is backed up
– Range-partitioned by date
– Set read-only for historical partitions
– Only back up new partitions; skip RO
partitions
• Backup and Restore
– Oracle RMAN: 6 Channels; level 0
– NetVault with 6 Tapes
– 300+ MB/s backup and 200+ MB/s restore
33
34. Challenges for Oracle
• Degree of parallelism (DOP) is fixed at the query
startup
• AWR report has no aggregation for parallel executions
yet
• ORA-12805: parallel query server died unexpectedly
– Once that happens, all work is abandoned, and
resubmit is the only solution so far
– Hope to see “auto-recovery” feature in the future!
• No DOP information is available in the execution plan
– Improved in 11g (AUTOTRACE can see the DOP!)
• Lacking detailed information on parallel servers activity
and progress
– Improved in 11g (GV$SQL_MONITOR)
34
35. Major Oracle Enhancements /
Patches for Data Warehouse
• 6522622 – External tables need to read
compressed files
• 6643259 – Intermittent hang for inter-
instance parallel query using RDS over IB
• 6748058 – Transformed query does not
parallelize
• 6402957 – Predicate pushdown not working
with window functions for some cases
• 6808773 – Sub optimal hash distribution
when join on highly skewed columns
• 6471770 – Parallel servers die unexpectedly
35
36. Future Plans
• Near future:
– ETL Tool
– Backup/Restore throughput enhancement
– Resource plans for different users and workloads
• Further collaboration/integration with Hadoop
• Oracle 11g evaluation and upgrade
• EMC CX4-960
– Up to 2x IO and 2x capacity (vs CX3)
– Upgrade without migrating data
• Intel 7400 series 6-core CPU “Dunnington”
– Up to 50% more performance and 10% less power
consumption vs 7300 series
• 10 GigE evaluation 36