4. History ESE/JET Blue IOPS – Random IO application Why? – Small Expensive drives 1.6GB disk $400 in 1996 SCSI 2GB and 4GB 100 IOPS Single Instance Storage Clustering with Shared Storage Backup an issue Single Point of Failure 32 bit Not enough RAM Ram limited number of users per server
5. History - Exchange 2007 Big improvements in Exchange Server 2007 Reduce storage input/output (I/O) (70%) Use large amounts of memory (64 bit) Increased page size (4 kilobyte (KB) -> 8 KB) Lower storage costs Support large mailboxes (> 1 gigabyte (GB)) Provide fast search (CI) Continuous replication (log shipping) High Availability (HA) + fast recovery Eliminate single points of failure 5
7. Email Usage Radicati seeing 165 mails per day growing to 230 over next couple of years Users used to large free storage 25GB 5GB 3 years of mail Triage once per year to archive Not once per day! Mail available through all clients Cached Mode/Performance issues High Item counts – 5000, 20000, 100000
8. Disk Technology Currently 2TB Moving to 8TB Random IO not getting quicker 15K RPM, 10K RPM, 7.2K RPM Density is getting better so can read more data in the same time Flash – SSD – Didn’t take that bet Optimised for spinning media for E14 Expensive – so use as Cache in SAN
10. Exchange Server 2010 HA Storage Design Flexibility 10 DAS (SAS) DAS (SATA) HA = Shared Storage Clustering +1.0 IOPS/Mailbox 3.5” 15K 146GB FC Disks RAID10 for DB & Logs Dedicated Spindles Multi-path (HBA’s, FC Switches, SAN array controllers) Backup = Streaming off active Fast Recovery = Hardware VSS (Snapshots/Clones) HA = CCR .33 IOPS/Mailbox 2.5” 146GB 10K SAS Disks RAID5 for DB RAID10 for Logs SAS Array Controller (/w BBU) Backup = VSS Snapshot Fast Recovery = CCR HA = DAG (2 DB copies) .11 IOPS/Mailbox 3.5” 2TB 7.2K SATA/SAS Disks RAID10 for DB & Logs SAS Array Controller (/w BBU) Backup = Optional/VSS Fast Recovery = Database Failover HA = DAG (3+ DB copies) .11 IOPS/Mailbox 3.5” 2TB 7.2K SATA/SAS Disks 1 DB = 1 Disk Backup = Optional/VSS Fast Recovery = Database Failover SAN JBOD (SATA) More options to reduce storage cost
11. JBOD/RAID-less Storage: Now An Option JBOD : 1 disk = 1 database (with logs) Requires Exchange Server 2010 High Availability (3+ DB Copies) Annual Disk Failure Rate (AFR) = 5% 11
12. Exchange Server 2010 HA Simplified mailbox High Availability and disaster recovery with new unified platform New York San Jose Mailbox Server Mailbox Server Mailbox Server Replicate databases to remote datacenter DB1 DB1 DB1 Recover quickly from disk and database failures DB2 DB2 DB2 DB3 DB3 DB3 DB4 DB4 DB4 DB5 DB5 DB5 Evolution of continuous replication technology (database mobility) Easier than traditional clustering to deploy and manage Allows each database to have 16 replicated copies Provides full redundancy of Exchange roles on as few as two servers 12
14. Exchange 2010 Features Move to Sequential IO Change Table structure Lazy View Page size 32KB Database Compression (LVC) Read/Write Coalescing Database Contiguity Cache Compression Storage Groups Gone Single Point of Failure Gone Optimised for huge mailboxes
15. Random vs. Sequential Disk IO Random IO Disk head has to move to process subsequent IO Head movement = High IO latency Seek Latency limits IO (IOPS) Sequential IO Disk head does not move to process subsequent IO Stationary head = low IO latency Disk RPM speed limits I/O per second (IOPS) Disk Head 7.2K SATA Disk (20ms Latency) Random = 50 IOPS Sequential = +300 IOPS 15
16. IO Reduction: Store Table Architecture Per Database Per Folder Exchange Server 2007 Secondary Indexes used for Views Per Database Per Mailbox Per View Exchange Server 2010 New store schema = no more single instance storage within a database 16
17. Exchange 2007 M1 M2 M1 M3 M2 Nickel & Dime Approach Many, random, IOs (1 per update) Time DB I/O M1 arrives M2 arrives M1 flagged M3 arrives M2 deleted User uses OWA/Outlook Online and switches to this view Exchange 2010 M1 M2 M1 M3 M2 Pay to Play Approach Fewer, sequential, IOs (1 per view) Store Schema Changes: Lazy View Updates
18. IO Reduction: Database Page Size Increased to 32 KB Exchange Server 2007 DB Read 20 KB Message DB Cache Disk 3 Read IO’s 8 KB Pages Exchange Server 2010 DB Read 20 KB Message DB Cache Disk 1 Read IO 32 KB Pages 18
19. Mitigate DB Space Growth: Database Compression Problem:Store Schema change, space hints, B+Tree Defrag and 32 KB page size combine to increase DB file size by 20% Solution: Growth is 100% mitigated by Database Compression Targeted compression for message headers and text/html bodies (7bit/Express) DB Space Analysis DB File Size Comparison Msg Views 32KB Pages 1 Database, 750 x 250MB mailboxes RTF = RTF Compressed, Mix = 77% HTML, 15% RTF, 8% Text Avg. Message size = ~50KB 19
20. IO Reduction: Read IO Gap Coalescing Exchange Server 2007 DB Read Behavior DB Cache Disk 3 Read IO’s Exchange Server 2010 DB Read Behavior DB Cache Disk 1 Read IO 20
21. IO Reduction: Maintain Contiguity Over Time New Database Maintenance Architecture: Database B+Tree Defragmentation (aka OLD2): Background/throttled process that maintains space and contiguity of database tables 21
22. IO Reduction: Database Contiguity Results Exchange Server 2007 Message Header Table (aka MFT) DB Page Numbers FRAGMENTED Random deletes at the tail Exchange Server 2010 Message Header Table (aka MsgHeader) CONTIGUOUS *Production/Dogfood database analysis Blue = contiguous (good) Red = fragmented (bad) 22
25. Putting It All Together: Mailboxes/Disk Exchange Server 2010 storage improvements cannot be quantified in IOPS reductions alone +4X Mailboxes/Disk! +500 125 250 MB Mailbox Size, 3MB DB Cache/user, 12 x 7.2k SATA disks (DB/Logs on same spindles), Loadgen Outlook 2007 Online Very Heavy Profile, measured at <20ms RPC Average latency 25
26. Summary Exchange Server 2010 store has… Reduced DB IOPS by +70%...again! Optimized for large mailboxes (+10 GB) and 100K item counts Optimized for large/slow/low-cost disks (SATA/Tier2) Made JBOD/RAID-less storage a viable option Enables unmatched storage flexibility to push storage Capex costs down Provides many more backup/DR options 26
Notes de l'éditeur
Currently 2TBMoving to 8TBRandom IO not getting quicker 15K RPM, 10K RPM, 7.2K RPMDensity is getting better so can read more data in the same timeFlash – SSD – Didn’t take that betOptimised for spinning media for E14Expensive – so use as Cache in SAN
JBOD – i.e. 1 disk per database and log setRAID less – disks will fail3 + copies
2007 roughly the same at Exchange 4.0One database and then a couple of really large tablesMessage table and attachments – all messages per databaseMessage folder table per mailbox which does all the viewsThis gives the benefit of single instance storage – one copy in the message table with pointers from the message folder tableRandom IO!2010 schema is changed massivelyOnly one table per database Now data specific to mailbox so data can be kept sequential for quick retrieval from the same area of diskNow message view table instead of secondary indexes
Really important in reduction of IOUpdate the view when user views!
Page size is smallest section of IOBigger page means less small IO for a single message read2007 random layout of data on disk means 3 Ios201020K message – pull the same message – get the message header and body on one pageThis makes a huge difference to IOPage size will be fine for handling large messages – 12-15K is mean size of message currently
As you add larger page sizes, and lay things out for sequential IO DB grows by 20%.Same as OST grew in SP2 for office 2007Now compress message headers and text/hmtl bodies – limited to this for speedCan bring database back to same size as 2007 or even less if bulk of HMTL messagesNow have many more tables and fewer bigger pagesThere is also Cache compressionSo when you pull a 32KB page – the smallest element of Exchange data but that page only holds 16KB of data the free space will be compresses so that only 16KB of cache is used.
Can do coalescing when pages are not next to each other2007 needs 3 ios to get pages off disk – random IO2010 bring all five up – a stream of IOThen evict the middle pages
Cleanup was done using online maintenance and defragThis has changes and cleanup is done when tombstone or dumpster cleanup happens – Page 0 happens automatically as because it occurs when the write is being done anyway, there is no additional IO2003 and 2007 are great for compaction2007 SP1 changed this slightly to reduce IO during the maintenance window.2010 this has changed a lot – it is done at run time as space is seenContiguity has never been a concern until now - compaction has always not worried about continuity to make it small2010 makes trade offs on size as we’ve mentioned to ensure contiguity – analysis happens continuouslyDB check summing
Utility MSFT built to track contiguity of the new DBThis is showing a Message folder table of the inbox on 2007This is massively Random2010 is contiguous as pages are laid out sequentially so that reading a huge folder full of 10000 items is quick and easy!