Welcome to the future! The future of Exchange high availability, that is. In this session we reveal the changes and improvements to the built-in high availability platform in Exchange Server 2010. Exchange 2010 includes a unified solution for high availability and disaster recovery that is quick to deploy and easy to manage. Learn about all of the new features in Exchange 2010 that make it the most resilient, highly available version of Exchange ever.
Injustice - Developers Among Us (SciFiDevCon 2024)
UNC307 - Microsoft Exchange Server 2010 High Availability
1.
2. High Availability Scott Schnoll Principal Technical Writer Microsoft Corporation Session Code: UNC307
3. Agenda Exchange 2010 High Availability Vision/Goals Exchange 2010 High Availability Features Exchange 2010 High Availability Deep Dive Deploying Exchange 2010 High Availability Features Transitioning to Exchange 2010 High Availability High Availability Design Examples
5. Exchange 2010 High Availability Vision and Goals Vision: Deliver a fast, easy-to-deploy and operate, economical solution that can provide messaging service continuity for all customers Goals Deliver a native solution for high availability/site resilience Enable less expensive and less complex storage Simplify administration and reduce support costs Increase end-to-end availability Support Exchange Server 2010 Online Support large mailboxes at low cost
6. Complex site resilience and recovery Dallas DB1 Outlook OWA, ActiveSync, or Outlook Anywhere DB2 Standby Cluster DB3 Clustered Mailbox Server had to be created manually San Jose Front End Server Third-party data replication needed for site resilience NodeB(passive) NodeA(active) Clustering knowledge required Failover at Mailbox server level DB1 DB4 DB2 DB5 DB3 DB6 Exchange Server 2003
7. Complex activation for remote server / datacenter Dallas DB1 SCR Outlook OWA, ActiveSync, or Outlook Anywhere DB2 Standby Cluster DB3 Clustered Mailbox Server can’t co-exist with other roles San Jose Client Access Server No GUI to manage SCR NodeB(passive) NodeA(active) CCR Clustering knowledge required DB1 DB4 DB1 DB4 DB2 DB2 DB5 DB5 Failover at Mailbox server level DB3 DB3 DB6 DB6 Exchange Server 2007
8. Dallas All clients connect via CAS servers DB1 DB3 Client DB5 Mailbox Server 6 San Jose Easy to extend across sites Client Access Server Failover managed by/with Exchange Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 5 DB1 DB4 DB1 DB5 DB3 DB2 Database level failover DB5 DB2 DB1 DB4 DB3 DB3 DB1 DB2 DB4 DB5 Exchange Server 2010
10. Exchange 2010 High Availability Terminology High Availability – Solution must provide data availability, service availability, and automatic recovery from failures Disaster Recovery – Process used to manually recover from a failure Site Resilience – Disaster recovery solution used for recovery from site failure *over – Short for switchover/failover; a switchover is a manual activation of one or more databases; a failover is an automatic activation of one or more databases after a failure
11. Exchange 2010 High Availability Feature Names Mailbox Resiliency – Name of Unified High Availability and Site Resilience Solution Database Mobility – The ability of a single mailbox database to be replicated to and mounted on other mailbox servers Incremental Deployment – The ability to deploy high availability /site resilience after Exchange is installed Exchange Third Party Replication API – An Exchange-provided API that enables use of third-party replication for a DAG in lieu of continuous replication
12. Exchange 2010 High Availability Feature Names Database Availability Group – A group of up to 16 Mailbox servers that host a set of replicated databases Mailbox Database Copy – A mailbox database (.edb file and logs) that is either active or passive RPC Client Access service – A Client Access server feature that provides a MAPI endpoint for Outlook clients Shadow Redundancy – A transport feature that provides redundancy for messages for the entire time they are in transit
13. Exchange 2010 *overs Within a datacenter Database or server *overs Datacenter level: switchover Between datacenters Database or server *overs Assumptions: Each datacenter is a separate Active Directory site Each datacenter has live, active messaging services Standby datacenter must be active to support single database *over
14. Exchange 2007 Concepts Brought Forward Extensible Storage Engine (ESE) Databases and log files Continuous Replication Log shipping and replay Database seeding Store service/Replication service Database health and status monitoring Divergence Automatic database mount behavior Concepts of quorum and witness Concepts of *overs
15. Exchange 2010 Cut Concepts Storage Groups Databases identified by the server on which they live Server names as part of database names Clustered Mailbox Servers Pre-installing a Windows Failover Cluster Running Setup in Clustered Mode Moving a CMS network identity between servers Shared Storage Two HA Copy Limits Requirement of Two Networks Concepts of public, private and mixed networks
16.
17.
18. Exchange 2010 HA Fundamentals Database Availability Group Server Database Database Copy Active Manager RPC Client Access RPC CAS SVR DB DB copy copy copy copy AM AM SVR DAG RPC CAS
19. Database Availability Group (DAG) Base component of high availability and site resilience A group of up to 16 servers that host a set of replicated databases “Wraps” a Windows Failover Cluster Manages membership (DAG member = node) Provides heartbeat of DAG member servers Active Manager stores data in cluster database Defines a boundary for: Mailbox database replication Database and server *overs Active Manager
20. DAG Requirements Windows Server 2008 SP2 Enterprise Edition or Windows Server 2008 R2 Enterprise Edition Exchange Server 2010 Standard Edition or Exchange Server 2010 Enterprise Edition Standard supports up to 5 databases per server Enterprise supports up to 100 databases per server At least one network card per DAG member
21. Active Manager Exchange component that manages *overs Runs on every server in the DAG Selects best available copy on failovers Is the definitive source of information on where a database is active Stores this information in cluster database Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport) Two Active Manager roles: PAM and SAM Active Manager client runs on CAS and Hub
22. Active Manager Primary Active Manager (PAM) Runs on the node that owns the cluster group Gets topology change notifications Reacts to server failures Selects the best database copy on *overs Standby Active Manager (SAM) Runs on every other node in the DAG Responds to queries about which server hosts the active copy of the mailbox database Both roles are necessary for automatic recovery If Replication service is stopped, automatic recovery will not happen
23. Active ManagerSelection of Active Database Copy Active Manager selects the “best” copy to become active when existing active fails Ignores servers that are unreachable or activation is temporarily or regularly blocked Sorts copies by currency to minimize data loss Breaks ties during sort based on Activation Preference Selects from sorted listed based on copy status of each copy
24. Active ManagerSelection of Active Database Copy Active Manager selects the “best” copy to become active when existing active fails 8 6 9 5 7 10 Catalog Crawling Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource CopyQueueLength < 10 Catalog Healthy Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource Catalog Crawling Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource Catalog Healthy Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource CopyQueueLength < 10 ReplayQueueLength < 50 Catalog Crawling Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource CopyQueueLength < 10 ReplayQueueLength < 50 Catalog Healthy Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource ReplayQueueLength < 50 Catalog Crawling Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource ReplayQueueLength < 50 Catalog Healthy Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource CopyQueueLength < 10 Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource ReplayQueueLength < 50 Copy status Healthy, DisconnectedAndHealthy,DisconnectedAndResynchronizing, orSeedingSource
25. Automatic Recovery Process When a failure occurs that affects a database: Active Manager determines the best copy to activate The Replication service on the target server attempts to copy missing log files from the source (ACLL) If successful, then the database will mount with zero data loss If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting The mounted database will generate new log files (using the same log generation sequence) Transport Dumpster requests will be initiated for the mounted database to recover lost messages When original server or database recovers, it will run through divergence detection and either perform an incremental resync or require a full reseed
26. Example: Database Failover Database failure occurs Failure item is raised Active Manager moves active database Database copy is restored Similar flow within and across datacenters DAG Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 5 DB3 DB2 DB4 DB3 DB4 DB1 DB5 DB4 DB5 DB5 DB2 DB1 DB3 DB1 DB2
27. Example: Server Failover Server failure occurs Cluster notification of node down Active Manager moves active databases Server is restored Cluster notification of node up Database copies resynchronize with active databases Similar flow within and across datacenters DAG Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 5 DB3 DB2 DB4 DB3 DB4 DB1 DB5 DB4 DB5 DB5 DB2 DB1 DB3 DB1 DB2
28. Example: RCA service and AM Outlook tries to reconnect Outlook tries again Outlook1 Outlook3 Outlook2 CAS Array Load Balancer RPC Client Access Server RPC Client Access Server RPC Client Access Server Active Manager Client Active Manager Client Active Manager Client CAS1 CAS2 CAS3 Active Manager Returns Mailbox Server1 Outlook’s reconnect triggers new AM request If failover is in progress AM returns old server & connect fails DB failover is complete & AM returns new server Disk Fails CAS Fails Where’s the DB mounted? DAG MAPI RPC Active Manager MAPI RPC Active Manager MAPI RPC Active Manager MAPI RPC Active Manager Store Store Store Store Mailbox Server1 Mailbox Server2 Mailbox Server3 Mailbox Server4
29. DAG Lifecycle DAG is created initially as empty object in Active Directory Continuous replication or 3rd party replication using Third Party Replication mode DAG is given a name and one or more IP addresses (or configured to use DHCP) When first Mailbox server is added to a DAG A Windows failover cluster is formed with a Node Majority quorum using the name of the DAG The server is added to the DAG object in Active Directory A cluster network object (CNO) for the DAG is created in the built-in Computers container The Name and IP address of the DAG is registered in DNS The cluster database for the DAG is updated with info on configured databases, including if they are locally active (which they should be)
30. DAG Lifecycle When second and subsequent Mailbox server is added to a DAG The server is joined to cluster for the DAG The quorum model is automatically adjusted Node Majority - DAGs with odd number of members Node and File Share Majority - DAGs with even number of members File share witness cluster resource, directory, and share are automatically created by Exchange when needed The server is added to the DAG object in Active Directory The cluster database for the DAG is updated with info on configured databases, including if they are locally active (which they should be)
31. DAG Lifecycle After servers have been added to a DAG Configure the DAG Network Encryption Network Compression Configure DAG networks Network subnets Enable/disable MAPI traffic/replication Create mailbox database copies Seeding is performed automatically Monitor health and status of database copies Perform switchovers as needed
32. DAG Lifecycle Before you can remove a server from a DAG, you must first remove all replicated databases from the server When a server is removed from a DAG: The server is evicted from the cluster The cluster quorum is adjusted as needed The server is removed from the DAG object in Active Directory Before you can remove a DAG, you must first remove all servers from the DAG
37. Transition Steps Verify that you meet requirements for Exchange 2010 Deploy Exchange 2010 Use Exchange 2010 mailbox move features to migrate Unsupported Transitions In-place upgrade to Exchange 2010 from any previous version of Exchange Using database portability between Exchange 2010 and non-Exchange 2010 databases Backup and restore of earlier versions of Exchange databases on Exchange 2010 Using continuous replication between Exchange 2010 and Exchange 2007
39. High Availability Design ExampleBranch/Small Office Design Hardware Load Balancer 8 processor cores recommended with a maximum of 64GB RAM Member servers of DAG can host other server roles Client Access Hub Transport Mailbox Client AccessHub TransportMailbox DB1 DB1 UM role not recommended for co-location 2-server DAGs should use RAID DB2 DB2 DB2 DB3 DB3
40. High Availability Design ExampleDouble Resilience – Maintenance + DB Failure 2 servers out -> manual activation of server 3 In 3 server DAG, quorum is lost DAGs with more servers sustain more failures – greater resiliency AD: Dublin Single Site 3 Nodes 3 HA Copies CAS NLB Farm JBOD -> 3 physical Copies X Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 X DB2 DB1 DB3 DB2 DB1 DB3 DB2 DB1 DB3 DB4 DB5 DB6 DB4 DB5 DB6 DB5 DB6 DB4 Database Availability Group
48. 2 active copies dieCAS NLB Farm X Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 X DB6 DB4 DB5 DB3 DB7 DB5 DB2 DB1 DB3 DB8 DB7 DB1 DB8 DB1 DB2 DB6 DB7 DB8 DB5 DB4 DB6 DB2 DB3 DB4 Database Availability Group (DAG)
49. High Availability on JBOD6 Servers, 3 Racks, 3 Copy DAG 24,000 Mailboxes Heavy Profile: 100 Messages/day .1 IOPS/Mailbox MAPI network 2GB Mailbox Size 8 Cores 48 GB RAM 8 Cores 48 GB RAM Replication network 4,000 Active Mbxs/Svr Mbx Server 1 Mbx Server 2 6 Servers, 3 Copies = double server failure resiliency DB1 DB2 DB3 DB4 DB5 DB6 DB46 DB47 DB48 DB49 DB50 DB51 DB52 DB53 DB31 DB32 DB54 DB33 4,000 Active Mbxs/Svr DB1 DB7 DB8 DB9 DB10 DB11 DB12 DB55 DB56 DB57 DB58 DB59 DB60 DB61 DB62 DB34 DB35 DB63 DB36 1st failure: ~5,000 active DB13 DB14 DB15 DB16 DB17 DB18 DB64 DB65 DB66 DB67 DB68 DB69 DB70 DB71 DB37 DB38 DB72 DB39 2nd failure: 6,000 active DB1 Soft active limit: 24 DB19 DB20 DB21 DB22 DB23 DB24 DB73 DB74 DB75 DB76 DB77 DB78 DB79 DB80 DB40 DB41 DB81 DB42 DB25 DB26 DB27 DB28 DB29 DB30 DB82 DB83 DB84 DB85 DB86 DB87 DB88 DB89 DB43 DB44 DB90 DB45 1TB 7.2k SATA disks JBOD: 48 Disks/node Online Spares (3) Database Availability Group (DAG) 288 disks total 30 TB of db space Battery Backed Caching Array Controller Active copy Passive copy Spare Disk Legend
50. Key Takeaways Greater end-to-end availability with Mailbox Resiliency Unified framework for high availability and site resilience Faster and easier to deploy with Incremental Deployment Reduced TCO with core ESE architecture changes and more storage options Supports large mailboxes for less money
52. Required Slide Speakers, TechEd 2009 is not producing a DVD. Please announce that attendees can access session recordings at TechEd Online. www.microsoft.com/teched Sessions On-Demand & Community www.microsoft.com/learning Microsoft Certification & Training Resources http://microsoft.com/technet Resources for IT Professionals http://microsoft.com/msdn Resources for Developers Resources