SlideShare une entreprise Scribd logo
1  sur  56
Exchange Server 2013
High Availability | Site Resilience
Scott Schnoll
Principal Technical Writer
Microsoft Corporation
Agenda

Storage

High Availability

Site Resilience
Storage Challenges

Capacity is increasing, but IOPS are not
Database sizes must be manageable
Reseeds must be fast and reliable
Passive copy IOPS are inefficient
Lagged copies have asymmetric storage requirements
Low agility from low disk space recovery
Storage Innovations

Multiple Databases Per Volume
Automatic Reseed
Automatic Recovery from Storage Failures
Lagged Copy Enhancements
Multiple Databases Per Volume
                                      4-member DAG
                                      4 databases
                                      4 copies of each database
                                      4 databases per volume

                                      Symmetrical design with balanced
                                      activation preference

                                      Number of copies per database =
                                      number of databases per volume

           DB2      DB3         DB4




                                           Active Passive Lagged
Multiple Databases Per Volume

                                       Single database copy/disk:
                                       Reseed 2TB Database = ~23 hrs
                                       Reseed 8TB Database = ~93 hrs




  DB1      DB1   20 MB/s   DB1   DB1




                                                  Active Passive
Multiple Databases Per Volume

                                                      Single database copy/disk:
                                                      Reseed 2TB Database = ~23 hrs
                                                      Reseed 8TB Database = ~93 hrs


                                                      4 database copies/disk:
                                                      Reseed 2TB Disk = ~9.7 hrs
                                                      Reseed 8TB Disk = ~39 hrs

      20 MB/s   DB2   12 MB/s   DB3   20 MB/s   DB4




                      12 MB/s                                     Active Passive Lagged
Multiple Databases Per Volume

Requirements                        Best Practices
Single logical disk/partition per   Same neighbors on all servers
physical disk
                                    Balance activation preferences


                                    Database copies per volume =
                                    copies per database
Seeding Challenges

Disk failure on active copy = database failover
Failed disk and database corruption issues need to be addressed quickly
Fast recovery to restore redundancy is needed
Seeding Innovations

Automatic Reseed (Autoreseed) - use spares to automatically restore
database redundancy after a disk failure
Autoreseed




             X
Autoreseed Workflow




 Periodically        Check                                 Verify that
scan for failed   prerequisites:    Allocate                               Admin
                                               Start the    the new
     and              single       and remap                              replaces
  suspended        copy, spare                   seed        copy is
                                     a spare                             failed disk
    copies         availability                             healthy
Autoreseed Workflow

1. Detect a copy in an F&S state for 15 min in a row
2. Try to resume copy 3 times (with 5 min sleeps in between)
3. Try assigning a spare volume 5 times (with 1 hour sleeps in between)
4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with 1 hour
   sleeps in between)
5. Once all retries are exhausted, workflow stops
6. If 3 days have elapsed and copy is still F&S, workflow state is reset
   and starts from Step 1
Autoreseed Workflow

Prerequisites
Copy is not ReseedBlocked or ResumeBlocked
Logs and EDB files are on same volume
Database and Log folder structure matches required naming convention
No active copies on failed volume
All copies are F&S on the failed volume
No more than 8 F&S copies on the server (if so, we might be in a controller failure situation)


For InPlaceReseed
If EDB files exists, wait for 2 days before in-place reseeding (based on LastWriteTime of edb file)
Only up to 10 concurrent seeds are allowed
Autoreseed                                                                     AutoDagDatabasesRootFolderPath

                                                  AutoDagVolumesRootFolderPath

     Configure storage                Create DAG, add servers
 subsystem with spare disks           with configured storage




   Create directory and                 Configure 3 AutoDAG
      mount points                           properties


                                                                                             MDB1       MDB2
 Create mailbox databases
   and database copies
                                                                   MDB1         MDB2


             AutoDagDatabaseCopiesPerVolume = 1                                        MDB1.DB      MDB1.log
                                                         MDB1.DB          MDB1.log
Autoreseed

Requirements
Single logical disk/partition per physical disk
Specific database and log folder structure must be used


Recommendations
Same neighbors on all servers
Databases per volume should equal the number of copies per database
Balance activation preferences


Configuration instructions at http://aka.ms/autoreseed
Autoreseed

Numerous fixes in CU1
Autoreseed not detecting spare disks correctly or using detected disks

GetCopyStatus has a new field 'ExchangeVolumeMountPoint' which
shows the mount point of the database volume under
C:ExchangeVolumes

Better tracking around mount path and ExchangeVolume path

Increased autoreseed copy limits (previously 4, now 8)
Storage Challenges

Storage controllers are essentially mini-PCs and they can
crash/hang

Other operator-recoverable conditions can occur
Loss of vital system elements
Hung or highly latent IO
Storage Innovations

Exchange Server 2013 includes functionality to automatically recovery
from a variety of new storage-related failures

Innovations added in Exchange 2010 also carried forward

Even more behaviors added in CU1
Automatic Recovery from Storage Failures

Exchange Server 2010                   Exchange Server 2013

ESE Database Hung IO (240s)            System Bad State (302s)
Failure Item Channel Heartbeat (30s)   Long I/O times (41s)
SystemDisk Heartbeat (120s)            MSExchangeRepl.exe memory threshold
                                       (4GB)


                                       System Bus Reset (Event 129)
                                       Replication service endpoints not
                                       responding
Lagged Copy Challenges

Activation is difficult
Require manual care
Cannot be page patched
Lagged Copy Innovations

Automatic play down of log files in critical situations
Integration with Safety Net
Lagged Copy Innovations

Automatic log play down
Low disk space (enable in registry)
Page patching (enabled by default)
Less than 3 other healthy copies (enable in AD; configure in registry)


Simpler activation with Safety Net
No need for log surgery or hunting for the point of corruption
High Availability Challenges

High availability focuses on database health
Best copy selection insufficient for new architecture
Management challenges around maintenance and DAG network
configuration
High Availability Innovations

Managed Availability
Best Copy and Server Selection
DAG Network Autoconfig
Managed Availability

Key tenet for Exchange 2013
All access to a mailbox is provided by the protocol stack on the Mailbox server that hosts the active
copy of the user’s mailbox
If a protocol is down on a Mailbox server, all active databases lose access via that protocol


Managed Availability was introduced to detect these kinds of
failures and automatically correct them
For most protocols, quick recovery is achieved via a restart action
If the restart action fails, a failover can be triggered
Managed Availability

An internal framework used by component teams

Sequencing mechanism to control when recovery actions are taken
versus alerting and escalation

Includes a mechanism for taking servers in/out of service (maintenance
mode)

Enhancement to best copy selection algorithm
Managed Availability

MA failovers come in two forms
Server: Protocol failure can trigger server failover
Database: Store-detected database failure can trigger database failover


MA includes Single Copy Alert
Alert is per-server to reduce flow
Still triggered across all machines with copies
Monitoring triggered through a notification
Logs 4138 (red) and 4139 (green) events based on 4113 (red) and 4114 (green) events
Best Copy Selection Challenges

Process for finding the “best” copy of a specific database to
activate

Exchange 2010 uses several criteria
Copy queue length
Replay queue length
Database copy status – including activation blocked
Content index status


Not good enough for Exchange Server 2013, because protocol
health is not considered
Best Copy and Server Selection

Still an Active Manager algorithm performed at *over time based on
extracted health of the system
Replication health still determined by same criteria and phases

Criteria now includes health of the entire protocol stack
Considers a prioritized protocol health set in the selection
Four priorities – critical, high, medium, low (all health sets have a priority)
Failover responders trigger added checks to select a “protocol not worse”
target
Best Copy and Server Selection

      All Healthy
      Checks for a server hosting a copy that has all health sets in a healthy state



            Up to Normal Healthy
            Checks for a server hosting a copy that has all health sets Medium and above in a healthy state




            All Better than Source
            Checks for a server hosting a copy that has health sets in a state that is better than the current server hosting the
            affected copy




      Same as Source
      Checks for a server hosting a copy of the affected database that has health sets in a state that is the same as the current server hosting the
      affected copy
DAG Network Autoconfig

Automatic or manual DAG network config
Default is Automatic
Requires specific configuration settings on MAPI and Replication network interfaces


Manual edits and EAC controls blocked when automatic networking
is enabled
Set DAG to manual network setup to edit or change DAG networks


DAG networks automatically collapsed in multi-subnet environment
Site Resilience Challenges

Operationally complex
Mailbox and Client Access recovery connected
Namespace is a SPOF
Site Resilience Innovations

Operationally simplified
Mailbox and Client Access recovery independent
Namespace provides redundancy
Site Resilience

Key Characteristics
DNS resolves to multiple IP addresses
Almost all protocol access in Exchange 2013 is HTTP
HTTP clients have built-in IP failover capabilities
Clients skip past IPs that produce hard TCP failures
Admins can switchover by removing VIP from DNS
Namespace no longer a SPOF
No dealing with DNS latency
Site Resilience

Previously loss of CAS, CAS array, VIP, LB, some portion of the DAG
required admin to perform a datacenter switchover

In Exchange Server 2013, recovery happens automatically
The admin focuses on fixing the issue, instead of restoring service
Site Resilience

Previously, CAS and Mailbox server recovery were tied together in site
recoveries

In Exchange Server 2013, recovery is independent, and may come
automatically in the form of failover
This is dependent on the customer’s business requirements and
configuration
Site Resilience

With the namespace simplification, consolidation of server
roles, separation of CAS array and DAG recovery, de-coupling of CAS
and Mailbox by AD site, and load balancing changes…

if available, three locations can simplify mailbox recovery in
response to datacenter-level events
Site Resilience

You must have at least three locations
Two locations with Exchange; one with witness server

Exchange sites must be well-connected

Witness server site must be isolated from network failures affecting
Exchange sites
Site Resilience




     cas1         cas2      cas3     cas4




                  Redmond          Portland
Site Resilience




     cas1         cas2      cas3     cas4




                  Redmond          Portland
Site Resilience

 Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should
                                                 occur




        mbx1                 mbx2                  dag1                      mbx3            mbx4




                         Redmond               witness                                           Portland
Site Resilience




     mbx1         mbx2      dag1   mbx3   mbx4




                  Redmond                 Portland
Site Resilience
 1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond

 2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc

 3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland




          mbx1                   mbx2                    dag1                mbx3                       mbx4




                                 Redmond                                                               Portland
Questions?

Scott Schnoll
Principal Technical Writer
scott.schnoll@microsoft.com
http://aka.ms/schnoll

   schnoll

Contenu connexe

Tendances

Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisJignesh Shah
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuiteEDB
 
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best PracticesVMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best PracticesVMworld
 
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003Armando Leon
 
PostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesPostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesAshnikbiz
 
Distributed Caching Essential Lessons (Ts 1402)
Distributed Caching   Essential Lessons (Ts 1402)Distributed Caching   Essential Lessons (Ts 1402)
Distributed Caching Essential Lessons (Ts 1402)Yury Kaliaha
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDenish Patel
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseJignesh Shah
 
Geek Sync | Infrastructure for the Data Professional: An Introduction
Geek Sync | Infrastructure for the Data Professional: An IntroductionGeek Sync | Infrastructure for the Data Professional: An Introduction
Geek Sync | Infrastructure for the Data Professional: An IntroductionIDERA Software
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machineheraflux
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and BenchmarksJignesh Shah
 
Scaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsScaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsYury Kaliaha
 
MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB plc
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverJohn Paulett
 
Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...
Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...
Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...elliando dias
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in CloudHoward Marks
 

Tendances (20)

Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
Postgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster SuitePostgres & Red Hat Cluster Suite
Postgres & Red Hat Cluster Suite
 
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best PracticesVMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
 
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003Ms Tech Ed   Best Practices For Exchange Server Cluster Deployments June 2003
Ms Tech Ed Best Practices For Exchange Server Cluster Deployments June 2003
 
PostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesPostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use Cases
 
Distributed Caching Essential Lessons (Ts 1402)
Distributed Caching   Essential Lessons (Ts 1402)Distributed Caching   Essential Lessons (Ts 1402)
Distributed Caching Essential Lessons (Ts 1402)
 
Deploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQLDeploying Maximum HA Architecture With PostgreSQL
Deploying Maximum HA Architecture With PostgreSQL
 
Deep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL UniverseDeep Dive into RDS PostgreSQL Universe
Deep Dive into RDS PostgreSQL Universe
 
Geek Sync | Infrastructure for the Data Professional: An Introduction
Geek Sync | Infrastructure for the Data Professional: An IntroductionGeek Sync | Infrastructure for the Data Professional: An Introduction
Geek Sync | Infrastructure for the Data Professional: An Introduction
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
VMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing DatabasesVMworld 2014: Virtualizing Databases
VMworld 2014: Virtualizing Databases
 
PostgreSQL and Benchmarks
PostgreSQL and BenchmarksPostgreSQL and Benchmarks
PostgreSQL and Benchmarks
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Scaling Out Tier Based Applications
Scaling Out Tier Based ApplicationsScaling Out Tier Based Applications
Scaling Out Tier Based Applications
 
MariaDB High Availability Webinar
MariaDB High Availability WebinarMariaDB High Availability Webinar
MariaDB High Availability Webinar
 
PostgreSQL Scaling And Failover
PostgreSQL Scaling And FailoverPostgreSQL Scaling And Failover
PostgreSQL Scaling And Failover
 
Double-Take Software
Double-Take SoftwareDouble-Take Software
Double-Take Software
 
Sum209
Sum209Sum209
Sum209
 
Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...
Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...
Distributed Caching Using the JCACHE API and ehcache, Including a Case Study ...
 
Managing storage on Prem and in Cloud
Managing storage on Prem and in CloudManaging storage on Prem and in Cloud
Managing storage on Prem and in Cloud
 

En vedette

New Exchange Server 2013 Architecture
New Exchange Server 2013 ArchitectureNew Exchange Server 2013 Architecture
New Exchange Server 2013 ArchitectureKhalid Al-Ghamdi
 
Giải pháp hệ thống high availability mail exchange 2016
Giải pháp hệ thống high availability mail exchange 2016Giải pháp hệ thống high availability mail exchange 2016
Giải pháp hệ thống high availability mail exchange 2016laonap166
 
Microsoft Exchange 2013 architecture
Microsoft Exchange 2013 architectureMicrosoft Exchange 2013 architecture
Microsoft Exchange 2013 architectureMotty Ben Atia
 
Microsoft Exchange 2013 Introduction
Microsoft Exchange 2013 IntroductionMicrosoft Exchange 2013 Introduction
Microsoft Exchange 2013 IntroductionMotty Ben Atia
 
Conditions for Stretched Hosts Cluster Support on EMC VPLEX Metro
Conditions for Stretched Hosts Cluster Support on EMC VPLEX MetroConditions for Stretched Hosts Cluster Support on EMC VPLEX Metro
Conditions for Stretched Hosts Cluster Support on EMC VPLEX MetroEMC
 
International Conference on Cloud and Big Data Analytics ICCBDA 2013
International Conference on Cloud and Big Data Analytics ICCBDA 2013 International Conference on Cloud and Big Data Analytics ICCBDA 2013
International Conference on Cloud and Big Data Analytics ICCBDA 2013 EMC
 
White Paper: EMC FAST Cache — A Detailed Review
White Paper: EMC FAST Cache — A Detailed Review   White Paper: EMC FAST Cache — A Detailed Review
White Paper: EMC FAST Cache — A Detailed Review EMC
 
Day 4 legal matters
Day 4 legal mattersDay 4 legal matters
Day 4 legal mattersTravis Klein
 
Analisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero MailAnalisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero MailSara M
 
Fotonovel·la tutorial adrià, roger i gerard
Fotonovel·la tutorial adrià, roger i gerardFotonovel·la tutorial adrià, roger i gerard
Fotonovel·la tutorial adrià, roger i gerardmgonellgomez
 
Oracle Virtualization Best Practices
Oracle Virtualization Best PracticesOracle Virtualization Best Practices
Oracle Virtualization Best PracticesEMC
 
вивчення мотивації вибору професії
вивчення  мотивації вибору професіївивчення  мотивації вибору професії
вивчення мотивації вибору професіїТатьяна Глинская
 

En vedette (20)

New Exchange Server 2013 Architecture
New Exchange Server 2013 ArchitectureNew Exchange Server 2013 Architecture
New Exchange Server 2013 Architecture
 
Exchange 2013 ABC's: Architecture, Best Practices and Client Access
Exchange 2013 ABC's: Architecture, Best Practices and Client AccessExchange 2013 ABC's: Architecture, Best Practices and Client Access
Exchange 2013 ABC's: Architecture, Best Practices and Client Access
 
Giải pháp hệ thống high availability mail exchange 2016
Giải pháp hệ thống high availability mail exchange 2016Giải pháp hệ thống high availability mail exchange 2016
Giải pháp hệ thống high availability mail exchange 2016
 
Microsoft Exchange 2013 architecture
Microsoft Exchange 2013 architectureMicrosoft Exchange 2013 architecture
Microsoft Exchange 2013 architecture
 
Microsoft Exchange 2013 Introduction
Microsoft Exchange 2013 IntroductionMicrosoft Exchange 2013 Introduction
Microsoft Exchange 2013 Introduction
 
Conditions for Stretched Hosts Cluster Support on EMC VPLEX Metro
Conditions for Stretched Hosts Cluster Support on EMC VPLEX MetroConditions for Stretched Hosts Cluster Support on EMC VPLEX Metro
Conditions for Stretched Hosts Cluster Support on EMC VPLEX Metro
 
Titanic
TitanicTitanic
Titanic
 
Thebracelet
ThebraceletThebracelet
Thebracelet
 
International Conference on Cloud and Big Data Analytics ICCBDA 2013
International Conference on Cloud and Big Data Analytics ICCBDA 2013 International Conference on Cloud and Big Data Analytics ICCBDA 2013
International Conference on Cloud and Big Data Analytics ICCBDA 2013
 
Sub formulario2
Sub formulario2Sub formulario2
Sub formulario2
 
Amarnath darshan
Amarnath darshanAmarnath darshan
Amarnath darshan
 
White Paper: EMC FAST Cache — A Detailed Review
White Paper: EMC FAST Cache — A Detailed Review   White Paper: EMC FAST Cache — A Detailed Review
White Paper: EMC FAST Cache — A Detailed Review
 
Day 4 legal matters
Day 4 legal mattersDay 4 legal matters
Day 4 legal matters
 
Analisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero MailAnalisi di Usabilità di Libero Mail
Analisi di Usabilità di Libero Mail
 
Fotonovel·la tutorial adrià, roger i gerard
Fotonovel·la tutorial adrià, roger i gerardFotonovel·la tutorial adrià, roger i gerard
Fotonovel·la tutorial adrià, roger i gerard
 
Oracle Virtualization Best Practices
Oracle Virtualization Best PracticesOracle Virtualization Best Practices
Oracle Virtualization Best Practices
 
2015 day 9
2015 day 92015 day 9
2015 day 9
 
Formulario devoluciones
Formulario devolucionesFormulario devoluciones
Formulario devoluciones
 
The ant
The antThe ant
The ant
 
вивчення мотивації вибору професії
вивчення  мотивації вибору професіївивчення  мотивації вибору професії
вивчення мотивації вибору професії
 

Similaire à Exchange Server 2013 High Availability - Site Resilience

CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementJ Singh
 
VLDB Administration Strategies
VLDB Administration StrategiesVLDB Administration Strategies
VLDB Administration StrategiesMurilo Miranda
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)Sri Prasanna
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsDrPDShebaKeziaMalarc
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Geek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring TempdbGeek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring TempdbIDERA Software
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systemstugrulh
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structuresconfluent
 
The Power of the Log
The Power of the LogThe Power of the Log
The Power of the LogBen Stopford
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraAmazon Web Services
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadMarius Adrian Popa
 
How Data Instant Replay and Data Progression Work Together
How Data Instant Replay and Data Progression Work TogetherHow Data Instant Replay and Data Progression Work Together
How Data Instant Replay and Data Progression Work TogetherCompellent Technologies
 

Similaire à Exchange Server 2013 High Availability - Site Resilience (20)

CS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage ManagementCS 542 Putting it all together -- Storage Management
CS 542 Putting it all together -- Storage Management
 
Dba tuning
Dba tuningDba tuning
Dba tuning
 
VLDB Administration Strategies
VLDB Administration StrategiesVLDB Administration Strategies
VLDB Administration Strategies
 
Exchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store ChangesExchange Server 2013 Database and Store Changes
Exchange Server 2013 Database and Store Changes
 
Index file
Index fileIndex file
Index file
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
 
Google file system
Google file systemGoogle file system
Google file system
 
SQL 2005 Disk IO Performance
SQL 2005 Disk IO PerformanceSQL 2005 Disk IO Performance
SQL 2005 Disk IO Performance
 
Hadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data AnalyticsHadoop Distributed File System for Big Data Analytics
Hadoop Distributed File System for Big Data Analytics
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Geek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring TempdbGeek Sync | Guide to Understanding and Monitoring Tempdb
Geek Sync | Guide to Understanding and Monitoring Tempdb
 
Lec3 Dfs
Lec3 DfsLec3 Dfs
Lec3 Dfs
 
Distributed computing seminar lecture 3 - distributed file systems
Distributed computing seminar   lecture 3 - distributed file systemsDistributed computing seminar   lecture 3 - distributed file systems
Distributed computing seminar lecture 3 - distributed file systems
 
Gfs final
Gfs finalGfs final
Gfs final
 
Power of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data StructuresPower of the Log: LSM & Append Only Data Structures
Power of the Log: LSM & Append Only Data Structures
 
The Power of the Log
The Power of the LogThe Power of the Log
The Power of the Log
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
SRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon AuroraSRV407 Deep Dive on Amazon Aurora
SRV407 Deep Dive on Amazon Aurora
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
 
How Data Instant Replay and Data Progression Work Together
How Data Instant Replay and Data Progression Work TogetherHow Data Instant Replay and Data Progression Work Together
How Data Instant Replay and Data Progression Work Together
 

Plus de Microsoft TechNet - Belgium and Luxembourg

Plus de Microsoft TechNet - Belgium and Luxembourg (20)

Windows 10: all you need to know!
Windows 10: all you need to know!Windows 10: all you need to know!
Windows 10: all you need to know!
 
Configuration Manager 2012 – Compliance Settings 101 - Tim de Keukelaere
Configuration Manager 2012 – Compliance Settings 101 - Tim de KeukelaereConfiguration Manager 2012 – Compliance Settings 101 - Tim de Keukelaere
Configuration Manager 2012 – Compliance Settings 101 - Tim de Keukelaere
 
Windows 8.1 a closer look
Windows 8.1 a closer lookWindows 8.1 a closer look
Windows 8.1 a closer look
 
So you’ve successfully installed SCOM… Now what.
So you’ve successfully installed SCOM… Now what.So you’ve successfully installed SCOM… Now what.
So you’ve successfully installed SCOM… Now what.
 
Data Leakage Prevention
Data Leakage PreventionData Leakage Prevention
Data Leakage Prevention
 
Deploying and managing ConfigMgr Clients
Deploying and managing ConfigMgr ClientsDeploying and managing ConfigMgr Clients
Deploying and managing ConfigMgr Clients
 
Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?
Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?
Self Service BI anno 2013 – Where Do We Come From and Where Are We Going?
 
Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating
Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware UpdatingHands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating
Hands on with Hyper-V Clustering Maintenance Mode & Cluster Aware Updating
 
SCEP 2012 inside SCCM 2012
SCEP 2012 inside SCCM 2012SCEP 2012 inside SCCM 2012
SCEP 2012 inside SCCM 2012
 
Jump start your application monitoring with APM
Jump start your application monitoring with APMJump start your application monitoring with APM
Jump start your application monitoring with APM
 
What’s new in Lync Server 2013: Persistent Chat
What’s new in Lync Server 2013: Persistent ChatWhat’s new in Lync Server 2013: Persistent Chat
What’s new in Lync Server 2013: Persistent Chat
 
What's new for Lync 2013 Clients & Devices
What's new for Lync 2013 Clients & DevicesWhat's new for Lync 2013 Clients & Devices
What's new for Lync 2013 Clients & Devices
 
Office 365 ProPlus: Click-to-run deployment and management
Office 365 ProPlus: Click-to-run deployment and managementOffice 365 ProPlus: Click-to-run deployment and management
Office 365 ProPlus: Click-to-run deployment and management
 
Office 365 Identity Management options
Office 365 Identity Management options Office 365 Identity Management options
Office 365 Identity Management options
 
SharePoint Installation and Upgrade: Untangling Your Options
SharePoint Installation and Upgrade: Untangling Your Options SharePoint Installation and Upgrade: Untangling Your Options
SharePoint Installation and Upgrade: Untangling Your Options
 
The application model in real life
The application model in real lifeThe application model in real life
The application model in real life
 
Microsoft private cloud with Cisco and Netapp - Flexpod solution
Microsoft private cloud with Cisco and Netapp -  Flexpod solutionMicrosoft private cloud with Cisco and Netapp -  Flexpod solution
Microsoft private cloud with Cisco and Netapp - Flexpod solution
 
Managing Windows RT devices in the Enterprise
Managing Windows RT devices in the Enterprise Managing Windows RT devices in the Enterprise
Managing Windows RT devices in the Enterprise
 
Moving from Device Centric to a User Centric Management
Moving from Device Centric to a User Centric Management Moving from Device Centric to a User Centric Management
Moving from Device Centric to a User Centric Management
 
Network Management in System Center 2012 SP1 - VMM
Network Management in System Center 2012  SP1 - VMM Network Management in System Center 2012  SP1 - VMM
Network Management in System Center 2012 SP1 - VMM
 

Exchange Server 2013 High Availability - Site Resilience

  • 1. Exchange Server 2013 High Availability | Site Resilience Scott Schnoll Principal Technical Writer Microsoft Corporation
  • 3.
  • 4. Storage Challenges Capacity is increasing, but IOPS are not Database sizes must be manageable Reseeds must be fast and reliable Passive copy IOPS are inefficient Lagged copies have asymmetric storage requirements Low agility from low disk space recovery
  • 5. Storage Innovations Multiple Databases Per Volume Automatic Reseed Automatic Recovery from Storage Failures Lagged Copy Enhancements
  • 6.
  • 7. Multiple Databases Per Volume 4-member DAG 4 databases 4 copies of each database 4 databases per volume Symmetrical design with balanced activation preference Number of copies per database = number of databases per volume DB2 DB3 DB4 Active Passive Lagged
  • 8. Multiple Databases Per Volume Single database copy/disk: Reseed 2TB Database = ~23 hrs Reseed 8TB Database = ~93 hrs DB1 DB1 20 MB/s DB1 DB1 Active Passive
  • 9. Multiple Databases Per Volume Single database copy/disk: Reseed 2TB Database = ~23 hrs Reseed 8TB Database = ~93 hrs 4 database copies/disk: Reseed 2TB Disk = ~9.7 hrs Reseed 8TB Disk = ~39 hrs 20 MB/s DB2 12 MB/s DB3 20 MB/s DB4 12 MB/s Active Passive Lagged
  • 10. Multiple Databases Per Volume Requirements Best Practices Single logical disk/partition per Same neighbors on all servers physical disk Balance activation preferences Database copies per volume = copies per database
  • 11.
  • 12. Seeding Challenges Disk failure on active copy = database failover Failed disk and database corruption issues need to be addressed quickly Fast recovery to restore redundancy is needed
  • 13. Seeding Innovations Automatic Reseed (Autoreseed) - use spares to automatically restore database redundancy after a disk failure
  • 15. Autoreseed Workflow Periodically Check Verify that scan for failed prerequisites: Allocate Admin Start the the new and single and remap replaces suspended copy, spare seed copy is a spare failed disk copies availability healthy
  • 16. Autoreseed Workflow 1. Detect a copy in an F&S state for 15 min in a row 2. Try to resume copy 3 times (with 5 min sleeps in between) 3. Try assigning a spare volume 5 times (with 1 hour sleeps in between) 4. Try InPlaceSeed with SafeDeleteExistingFiles 5 times (with 1 hour sleeps in between) 5. Once all retries are exhausted, workflow stops 6. If 3 days have elapsed and copy is still F&S, workflow state is reset and starts from Step 1
  • 17. Autoreseed Workflow Prerequisites Copy is not ReseedBlocked or ResumeBlocked Logs and EDB files are on same volume Database and Log folder structure matches required naming convention No active copies on failed volume All copies are F&S on the failed volume No more than 8 F&S copies on the server (if so, we might be in a controller failure situation) For InPlaceReseed If EDB files exists, wait for 2 days before in-place reseeding (based on LastWriteTime of edb file) Only up to 10 concurrent seeds are allowed
  • 18. Autoreseed AutoDagDatabasesRootFolderPath AutoDagVolumesRootFolderPath Configure storage Create DAG, add servers subsystem with spare disks with configured storage Create directory and Configure 3 AutoDAG mount points properties MDB1 MDB2 Create mailbox databases and database copies MDB1 MDB2 AutoDagDatabaseCopiesPerVolume = 1 MDB1.DB MDB1.log MDB1.DB MDB1.log
  • 19. Autoreseed Requirements Single logical disk/partition per physical disk Specific database and log folder structure must be used Recommendations Same neighbors on all servers Databases per volume should equal the number of copies per database Balance activation preferences Configuration instructions at http://aka.ms/autoreseed
  • 20. Autoreseed Numerous fixes in CU1 Autoreseed not detecting spare disks correctly or using detected disks GetCopyStatus has a new field 'ExchangeVolumeMountPoint' which shows the mount point of the database volume under C:ExchangeVolumes Better tracking around mount path and ExchangeVolume path Increased autoreseed copy limits (previously 4, now 8)
  • 21.
  • 22. Storage Challenges Storage controllers are essentially mini-PCs and they can crash/hang Other operator-recoverable conditions can occur Loss of vital system elements Hung or highly latent IO
  • 23. Storage Innovations Exchange Server 2013 includes functionality to automatically recovery from a variety of new storage-related failures Innovations added in Exchange 2010 also carried forward Even more behaviors added in CU1
  • 24. Automatic Recovery from Storage Failures Exchange Server 2010 Exchange Server 2013 ESE Database Hung IO (240s) System Bad State (302s) Failure Item Channel Heartbeat (30s) Long I/O times (41s) SystemDisk Heartbeat (120s) MSExchangeRepl.exe memory threshold (4GB) System Bus Reset (Event 129) Replication service endpoints not responding
  • 25.
  • 26. Lagged Copy Challenges Activation is difficult Require manual care Cannot be page patched
  • 27. Lagged Copy Innovations Automatic play down of log files in critical situations Integration with Safety Net
  • 28. Lagged Copy Innovations Automatic log play down Low disk space (enable in registry) Page patching (enabled by default) Less than 3 other healthy copies (enable in AD; configure in registry) Simpler activation with Safety Net No need for log surgery or hunting for the point of corruption
  • 29.
  • 30. High Availability Challenges High availability focuses on database health Best copy selection insufficient for new architecture Management challenges around maintenance and DAG network configuration
  • 31. High Availability Innovations Managed Availability Best Copy and Server Selection DAG Network Autoconfig
  • 32.
  • 33. Managed Availability Key tenet for Exchange 2013 All access to a mailbox is provided by the protocol stack on the Mailbox server that hosts the active copy of the user’s mailbox If a protocol is down on a Mailbox server, all active databases lose access via that protocol Managed Availability was introduced to detect these kinds of failures and automatically correct them For most protocols, quick recovery is achieved via a restart action If the restart action fails, a failover can be triggered
  • 34. Managed Availability An internal framework used by component teams Sequencing mechanism to control when recovery actions are taken versus alerting and escalation Includes a mechanism for taking servers in/out of service (maintenance mode) Enhancement to best copy selection algorithm
  • 35. Managed Availability MA failovers come in two forms Server: Protocol failure can trigger server failover Database: Store-detected database failure can trigger database failover MA includes Single Copy Alert Alert is per-server to reduce flow Still triggered across all machines with copies Monitoring triggered through a notification Logs 4138 (red) and 4139 (green) events based on 4113 (red) and 4114 (green) events
  • 36.
  • 37. Best Copy Selection Challenges Process for finding the “best” copy of a specific database to activate Exchange 2010 uses several criteria Copy queue length Replay queue length Database copy status – including activation blocked Content index status Not good enough for Exchange Server 2013, because protocol health is not considered
  • 38. Best Copy and Server Selection Still an Active Manager algorithm performed at *over time based on extracted health of the system Replication health still determined by same criteria and phases Criteria now includes health of the entire protocol stack Considers a prioritized protocol health set in the selection Four priorities – critical, high, medium, low (all health sets have a priority) Failover responders trigger added checks to select a “protocol not worse” target
  • 39. Best Copy and Server Selection All Healthy Checks for a server hosting a copy that has all health sets in a healthy state Up to Normal Healthy Checks for a server hosting a copy that has all health sets Medium and above in a healthy state All Better than Source Checks for a server hosting a copy that has health sets in a state that is better than the current server hosting the affected copy Same as Source Checks for a server hosting a copy of the affected database that has health sets in a state that is the same as the current server hosting the affected copy
  • 40.
  • 41. DAG Network Autoconfig Automatic or manual DAG network config Default is Automatic Requires specific configuration settings on MAPI and Replication network interfaces Manual edits and EAC controls blocked when automatic networking is enabled Set DAG to manual network setup to edit or change DAG networks DAG networks automatically collapsed in multi-subnet environment
  • 42.
  • 43.
  • 44. Site Resilience Challenges Operationally complex Mailbox and Client Access recovery connected Namespace is a SPOF
  • 45. Site Resilience Innovations Operationally simplified Mailbox and Client Access recovery independent Namespace provides redundancy
  • 46. Site Resilience Key Characteristics DNS resolves to multiple IP addresses Almost all protocol access in Exchange 2013 is HTTP HTTP clients have built-in IP failover capabilities Clients skip past IPs that produce hard TCP failures Admins can switchover by removing VIP from DNS Namespace no longer a SPOF No dealing with DNS latency
  • 47. Site Resilience Previously loss of CAS, CAS array, VIP, LB, some portion of the DAG required admin to perform a datacenter switchover In Exchange Server 2013, recovery happens automatically The admin focuses on fixing the issue, instead of restoring service
  • 48. Site Resilience Previously, CAS and Mailbox server recovery were tied together in site recoveries In Exchange Server 2013, recovery is independent, and may come automatically in the form of failover This is dependent on the customer’s business requirements and configuration
  • 49. Site Resilience With the namespace simplification, consolidation of server roles, separation of CAS array and DAG recovery, de-coupling of CAS and Mailbox by AD site, and load balancing changes… if available, three locations can simplify mailbox recovery in response to datacenter-level events
  • 50. Site Resilience You must have at least three locations Two locations with Exchange; one with witness server Exchange sites must be well-connected Witness server site must be isolated from network failures affecting Exchange sites
  • 51. Site Resilience cas1 cas2 cas3 cas4 Redmond Portland
  • 52. Site Resilience cas1 cas2 cas3 cas4 Redmond Portland
  • 53. Site Resilience Assuming MBX3 and MBX4 are operating and one of them can lock the witness.log file, automatic failover should occur mbx1 mbx2 dag1 mbx3 mbx4 Redmond witness Portland
  • 54. Site Resilience mbx1 mbx2 dag1 mbx3 mbx4 Redmond Portland
  • 55. Site Resilience 1. Mark the failed servers/site as down: Stop-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Redmond 2. Stop the Cluster Service on Remaining DAG members: Stop-Clussvc 3. Activate DAG members in 2nd datacenter: Restore-DatabaseAvailabilityGroup DAG1 –ActiveDirectorySite:Portland mbx1 mbx2 dag1 mbx3 mbx4 Redmond Portland
  • 56. Questions? Scott Schnoll Principal Technical Writer scott.schnoll@microsoft.com http://aka.ms/schnoll schnoll

Notes de l'éditeur

  1. The primary input condition for the autoreseed workflow is a copy in an F&S state for 15 consecutive minutes.Reset/resume behavior is useful since it can take a few days to replace a failed disk/controller, etc.