Sunday, September 29, 2019

WSFC Quorum Modes and Voting Configuration (SQL Server)

Both SQL ServerAlways On availability groups and Always On Failover Cluster Instances (FCI) take advantage of Windows Server Failover Clustering (WSFC) as a platform technology. WSFC uses a quorum-based approach to monitoring overall cluster health and maximize node-level fault tolerance. 
A fundamental understanding of WSFC quorum modes and node voting configuration is very important to designing, operating, and troubleshooting your Always On high availability and disaster recovery solution.
In this topic:
  • Cluster Health Detection by Quorum
  • Quorum Modes
  • Voting and Non-Voting Nodes
  • Recommended Adjustments to Quorum Voting
  • Related Tasks
  • Related Content

Cluster Health Detection by Quorum

Each node in a WSFC cluster participates in periodic heartbeat communication to share the node's health status with the other nodes. Unresponsive nodes are considered to be in a failed state.
quorum node set is a majority of the voting nodes and witnesses in the WSFC cluster. The overall health and status of a WSFC cluster is determined by a periodic quorum vote. The presence of a quorum means that the cluster is healthy and able to provide node-level fault tolerance.
The absence of a quorum indicates that the cluster is not healthy. Overall WSFC cluster health must be maintained in order to ensure that healthy secondary nodes are available for primary nodes to fail over to. If the quorum vote fails, the WSFC cluster will be set offline as a precautionary measure. This will also cause all SQL Server instances registered with the cluster to be stopped.
 Important
If a WSFC cluster is set offline because of quorum failure, manual intervention is required to bring it back online.

Quorum Modes

quorum mode is configured at the WSFC cluster level that dictates the methodology used for quorum voting. The Failover Cluster Manager utility will recommend a quorum mode based on the number of nodes in the cluster.
The following quorum modes can be used to determine what constitutes a quorum of votes:
  • Node Majority. More than one-half of the voting nodes in the cluster must vote affirmatively for the cluster to be healthy.
  • Node and File Share Majority. Similar to Node Majority quorum mode, except that a remote file share is also configured as a voting witness, and connectivity from any node to that share is also counted as an affirmative vote. More than one-half of the possible votes must be affirmative for the cluster to be healthy.
    As a best practice, the witness file share should not reside on any node in the cluster, and it should be visible to all nodes in the cluster.
  • Node and Disk Majority. Similar to Node Majority quorum mode, except that a shared disk cluster resource is also designated as a voting witness, and connectivity from any node to that shared disk is also counted as an affirmative vote. More than one-half of the possible votes must be affirmative for the cluster to be healthy.
  • Disk Only. A shared disk cluster resource is designated as a witness, and connectivity by any node to that shared disk is counted as an affirmative vote.
 Tip
When using an asymmetric storage configuration for Always On availability groups, you should generally use the Node Majority quorum mode when you have an odd number of voting nodes, or the Node and File Share Majority quorum mode when you have an even number of voting nodes.

Voting and Non-Voting Nodes

By default, each node in the WSFC cluster is included as a member of the cluster quorum; each node has a single vote in determining the overall cluster health, and each node will continuously attempt to establish a quorum. The quorum discussion to this point has carefully qualified the set of WSFC cluster nodes that vote on cluster health as voting nodes.
No individual node in a WSFC cluster can definitively determine that the cluster as a whole is healthy or unhealthy. At any given moment, from the perspective of each node, some of the other nodes may appear to be offline, or appear to be in the process of failover, or appear unresponsive due to a network communication failure. A key function of the quorum vote is to determine whether the apparent state of each of node in the WSFC cluster is indeed that actual state of those nodes.
For all of the quorum models except 'Disk Only', the effectiveness of a quorum vote depends on reliable communications between all of the voting nodes in the cluster. Network communications between nodes on the same physical subnet should be considered reliable; the quorum vote should be trusted.
However, if a node on another subnet is seen as non-responsive in a quorum vote, but it is actually online and otherwise healthy, that is most likely due to a network communications failure between subnets. Depending upon the cluster topology, quorum mode, and failover policy configuration, that network communications failure may effectively create more than one set (or subset) of voting nodes.
When more than one subset of voting nodes is able to establish a quorum on its own, that is known as a split-brain scenario. In such a scenario, the nodes in the separate quorums may behave differently, and in conflict with one another.
 Note
The split-brain scenario is only possible when a system administrator manually performs a forced quorum operation, or in very rare circumstances, a forced failover; explicitly subdividing the quorum node set.
In order to simplify your quorum configuration and increase up-time, you may want to adjust each node's NodeWeight setting so that the node's vote is not counted towards the quorum.
 Important
In order to use NodeWeight settings, the following hotfix must be applied to all servers in the WSFC cluster:
KB2494036: A hotfix is available to let you configure a cluster node that does not have quorum votes in Windows Server 2008 and in Windows Server 2008 R2

Recommended Adjustments to Quorum Voting

When enabling or disabling a given WSFC node's vote, follow these guidelines:
  • No vote by default. Assume that each node should not vote without explicit justification.
  • Include all primary replicas. Each WSFC node that hosts an availability group primary replica or is the preferred owner of an FCI should have a vote.
  • Include possible automatic failover owners. Each node that could host a primary replica, as the result of an automatic availability group failover or FCI failover, should have a vote. If there is only one availability group in the WSFC cluster and availability replicas are hosted only by standalone instances, this rule includes only the secondary replica that is the automatic failover target.
  • Exclude secondary site nodes. In general, do not give votes to WSFC nodes that reside at a secondary disaster recovery site. You do not want nodes in the secondary site to contribute to a decision to take the cluster offline when there is nothing wrong with the primary site.
  • Odd number of votes. If necessary, add a witness file share, a witness node, or a witness disk to the cluster and adjust the quorum mode to prevent possible ties in the quorum vote.
  • Re-assess vote assignments post-failover. You do not want to fail over into a cluster configuration that does not support a healthy quorum.
 Important
When validating WSFC quorum vote configuration, the Always On Availability Group Wizard shows a warning if any of the following conditions are true:
  • The cluster node that hosts the primary replica does not have a vote
  • A secondary replica is configured for automatic failover and its cluster node does not have a vote.
  • KB2494036 is not installed on all cluster nodes that host availability replicas. This patch is required to add or remove votes for cluster nodes in multi-site deployments. However, in single-site deployments, it is usually not required and you may safely ignore the warning.
 Tip
SQL Server exposes several system dynamic management views (DMVs) that can help you manage settings related WSFC cluster configuration and node quorum voting.

Failover Policy for Failover Cluster Instances

In a SQL Server failover cluster instance (FCI), only one node can own the Windows Server Failover Cluster (WSFC) cluster resource group at a given time. The client requests are served through this node in the FCI. 
In the case of a failure and an unsuccessful restart, the group ownership is moved to another WSFC node in the FCI. This process is called failover. SQL Server 2017 increases the reliability of failure detection and provides a flexible failover policy.
A SQL Server FCI depends on the underlying WSFC service for failover detection. Therefore, two mechanisms determine the failover behavior for FCI: the former is native WSFC functionality, and the latter is functionality added by SQL Server setup.
  • The WSFC cluster maintains the quorum configuration, which ensures a unique failover target in an automatic failover. The WSFC service determines whether the cluster is in optimal quorum health at all times and brings the resource group online and offline accordingly.
  • The active SQL Server instance periodically reports a set of component diagnostics to the WSFC resource group over a dedicated connection. The WSFC resource group maintains the failover policy, which defines the failure conditions that trigger restarts and failovers.
 Important
Automatic failovers to and from an FCI are not allowed in an Always On availability group. However, manual failovers to and from and FCI are allowed in an Always On availability group.

Failover Policy Overview

The failover process can be broken down into the following steps:
  1. Monitor the Health Status
  2. Determining Failures
  3. Responding to Failures

Monitor the Health Status

There are three types of health statuses that are monitored for the FCI:
  • State of the SQL Server service
  • Responsiveness of the SQL Server instance
  • SQL Server component diagnostics

State of the SQL Server service

The WSFC service monitors the start state of the SQL Server service on the active FCI node to detect when the SQL Server service is stopped.

Responsiveness of the SQL Server instance

During SQL Server startup, the WSFC service uses the SQL Server Database Engine resource DLL to create a new connection to on a separate thread that is used exclusively for monitoring the health status. 
This ensures that there the SQL instance has the required resources to report its health status while under load. Using this dedicated connection, SQL Server runs the sp_server_diagnostics (Transact-SQL) system stored procedure in repeat mode to periodically report the health status of the SQL Server components to the resource DLL.
The resource DLL determines the responsiveness of the SQL instance using a health check timeout. The HealthCheckTimeout property defines how long the resource DLL should wait for the sp_server_diagnostics stored procedure before it reports the SQL instance as unresponsive to the WSFC service. This property is configurable using T-SQL as well as in the Failover Cluster Manager snap-in. 
The following items describe how this property affects timeout and repeat interval settings:
  • The resource DLL calls the sp_server_diagnostics stored procedure and sets the repeat interval to one-third of the HealthCheckTimeout setting.
  • If the sp_server_diagnostics stored procedure is slow or is not returning information, the resource DLL will wait for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.
  • If the dedicated connection is lost, the resource DLL will retry the connection to the SQL instance for the interval specified by HealthCheckTimeout before it reports to the WSFC service that the SQL instance is unresponsive.

SQL Server component diagnostics

The system stored procedure sp_server_diagnostics periodically collects component diagnostics on the SQL instance. The diagnostic information that is collected is surfaced as a row for each of the following components and passed to the calling thread.
  1. system
  2. resource
  3. query process
  4. io_subsystem
  5. events
The system, resource, and query process components are used for failure detection. The io_subsytem and events components are used for diagnostic purposes only.
Each rowset of information is also written to the SQL Server cluster diagnostics log.
 Tip
While the sp_server_diagnostic stored procedure is used by SQL Server Always On technology, it is available for use in any SQL Server instance to help detect and troubleshoot problems.

Determining Failures

The SQL Server Database Engine resource DLL determines whether the detected health status is a condition for failure using the FailureConditionLevel property. The FailureConditionLevel property defines which detected health statuses cause restarts or failovers. Multiple levels of options are available, ranging from no automatic restart or failover to all possible failure conditions resulting in an automatic restart or failover.
The failure conditions are set on an increasing scale. For levels 1-5, each level includes all the conditions from the previous levels in addition to its own conditions. This means that with each level, there is an increased probability of a failover or restart. The failure condition levels are described in the following table.
Review sp_server_diagnostics (Transact-SQL) as this system stored procedure plays in important role in the failure condition levels.
LevelConditionDescription
0No automatic failover or restartIndicates that no failover or restart will be triggered automatically on any failure conditions. This level is for system maintenance purposes only.
1Failover or restart on server downIndicates that a server restart or failover will be triggered if the following condition is raised:

SQL Server service is down.
2Failover or restart on server unresponsiveIndicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).
3*Failover or restart on critical server errorsIndicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

System stored procedure sp_server_diagnostics returns 'system error'.
4Failover or restart on moderate server errorsIndicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

System stored procedure sp_server_diagnostics returns 'system error'.

System stored procedure sp_server_diagnostics returns 'resource error'.
5Failover or restart on any qualified failure conditionsIndicates that a server restart or failover will be triggered if any of the following conditions are raised:

SQL Server service is down.

SQL Server instance is not responsive (Resource DLL cannot receive data from sp_server_diagnostics within the HealthCheckTimeout settings).

System stored procedure sp_server_diagnostics returns 'system error'.

System stored procedure sp_server_diagnostics returns 'resource error'.

System stored procedure sp_server_diagnostics returns 'query_processing error'.
*Default Value

Responding to Failures

After one or more failure conditions are detected, how the WSFC service responds to the failures depends on the WSFC quorum state and the restart and failover settings of the FCI resource group. 
If the FCI has lost its WSFC quorum, then the entire FCI is brought offline and the FCI has lost its high availability. If the FCI still retains its WSFC quorum, then the WSFC service may respond by first attempting to restart the failed node and then failover if the restart attempts are unsuccessful. The restart and failover settings are configured in the Failover Cluster Manager snap-in.

Database Checkpoints (SQL Server)

checkpoint creates a known good point from which the SQL Server Database Engine can start applying changes contained in the log during recovery after an unexpected shutdown or crash.

Overview

For performance reasons, the Database Engine performs modifications to database pages in memory-in the buffer cache-and does not write these pages to disk after every change. Rather, the Database Engine periodically issues a checkpoint on each database. 
checkpoint writes the current in-memory modified pages (known as dirty pages) and transaction log information from memory to disk and, also records the information in the transaction log.
The Database Engine supports several types of checkpoints: automatic, indirect, manual, and internal. The following table summarizes the types of checkpoints:
NameTransact-SQL InterfaceDescription
AutomaticEXEC sp_configure 'recovery interval','seconds'Issued automatically in the background to meet the upper time limit suggested by the recovery interval server configuration option. Automatic checkpoints run to completion. Automatic checkpoints are throttled based on the number of outstanding writes and whether the Database Engine detects an increase in write latency above 50 milliseconds.
IndirectALTER DATABASE ... SET TARGET_RECOVERY_TIME =target_recovery_time { SECONDS | MINUTES }Issued in the background to meet a user-specified target recovery time for a given database. Beginning with SQL Server 2016 (13.x), the default value is 1 minute. The default is 0 for older versions, which indicates that the database will use automatic checkpoints, whose frequency depends on the recovery interval setting of the server instance.
ManualCHECKPOINT [checkpoint_duration]Issued when you execute a Transact-SQL CHECKPOINT command. The manual checkpoint occurs in the current database for your connection. By default, manual checkpoints run to completion. Throttling works the same way as for automatic checkpoints. Optionally, the checkpoint_duration parameter specifies a requested amount of time, in seconds, for the checkpoint to complete.
InternalNone.Issued by various server operations such as backup and database-snapshot creation to guarantee that disk images match the current state of the log.
 Note
The -k SQL Server advanced setup option enables a database administrator to throttle checkpoint I/O behavior based on the throughput of the I/O subsystem for some types of checkpoints. The -k setup option applies to automatic checkpoints and any otherwise unthrottled manual and internal checkpoints.
For automatic, manual, and internal checkpoints, only modifications made after the latest checkpoint need to be rolled forward during database recovery. This reduces the time required to recover a database.
 Important
Long-running, uncommitted transactions increase recovery time for all checkpoint types.

Interaction of the TARGET_RECOVERY_TIME and 'recovery interval' Options

The following table summarizes the interaction between the server-wide sp_configure'recovery interval' setting and the database-specific ALTER DATABASE ... TARGET_RECOVERY_TIME setting.
TARGET_RECOVERY_TIME'recovery interval'Type of Checkpoint Used
00automatic checkpoints whose target recovery interval is 1 minute.
0>0Automatic checkpoints whose target recovery interval is specified by the user-defined setting of the sp_configure 'recovery interval' option.
>0Not applicable.Indirect checkpoints whose target recovery time is determined by the TARGET_RECOVERY_TIME setting, expressed in seconds.

Automatic checkpoints

An automatic checkpoint occurs each time the number of log records reaches the number the Database Engine estimates it can process during the time specified in the recovery interval server configuration option. 
In every database without a user-defined target recovery time, the Database Engine generates automatic checkpoints. The frequency depends on the recovery interval advanced server configuration option, which specifies the maximum time that a given server instance should use to recover a database during a system restart. 
The Database Engine estimates the maximum number of log records it can process within the recovery interval. When a database using automatic checkpoints reaches this maximum number of log records, the Database Engine issues an checkpoint on the database.
The time interval between automatic checkpoints can be highly variable. A database with a substantial transaction workload will have more frequent checkpoints than a database used primarily for read-only operations. Under the simple recovery model, an automatic checkpoint is also queued if the log becomes 70 percent full.
Under the simple recovery model, unless some factor is delaying log truncation, an automatic checkpoint truncates the unused section of the transaction log. By contrast, under the full and bulk-logged recovery models, once a log backup chain has been established, automatic checkpoints do not cause log truncation. 
After a system crash, the length of time required to recover a given database depends largely on the amount of random I/O needed to redo pages that were dirty at the time of the crash. This means that the recovery interval setting is unreliable. It cannot determine an accurate recovery duration. Furthermore, when an automatic checkpoint is in progress, the general I/O activity for data increases significantly and quite unpredictably.

Impact of recovery interval on recovery performance

For an online transaction processing (OLTP) system using short transactions, recovery interval is the primary factor determining recovery time. However, the recovery interval option does not affect the time required to undo a long-running transaction. Recovery of a database with a long-running transaction can take much longer than the time specified in the recovery interval setting.
For example, if a long-running transaction took two hours to perform updates before the server instance became disabled, the actual recovery takes considerably longer than the recovery interval value to recover the long transaction. 
Typically, the default values provides optimal recovery performance. However, changing the recovery interval might improve performance in the following circumstances:
  • If recovery routinely takes significantly longer than 1 minute when long-running transactions are not being rolled back.
  • If you notice that frequent checkpoints are impairing performance on a database.
If you decide to increase the recovery interval setting, we recommend increasing it gradually by small increments and evaluating the effect of each incremental increase on recovery performance. 
This approach is important because as the recovery interval setting increases, database recovery takes that many times longer to complete. For example, if you change recovery interval to 10 minutes, recovery takes approximately 10 times longer to complete than when recovery interval is set to 1 minute.

Indirect checkpoints

Indirect checkpoints, introduced in SQL Server 2012 (11.x), provide a configurable database-level alternative to automatic checkpoints. This can be configured by specifying the target recovery time database configuration option. 
In the event of a system crash, indirect checkpoints provide potentially faster, more predictable recovery time than automatic checkpoints. Indirect checkpoints offer the following advantages:
  • An online transactional workload on a database configured for indirect checkpoints can experience performance degradation. Indirect checkpoints ensure that the number of dirty pages are below a certain threshold so the database recovery completes within the target recovery time.
The recovery interval configuration option uses the number of transactions to determine the recovery time, as opposed to indirect checkpoints which makes use of the number of dirty pages. 
When indirect checkpoints are enabled on a database receiving a large number of DML operations, the background writer can start aggressively flushing dirty buffers to disk to ensure that the time required to perform recovery is within the target recovery time set of the database. 
This can cause additional I/O activity on certain systems which can contribute to a performance bottleneck if the disk subsystem is operating above or nearing the I/O threshold.
  • Indirect checkpoints enable you to reliably control database recovery time by factoring in the cost of random I/O during REDO. This enables a server instance to stay within an upper-bound limit on recovery times for a given database (except when a long-running transaction causes excessive UNDO times).
  • Indirect checkpoints reduce checkpoint-related I/O spiking by continually writing dirty pages to disk in the background.
However, an online transactional workload on a database configured for indirect checkpoints can experience performance degradation. This is because the background writer used by indirect checkpoint sometimes increases the total write load for a server instance.
 Important
Indirect checkpoint is the default behavior for new databases created in SQL Server 2016 (13.x), including the Model and TempDB databases.
Databases that were upgraded in-place, or restored from a previous version of SQL Server, will use the previous automatic checkpoint behavior unless explicitly altered to use indirect checkpoint.

Improved indirect checkpoint scalability

Prior to SQL Server 2019, you may experience non-yielding scheduler errors when there is a database that generates a large number of dirty pages, such as tempdb. SQL Server 2019 introduces improved scalability for indirect checkpoint, which should help avoid these errors on databases that have a heavy UPDATE/INSERT workload.

Internal checkpoints

Internal Checkpoints are generated by various server components to guarantee that disk images match the current state of the log. Internal checkpoint are generated in response to the following events:
  • Database files have been added or removed by using ALTER DATABASE.
  • A database backup is taken.
  • A database snapshot is created, whether explicitly or internally for DBCC CHECKDB.
  • An activity requiring a database shutdown is performed. For example, AUTO_CLOSE is ON and the last user connection to the database is closed, or a database option change is made that requires a restart of the database.
  • An instance of SQL Server is stopped by stopping the SQL Server (MSSQLSERVER) service . Either action causes a checkpoint in each database in the instance of SQL Server.
  • Bringing a SQL Server failover cluster instance (FCI) offline.

Lab 09: Publish and subscribe to Event Grid events

  Microsoft Azure user interface Given the dynamic nature of Microsoft cloud tools, you might experience Azure UI changes that occur after t...