| Oracle® High Availability Architecture and Best Practices 10g Release 1 (10.1) Part Number B10726-01 |
|
![]() Previous |
![]() Next font> |
This chapter describes scheduled and unscheduled outages and the Oracle recove ry process and architectural framework that can manage each outage and minimize downtime. This chapter contains the following section s:
Unscheduled outages are u nanticipated failures in any part of the technology infrastructure that supports the application, including the following components:
Table&n bsp;9-1 describes the unscheduled outages that impact the primary or secondary site components.
The rest of this section provides outage decisi on trees for unscheduled outages on the primary site and the secondary site. The decision trees appear in the following sections:
The high-level recovery steps for each o utage are listed with links to the detailed descriptions for each recovery step. These descriptions are found in Chapter 10, "Detailed Recovery Steps".
Some outages require multiple rec overy steps. For example, when a site failure occurs, the outage decision matrix states that Data Guard failover must occur before si te failover. Some outages are handled automatically without any loss of availability. For example, instance failure is managed automa tically by RAC. Multiple recovery options for each outage are listed wherever relevant.
If the primary site contains the production database and the se condary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these o utages are critical for maximum availability of the system. Only the "Data Guard only" and MAA architectures have a secondary site to protect from site disasters. The estimated recovery times (ERT) are strictly examples derived from customer and actual testing exper iences and do not reflect a guaranteed recovery time.
Table&n bsp;9-2 summarizes the recovery steps for unscheduled outages on the primary site.
| Reason for Outage | Recovery Steps for "Database Only" Architecture | Recovery Steps for "RAC Only" Architecture | Recovery Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
|---|---|---|---|---|
|
ERT: hours to days |
ERT: hours to days |
ERT: minutes to an hour |
ERT: minutes to an hour | |
|
ERT: minutes to an hour |
Managed automatically by RAC Recovery |
ERT: minutes to an hour ERT: minutes to an hour p> |
Man aged automatically by RAC Recovery | |
| <
a name="1011437">
ERT: minutes |
Managed automatically by RAC Recovery |
ERT: minutes |
Managed automatically by RAC Recovery | |
|
N/A |
ERT: hours to days |
N/A < /td> |
ERT: minutes to an hour | |
|
Re covery Solutions for Data Failures ERT: minutes to an hour Note: For primary data base media failures or media corruptions, database failover may minimize data loss. | ||||
| td> |
Outages on the secondary site do not directly affect availability because the clients always access the primary s ite unless there is a switchover or failover. Outages on the secondary site may impact the MTTR if there are concurrent failures on t he primary site. For most cases, outages on the secondary site can be managed with no impact on availability. However, if maximum pro tection mode is part of the configuration, then an unscheduled outage on the last surviving standby database causes downtime on the p roduction database. After downgrading the data protection mode, you can restart the production database.
Table 9-3 summarizes the recovery steps for unscheduled outages of the standby d atabase on the secondary site.
| Reason for Outage | Recovery Steps fo r "Data Guard Only" Architecture | < /a> Recovery Steps for MAA |
|---|---|---|
|
Standby apply instance failure |
If there is o nly one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database. |
There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect- time failover to an available standby instance. Restart node and instance when they are avai lable. |
|
Standby n on-apply instance failure |
N/A |
There is no effect on availability because the primary node or instance receives redo logs and applies t hem with the recovery process. The production database continues to communicate with this standby instance. Restart node and instance when they are available. |
|
Data failure such as media failure or disk corruption |
Restoring Fault Tolerance after a Standby Database Data Failure a> |
Restoring Fault Tolerance after a Standby Database Data Failure |
|
a>
Primary database resets logs because of flashback operations or media recovery |
Restoring Fault Tolerance After the Production Database Has Opened Resetl ogs |
Restoring Fault Tolerance Aft er the Production Database Has Opened Resetlogs |
Scheduled outages are planned outages. They are required for regular maintenance of the technol ogy infrastructure that supports the application and include tasks such as hardware maintenance, repair, and upgrades; software upgra des and patching; application changes and patching; and changes to improve performance and manageability of systems. Scheduled outage s should be scheduled at times best suited for continual application availability.
Table 9-4 describes the scheduled outages that impact either the primary or secondary site components.
The rest of this section provides outage decision trees for scheduled outa ges. They appear in the following sections:
The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. The detai led descriptions of the recovery operations are found in Chapter 10, "Detailed Recovery Steps" .
This section also includes the following topic:
If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for continued availability of the system.
Table 9-5 shows the recovery steps for scheduled outages on the primary site.
| |
Reason for Outage | Recovery Steps for "Database Only" Architecture | Recovery Steps for "RAC Only" Architecture | Recovery Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
|---|---|---|---|---|---|
|
Site |
Site shut down |
Downtime for entire duration |
Downtime for entire duration |
||
|
Primary database |
Hardware maintenance (node impact) |
a>
Downtime for entire duration |
Managed automaticall y by RAC Recovery |
Managed automatically by RAC Recovery | |
|
Primary database |
a>
Hardware maintenance (clusterwide impact) |
Downtim e for entire duration |
Downtime for entire duration |
||
|
Primary database |
System software maintenance (node impact) |
Downtime for entire duration |
Managed automatically by RAC Recovery |
Managed automatically by RAC Recovery | |
|
Pri mary database |
System software maintenance (clusterwide impact) < /td> |
Downtime for entire duration |
Downtime for entire duration |
||
|
Primary database |
Oracle patch upgrade for the database |
Downtime for entire duration |
Downtime for entire duration |
<
td class="Formal">
||
|
Primary database |
Oracle patch set or software upgrade for the database |
Downtime for entire duration |
Downtim e for entire duration |
||
|
Primary database |
Database object reorganization |
Outages on the secondary site do not impact availability because the clients always access the primary site unless there is a switchover or fail over. Outages on the secondary site may affect the MTTR if there are concurrent failures on the primary site. Outages on the secondar y site can be managed with no impact on availability. If maximum protection database mode is configured, then downgrade the protectio n mode before a scheduled outage on the standby instance or database so that there will be no downtime on the production database.
Table 9-6 describes the recovery steps for scheduled o utages on the secondary site.
| Scope of Outage | Reason for Outage | Recover y Steps for "Data Guard Only" Architecture | Recovery Steps for MAA |
|---|---|---|---|
|
Site |
<
a name="1012427">
Site shutdown |
Before the outag e: "Preparing for Scheduled Secondary Site Maintenance" After the outage: "Restoring Fault Tolerance after Secondary Site or Cl usterwide Scheduled Outage" |
B efore the outage: "Preparing for Scheduled Secondary Site Maintenance" After the outage: "Restoring Fault Tolerance after Secon dary Site or Clusterwide Scheduled Outage" |
|
Standby database |
Hardware or software maintenance the node that is running the managed recovery process (MRP) |
Before the outage: "Preparing for Scheduled Secondary Site Mai ntenance" |
Before the outage: "Preparing for Scheduled Secondary Site Maintenance" |
|
Standby database |
N/A |
No i mpact because the primary standby node or instance receives redo logs that are applied with the managed recovery process After the outage: Restart node and instance when available. | |
|
Standby database |
Hardware or software maintenance (clusterwide impact) |
N/A |
Before the outage: "Pre paring for Scheduled Secondary Site Maintenance" After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage" |
| < p class="TB">Standby database |
Oracle patch and software upgrades |
Downtime needed for upgrade, but there is no impact on primary node u nless the configuration is in maximum protection database mode. |
Dow ntime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode. |
To achieve continued service during a secondary site scheduled outage, downgrade the maximum protection mode to maximum availability or maximum performance. When you are scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide out age adds to the time the standby lags behind the production database, which lengthens the time to restore fault tolerance.
Table 9-7 shows how to prepare for scheduled secondary site ma intenance.
| Production Database Protection Mode | Reason for Outage | Preparation St eps for "Data Guard Only" Architecture and MAA |
|---|---|---|
|
Maximum protection |
Site shutdown |
Switch the production data protection mode to either maximum availability or maximum performance See Also: "Changing the Data Protection Mode" |
|
Maximum protection |
Hardware maintenance (clusterwide impact) |
Switch the production data protection mode to either maximum availability or maximum performance See Also: "Changing the Data Protection Mode" a> |
|
Maximum protection |
Software maintenance (clus terwide impact) |
Switch the production data protection mode to eithe r maximum availability or maximum performance See Also: "Changing the Data Protection Mode" |
|
Maximum protection |
Hardware maintenance on the primary node (the node that is running the recovery process) |
Apply Instance Failover (MAA only) Switch the production data pro tection mode to either maximum availability or maximum performance |
|
Maximum protection |
Software maintenance on the primary node (the node that is running the recovery process) |
Apply Instance Failover (MAA only) Switch the production data protection mode to either maximum av ailability or maximum performance |
|
a>
Maximum availability or maximum performance |
Site shutdown |
None; no impact on production database |
|
Maximum availability or maximum per formance |
Hardware maintenance (clusterwide impact) |
None; no impact on production database |
|
Maximum availability or maximum performance |
Software maintenance (clusterwide impact) |
None; no impact on production database |
|
Maximum availability or maximum performance |
Hardware maintenance on the primary node (the node that is running the recovery process) |
Apply Instance Failover (MAA only) None; no impact on production database |
|
Maximum availabili ty or maximum performance |
Software maintenance on the primary node (the node that is running the recovery process) |
Apply Instance Failover (MAA only) None; no impact on production database |