Skip Headers

Oracle® High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master&nb sp;Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

9
Recovering from Outages

< !--/TOC=Title-->

This chapter describes scheduled and unscheduled outages and the Oracle recove ry process and architectural framework that can manage each outage and minimize downtime. This chapter contains the following section s:

Recovery Steps for Unscheduled Outages

Unscheduled outages are u nanticipated failures in any part of the technology infrastructure that supports the application, including the following components:

< p class="BP">The monitoring and HA infrastructure should provide rapid detection and recovery from failures. Detection is described i n Chapter 8, "Using Oracle Enterprise Manager for Monitoring and Detection", while this chapte r focuses on the recovery operations for each outage.

Table&n bsp;9-1 describes the unscheduled outages that impact the primary or secondary site components.

Table 9-1 Unscheduled Outages  

This failure results in unavailability of parts of the database and causes tra nsactional or logical data inconsistencies. It is usually caused by the operator or by bugs in the application code.

This is estimated to be the greatest single cause of downtime.

Outage Description Examples

Site failure

The entire site where the current production database resides is unavailable. This includes all tiers o f the application.

  • Disaster at the pr oduction site such as a fire, flood, or earthquake
  • Power outages. (If there ar e multiple power grids and backup generators for critical systems, then this should affect only part of the data center.)

Node failure

A node of the RAC cluster is unavailable or fails

  • A database tier node fails or has to be shut down because of b ad memory or bad CPU
  • The database tier node is unreachable
  • Both of the redundant cluster interconnects fail, resulting in another node taking ownership

Instance failu re

A database instance is unavailable or fails

An instance of the RAC database on the data server fails because of a software bug or a n operating system or hardware problem

Clusterwide failure

The whole cluster hostin g the RAC database is unavailable or fails. This includes failures of nodes in the cluster as well as any other components that resul t in the cluster being unavailable and the Oracle database and instances on the site being unavailable.

  • The last surviving node on the RAC cluster fails and cannot be res tarted
  • Both of the redundant cluster interconnects fail
  • Database corruption is severe enough to disallow continuity on the current data server
  • Disk storage fails

Data failure

Thi s failure results in unavailability of parts of the database because of media corruptions, inaccessibility, or inconsistencies.

  • A datafile is accidentally removed or is u navailable
  • Media corruption affects blocks of the database
  • Oracle block corruption is caused by operating system or other node-related problems

User error

Localized damage (needs surgical repair)

  • User error results in a table being dropped or in rows being deleted from a table

Wid espread damage (needs drastic action to avoid downtime)

  • Applic ation errors result in logical corruptions in the database
  • Operator error resu lts in a batch job being run more times than specified.

Note: This category focuses on user errors that affect database availability and, in particular, cause transactional or logical data inconsistencies.

The rest of this section provides outage decisi on trees for unscheduled outages on the primary site and the secondary site. The decision trees appear in the following sections:

The high-level recovery steps for each o utage are listed with links to the detailed descriptions for each recovery step. These descriptions are found in Chapter 10, "Detailed Recovery Steps".

Some outages require multiple rec overy steps. For example, when a site failure occurs, the outage decision matrix states that Data Guard failover must occur before si te failover. Some outages are handled automatically without any loss of availability. For example, instance failure is managed automa tically by RAC. Multiple recovery options for each outage are listed wherever relevant.

Recovery Steps for Unscheduled Outages on the Primary Site

If the primary site contains the production database and the se condary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these o utages are critical for maximum availability of the system. Only the "Data Guard only" and MAA architectures have a secondary site to protect from site disasters. The estimated recovery times (ERT) are strictly examples derived from customer and actual testing exper iences and do not reflect a guaranteed recovery time.

Table&n bsp;9-2 summarizes the recovery steps for unscheduled outages on the primary site.

Table 9-2 Recovery Steps for Unsched uled Outages on the Primary Site  
< /thead>

ERT: minutes to an hour

Recovery Solutions for Data Failures

or

ERT: minutes to an hour

  1. < /a>Database Failover
  2. Complete or Partial Site Failover

Note: For primary database media failures or media corruptions, database failover may minimize data loss.

Reason for Outage Recovery Steps for "Database Only" Architecture Recovery Steps for "RAC Only" Architecture Recovery Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Site failure

ERT: hours to days

  1. Restore site.
  2. Restore from tape backups.
  3. Recover database.

ERT: hours to days

  1. Restore site.
  2. Restore from tape backups.
  3. Recover database.

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Par tial Site Failover

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

Node failure

ERT: minutes to an hour

  1. Restart node and restart database.
  2. Reconnect users.

ERT: seconds to minutes

Managed automatically by RAC Recovery

ERT: minutes to an hour

  1. Restart node and restart database.
  2. Reconnect users.

or

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

ERT: seconds to minutes

Man aged automatically by RAC Recovery

Instance failure

< a name="1011437">

ERT: minutes

  1. Resta rt instance.
  2. Reconnect users.

ERT: seconds to minutes

Managed automatically by RAC Recovery

ERT: minutes

  1. Restart instance.
  2. Reconnect users.

ERT: seconds to minutes

Managed automatically by RAC Recovery

Clusterwide failure< /a>

N/A

ERT: hours to days

  1. Restore cluster or restore at least one node.
  2. Restore from tape backups.
  3. Recover database.

N/A

< /td>

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

Data failure

ERT: minutes to an hour

Recovery Solutions for Data Failures

ERT: minutes t o an hour

Recovery Solutions for Data Failures

ERT: minutes to an hour

Re covery Solutions for Data Failures

or

ERT: minutes to an hour

  1. Database Failover
  2. Compl ete or Partial Site Failover

Note: For primary data base media failures or media corruptions, database failover may minimize data loss.

User error

ERT: minutes

Rec overing from User Error with Flashback Technology

ERT: minutes

Recovering from User Error with Flashback Technology

ERT: minutes

Recovering from User Error with Flashback Technology

ERT: minutes

Recovering from User Error with Flashback Techn ology

Recovery Steps for Unscheduled Outages on the Secondary Site

Outages on the secondary site do not directly affect availability because the clients always access the primary s ite unless there is a switchover or failover. Outages on the secondary site may impact the MTTR if there are concurrent failures on t he primary site. For most cases, outages on the secondary site can be managed with no impact on availability. However, if maximum pro tection mode is part of the configuration, then an unscheduled outage on the last surviving standby database causes downtime on the p roduction database. After downgrading the data protection mode, you can restart the production database.

Table 9-3 summarizes the recovery steps for unscheduled outages of the standby d atabase on the secondary site.

Table 9-3 Recovery Steps for Unscheduled Outages of the Standby Database on the Secondary Si te  
Reason for Outage Recovery Steps fo r "Data Guard Only" Architecture < /a> Recovery Steps for MAA

Standby apply instance failure

  1. Restart node and standby instance.
  2. Restart recovery.

If there is o nly one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database.

ERT: second s

Apply Instance Failover

There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect- time failover to an available standby instance.

Restart node and instance when they are avai lable.

Standby n on-apply instance failure

N/A

There is no effect on availability because the primary node or instance receives redo logs and applies t hem with the recovery process. The production database continues to communicate with this standby instance.

Restart node and instance when they are available.

Data failure such as media failure or disk corruption

Restoring Fault Tolerance after a Standby Database Data Failure

Restoring Fault Tolerance after a Standby Database Data Failure

Primary database resets logs because of flashback operations or media recovery

Restoring Fault Tolerance After the Production Database Has Opened Resetl ogs

Restoring Fault Tolerance Aft er the Production Database Has Opened Resetlogs

Recovery Steps for Scheduled Outages

Scheduled outages are planned outages. They are required for regular maintenance of the technol ogy infrastructure that supports the application and include tasks such as hardware maintenance, repair, and upgrades; software upgra des and patching; application changes and patching; and changes to improve performance and manageability of systems. Scheduled outage s should be scheduled at times best suited for continual application availability.

Table 9-4 describes the scheduled outages that impact either the primary or secondary site components.

Table 9-4 Scheduled Outages  
< /tr>
Outage Class Description Examples

Site-wide

The entire site where the current production database resides is unavailable. This is usually known well in advance and ca n be scheduled.

  • Scheduled power outag es
  • Site maintenance
  • Reg ular planned switchovers to test infrastructure

Hardware maintenance (node impact)

This is scheduled downtime of a database server node for hardware maintenance. The scope of this downtime is restricted to a no de of the database cluster.

  • Repair of a failed component such as a memory card or CPU board
  • Addition of memory or C PU to an existing node in the database tier

Hardware maintenance (clusterwide impact)

This is scheduled downtime of the database server cluster for hardware maintenance.

  • Some cases of adding a node to the cluster
  • < a name="1011819">Upgrade or repair of the cluster interconnect
  • Upgrade to the storage tier that requires downtime on the database tier

System software maintenance (node impact)

This is scheduled downtime of a database server node for system software maintenance. The scope of the down time is restricted to a node.

  • Upgrade of a software component such as the operating system
  • Changes to the configura tion parameters for the operating system

System software maintenance (clusterwide impact)

< p class="TB">This is scheduled downtime of the database server cluster for system software maintenance.

  • Upgrade or patching of the cluster software
  • Upgrade of the volume management software

Oracle patch upgrade for the database

Scheduled downtime for an Oracle patch

Patch Oracle software to fix a specific customer issue

Oracle patch set or software upgrade for the database

Scheduled downtime for Oracle patch set or software upgrade

  • Patching Oracle software with a patch set
  • < a name="1014246">Upgrade Oracle software

Database object reorganization

T hese are changes to the logical structure or the physical organization of Oracle database objects. The primary reason for these chang es is to improve performance or manageability. This is always a planned activity. The method and the time chosen to do the reorganiza tion should be planned and appropriate.

Using Oracle's online reorganization features enable s objects to be available during the reorganization.

  • Moving an object to a different tablespace
  • Converting a table to a partitioned table
  • Renaming or dropping columns of a table

The rest of this section provides outage decision trees for scheduled outa ges. They appear in the following sections:

The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. The detai led descriptions of the recovery operations are found in Chapter 10, "Detailed Recovery Steps" .

This section also includes the following topic:

Recovery Steps for Sc heduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for continued availability of the system.

Table 9-5 shows the recovery steps for scheduled outages on the primary site.

Tabl e 9-5 Recovery Steps for Scheduled Outages on the Primary Site  
< td class="Formal">

RAC Rolling Upgrade

Online Object Reorganization

Scope of Outage Reason for Outage Recovery Steps for "Database Only" Architecture Recovery Steps for "RAC Only" Architecture Recovery Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Site

Site shut down

Downtime for entire duration

Downtime for entire duration

  1. Database Switchover
  2. Complete or Partial Site Failover
  1. Database Switchover
  2. Complete or Partial Site Failover

Primary database

Hardware maintenance (node impact)

Downtime for entire duration

Managed automaticall y by RAC Recovery

  1. Database Switchover
  2. Complete or Partial Site Failover

Managed automatically by RAC Recovery

Primary database

Hardware maintenance (clusterwide impact)

Downtim e for entire duration

Downtime for entire duration

  1. Database Switcho ver
  2. Complete or Partial Site Failover
  1. Database Switchover
  2. Comple te or Partial Site Failover

Primary database

System software maintenance (node impact)

Downtime for entire duration

Managed automatically by RAC Recovery

  1. Database Switchover
  2. Complete or Partial Site Failover

Managed automatically by RAC Recovery

Pri mary database

System software maintenance (clusterwide impact)

< /td>

Downtime for entire duration

Downtime for entire duration

  1. Database Switchover
  2. Complete or Partial Site Failover
  1. Database Switchover
  2. Complete or Partial Site Failover

Primary database

Oracle patch upgrade for the database

Downtime for entire duration

RAC Rolling Upgrade

Downtime for entire duration

Primary database

Oracle patch set or software upgrade for the database

Downtime for entire duration

Downtim e for entire duration

Upgrade with Lo gical Standby Database

Upgrade wi th Logical Standby Database

Primary database

Database object reorganization

Online Object Reorganization

Online Object Reorganization

Online Object Reorganization

Re covery Steps for Scheduled Outages on the Secondary Site

Outages on the secondary site do not impact availability because the clients always access the primary site unless there is a switchover or fail over. Outages on the secondary site may affect the MTTR if there are concurrent failures on the primary site. Outages on the secondar y site can be managed with no impact on availability. If maximum protection database mode is configured, then downgrade the protectio n mode before a scheduled outage on the standby instance or database so that there will be no downtime on the production database.

Table 9-6 describes the recovery steps for scheduled o utages on the secondary site.

Table 9-6 Recovery Steps for Scheduled Outages on the Secondary Site  

Hardware or software maintenance on a node that is not running the MRP

Scope of Outage Reason for Outage Recover y Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Site

< a name="1012427">

Site shutdown

Before the outag e: "Preparing for Scheduled Secondary Site Maintenance"

After the outage: "Restoring Fault Tolerance after Secondary Site or Cl usterwide Scheduled Outage"

B efore the outage: "Preparing for Scheduled Secondary Site Maintenance"

After the outage: "Restoring Fault Tolerance after Secon dary Site or Clusterwide Scheduled Outage"

Standby database

Hardware or software maintenance the node that is running the managed recovery process (MRP)

Before the outage: "Preparing for Scheduled Secondary Site Mai ntenance"

Before the outage: "Preparing for Scheduled Secondary Site Maintenance"

Standby database

N/A

No i mpact because the primary standby node or instance receives redo logs that are applied with the managed recovery process

After the outage: Restart node and instance when available.

Standby database

Hardware or software maintenance (clusterwide impact)

N/A

Before the outage: "Pre paring for Scheduled Secondary Site Maintenance"

After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"

< p class="TB">Standby database

Oracle patch and software upgrades

Downtime needed for upgrade, but there is no impact on primary node u nless the configuration is in maximum protection database mode.

Dow ntime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode.

Preparing for Scheduled Secondary Site Maintenance

To achieve continued service during a secondary site scheduled outage, downgrade the maximum protection mode to maximum availability or maximum performance. When you are scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide out age adds to the time the standby lags behind the production database, which lengthens the time to restore fault tolerance.

Table 9-7 shows how to prepare for scheduled secondary site ma intenance.

Table 9-7 Preparing for Scheduled Secondary Site Maintenance  
Production Database Protection Mode Reason for Outage Preparation St eps for "Data Guard Only" Architecture and MAA

Maximum protection

Site shutdown

Switch the production data protection mode to either maximum availability or maximum performance

See Also: "Changing the Data Protection Mode"

Maximum protection

Hardware maintenance (clusterwide impact)

Switch the production data protection mode to either maximum availability or maximum performance

See Also: "Changing the Data Protection Mode"

Maximum protection

Software maintenance (clus terwide impact)

Switch the production data protection mode to eithe r maximum availability or maximum performance

See Also: "Changing the Data Protection Mode"

Maximum protection

Hardware maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

Switch the production data pro tection mode to either maximum availability or maximum performance

Maximum protection

Software maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

Switch the production data protection mode to either maximum av ailability or maximum performance

Maximum availability or maximum performance

Site shutdown

None; no impact on production database

Maximum availability or maximum per formance

Hardware maintenance (clusterwide impact)

None; no impact on production database

Maximum availability or maximum performance

Software maintenance (clusterwide impact)

None; no impact on production database

Maximum availability or maximum performance

Hardware maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

None; no impact on production database

Maximum availabili ty or maximum performance

Software maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

None; no impact on production database