Skip Headers

Oracle® High Availability Architecture and Be st Practices
10g Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

5
Operat ional Policies for High Availability

This chapter describes t he policies and procedures that essential for maintaining high availability.

Introduction to Operational Policies for High Availability

Operational policies together with service ma nagement are fundamental to avoiding and minimizing outages, as well as reducing the time to recover from an outage. Operational poli cies are the foundation for managing the information technology infrastructure. They focus on process, policy, and management.

Operational policies for high availability focus on setting and establishing processes, policies, and management. They are divided into the following categories:

Servi ce Level Management for High Availability

Information Technology (IT ) departments are required to deliver increasing levels of service and availability while reducing costs. Service level management is an accepted method to ensure that IT services are meeting the business requirements. Service level management requires a dialogue be tween IT managers and the company's lines of business. It starts with mapping business requirements to IT investments.

Service level management encompasses complete end-to-end management of the service infrastructure. The fo undation of service level management is the service level agreement (SLA). The SLA is critical for building accountability into the p rovider-client relationship and for evaluating the provider's performance. SLAs are becoming more accepted and necessary as a monitor ing and control instrument for the relationship between a customer and the IT supplier (external or internal). SLAs are developed for critical business processes and application systems, such as order processing. The business individuals who specify the functionalit y of systems should develop the SLA for those systems. The SLA represents a detailed, complete description of the services that the s upplier is obligated to deliver, and the responsibilities of the users of that service. Developing an SLA challenges the client to ra nk the requirements and focus resources toward the most important requirements. An SLA should evolve with the business requirements.< /p>

There is no standardized SLA that meets the needs of all companies, but a typical SLA shoul d contain the following sections:

Developing an SLA and service level measurements requires c ommitment and hard work by all parties. Service levels should be measured by business requirements, such as cost for each order proce ssed. Any shared services or components must perform at the level of the most stringent SLA. Furthermore, SLAs should be developed be tween interdependent IT groups and with external suppliers. Many technologists advocate an integrated, comprehensive SLA rather than individual SLAs for infrastructure components.

The benefits of developing a SLA are:

  • A professional relationship between the supplier and customer with d ocumented accountability
  • A mutual goal of understanding and meeting the busine ss requirements
  • A system of measurement for service delivery so that IT can qu antify their capabilities and results and continuously improve upon them
  • IT ca n prevent or respond faster to events that decrease availability
  • A documented set of communication and escalation procedures
  • Planning Capacity to Promote High Availability

    Planning capacity and monitoring thresholds is essential to prevent downtime or unacceptably delayed transacti ons. Understanding average and maximum usage and the requirements to maintain that load over time helps ensure acceptable performance .

    Capacity planning includes the ability to estimate the time remaining before a tablespace becomes completely full and planning ahead to add disk space. Capacity planning can also delay or prevent scheduled outages to incre ase the maximum number of sessions in the database.

    Change Management for High Availability

    Change management is a set of procedures or rules that ensure that changes to the hardware, software, application, and data on a system are authorized, scheduled, and tested. A stable system in which unexpected, untested, and unauthorized changes are not permitted is one that guarantees integrity to its users. The users can rely on the hardware, software, and data to perform as ant icipated. Knowing exactly when and what changes have been applied to the system is vital to debugging a problem. Each customer handle s change management of systems, databases, and application code differently, but there are general guidelines that can help prevent u nnecessary system outages, thus protecting the system's integrity. With proper change management, application and hardware systems ha ve greater stability, and new problems are easier to debug.

    Fi gure 5-1 describes a typical change control flow. For emergency cases such as disasters, the change control process may need to be shortened.

    Figure 5-1 Change Control< /em>

    Text description of maxav030.gif follows

    Text description of the illustration maxav030.gif

    The following recommend ations are the foundation of good change management:

    • Develop a change control process

      A change control process for both nonescalated and escalated ca ses should be created, documented, and implemented. Ad hoc and emergency changes to the hardware, database, and software in a system are inevitable, but the change control process must ensure that they are later incorporated into the change management system so thei r effects and ramifications are examined and recorded.

    • Form a change contr ol group

      Representatives from applications, databases, systems, and management should b e members of the change control board. Both hardware and software representatives must be present.

      Determine meeting frequency and minimum assessment time. Change control processes should allow essential changes to be imple mented in a reasonable time. Change control meetings should be frequent enough to address and discuss the most important issues. Ther e should be a minimum grace period from the time a change is submitted until the time it is scheduled for review to provide adequate assessment time. This assessment time should be bypassed only with upper management approval.

    • Evaluate proposed changes

      Changes must provide short-term or long-t erm benefit. The change control team needs to assess whether the benefits of a change outweigh the risks and whether the change is co nsistent with the overall vision of the business, the application, and its rules. Proposed change must document the following:

    • Purpose of the change
    • Risk assessment
    • Fall-back plans
    • Test plans and results of the tests
    • Estimated time to implement and back out change, including outage times
  • Gather sta tistics for base comparisons

    Gather snapshots of system, hardware, database, and applic ation configuration and performance statistics. These base numbers can be used for comparison when a change is implemented. After cha nging a database parameter, you can gather new statistics and compare them with the base statistics to determine the impact of a chan ge.

  • Track database changes

    D atabase structure changes are easy to make on demand and easy to slip through the change management process. Therefore, it may be nec essary to have a special procedure in place for these changes. It is also beneficial to track these changes for trend analysis. In ad dition, when files are added to the database, the files must be incorporated into the backup and monitoring schemes; proper tracking of this type of change can act as a reminder.

  • Use a version control system for application code

    Some version control system must exist for application code to he lp track changes and enable fallback to a previous code base. Internally developed applications are modified and enhanced frequently, and new versions are put in place. When a problem is found with the new version, testing the case in the old version provides valuab le debugging information. Depending on the type of application and the likelihood of the users' need to revert to an earlier version, the company must decide how many previous versions to keep on hand. At least one is mandatory.

  • Develop quality assurance and testing procedures

    Quality assuranc e should validate test specifications, test plans, and the results of tests to ensure that the test simulation mimics your applicatio ns or at least considers all critical points of the application being tested. Tests and test environments should be designed to test both essential functionality and scalability of the application.

  • Perform i nternal audits

    Internal audits may be used to verify that your hardware, software, data base and application are certified with vendors, performing within service levels, and achieving high availability.

  • Backup and Recov ery Planning for High Availability

    Proper backup and recovery plans are essential and must be constructed to meet your specific service levels. Both disk and tape database backups are recommended. Disk and tape backups are essential for disaster recovery and for cases when you need to restore from a very old backup.

    A robust backup and recovery scheme requires an understanding of how it is strengthened or compromised by t he physical location of files, the order of events during the backup process, and the handling of errors. A robust scheme is resilien t in the face of media failures, programmatic failures, and, if possible, operator failures. Complete, successful, and tested process es are fundamental to the successful recovery of any failed environment.

    Take the following steps to construct useful backup and recovery plans:

    • Create r ecovery plans

      Create recovery plans for different types of outages. Start with the most common outage types and progress to the least probable. An outage matrix with recommended recovery actions and a validated MTTR esti mate enables you to assess if you can meet your SLAs for different types of outages.

    • Test backups on a regular basis

      Monitor the backup tasks for errors and validate backups by testing your recovery procedures periodically.

    • Automate backup and recovery procedures
    • Choose an appropriate ba ckup frequency

      Having up-to-date backups reduces recovery time.

    • Maintain offsite tape backups

      Offsite backups of th e database are essential to protect from site failures.

    • Maintain updated d ocumentation for backup and recovery plans

      Documentation is, for obvious reasons, a saf eguard against mistakes, loss of knowledgeable people, and misinterpretations. Maintaining accurate documentation on backup and recov ery procedures is as important as having procedures in place.

    Disaster Recovery Planning

    Disaster recovery (DR) planning is a process designed and developed specifically to deal with catastrophic, lar ge-scale interruptions in service to allow timely resumption of operations. These interruptions can be caused by disasters like fire, flood, earthquakes, or malicious attacks. The basic assumption is that the building where the data center and computers reside may n ot be accessible, and that the operations need to resume elsewhere. It assumes the worst and tries to deal with the worst. As an orga nization increasingly relies on its electronic systems and data, access to these systems and data become a fundamental component of s uccess. This underscores the importance of disaster recovery planning. Proper disaster planning reduces MTTR during a catastrophe and provides continual availability of critical applications, helping to preserve customers and revenue.

    Take the following steps to plan recovery from disasters:

    • Choose the right disaster recovery plans (DRPs)

      DRPs must deliver the expected M TTR service levels. The implementation costs must also be justified by the service levels. One DRP may not accommodate all disasters or even the common disasters.

    • Determine what is covered under the disaster recovery plans

      The first question to ask when trying to decide whether to include an a pplication in the disaster recovery plans is whether that application supports a key business operation that must be brought back onl ine within a few hours or days of a disaster. This may not be the same as the availability requirements of the application, although it is closely related. It has more to do with the cost to the company every hour or day that the system is not available. Disaster re covery planning requires securing off-site equipment, personnel, and supporting components such as phone lines and networks that can function at an acceptable level in an interim basis. This is costly, and care must be taken to consider only those applications that are key to the survival of the company.

    • Document DRPs, including diagrams of affected areas and systems

      A DRP should clearly identify the outage it protects agai nst and the steps to implement in case of that outage. A general diagram of the system is essential. It needs to be detailed enough t o determine hardware fault tolerance. including controllers, mirrored disks, the disaster backup site, processors, communication line s, and power. It also helps identify the current resources and any additional resources that may be necessary. Understanding how and when data flows in and out of the system is essential in identifying parts of the system that merit special attention. Special attent ion can be in the form of additional monitoring requirements or the frequency and types of backups taken. Conversely, it may also sho w areas that only require minimal attention and fewer system resources to monitor and manage.

    • Set up disaster recovery processes

      Ensure that critical application s, database instances, systems, or business processes are included in your disaster recovery plan. Use application, system and networ k diagrams to assess fault tolerance and alternative routes during a disaster.

    • Assess all of the important business components

      Consider all the components that a llow your business to run. Ensure that the DRP includes all system, hardware, application and people resources. Verify that network, telephone service and security measures are in place.

    • Assign a DR coordina tor

      A DR coordinator and a backup coordinator should be pre-assigned to ensure that all operations and communications are passed on.

    • Test and validate the DRP < p>

      The DRP must be rehearsed periodically, which implies that the facilities to test the DRP must be available.

    Planning Scheduled Outages

    Scheduled outages ca n affect the application server tier, the database tier, or the entire site. These outages may include one or more of the following: node hardware maintenance, node software maintenance, Oracle software maintenance, redundant component maintenance, entire site maint enance. Proper scheduled outage planning reduces MTTR and reduces risk when changes do not go as planned.

    Take the following steps to plan scheduled outages:

    • Create a list of scheduled outages

      Creating a list of possible scheduled outages, their projected frequency, and estimated duration enables advanced planning and a better assessment of availability requirements. A r eliability assessment to understand the mean time between failures (MTBF) of critical components can be used to plan for certain sche duled outages in order to prevent an unscheduled outage. In many cases, only one large scheduled outage is allotted each year, so mai ntenance must be prioritized and justified.

    • For each possible scheduled ou tage, document the impact and assess the risk

      Scheduled outages that do not require sof tware or application changes can usually be done with minimum downtime if a subsequent system can take over the new transactions. Wit h Real Application Clusters and Data Guard switchover, you can upgrade hardware and do some standard system maintenance with minimum downtime to your business. For most software upgrades such as Oracle upgrades, the actual downtime can be less than an hour if prepar ed correctly. For more complex application changes that require schema changes or database object reorganizations, customers must ass ess whether Oracle's online reorganization features suffice or use some of Oracle's rolling upgrade capabilities.

    • Justify the scheduled outage

      Each change must be consistent with the overall vision of the application and business and adhere to compatibility and change control rules.

    • Create and automate change, testing, and fallback procedur es

      Each planned change, such as an Oracle upgrade, should be tested in a simulated real world environment to assess performance and availability impacts. Oracle recommends using complete stress tests and a load simulated to accurately assess performance and load impact. Fallback plans should be created and incorporated into the scheduled outage. An au tomated process should be in place to implement the change and properly fall back if required.

    • < !--TOC=h1-"1008111"-->

      Staff Training for High Availability< /font>

      Highly trained people can make better and more informed decisions an d are less likely to make mistakes. A comprehensive plan for the continued technical education of the systems administration, databas e administration, development, and users groups can help ensure higher availability of your systems and databases. Additionally, just as redundancy of system components eliminates a single point of failure, knowledge management and cross training should eradicate th e effects to operations of losing an employee.

      • Cross-train for business-critical positions

        Any business-critical systems should have cross-training o f technical staff to reduce the impact to operations if an employee becomes unavailable or leaves the company. For example, the syste m administration group should be familiar with Oracle RDBMS and tools. Maintain formal and regular forms of communication (such as we ekly meetings) between different support groups.

      • Develop guidelines to ens ure continued technical education

        Ensure that there is a process in place to notify and train staff about new features or procedures associated with the hardware and software your company uses. Additionally, allow time f or investigation into new technologies that can improve service levels or reduce costs.

      • Implement a knowledge management process

        Effectively managing the intelle ctual assets of a company reduces the risk of losing those assets. Create a process to promote central access to information about "l essons learned" within the IT group. For example, group round tables, internal white papers, new features related to upgrades, reposi tories for problem analysis and resolutions are ways of making information accessible.

      • Update training materials when applications or system are changed

        Training material should be kept up to date with application and system changes. Incorporate training materials into the change management an d documentation procedures.

      Documentation as a Means of Maintaining High Availability

      < /a>

      Clear and complete documentation should be part of every set of HA operational policies. Without documenting the s teps for implementing or executing a process, you run the risk of losing that knowledge, increasing the risk for human error during t he execution of that process, and omitting a step within a process. All of these risks affect availability.

      Clearly defined operational procedures contribute to shorter learning curves when new employees join your organizati on. Properly documented operational procedures can greatly reduce the number of questions for support personnel, especially when the people who put the procedures in place are no longer with the group. Proper documentation can also eliminate confusion by clearly def ining roles and responsibilities within the organization.

      Clear documentation of applicatio ns is essential for new employees. When internally developed applications need to be maintained and enhanced, documentation helps dev elopers refresh their knowledge of the internal details of the programs. If the original developers are no longer with the group, thi s documentation becomes especially valuable to new developers who would otherwise have to struggle through reading the code itself. R eadily available application documentation also can greatly reduce the number of questions for your support organization.

      • Ensure that documentation is kept up to date

        Update operational procedures when an application or system changes. Keep users informed of documentation updates.< /p>

      • Approve documentation changes through the change management process

        The change management team should review and approve changes to the documentation to ensure a ccuracy.

      • Document lessons learned and problem resolutions

        Documenting problem resolutions and lessons learned can improve the recovery time for repeated problems . Ideally, this documentation can be part of a periodic review process to help set priorities for system enhancements.

      • Protect the documentation

        Secure access to documentation and keep an offsite copy of your operational procedure documentation and any other critical documentation. All critica l documentation should also be part of any remote site implementations. Whether the remote site is intended for restarting a system a fter a disaster or for disaster recovery, the site should also contain a copy of the documented procedures for failing over to that s ite.

      Physical Security Policies and Procedures for High Availability

      Security policies consider the physical security and operations of the hardware and the data center. Physical security includes pro tection from unauthorized access, as well as from physical damage such as from fire, heat, and electrical surges. Physical security i s the most fundamental security precaution and is essential for the system to meet the customer's availability requirements. Physical security protects against external and internal security breach. The CSI/FBI Computer Crime and Security Survey documents a trend to ward increasing external intrusions and maintains that internal security violations still pose a large threat. A detailed discovery p rocess into the security of data center operations and organization is outside the scope of this book. However, a properly secured in frastructure reduces the risk of damage, downtime, and financial loss.

      Take the following s teps to maintain physical security of the hardware and data center:

      • Provide a suitable physical environment for computer equipment

        Not every room or closet in an office environment can be used to house computer equipment. The data center should not only account for the appropriate temperature, humidity, and security of the systems, it should also attempt to prevent potential hazards such as electrical surge, fir e, and flood.

      • Restrict access to the operations area to authorized personn el

        All security-conscious operations centers need to have some sort of secure access, e ither in the form of biometric authentication devices or smart-card readers.

      • < /a>Use internal security monitoring

        Devices such as cameras and closed-circuit televisi on are essential to a secure operations center by preventing crime and damage caused by people who are inside the facility.

      • Conduct background checks on DBAs, system administrators, and operational staff

        DBAs, system administrators, and operational staff are inherently privileged users and hol d positions of trust. Organizations must perform adequate background checks to ensure that these privileged individuals are worthy of the trust placed in them. There is no technical solution that can completely protect against a determined, malicious, and poorly eva luated person holding a position of power.

      See Also:

      "Recovery Steps for Scheduled Outages"

      See Also:

      Chapter 6, "System and Netwo rk Configuration" for information about data security