| Oracle® High Availability Architecture and Be
st Practices 10g Release 1 (10.1) Part Number B10726-01 |
Home Book List Contents Index ![]() Master Index Feedback |
|
View PDF |
This chapter describes t he policies and procedures that essential for maintaining high availability.
Operational policies together with service ma nagement are fundamental to avoiding and minimizing outages, as well as reducing the time to recover from an outage. Operational poli cies are the foundation for managing the information technology infrastructure. They focus on process, policy, and management.
Operational policies for high availability focus on setting and establishing processes, policies, and management. They are divided into the following categories:
| See Also:
Chapter 6, "System and Network Configuration" for information about technical best practices |
Information Technology (IT ) departments are required to deliver increasing levels of service and availability while reducing costs. Service level management is an accepted method to ensure that IT services are meeting the business requirements. Service level management requires a dialogue be tween IT managers and the company's lines of business. It starts with mapping business requirements to IT investments.
Service level management encompasses complete end-to-end management of the service infrastructure. The fo undation of service level management is the service level agreement (SLA). The SLA is critical for building accountability into the p rovider-client relationship and for evaluating the provider's performance. SLAs are becoming more accepted and necessary as a monitor ing and control instrument for the relationship between a customer and the IT supplier (external or internal). SLAs are developed for critical business processes and application systems, such as order processing. The business individuals who specify the functionalit y of systems should develop the SLA for those systems. The SLA represents a detailed, complete description of the services that the s upplier is obligated to deliver, and the responsibilities of the users of that service. Developing an SLA challenges the client to ra nk the requirements and focus resources toward the most important requirements. An SLA should evolve with the business requirements.< /p>
There is no standardized SLA that meets the needs of all companies, but a typical SLA shoul d contain the following sections:
Developing an SLA and service level measurements requires c ommitment and hard work by all parties. Service levels should be measured by business requirements, such as cost for each order proce ssed. Any shared services or components must perform at the level of the most stringent SLA. Furthermore, SLAs should be developed be tween interdependent IT groups and with external suppliers. Many technologists advocate an integrated, comprehensive SLA rather than individual SLAs for infrastructure components.
The benefits of developing a SLA are:
Planning capacity and monitoring thresholds is essential to prevent downtime or unacceptably delayed transacti ons. Understanding average and maximum usage and the requirements to maintain that load over time helps ensure acceptable performance .
Capacity planning includes the ability to estimate the time remaining before a tablespace becomes completely full and planning ahead to add disk space. Capacity planning can also delay or prevent scheduled outages to incre ase the maximum number of sessions in the database.
Change management is a set of procedures or rules that ensure that changes to the hardware, software, application, and data on a system are authorized, scheduled, and tested. A stable system in which unexpected, untested, and unauthorized changes are not permitted is one that guarantees integrity to its users. The users can rely on the hardware, software, and data to perform as ant icipated. Knowing exactly when and what changes have been applied to the system is vital to debugging a problem. Each customer handle s change management of systems, databases, and application code differently, but there are general guidelines that can help prevent u nnecessary system outages, thus protecting the system's integrity. With proper change management, application and hardware systems ha ve greater stability, and new problems are easier to debug.
Fi gure 5-1 describes a typical change control flow. For emergency cases such as disasters, the change control process may need to be shortened.
Text description of the illustration maxav030.gif
The following recommend ations are the foundation of good change management:
A change control process for both nonescalated and escalated ca ses should be created, documented, and implemented. Ad hoc and emergency changes to the hardware, database, and software in a system are inevitable, but the change control process must ensure that they are later incorporated into the change management system so thei r effects and ramifications are examined and recorded.
Representatives from applications, databases, systems, and management should b e members of the change control board. Both hardware and software representatives must be present.
Determine meeting frequency and minimum assessment time. Change control processes should allow essential changes to be imple mented in a reasonable time. Change control meetings should be frequent enough to address and discuss the most important issues. Ther e should be a minimum grace period from the time a change is submitted until the time it is scheduled for review to provide adequate assessment time. This assessment time should be bypassed only with upper management approval.
Changes must provide short-term or long-t erm benefit. The change control team needs to assess whether the benefits of a change outweigh the risks and whether the change is co nsistent with the overall vision of the business, the application, and its rules. Proposed change must document the following:
Gather snapshots of system, hardware, database, and applic ation configuration and performance statistics. These base numbers can be used for comparison when a change is implemented. After cha nging a database parameter, you can gather new statistics and compare them with the base statistics to determine the impact of a chan ge.
D atabase structure changes are easy to make on demand and easy to slip through the change management process. Therefore, it may be nec essary to have a special procedure in place for these changes. It is also beneficial to track these changes for trend analysis. In ad dition, when files are added to the database, the files must be incorporated into the backup and monitoring schemes; proper tracking of this type of change can act as a reminder.
Some version control system must exist for application code to he lp track changes and enable fallback to a previous code base. Internally developed applications are modified and enhanced frequently, and new versions are put in place. When a problem is found with the new version, testing the case in the old version provides valuab le debugging information. Depending on the type of application and the likelihood of the users' need to revert to an earlier version, the company must decide how many previous versions to keep on hand. At least one is mandatory.
Quality assuranc e should validate test specifications, test plans, and the results of tests to ensure that the test simulation mimics your applicatio ns or at least considers all critical points of the application being tested. Tests and test environments should be designed to test both essential functionality and scalability of the application.
Internal audits may be used to verify that your hardware, software, data base and application are certified with vendors, performing within service levels, and achieving high availability.
Proper backup and recovery plans are essential and must be constructed to meet your specific service levels. Both disk and tape database backups are recommended. Disk and tape backups are essential for disaster recovery and for cases when you need to restore from a very old backup.
A robust backup and recovery scheme requires an understanding of how it is strengthened or compromised by t he physical location of files, the order of events during the backup process, and the handling of errors. A robust scheme is resilien t in the face of media failures, programmatic failures, and, if possible, operator failures. Complete, successful, and tested process es are fundamental to the successful recovery of any failed environment.
Take the following steps to construct useful backup and recovery plans:
Create recovery plans for different types of outages. Start with the most common outage types and progress to the least probable. An outage matrix with recommended recovery actions and a validated MTTR esti mate enables you to assess if you can meet your SLAs for different types of outages.
Monitor the backup tasks for errors and validate backups by testing your recovery procedures periodically.
Having up-to-date backups reduces recovery time.
Offsite backups of th e database are essential to protect from site failures.
Documentation is, for obvious reasons, a saf eguard against mistakes, loss of knowledgeable people, and misinterpretations. Maintaining accurate documentation on backup and recov ery procedures is as important as having procedures in place.
Disaster recovery (DR) planning is a process designed and developed specifically to deal with catastrophic, lar ge-scale interruptions in service to allow timely resumption of operations. These interruptions can be caused by disasters like fire, flood, earthquakes, or malicious attacks. The basic assumption is that the building where the data center and computers reside may n ot be accessible, and that the operations need to resume elsewhere. It assumes the worst and tries to deal with the worst. As an orga nization increasingly relies on its electronic systems and data, access to these systems and data become a fundamental component of s uccess. This underscores the importance of disaster recovery planning. Proper disaster planning reduces MTTR during a catastrophe and provides continual availability of critical applications, helping to preserve customers and revenue.
Take the following steps to plan recovery from disasters:
DRPs must deliver the expected M TTR service levels. The implementation costs must also be justified by the service levels. One DRP may not accommodate all disasters or even the common disasters.
The first question to ask when trying to decide whether to include an a pplication in the disaster recovery plans is whether that application supports a key business operation that must be brought back onl ine within a few hours or days of a disaster. This may not be the same as the availability requirements of the application, although it is closely related. It has more to do with the cost to the company every hour or day that the system is not available. Disaster re covery planning requires securing off-site equipment, personnel, and supporting components such as phone lines and networks that can function at an acceptable level in an interim basis. This is costly, and care must be taken to consider only those applications that are key to the survival of the company.
A DRP should clearly identify the outage it protects agai nst and the steps to implement in case of that outage. A general diagram of the system is essential. It needs to be detailed enough t o determine hardware fault tolerance. including controllers, mirrored disks, the disaster backup site, processors, communication line s, and power. It also helps identify the current resources and any additional resources that may be necessary. Understanding how and when data flows in and out of the system is essential in identifying parts of the system that merit special attention. Special attent ion can be in the form of additional monitoring requirements or the frequency and types of backups taken. Conversely, it may also sho w areas that only require minimal attention and fewer system resources to monitor and manage.
Ensure that critical application s, database instances, systems, or business processes are included in your disaster recovery plan. Use application, system and networ k diagrams to assess fault tolerance and alternative routes during a disaster.
Consider all the components that a llow your business to run. Ensure that the DRP includes all system, hardware, application and people resources. Verify that network, telephone service and security measures are in place.
A DR coordinator and a backup coordinator should be pre-assigned to ensure that all operations and communications are passed on.
The DRP must be rehearsed periodically, which implies that the facilities to test the DRP must be available.
Scheduled outages ca n affect the application server tier, the database tier, or the entire site. These outages may include one or more of the following: node hardware maintenance, node software maintenance, Oracle software maintenance, redundant component maintenance, entire site maint enance. Proper scheduled outage planning reduces MTTR and reduces risk when changes do not go as planned.
Take the following steps to plan scheduled outages:
Creating a list of possible scheduled outages, their projected frequency, and estimated duration enables advanced planning and a better assessment of availability requirements. A r eliability assessment to understand the mean time between failures (MTBF) of critical components can be used to plan for certain sche duled outages in order to prevent an unscheduled outage. In many cases, only one large scheduled outage is allotted each year, so mai ntenance must be prioritized and justified.
Scheduled outages that do not require sof tware or application changes can usually be done with minimum downtime if a subsequent system can take over the new transactions. Wit h Real Application Clusters and Data Guard switchover, you can upgrade hardware and do some standard system maintenance with minimum downtime to your business. For most software upgrades such as Oracle upgrades, the actual downtime can be less than an hour if prepar ed correctly. For more complex application changes that require schema changes or database object reorganizations, customers must ass ess whether Oracle's online reorganization features suffice or use some of Oracle's rolling upgrade capabilities.
| See Also: |
| See Also:
Chapter 6, "System and Netwo rk Configuration" for information about data security |