| Oracle® High Availability Architecture and Be
st Practices 10g Release 1 (10.1) Part Number B10726-01 |
Home Book List Contents Index ![]() Master Index Feedback |
|
View PDF |
This chapter includes th e following topics:
The interconnected nature of today's global businesses demand s continuous availability for more of the business components. However, a business that is designing and implementing an HA strategy must perform a thorough analysis and have a complete understanding of the business drivers that require high availability, because im plementing high availability is costly. It may involve critical tasks such as:
Higher degr ees of availability reduce downtime significantly, ash shown in the following table:
| Availability Percentage | Approximate Downtime Per Year |
|---|---|
|
95% |
18 days |
|
99% |
4 days |
|
99.9% |
9 hours |
|
99.99% |
1 hour |
| <
/a>
99.999% |
5 minutes |
Businesses with higher availability requirements must deploy more fault-tolerant, redundant s ystems for their business components and have a larger investment in IT staff, processes, and services to ensure that the risk of bus iness downtime is minimized.
An analysis of the business requirements for high availability and an understanding of the accompanying costs enables an optimal solution that meets the needs of the business managers to be avail able as much as possible within financial and resource limitations of the business. This chapter provides a simple framework that can be used effectively to evaluate the high availability requirements of a business.
The elements of this analysis framework are:
A rigorous business impact analysis identifies the critical business processes within an organization, calculates the quantifia ble loss risk for unplanned and planned IT outages affecting each of these business processes, and outlines the less tangible impacts of these outages. It takes into consideration essential business functions, people and system resources, government regulations, and internal and external business dependencies. This analysis is done using objective and subjective data gathered from interviews with knowledgeable and experienced personnel, reviewing business practice histories, financial reports, IT systems logs, and so on.
< a name="1005987">The business impact analysis categorizes the business processes based on the severity of the imp act of IT-related outages. For example, consider a semiconductor manufacturer, with chip design centers located worldwide. An interna l corporate system providing access to human resources, business expenses and internal procurement is not likely to be considered as mission-critical as the internal e-mail system. Any downtime of the e-mail system is likely to severely affect the collaboration and communication capabilities among the global R&D centers, causing unexpected delays in chip manufacturing, which in turn will have a material financial impact on the company.
In a similar fashion, an internal knowledge ma nagement system is likely to be considered mission-critical for a management consulting firm because the business of a client-focused company is based on internal research accessibility for its consultants and knowledge workers. The cost of downtime of such a system is extremely high for this business. This leads us to the next element in the high availability requirements framework: cost of down time.
A well-implemented business impact analysis provides i nsights into the costs that result from unplanned and planned downtimes of the IT systems supporting the various business processes. Understanding this cost is essential because this has a direct influence on the high availability technology chosen to minimize the d owntime risk.
Various reports have been published, documenting the costs of downtime across industry verticals. These costs range from millions of dollars per hour for brokerage operations and credit card sales, to tens of t housands of dollars per hour for package shipping services.
While these numbers are stagger ing, the reasons are quite obvious. The Internet has brought millions of customers directly to the businesses' electronic storefronts . Critical and interdependent business issues such as customer relationships, competitive advantages, legal obligations, industry rep utation, and shareholder confidence are even more critical now because of their increased vulnerability to business disruptions.
A business impact analysis, as well as the calculated cos t of downtime, provides insights into the recovery time objective (RTO), an important statistic in business continuity planning. It i s defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering signif icant material losses. RTO indicates the downtime tolerance of a business process or an organization in general.
The RTO requirements are proportional to the mission-critical nature of the business. Thus, for a system runnin g a stock exchange, the RTO is zero or very near to zero.
An organization is likely to have varying RTO requirements across its various business processes. Thus, for a high volume e-commerce Web site, for which there is an e xpectation of rapid response times and for which customer switching costs are very low, the web-based customer interaction system tha t drives e-commerce sales is likely to have an RTO close to zero. However, the RTO of the systems that support the backend operations such as shipping and billing can be higher. If these backend systems are down, then the business may resort to manual operations tem porarily without a significantly visible impact.
A systems statistic related to RTO is the network recovery objective (NRO), which indicates the maximum time that network operations can be down for a business. Components of network operations include communication links, routers, name servers, load balancers, and traffic managers. NRO impacts the RTO of t he whole organization because individual servers are useless if they cannot be accessed when the network is down.
Recovery point objective (RPO) is another important statistic for busin ess continuity planning and is calculated through an effective business impact analysis. It is defined as the maximum amount of data an IT-based business process may lose before causing detrimental harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, 5 hours or 2 days wo rth of data loss.
A stock exchange where millions of dollars worth of transactions occur ev ery minute cannot afford to lose any data. Thus, its RPO must be zero. Referring to the e-commerce example, the web-based sales syste m does not strictly require an RPO of zero, although a low RPO is essential for customer satisfaction. However, its backend merchandi sing and inventory update system may have a higher RPO; lost data in this case can be re-entered.
Using the high availability analysis framework, a business can complet e a business impact analysis, identify and categorize the critical business processes that have the high availability requirements, f ormulate the cost of downtime, and establish RTO and RPO goals for these various business processes.
This enables the business to define service level agreements (SLAs) in terms of high availability for critical aspects of i ts business. For example, it can categorize its businesses into several HA tiers:
The next step for the business is to evaluate the capabilities of the various HA systems and technologies and c hoose the ones that meet its SLA requirements, within the guidelines as dictated by business performance issues, budgetary constraint s, and anticipated business growth.
Figure 2-1 illu strates this process.
Text description of the illustration maxav014.gif
The following sections provide further details about this methodology:
A broad range of high availability and business con tinuity solutions exists today. As the sophistication and scope of these systems increase, they make more of the IT infrastructure, s uch as the data storage, server, network, applications, and facilities, highly available. They also reduce RTO and RPO from days to h ours, or even to minutes. But increased availability comes with an increased cost, and on some occasions, with an increased impact on systems performance.
Organizations need to carefully analyze the capabilities of these HA systems and map their capabilities to the business requirements to make sure they have an optimal combination of HA solutions to keep their business running. Consider the business with a significant e-commerce presence as an example.
For this business, the IT infrastructure that supports the system that customers encounter, the core e-commerce engine, nee ds to be highly available and disaster-proof. The business may consider clustering for the web servers, application servers and the d atabase servers serving this e-commerce engine. With built-in redundancy, clustered solutions eliminate single points of failure. Als o, modern clustering solutions are application-transparent, provide scalability to accommodate future business growth, and provide lo ad-balancing to handle heavy traffic. Thus, such clustering solutions are ideally suited for mission-critical high-transaction applic ations.
The data that supports the high volume e-commerce transactions must be protected ad equately and be available with minimal downtime if unplanned and planned outages occur. This data should not only be backed up at reg ular intervals at the local data centers, but should also be replicated to databases at a remote data center connected over a high-sp eed, redundant network. This remote data center should be equipped with secondary servers and databases readily available, and be syn chronized with the primary servers and databases. This gives the business the capability to switch to these servers at a moment's not ice with minimal downtime if there is an outage, instead of waiting for hours and days to rebuild servers and recover data from backe d-up tapes.
Maintaining synchronized remote data centers is an example where redundancy is built along the entire system's infrastructure. This may be expensive. However, the mission-critical nature of the systems and the da ta it protects may warrant this expense. Considering another aspect of the business: for example, the high availability requirements are less stringent for systems that gather clickstream data and perform data mining. The cost of downtime is low, and the RTO and RPO requirements for this system could be a few days, because even if this system is down and some data is lost, that will not have a de trimental effect on the business. While the business may need powerful machines to perform data mining, it does not need to mirror th is data on a real-time basis. Data protection may be obtained by simply performing regularly scheduled backups, and archiving the tap es for offsite storage.
For this e-commerce business, the back-end merchandising and invent ory systems are expected to have higher HA requirements than the data mining systems, and thus they may employ technologies such as l ocal mirroring or local snapshots, in addition to scheduled backups and offsite archiving.
The business should employ a management infrastructure that performs overall systems management, administration and monitoring, and p rovides an executive dashboard. This management infrastructure should be highly available and fault-tolerant.
< /a>Finally, the overall IT infrastructure for this e-commerce business should be extremely secure, to protect against malicious external and internal electronic attacks.
High availability solutions must also be chosen keeping in mind business performance issues. For example, a bus iness may use a zero-data-loss solution that synchronously mirrors every transaction on the primary database to a remote database. Ho wever, considering the speed-of-light limitations and the physical limitations associated with a network, there will be round-trip-de lays in the network transmission. This delay increases with distance, and varies based on network bandwidth, traffic congestion, rout er latencies, and so on. Thus, this synchronous mirroring, if performed over large WAN distances, may impact the primary site perform ance. Online buyers may notice these system latencies and be frustrated with long system response times; they may go somewhere else f or their purchases. This is an example where the business must make a trade-off between having a zero data loss solution and maximizi ng system performance.
High availability solutions must also be chosen keeping in mind fina ncial considerations and future growth estimates. It is tempting to build redundancies throughout the IT infrastructure and claim tha t the infrastructure is completely failure-proof. However, going overboard with such solutions may not only lead to budget overruns, it may lead to an unmanageable and unscalable combination of solutions that are extremely complex and expensive to integrate and main tain.
An HA solution that has very impressive performance benchmark results may look good o n paper. However, if an investment is made in such a solution without a careful analysis of how the technology capabilities match the business drivers, then a business may end up with a solution that does not integrate well with the rest of the system infrastructure , has annual integration and maintenance costs that easily exceed the upfront license costs, and forces a vendor lock-in. Cost-consci ous and business-savvy CIOs must invest only in solutions that are well-integrated, standards-based, easy to implement, maintain and manage, and have a scalable architecture for accommodating future business growth.
Choosing and implementing the architecture that best fits the availability requirements of a b usiness can be a daunting task. This architecture must encompass appropriate redundancy, provide adequate protection from all types o f outages, ensure consistent high performance and robust security, while being easy to deploy, manage, and scale. Needless to mention , this architecture should be driven by well-understood business requirements.
To build, im plement and maintain such an architecture, a business needs high availability best practices that involve both technical and operatio nal aspects of its IT systems and business processes. Such a set of best practices removes the complexity of designing an HA architec ture, maximizes availability while using minimum system resources, reduces the implementation and maintenance costs of the HA systems in place, and makes it easy to duplicate the high availability architecture in other areas of the business.
a>An enterprise with a well-articulated set of high availability best practices that encompass HA analysis frameworks , business drivers and system capabilities, will enjoy an improved operational resilience and enhanced business agility. The remainin g chapters in this book will provide technical details on the various high availability technologies offered by Oracle, along with be st practice recommendations on configuring and using such technologies.