| Oracle® Database Con
cepts 10g Release 1 (10.1) Part Number B10743-01 |
|
|
View PDF |
This chapter describes some of the basic ideas in business intelligence.
This chapter contains the following topics:
A data warehouse is a relational database that is designed for query and analysis rather than for transaction proce ssing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates an alysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) soluti on, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.
A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:
Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, y ou can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer f or this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subj ect oriented.
Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems a s naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to a nalyze what has occurred.
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) syste ms, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time i s what is meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP) databases into a d
ata warehouse on a monthly, weekly, or daily basis. The data is normally processed in a
Data warehouses and OLTP syst ems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:
Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in adv ance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
OLTP systems su pport only predefined operations. Your applications might be specifically tuned or designed to support only these operations.
A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.
Data warehou ses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
OLTP syste ms often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.
A typical data warehouse query scans thousands or millions of rows.For example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example, "Retriev e the current order for this customer."
Data warehouses usually store many mon ths or years of data. This is to support historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.
Data wa rehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:
Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several sou rce systems through the data warehouse.
In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summari es are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse qu ery is to retrieve something like August sales.
Summaries in Oracle are called materialized views.
Figure 16-1, y ou need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although m ost data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture.
Figure 16-2 Architecture of a Data Warehou se with a Staging Area

Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization.
Do this by adding data marts, which are systems design ed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventori es are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts

|
See Also: Oracle Data Warehousing Guide |
You need to load your data warehouse regularly so that it can serve its purpose of facilitating bus iness analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The proces s of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stand s for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase a nd implies that each of the other phases of the process is distinct. We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
The methodology and tasks of ETL h ave been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary appl ications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what we now call ETL.
Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified informatio n base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.
What ha ppens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data th an necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction pr ocess. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the busin ess situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days /hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.
After extracting data, it has to be physically transported to the target system or an intermediate system for furthe
r processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example,
a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement.
If any errors occur during loading, an error is logged and the operation can continue.
Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can tran sport tablespaces between different machine architectures and operating systems.
Previously, the most scalable data transporta tion mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into fil es from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps.
Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data.
The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.
Table func tions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as men tioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various tr ansformations steps.
A table function is defined as a function that can produce a set of rows as output. Additionally, table f unctions can take a set of rows as input. Table functions extend database functionality by allowing:
Mul tiple rows to be returned from a function
Results of SQL subqueries (that select multiple rows) to be pa ssed directly to functions
Functions take cursors as input
Functions can be parallelized
Returning result sets incrementally for further processing as soon as they are created. Th is is called incremental pipelining
Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).
External tables le t you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external dat a to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data.
External tables enab
le the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading proce
ss without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further proces
sing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can
be used for a direct-path INSERT AS SELECT statement in conjunction with the SELECT from an external table. Figure 16-4 illustrates a typical example of pipelining.
The main difference between external tables and regular tables is that externally organized table
s are read-only. No DML operations (UPDATE/INSERT/DELETE) are possible and no indexes can be c
reated on them.
External tables are a complement to SQL*Loader and are especially useful for environments where the complete e xternal source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional in dexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations o r the data is only partially used in further processing.
You can save disk space by compress ing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables.
To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead.
Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed ta bles or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compress ion by causing some space to be wasted.
Change data capture efficiently identifies and capt ures data that has been added to, updated, or removed from Oracle relational tables, and makes the change data available for use by a pplications.
Oftentimes, data warehousing involves the extraction and transportation of relational data from one or more sourc e databases into the data warehouse for analysis. Change data capture quickly identifies and processes only the data that has changed, not entire tables, and makes the change data available for further use.
Change data
capture does not depend on intermediate flat files to stage the data outside of the relational database. It captures the change data
resulting from INSERT, UPDATE, and DELETE operations made to user tables. The change data is
then stored in a database object called a change table, and the change data is made available to applications in a controlled way.
One technique employed in data warehouses to improve performance is the cr eation of summaries. Summaries are special kinds of aggregate views that improve query execution times by precalculating expensive jo ins and aggregation operations prior to execution and storing the results in a table in the database. For example, you can create a t able to contain the sums of sales by region and by product.
The summaries or aggregates that are referred to in this book and
in literature on data warehousing are created in Oracle using a schema object called a
Previously, organizations using summaries spent a significant amount of time and effort creatin g summaries manually, identifying which summaries to create, indexing the summaries, updating them, and advising their users on which ones to use. Summary management eased the workload of the database administrator and meant that the user no longer needed to be awar e of the summaries that had been defined. The database administrator creates one or more materialized views, which are the equivalent of a summary. The end user queries the tables and views at the detail data level.
The query rewrite mechanism in the Oracle d atabase server automatically rewrites the SQL query to use the summary tables. This mechanism reduces response time for returning res ults from the query. Materialized views within the data warehouse are transparent to the end user or to the database application.
Although materialized views are usually accessed through the query rewrite mechanism, an end user or database application can con struct queries that directly access the summaries. However, serious consideration should be given to whether users should be allowed to do this because any change to the summaries will affect the queries that reference them.
To help you select from among the
many possible materialized views in your schema, Oracle provides a collection of materialized view analysis and advisor functions and
procedures in the DBMS_ADVISOR package. Collectively, these functions are called the SQLAccess Advisor, and they are ca
llable from any PL/SQL program. The SQLAccess Advisor recommends materialized views from a hypothetical or user-defined workload or o
ne obtained from the SQL cache. You can run the SQLAccess Advisor from Oracle Enterprise Manager or by invoking the DBMS_ADVISO
R package.
|
See Also: Oracle Data Warehousing Guide for information about materialized views and the SQLAccess Advisor |
Bitmap indexes are widely used in data warehousing environments. The environments typically have large amounts of data and ad hoc queries, but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:
Reduced response time for large classes of ad hoc queries
Reduced storage requirement s compared to other indexing techniques
Dramatic performance gains even on hardware with a relatively sm all number of CPUs or a small amount of memory
Efficient maintenance during parallel DML and loads
li>Fully indexing a large table with a traditional B-tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the inde xed data in the table.
An index provides pointers to the rows in a table that contain a given key value. A regular index store s a list of rowids for each key corresponding to the rows with that key value. In a bitmap index, a bitmap for each key value replace s a list of rowids.
Each bit in the bitmap corresponds to a possible rowid, and if the bit is set, it means that the row with the corresponding rowid contains the key value. A mapping function converts the bit position to an actual rowid, so that the bitmap i ndex provides the same functionality as a regular index. If the number of different key values is small, bitmap indexes save space. p>
Bitmap indexes are most effective for queries that contain multiple conditions in the WHERE clause. Rows that sati
sfy some, but not all, conditions are filtered out before the table itself is accessed. This improves response time, often dramatical
ly. A good candidate for a bitmap index would be a gender column due to the low number of possible values.
Parallel query and parallel DML work with bitmap indexes as they do with traditi onal indexes. Bitmap indexing also supports parallel create indexes and concatenated indexes.
When Oracle runs SQL statements
in parallel, multiple processes work together simultaneously to run a single SQL statement. By dividing the work necessary to run a
statement among multiple processes, Oracle can run the statement more quickly than if only a single process ran it. This is called parallel execution or parallel proc
essing.
Parallel execution dramatically reduces response time for data-intensive operations on large databases typica lly associated with decision support systems (DSS) and data warehouses. Symmetric multiprocessing (SMP), clustered systems, and large -scale cluster systems gain the largest performance benefits from parallel execution because statement processing can be split up amo ng many CPUs on a single Oracle system. You can also implement parallel execution on certain types of online transaction processing ( OLTP) and hybrid systems.
Parallelism is the idea of breaking down a task so that, instead of one process doi ng all of the work in a query, many processes do part of the work at the same time. An example of this is when 12 processes handle 12 different months in a year instead of one process handling all 12 months by itself. The improvement in performance can be quite high .
Parallel execution helps systems scale in performance by making optimal use of hardware resources. If your system's CPUs and disk controllers are already heavily loaded, you n eed to alleviate the system's load or increase these hardware resources before using parallel execution to improve performance.
< p>Some tasks are not well-suited for parallel execution. For example, many OLTP operations are relatively fast, completing in mere se conds or fractions of seconds, and the overhead of utilizing parallel execution would be large, relative to the overall execution tim e.|
See Also: Oracle Data Warehousing Guide for specific information on tun ing your parameter files and database to take full advantage of parallel execution |
W
hen parallel execution is not used, a single server process performs all necessary processing for the sequential execution of a SQL s
tatement. For example, to perform a full table scan (such as SELECT * FROM emp), one process performs the e
ntire operation, as illustrated in Figure 16-5.
Figure 16-6 illustrates several parallel execution servers performing
a scan of the table emp. The table is divided dynamically (dynamic partitioning) into
load units called granules and each granule is read by a single parallel execution server. The granules are generated by the coordina
tor. Each granule is a range of physical blocks of the table emp. The mapping of granules to execution servers is not st
atic, but is determined at execution time. When an execution server finishes reading the rows of the table emp correspon
ding to a granule, it gets another granule from the coordinator if there are any granules remaining. This continues until all granule
s are exhausted, in other words, until the entire table emp has been read. The parallel execution servers send results b
ack to the parallel execution coordinator, which assembles the pieces into the desired full table scan.
Figure 16-6 Parallel Full Table Scan

Given a query plan for a SQL query, the parallel execution coordi nator breaks down each operator in a SQL query into parallel pieces, runs them in the right order as specified in the query, and then integrates the partial results produced by the parallel execution servers executing the operators. The number of parallel execution servers assigned to a single operation is the degree of parallelism (DOP) for an operation. Multiple operations within the same SQL s tatement all have the same degree of parallelism.
|
See Also: Oracle Data Warehousing Guide for information regarding granules as well as how Oracle divides work and handles DOP in multiuser environments | <
/tr>
Oracle has introduced many SQL operations for performing analytic ope rations in the database. These operations include ranking, moving averages, cumulative sums, ratio-to-reports, and period-over-period comparisons. Although some of these calculations were previously possible using SQL, the new syntax offers much better performance.< /p>
This section discusses:
Aggregation is a fundamental part of data warehousing. To improve aggregation performa nce in your warehouse, Oracle provides extensions to the GROUP BY clause to make querying and reporting easier and faster. Some of th ese extensions enable you to:
Aggregate at increasing levels of aggregation, from the most detailed up t o a grand total
Calculate all possible combinations of aggregations with a single statement
Generate the information needed in cross-tabulation reports with a single query
These extension let
you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis acros
s multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so rep
lacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets pro
duce a single result set that is equivalent to a UNION ALL of differently grouped rows.
To enhance p erformance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabiliti es make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
One of the key concept
s in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. W
e use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimens
ions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties
of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as
Here are some examples of multidimensional requests:
Show total sales across al l products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000 . Include all possible subtotals.
List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many mu ltidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.
Oracle has advan ced SQL analytical processing capabilities using a family of analytic SQL functions. These analytic functions enable you to calculate :
Rankings and percentiles
Moving window calculations
Lag/lead analysis
First/last analysis
Linear regression statistics< /p>
Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you c an calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group.
Ot
her features include the CASE expression. CASE expressions provide if-then logic useful in many situations.
To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously run all of these statem ents. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simpli city.
|
See Also: Oracle Data Warehousing Guide |
Oracle's MODEL clause brings a new level of power and flexibility to SQL calculations. With the MODEL cl
ause, you can create a multidimensional array from query results and then apply formulas to this array to calculate new values. The f
ormulas can range from basic arithmetic to simultaneous equations using recursion. For some applications, the MODEL clau
se can replace PC-based spreadsheets. Models in SQL leverage Oracle's strengths in scalability, manageability, collaboration, and sec
urity. The core query engine can work with unlimited quantities of data. By defining and executing models within the database, users
avoid transferring large datasets to and from separate modeling environments. Models can be shared easily across workgroups, ensuring
that calculations are consistent for all applications. Just as models can be shared, access can also be controlled precisely with Or
acle's security features. With its rich functionality, the MODEL clause can enhance all types of applications.
Oracle OLAP provides the query performance and calculation capability previously found only in multidi mensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, Oracle OLAP is not a multidimensi onal database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLA P-enabled relational database. As a result, Oracle provides the benefits of a multidimensional database along with the scalability, a ccessibility, security, manageability, and high availability of the Oracle database. The Java OLAP API, which is specifically designe d for internet-based analytical applications, offers productive data access.
Basing an OLAP system directly on the Oracle database server offers the fo llowing benefits:
There is tremendous growth along three dimensions of analytic applications: number of users, size of data, and complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone number s as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tab les and very large volumes of fact data. Oracle can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering.
Partitioning allows management of precise subsets of tables and indexes, so that management operations affect only small pi eces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the ti me data is unavailable. Transportable tablespaces also support high availability. With transportable tablespaces, large data sets, in cluding tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates.
Oracle lets you precisely control re source utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of end-users.
Another resource management facility is the progress monitor, which gives end users and ad ministrators the status of long-running operations. Oracle maintains statistics describing the percent-complete of these operations. Oracle Enterprise Manager lets you view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server using system vi ews.
Oracle provides a server-managed i nfrastructure for backup, restore, and recovery tasks that enables simpler, safer operations at terabyte scale. Some of the highlight s are:
Details related to backup, restore, and recovery operations are maintained by the server in a rec overy catalog and automatically used as part of these operations.
Backup and recovery operations are ful ly integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independe ntly of the other partitions of a table.
Oracle includes support for incremental backup and recovery usi ng Recovery Manager, enabling operations to be completed efficiently within times proportional to the amount of changes, rather than the overall size of the database.
The security features in Oracle have reached the highest levels of U.S. government certification for datab ase trustworthiness. Oracle's fine grained access control enables cell-level security for OLAP users. Fine grained access control wor ks with minimal burden on query processing, and it enables efficient centralized security management.
Oracle Data Mining (ODM) embeds data mining within the Oracle Database. The data never leaves the database — the data, data preparat ion, model building, and model scoring results all remain in the database. This enables Oracle to provide an infrastructure for appli cation developers to integrate data mining seamlessly with database applications. Some typical examples of the applications that data mining are used in are call centers, ATMs, ERM, and business planning applications.
By eliminating the need for extracting da ta into specialized tools and then importing the results back into the database, you can save significant amounts of time. In additio n, by having the data and the data model in the same location (an Oracle database), there is no need to export the model as code.
Data mining functions such as model building, testing, and scoring are provided through a Java API.
Oracle Data Mining supports the following algorithms:
For classification, Naive Bayes , Adaptive Bayes Networks, and Support Vector Machines (SVM)
For regression, Support Vector Machines
For clustering, k-means and O-Cluster
For feature extraction, Non-Neg ative Matrix Factorization (NMF)
For sequence matching and annotation, BLAST
ODM also i ncludes several feature and performance enhancements.
|
See Also:
|