link rel="Stylesheet" href="../../dcommon/css/doccd.css" title="Default" type="text/css" />
|
Oracle® Database Data Warehousing Guide 10g Release 1 (10.1) Part Number B10736-01 |
|
| View PDF |
This chapter discusses aggregation of SQL, a basic aspect of data warehousing. It co ntains these topics:
Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your wa
rehouse, Oracle Database provides the following extensions to the GROUP BY clause:
CUBE and ROLLUP extensions to the GROUP BY clause
Three GROUPING functions
GROUPING SETS expression
The CUBE, ROLLUP, and GROUPING SETS extensions to SQL make querying and repo
rting easier and faster. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to
a UNION ALL of differently grouped rows. ROLLUP calculates aggregations such as SUM, COUNT, MAX, MIN, and AVG at increasing levels of aggregation, from the most d
etailed up to a grand total. CUBE is an extension similar to ROLLUP, enabling a single statement to calcula
te all possible combinations of aggregations. The CUBE, ROLLUP, and the GROUPING SETS extension lets you specify just the groupings needed in the GROUP BY clause. This allows efficient ana
lysis across multiple dimensions without performing a CUBE operation. Computing a CUBE creates a heavy proc
essing load, so replacing cubes with grouping sets can significantly increase performance.
To enhance performance, CUBE<
/code>, ROLLUP, and GROUPING SETS can be parallelized: multiple processes can simultaneously e
xecute all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance
, and scalability.
The three GROUPING functions help you identify the group each row belongs to and enable sortin
g subtotal rows and filtering results.
One of the key concepts in decision support systems is multidimensional analysis: examining the en terprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, depart ment, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or en tities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or lo cal currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidi mensional requests:
Show total sales across all products at increasing aggregation levels for a geograph y dimension, from state to country to region, for 1999 and 2000.
Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commission s.
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and compa risons of data sets, often across time, geography or budgets.
To visualize data that has many dimensions, analysts commonly us e the analogy of a data cube, that is, a space where facts are stored at the intersection of n dimensions. Figure 20-1 shows a data cube and how it can be used differently by various groups. The cube stores sales data organized by the dimensi ons of product, market, sales, and time. Note that this is only a metaphor: the actual data is physically stored in normal tables. Th e cube data consists of both detail and aggregated data.
Not only multidimensional issues, but all types of processing can benefit from enhanced aggregation facilities. Transaction processing, financial and manufacturing systems— ;all of these generate large numbers of production reports needing substantial system resources. Improved efficiency when creating th ese reports will reduce system load. In fact, any computer process that aggregates data from details to higher levels will benefit fr om optimized aggregation performance.
These extensions provide aggregation features and bring many benefits, including:
Simplified programming requiring less SQL code for many tasks.
Quicker and more e fficient query processing.
Reduced client processing loads and network traffic because aggregation work is shifted to servers.
Opportunities for caching aggregations because similar queries can leverage exist ing work.
To illustrate
the use of the GROUP BY extension, this chapter uses the sh data of the sample schema. All the examples re
fer to data from this scenario. The hypothetical company has sales across the world and tracks sales by both dollars and quantities i
nformation. Because there are many rows of data, the queries shown here typically have tight constraints on their WHERE
clauses to limit the results to a small number of rows.
Example 20-1 Simple Cross-Tabular Report With Su btotals
Table 20-1 is a sample cross-tabular report showing the total sales by channel_desc for the US and France through the Internet and direct sales in September 2000.
Table 20-1 Simple Cross-Tabular Report With Subtotals
| Channel | Country | ||
|---|---|---|---|
| France | US | Total | |
| Internet | 9,597 | 124,224 | 133,821 |
| Direct Sales | 61,202 | 638,201 | 699,403 |
| Total | 70,799 | 762,425 | 833,224 |
Consider that even a simple report such as this, with just nine values in its grid, generates four
subtotals and a grand total. Half of the values needed for this report would not be calculated with a query that requested SUM
(amount_sold) and did a GROUP BY(channel_desc, country_id). To get the higher-level aggregates would require addi
tional queries. Database commands that offer improved calculation of subtotals bring major benefits to querying, reporting, and analy
tical operations.
SELECT channels.channel_desc, countries.country_iso_code,
TO_CHAR(SUM(amount_sold),
'9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=custom
ers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.calenda
r_month_desc='2000-09'
AND customers.country_id=countries.country_id
AND countries.country_iso_code IN ('US','FR')
GROUP BY CUBE(
channels.channel_desc, countries.country_iso_code);
CHANNEL_DESC CO SALES$
-------------------- -- --------------
833,224
FR 70,799
US 762,425
Internet
133,821
Internet FR 9,597
Internet US 124,224
Direct Sales 699,403
Dir
ect Sales FR 61,202
Direct Sales US 638,201
NULLs returned by the GROUP BY extensions are not always the traditional null
meaning value unknown. Instead, a NULL may indicate that its row is a subtotal. To avoid introducing another non-value i
n the database system, these subtotal values are not given a special tag. See "GROUPING Functions" for detail
s on how the NULLs representing subtotals are distinguished from NULLs stored in the data.
ROLLUP enables a SELECT statement to ca
lculate multiple levels of subtotals across a specified group of dimensions. It also calculates a grand total. ROLLUP is
a simple extension to the GROUP BY clause, so its syntax is extremely easy to use. The ROLLUP
extension is highly efficient, adding minimal overhead to a query.
The action of ROLLUP is straightforward: it c
reates subtotals that roll up from the most detailed level to a grand total, following a grouping list specified in the ROLLUP<
/code> clause. ROLLUP takes as its argument an ordered list of grouping columns. First, it calculates the standard aggre
gate values specified in the GROUP BY clause. Then, it creates progressively higher-level subtotals, moving
from right to left through the list of grouping columns. Finally, it creates a grand total.
ROLLUP creates subto
tals at n+1 levels, where n is the number of grouping columns. For instance, if a query specifies ROLLUP on grouping col
umns of time, region, and department (n=3), the result set will include rows at four aggregati
on levels.
You might want to compress your data when using ROLLUP. This is particularly useful when there are few
updates to older partitions.
Use the ROLLUP exten
sion in tasks involving subtotals.
It is very helpful for subtotaling along a hierarchical dimension suc
h as time or geography. For instance, a query could specify a ROLLUP(y, m, day) or ROLLU
P(country, state, city).
For data warehouse administrators using summar
y tables, ROLLUP can simplify and speed up the maintenance of summary tables.
ROLLUP appears in the GROUP BY clause in a SELECT statement.
Its form is:
SELECT … GROUP BY ROLLUP(grouping_column_reference_list)
Example 20-2 ROLLUP
This example uses the data in the sh sample schema data, the same data as was used in Figure 20-1. The ROLLUP is across three dimensions.
SELECT channels.channel_desc,
calendar_month_desc, countries.country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, c
hannels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id
AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND countries.c
ountry_iso_code IN ('GB', 'US')
GROUP BY ROLLUP(channels.channel_desc, calendar_month_desc,
countries.country_iso_code);
CHANNEL_DES
C CALENDAR CO SALES$
-------------------- -------- -- --------------
Internet 2000-09 GB 228,241
Internet
2000-09 US 228,241
Internet 2000-09 456,482
Internet 2000-10 GB 239,2
36
Internet 2000-10 US 239,236
Internet 2000-10 478,473
Internet
934,955
Direct Sales 2000-09 GB 1,217,808
Direct Sales 2000-09 US 1,217,808
Direct Sales 20
00-09 2,435,616
Direct Sales 2000-10 GB 1,225,584
Direct Sales 2000-10 US 1,225,584
Direct Sales
2000-10 2,451,169
Direct Sales 4,886,784
5,821,739
Note that results do not always add due to rounding.
This query returns the following sets of rows:
Regular aggregation rows that would be produced by GROUP BY wi
thout using ROLLUP.
First-level subtotals aggregating across country_id for ea
ch combination of channel_desc and calendar_month.
Second-level subtotals aggr
egating across calendar_month_desc and country_id for each channel_desc value.
A grand total row.
You can also roll up so that only some of the sub-totals will be included. This partial rollup uses the following syntax:
GROUP BY expr1, ROLLUP(expr2, expr3);
In this case, the GROUP BY clause
creates subtotals at (2+1=3) aggregation levels. That is, at level (expr1, expr2, expr3), (expr2), and (expr1).
Example 20-3 Partial ROLLUP
SELECT channel_desc, calendar_month_desc, countries.country_iso_code,
TO_CHAR(SUM(amount_s
old), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=c
ustomers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.c
alendar_month_desc IN
('2000-09', '2000-10') AND countries.country_iso_code IN ('GB', 'US')
GROUP BY channel_desc, ROLLUP(calendar_
month_desc, countries.country_iso_code);
CHANNEL_DESC CALENDAR CO SALES$
-------------------- -------- -- --------------
Int
ernet 2000-09 GB 228,241
Internet 2000-09 US 228,241
Internet 2000-09
456,482
Internet 2000-10 GB 239,236
Internet 2000-10 US 239,236
Internet 2000-10
478,473
Internet 934,955
Direct Sales 2000-09 GB 1,217,808
Direct Sales
2000-09 US 1,217,808
Direct Sales 2000-09 2,435,616
Direct Sales 2000-10 GB 1,225,584
Direct
Sales 2000-10 US 1,225,584
Direct Sales 2000-10 2,451,169
Direct Sales 4,886,
784
This query returns the following sets of rows:
Regular ag
gregation rows that would be produced by GROUP BY without using ROLLUP.
First-level subtotals aggregating across country_id for each combination of channel_desc and c
alendar_month_desc.
Second-level subtotals aggregating across calendar_month_desc an
d country_id for each channel_desc value.
It does not produce a grand total ro w.
CUBE takes a specified set of grouping columns and creates subto
tals for all of their possible combinations. In terms of multidimensional analysis, CUBE generates all the subtotals tha
t could be calculated for a data cube with the specified dimensions. If you have specified CUBE(time, region, department), the result set will include all the values that would be included in an equivalent ROLLUP s
tatement plus additional combinations. For instance, in Figure 20-1, the departmental totals across regions (
279,000 and 319,000) would not be calculated by a ROLLUP(time, region, department) clause, but
they would be calculated by a CUBE(time, region, department) clause. If n columns ar
e specified for a CUBE, there will be 2 to the n combinations of subtotals returned. Exampl
e 20-4 gives an example of a three-dimension cube. See Oracl
e Database SQL Reference for syntax and restrictions.
Consider Using CUBE in any situation requiring cross-tabular reports. The data needed for cross-tabular reports can be gen
erated with a single SELECT using CUBE. Like ROLLUP, CUBE can be helpful in gener
ating summary tables. Note that population of summary tables is even faster if the CUBE query executes in parallel.
CUBE is typically most suitable in queries that use columns from multiple dimensions rather than columns representing
different levels of a single dimension. For instance, a commonly requested cross-tabulation might need subtotals for all the combina
tions of month, state, and product. These are three independent dimensions, and analysis of all possible subtotal combinations is com
monplace. In contrast, a cross-tabulation showing all possible combinations of year, month, and day would have several values of limi
ted interest, because there is a natural hierarchy in the time dimension. Subtotals such as profit by day of month summed across year
would be unnecessary in most analyses. Relatively few users need to ask "What were the total sales for the 16th of each month across
the year?" See "Hierarchy Handling in ROLLUP and CUBE" for an example of handling rollup calculations effici
ently.
CUBE appears in the GROUP BY clause
in a SELECT statement. Its form is:
SELECT … GROUP BY CUBE (grouping_column_refere nce_list)
Example 20-4 CUBE
SELECT channel_desc, c
alendar_month_desc, countries.country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times
, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel
_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND countri
es.country_iso_code IN ('GB', 'US')
GROUP BY CUBE(channel_desc, calendar_month_desc, countries.country_iso_code);
CHANNEL_DESC
CALENDAR CO SALES$
-------------------- -------- -- --------------
5,821,739
GB 2,910,870
US 2,910,870
2000-09 2,892,098
2000-09 GB 1,446,049
2000-09 US 1,446,049
2000-10 2,9
29,641
2000-10 GB 1,464,821
2000-10 US 1,464,821
Internet
934,955
Internet GB 467,478
Internet US 467,478
Internet
2000-09 456,482
Internet 2000-09 GB 228,241
Internet 2000-09 US 228,241
Internet
2000-10 478,473
Internet 2000-10 GB 239,236
Internet 2000-10 US 239,2
36
Direct Sales 4,886,784
Direct Sales GB 2,443,392
Direct Sales US
2,443,392
Direct Sales 2000-09 2,435,616
Direct Sales 2000-09 GB 1,217,808
Direct Sales 20
00-09 US 1,217,808
Direct Sales 2000-10 2,451,169
Direct Sales 2000-10 GB 1,225,584
Direct Sales
2000-10 US 1,225,584
This query illustrates CUBE aggregation across three dimensions.
Partial C
UBE resembles partial ROLLUP in that you can limit it to certain dimensions and precede it with columns outside t
he CUBE operator. In this case, subtotals of all possible combinations are limited to the dimensions within the cube lis
t (in parentheses), and they are combined with the preceding items in the GROUP BY list.
The syntax
for partial CUBE is as follows:
GROUP BY expr1, CUBE(expr2, expr3)
This syntax example calculates 2*2, or 4, subtotals. That is:
(expr1, expr2, expr3<
/code>)
(expr1, expr2)
(expr1, expr3)
(expr1)
Example 20-5 Partial CUBE
Using the
SELECT channel_desc, calendar_month_d
esc, countries.country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, cou
ntries
WHERE sales.time_id = times.time_id AND sales.cust_id = customers.cust_id AND
sales.channel_id = channels.channel_id AND cha
nnels.channel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND countries.country_is
o_code IN ('GB', 'US')
GROUP BY channel_desc, CUBE(calendar_month_desc, countries.country_iso_code);
CHANNEL_DESC CALENDAR C
O SALES$
-------------------- -------- -- --------------
Internet 934,955
Internet
GB 467,478
Internet US 467,478
Internet 2000-09 456,482
Internet
2000-09 GB 228,241
Internet 2000-09 US 228,241
Internet 2000-10 478,473
Intern
et 2000-10 GB 239,236
Internet 2000-10 US 239,236
Direct Sales 4,886
,784
Direct Sales GB 2,443,392
Direct Sales US 2,443,392
Direct Sales 2000-09
2,435,616
Direct Sales 2000-09 GB 1,217,808
Direct Sales 2000-09 US 1,217,808
Direct Sales
2000-10 2,451,169
Direct Sales 2000-10 GB 1,225,584
Direct Sales 2000-10 US 1,225,584
Just as for ROLLUP,
multiple SELECT statements combined with UNION ALL statements could provide the same informati
on gathered through CUBE. However, this might require many SELECT statements. For an n-dimensional cube, 2
to the n SELECT statements are needed. In the three-dimension example, this would mean issuing SELECT statements linked with UNION ALL. So many SELECT statements yield inefficient processing a
nd very lengthy SQL.
Consider the impact of adding just one more dimension when calculating all possible combinations: the num
ber of SELECT statements would double to 16. The more columns used in a CUBE clause, the greater the saving
s compared to the UNION ALL approach.
Two challenges arise with the use of ROLLUP and CUBE. First, how can you programmatically determine
which result set rows are subtotals, and how do you find the exact level of aggregation for a given subtotal? You often need to use s
ubtotals in calculations such as percent-of-totals, so you need an easy way to determine which rows are the subtotals. Second, what h
appens if query results contain both stored NULL values and "NULL" values created by a ROLLUP or CUBE
? How can you differentiate between the two? See Oracle D
atabase SQL Reference for syntax and restrictions.
GROUPING handles these problems.
Using a single column as its argument, GROUPING returns 1 when it encounters a NULL value created by a CUBE operation. That is, if the NULL indicates the row is a subtotal, GROUPING returns a 1. Any other type of value, including a stored NULL, returns a 0.
GROUPING appears in
the selection list portion of a SELECT statement. Its form is:
SELECT … [GROUPING
(dimension_column)…] …
GROUP BY … {CUBE | ROLLUP| GROUPING SETS} (dimension_column)
Example 20-6 GROUPING to Mask Columns
This example uses GROUPING to create a set of mask columns for the result set s
hown in Example 20-3. The mask columns are easy to analyze programmatically.
SE
LECT channel_desc, calendar_month_desc, country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$, GROUPING(channel_desc)
AS Ch,
GROUPING(calendar_month_desc) AS Mo, GROUPING(country_iso_code) AS Co
FROM sales, customers, times, channels, countries
WHE
RE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_d
esc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND countries.country_iso_code IN ('GB'
, 'US')
GROUP BY ROLLUP(channel_desc, calendar_month_desc, countries.country_iso_code);
CHANNEL_DESC CALENDAR CO SALES$
CH MO CO
-------------------- -------- -- -------------- ---------- ---------- ----------
Internet
2000-09 GB 228,241 0 0 0
Internet 2000-09 US 228,241 0
0 0
Internet 2000-09 456,482 0 0 1
Internet 2000-10 GB
239,236 0 0 0
Internet 2000-10 US 239,236 0 0 0
Internet
2000-10 478,473 0 0 1
Internet 934,955 0
1 1
Direct Sales 2000-09 GB 1,217,808 0 0 0
Direct Sales 2000-09 US
1,217,808 0 0 0
Direct Sales 2000-09 2,435,616 0 0 1
Dir
ect Sales 2000-10 GB 1,225,584 0 0 0
Direct Sales 2000-10 US 1,225,584
0 0 0
Direct Sales 2000-10 2,451,169 0 0 1
Direct Sales
4,886,784 0 1 1
5,821,739 1 1
1
A program can easily identify the detail rows by a mask of "0 0 0" on the T, R, and D columns. The first level subtota l rows have a mask of "0 0 1", the second level subtotal rows have a mask of "0 1 1", and the overall total row has a mask of "1 1 1" .
You can improve the readability of result sets by using the GROUPING and DECODE functions as shown
in Example 20-7.
Example 20-7 GROUPING For Reada bility
SELECT DECODE(GROUPING(channel_desc), 1, 'Multi-channel sum', channel_desc)
AS
Channel, DECODE (GROUPING (country_iso_code), 1, 'Multi-country sum',
country_iso_code) AS Country, TO_CHAR(SUM(amount_sold), '
9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customer
s.cust_id
AND sales.channel_id= channels.channel_id AND channels.channel_desc
IN ('Direct Sales', 'Internet') AND times.calendar_m
onth_desc= '2000-09'
AND country_iso_code IN ('GB', 'US')
GROUP BY CUBE(channel_desc, country_iso_code);
CHANNEL COU
NTRY SALES$
-------------------- ----------------- --------------
Multi-channel sum Multi-country sum 2,892,098
Mul
ti-channel sum GB 1,446,049
Multi-channel sum US 1,446,049
Internet Multi-c
ountry sum 456,482
Internet GB 228,241
Internet US 228,241
Direct Sales Multi-country sum 2,435,616
Direct Sales GB 1,217,808
Direct Sales US
1,217,808
To understand the previous statement, note its first column specification, which handles th e channel_desc column. Consider the first line of the previous statement:
SELECT DECODE(GROUPING(channe l_desc), 1, 'All Channels', channel_desc)AS Channel
In this, the channel_desc value is determined with a GROUPING function. The GROUPING function returns a 1 if a row val
ue is an aggregate created by ROLLUP or CUBE, otherwise it returns a 0. The DECODE function th
en operates on the GROUPING function's results. It returns the text "All Channels" if it receives a 1 and the chan
nel_desc value from the database if it receives a 0. Values from the database will be either a real value such as "Internet" o
r a stored NULL. The second column specification, displaying country_id, works the same way.
The GROU
PING function is not only useful for identifying NULLs, it also enables sorting subtotal rows and filtering resul
ts. In Example 20-8, you retrieve a subset of the subtotals created by a CUBE and none of the ba
se-level aggregations. The HAVING clause constrains columns that use GROUPING functions.
Example 20-8 GROUPING Combined with HAVING
SELECT channel_desc, calenda
r_month_desc, country_iso_code, TO_CHAR(
SUM(amount_sold), '9,999,999,999') SALES$, GROUPING(channel_desc) CH, GROUPING
(calendar_m
onth_desc) MO, GROUPING(country_iso_code) CO
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND
sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'Internet
') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US')
GROUP BY CUBE(channel_desc, calendar
_month_desc, country_iso_code)
HAVING (GROUPING(channel_desc)=1 AND GROUPING(calendar_month_desc)= 1 AND
GROUPING(country_iso_code)
=1) OR (GROUPING(channel_desc)=1 AND
GROUPING (calendar_month_desc)= 1) OR (GROUPING(country_iso_code)=1
AND GROUPING(calendar_mo
nth_desc)= 1);
CHANNEL_DESC C CO SALES$ CH MO CO
-------------------- - -- -------------- --
-------- ---------- ----------
GB 2,910,870 1 1 0
US
2,910,870 1 1 0
Direct Sales 4,886,784 0 1 1
Internet
934,955 0 1 1
5,821,739 1 1
1
Compare the result set of Example 20-8 with that in Example 20-3 to see how Example 20-8 is a precisely specified group: it contains only the yearly totals, regional totals aggrega ted over time and department, and the grand total.
To find the GROUP BY
code> level of a particular row, a query must return columns. If we do this using the GROUPING function information for each of the GROUP GROUPING function, every GROUP BY column requir
es another column using the GROUPING function. For instance, a four-column GROUP BY clause nee
ds to be analyzed with four GROUPING functions. This is inconvenient to write in SQL and increases the number of columns
required in the query. When you want to store the query result sets in tables, as with materialized views, the extra columns waste s
torage space.
To address these problems, you can use the GROUPING_ID function. GROUPING_ID returns a
single number that enables you to determine the exact GROUP BY level. For each row, GROUPING_ID takes the set of 1's and 0's that would be generated if you used the appropriate GROUPING functions and concatenates
them, forming a bit vector. The bit vector is treated as a binary number, and the number's base-10 value is returned by the GRO
UPING_ID function. For instance, if you group with the expression CUBE(a, b) the possible values are
as shown in Table 20-2.
Table 20-2 GROUPING_ID Example for CUBE(a, b)< /em>
| Bit Vector | GROUPING_ID | |
|---|---|---|
| a, b | 0 0 | 0 |
| a | 0 1 | 1 |
| b | 1 0 | 2 |
| Grand Total | 1 1 | 3 |
GROUPING_ID clearly distinguishes groupings created by groupin
g set specification, and it is very useful during refresh and rewrite of materialized views.
While the extensions to GROU
P BY offer power and flexibility, they also allow complex result sets that can include duplicate groupings. The <
code>GROUP_ID function lets you distinguish among duplicate groupings. If there are multiple sets of rows calculated for a giv
en level, GROUP_ID assigns the value of 0 to all the rows in the first set. All other sets of duplicate rows for a parti
cular grouping are assigned higher values, starting with 1. For example, consider the following query, which generates a duplicate gr
ouping:
Example 20-9 GROUP_ID
SELECT country_iso_code, SUBSTR(cust_state_province,1,12), S
UM(amount_sold),
GROUPING_ID(country_iso_code, cust_state_province) GROUPING_ID, GROUP_ID()
FROM sales, customers, times, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id
AND customers.country_id=countries.country_id AND times.tim
e_id= '30-OCT-00'
AND country_iso_code IN ('FR', 'ES')
GROUP BY GROUPING SETS (country_iso_code,
ROLLUP(country_iso_code, cust_stat
e_province));
CO SUBSTR(CUST_ SUM(AMOUNT_SOLD) GROUPING_ID GROUP_ID()
-- ------------ ---------------- ----------- ----------
ES Ali
cante 135.32 0 0
ES Valencia 4133.56 0 0
ES Barcelona
24.22 0 0
FR Centre 74.3 0 0
FR Aquitaine 231.97 0
0
FR Rhtne-Alpes 1624.69 0 0
FR Ile-de-Franc 1860.59 0 0
FR Languedoc-
Ro 4287.4 0 0
12372.05 3 0
ES 4293.1
1 0
FR 8078.95 1 0
ES 4293.1 1 1
FR 8078.95 1 1
This query generates the following groupings: (country_id<
/code>, ) is repeated twice. The syntax for cust_state_province), (country_id), (country_id), and (). Note that the grouping (GROUPING SETS is explained in "G
ROUPING SETS Expression".
This function helps you filter out duplicate groupings from the result. For example, you can fil
ter out duplicate (region) groupings from the previous example by adding a HAVING clause condition GR
OUP_ID()=0 to the query.
You can selectively specify
the set of groups that you want to create using a GROUPING SETS expression within a GROUP CUBE. For e
xample, you can say:
SELECT channel_desc, calendar_month_desc, country_iso_code,
TO_CHAR(SUM(amo
unt_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust
_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND time
s.calendar_month_desc IN
('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US')
GROUP BY GROUPING SETS((channel_desc, calendar_
month_desc, country_iso_code),
(channel_desc, country_iso_code), (calendar_month_desc, country_iso_code));
Note that this statement uses composite columns, described in "Composite Columns". This statement calculates aggregate s over three groupings:
(channel_desc, calendar_month_desc, country_iso_code)
(channel_desc, country_iso_code)
(calendar_month_desc, country_iso_co
de)
Compare the previous statement with the following alternative, which uses the CUBE operation
and the GROUPING_ID function to return the desired rows:
SELECT channel_desc, calendar_mo
nth_desc, country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$,
GROUPING_ID(channel_desc, calendar_month
_desc, country_iso_code) gid
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=cu
stomers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.cale
ndar_month_desc IN
('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US')
GROUP BY CUBE(channel_desc, calendar_month_desc, coun
try_iso_code)
HAVING GROUPING_ID(channel_desc, calendar_month_desc, country_iso_code)=0
OR GROUPING_ID(channel_desc, calendar_month
_desc, country_iso_code)=2
OR GROUPING_ID(channel_desc, calendar_month_desc, country_iso_code)=4;
This statement comput es all the 8 (2 *2 *2) groupings, though only the previous 3 groups are of interest to you.
Another alternative is the followi
ng statement, which is lengthy due to several unions. This statement requires three scans of the base table, making it inefficient. <
code>CUBE and ROLLUP can be thought of as grouping sets with very specific semantics. For example, consider the f
ollowing statement:
CUBE(a, b, c)
This statement is equivalent to:
GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ()) ROLLUP(a, b, c)
And this statement is equivale nt to:
GROUPING SETS ((a, b, c), (a, b), ())
GROUPING GROUP BY computes all the grou
pings specified and combines them with UNION ALL. For example, consider the following statement:
GROUP BY GROUPING sets (channel_desc, calendar_month_desc, country_id )
This statement is equivalent to:
GROUP BY channel_desc UNION ALL GROUP BY calendar_month_desc UNION ALL GROUP BY country_id
Table 20-3 shows grouping sets specification and equivalent GROUP BY specificat
ion. Note that some examples use composite columns.
Table 20-3 GROUPING SETS Statements and Equivalent GROUP BY
| Equivalent GROUP BY Statement | |
|---|---|
GROUP BY GROUPING SETS(a, b, c) |
GR
OUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY c |
GROUP BY GROUPING SETS(a, b, (b, c)) |
GROUP BY a UNIO
N ALL GROUP BY b UNION ALL GROUP BY b, c |
GROUP BY GROUPING SETS((a, b, c)) |
GROUP BY a, b, c |
GROUP BY GROUPING SETS(a, (b), ())<
/td>
| GROUP BY a UNION ALL GROUP BY b UNION ALL GROUP BY () |
GROUP BY GROUPING SETS(a, ROLLUP(b, c)) |
GROUP BY a UNION ALL GROUP BY ROLLUP(b, c) |
In the absence of an optimizer that looks across query blocks to generate the execution plan, a query
based on UNION would need multiple scans of the base table, sales. This could be very inefficient as fact tables will no
rmally be huge. Using GROUPING SETS statements, all the groupings of interest are available in the same que
ry block.
A composite column is a collection of columns that are treated as a unit during the compu tation of groupings. You specify the columns in parentheses as in the following statement:
ROLLUP (year , (quarter, month), day)
In this statement, the data is not rolled up across year and quarter, but is instead equivalent
to the following groupings of a UNION ALL:
(year, quarter
code>, month, day),
(year, quarter, month),
(year)
()
Here, (quarter, <
code>month) form a composite column and are treated as a unit. In general, composite columns are useful in ROLLUP
, CUBE, GROUPING SETS, and concatenated groupings. For example, in CUBE or
ROLLUP, composite columns would mean skipping aggregation across certain levels. That is, the following statement:
GROUP BY ROLLUP(a, (b, c))
This is equivalent to:
GROUP BY a, b, c UNION ALL GROUP BY a UNION ALL GROUP BY ()
Here, (b, c) are treated as a unit and rollup will not be
applied across (b, c). It is as if you have an alias, for example z, for (b, c)
and the GROUP BY expression reduces to GROUP BY ROLLUP(a, z)
code>. Compare this with the normal rollup as in the following:
GROUP BY ROLLUP(a, b, c)
Th is would be the following:
GROUP BY a, b, c UNION ALL GROUP BY a, b UNION ALL GROUP BY a UNION ALL GROU P BY ().
Similarly, the following statement is equivalent to the four GROUP BYs:
GROUP BY CUBE((a, b), c) GROUP BY a, b, c UNION ALL GROUP BY a, b UNION ALL GROUP BY c UNION ALL GROUP By ()
In GROUPING SETS, a composite column is used to denote a particular level of GROUP
BY. See Table 20-3 for more examples of composite columns.
Example 20-10 Composite Columns
You do not have full control over what aggregation levels you want with CUBE and ROLLUP. For examp
le, the following statement:
SELECT channel_desc, calendar_month_desc, country_iso_code,
TO_CHAR(SUM(a
mount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cu
st_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN ('Direct Sales', 'Internet')
AND time
s.calendar_month_desc IN ('2000-09', '2000-10')
AND country_iso_code IN ('GB', 'US')
GROUP BY ROLLUP(channel_desc, calendar_month_de
sc, country_iso_code);
This statement results in Oracle computing the following groupings:
(channel_desc, calendar_month_desc, country_iso_code)
(channel_desc, calendar_month_desc)
(channel_desc)
If you are just interested in grouping of lines (1), (3) and (4) in this example, you cannot limit the c
alculation to those groupings without using composite columns. With composite columns, this is possible by treating month and country
as a single unit while rolling up. Columns enclosed in parentheses are treated as a unit while computing CUBE and ROLLUP. Thus, you would say:
SELECT channel_desc, calendar_month_desc, country_iso_code,
TO
_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id=times.time_id A
ND sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND channels.channel_desc IN
('Direct Sales', 'I
nternet') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND country_iso_code IN ('GB', 'US')
GROUP BY ROLLUP(channel_desc,
(calendar_month_desc, country_iso_code));
CHANNEL_DESC CALENDAR CO SALES$
-------------------- -------- -- --------------
I
nternet 2000-09 GB 228,241
Internet 2000-09 US 228,241
Internet 2000-10 GB
239,236
Internet 2000-10 US 239,236
Internet 934,955
Direct Sales 2000-0
9 GB 1,217,808
Direct Sales 2000-09 US 1,217,808
Direct Sales 2000-10 GB 1,225,584
Direct Sales
2000-10 US 1,225,584
Direct Sales 4,886,784
5,821,739
Concatenated groupings offer a concise way to generate useful combinations of groupings. Groupings specified with concatenate d groupings yield the cross-product of groupings from each grouping set. The cross-product operation enables even a small number of c oncatenated groupings to generate a large number of final groups. The concatenated groupings are specified simply by listing multiple grouping sets, cubes, and rollups, and separating them with commas. Here is an example of concatenated grouping sets:
GROUP BY GROUPING SETS(a, b), GROUPING SETS(c, d)
This SQL defines the following groupings:
(a, c), (a, d), (b, c), (b, d)
Concatenation of grouping sets is very helpful for these reasons:
Ease of query development
You need not enumerate all groupings manually.
Us e by applications
SQL generated by OLAP applications often involves concatenation of grouping sets, with each grouping set def ining groupings needed for a dimension.
Example 20-11 Concatenated Groupings
You can also specify more
than one grouping in the GROUP BY clause. For example, if you want aggregated sales values for each product
rolled up across all levels in the time dimension (year, month and day), and acr
oss all levels in the geography dimension (region), you can issue the following statement:
SELECT channel_desc, calendar_year, calendar_quarter_desc, country_iso_code,
cust_state_province, TO_CHAR(SUM(amount_
sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id = times.time_id AND sales.cust_i
d = customers.cust_id
AND sales.channel_id = channels.channel_id AND countries.country_id =
customers.country_id AND channels.c
hannel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN ('2000-09',
'2000-10') AND countries.country_iso_cod
e IN ('GB', 'FR')
GROUP BY channel_desc, GROUPING SETS (ROLLUP(calendar_year,
calendar_quarter_desc),
ROLLUP(country_iso_code, cus
t_state_province));
This results in the following groupings:
(channel_desc, calendar_year, calendar_quarter_desc)
(chann
el_desc, calendar_year)
(channel_desc)
(country_iso_code, cust_state_province)
(channel
_desc, country_iso_code)
(channel_desc)
This is the cr oss-product of the following:
The expression, channel_desc
calendar_year, calendar_quarter_desc), which is equivalent to ((calendar_year
, calendar_quarter_desc), (calendar_year), ())
ROLLUP(country_iso_code,
cust_state_province), which is equivalent to ((country_iso_code, cust_state_province), (countr
y_iso_code), ())
Note that the output contains two occurrences of (channel_desc) group. To filte
r out the extra (channel_desc) group, the query could use a GROUP_ID function.
Another concatenated join example is the following, showing the cross product of two grouping sets:
Example 20-12 Concatenated Groupings (Cross-Product of Two Gro uping Sets)
SELECT country_iso_code, cust_state_province, calendar_year, calendar_
quarter_desc,
TO_CHAR(SUM(amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries
WHERE sales.time_id
=times.time_id AND sales.cust_id=customers.cust_id AND
countries.country_id=customers.country_id AND
sales.channel_id= channels.c
hannel_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN
('2000-09', '2000-10') AND cou
ntry_iso_code IN ('GB', 'FR')
GROUP BY GROUPING SETS (country_iso_code, cust_state_province),
GROUPING SETS (calendar_year,
calendar_quarter_desc);
This statement results in the computation of groupings:
(country_iso_code, year), (country_iso_code, calendar_quarter_desc), (cust_state_province, year) and (cust_state_province, calendar_quarter_desc)
One of the most important uses for concatenated groupings is to generate the aggregates needed for a hierarchical cube of data. A hierarchical cube is
a data set where the data is aggregated along the rollup hierarchy of each of its dimensions and these aggregations are combined acro
ss dimensions. It includes the typical set of aggregations needed for business intelligence queries. By using concatenated groupings,
you can generate all the aggregations needed by a hierarchical cube with just n ROLLUPs (where n is t
he number of dimensions), and avoid generating unwanted aggregations.
Consider just three of the dimensions in the sh sample schema data set, each of which has a multilevel hierarchy:
time: year, qu
arter, month, day (week is in a separate hierarchy)
produc
t: category, subcategory, prod_name
geography: region, subregion, country, state, city
This data is represented usin g a column for each level of the hierarchies, creating a total of twelve columns for dimensions, plus the columns holding sales figur es.
For our business intelligence needs, we would like to calculate and store certain aggregates of the various combinations o
f dimensions. In Example 20-13, we create the aggregates for all levels, except for "day", which would create
too many rows. In particular, we want to use ROLLUP within each dimension to generate useful aggregates. Once we have t
he ROLLUP-based aggregates within each dimension, we want to combine them with the other dimensions. This will generate
our hierarchical cube. Note that this is not at all the same as a CUBE using all twelve of the dimension columns: that w
ould create 2 to the 12th power (4,096) aggregation groups, of which we need only a small fraction. Concatenated grouping sets make i
t easy to generate exactly the aggregations we need. Example 20-13 shows where a GROUP BY<
/code> clause is needed.
Example 20-13 Concatenated Groupings and Hierarchical Cubes
SELECT calendar_year, calendar_quarter_desc, calendar_month_desc,
country_region, country_subregion
, countries.country_iso_code, cust_state_province,
cust_city, prod_category_desc, prod_subcategory_desc, prod_name,
TO_CHAR(SUM (
amount_sold), '9,999,999,999') SALES$
FROM sales, customers, times, channels, countries, products
WHERE sales.time_id=times.time_id A
ND sales.cust_id=customers.cust_id AND
sales.channel_id= channels.channel_id AND sales.prod_id=products.prod_id AND
customers.cou
ntry_id=countries.country_id AND channels.channel_desc IN
('Direct Sales', 'Internet') AND times.calendar_month_desc IN
('2000-09',
'2000-10') AND prod_name IN ('Envoy Ambassador',
'Mouse Pad') AND countries.country_iso_code IN ('GB', 'US')
GROUP BY ROLLUP(calend
ar_year, calendar_quarter_desc, calendar_month_desc),
ROLLUP(country_region, country_subregion, countries.country_iso_code,
cust_state_province, cust_city),
ROLLUP(prod_category_desc, prod_subcategory_desc, prod_name);
The ROLLUPs in the GROUP BY specification generate the following groups, four for each dimension.
| ROLLUP By Product | ||
|---|---|---|
year, quarter, month |
category, subcategory, name |
region, subregion, country, state,
city
|
year, quarter |
category, subcategory |
region, subregion |
tr>
year |
category |
region |
all times |
all products |
all geographies |
The concatenated grouping sets specified in the previous SQL will take the ROLLUP
aggregations listed in the table and perform a cross-product on them. The cross-product will create the 96 (4x4x6) aggregate
groups needed for a hierarchical cube of the data. There are major advantages in using three ROLLUP expressions to repla
ce what would otherwise require 96 grouping set expressions: the concise SQL is far less error-prone to develop and far easier to mai
ntain, and it enables much better query optimization. You can picture how a cube with more dimensions and more levels would make the
use of concatenated groupings even more advantageous.
See "Working with Hierarchical Cubes in SQL" for more information regarding hierarchical cubes.
This section discusses the following topics.
The ROLLUP and CUBE extensions work independently of any hierarchy metadata in your syst
em. Their calculations are based entirely on the columns specified in the SELECT statement in which they appear. This ap
proach enables CUBE and ROLLUP to be used whether or not hierarchy metadata is available. The simplest way
to handle levels in hierarchical dimensions is by using the ROLLUP extension and indicating levels explicitly through se
parate columns. The following code shows a simple example of this with months rolled up to quarters and quarters rolled up to years.<
/p>
Example 20-14 ROLLUP and CUBE Hierarchy Handling
SELECT calendar_year, calendar_quarter_nu
mber,
calendar_month_number, SUM(amount_sold)
FROM sales, times, products, customers, countries
WHERE sales.time_id=times.time_id
AND sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND prod_name IN ('Envoy Ambassador',
'Mouse Pad') AND cou
ntry_iso_code = 'GB' AND calendar_year=1999
GROUP BY ROLLUP(calendar_year, calendar_quarter_number, calendar_month_number);
CALENDAR
_YEAR CALENDAR_QUARTER_NUMBER CALENDAR_MONTH_NUMBER SUM(AMOUNT_SOLD)
------------- ----------------------- --------------------- ----
------------
1999 1 1 168419.74
1999 1
2 332348.02
1999 1 3 169511.52
1999
1 670279.28
1999 2 4 247291.88
1999 2 5 182338.86
1999 2
6 264493.91
1999 2 694124.65
1999
3 7 192268.11
1999 3 8 182550.88
1999
3 9 270309.26
1999 3 6451
28.25
1999 4 10 180400.32
1999 4
11 232830.87
1999 4 12 168008.07
1999
4 581239.26
1999 2590771.44
<
/div>
CUBE, ROLLUP, and GROUPING SETS do not restrict the GROUP GROUP BY clause, with or without the extensions, can work with up
to 255 columns. However, the combinatorial explosion of CUBE makes it unwise to specify a large number of columns with
the CUBE extension. Consider that a 20-column list for CUBE would create 2 to the 20 combinations in the re
sult set. A very large CUBE list could strain system resources, so any such query needs to be tested carefully for perfo
rmance and the load it places on the system.
The HAVING clause of SELECT statements is unaffected by the use of
GROUP BY. Note that the conditions specified in the HAVING clause apply to both the subtotal and non
-subtotal rows of the result set. In some cases a query may need to exclude the subtotal rows or the non-subtotal rows from the HAVING clause. This can be achieved by using a GROUPING or GROUPING_ID function together with the <
code>HAVING clause. See Example 20-8 and its associated SQL statement for an example.
In many cases, a query ne
eds to order the rows in a certain way, and this is done with the ORDER BY clause. The ORDER <
code>BY clause of a SELECT statement is unaffected by the use of GROUP BY, since the BY clause is applied after the GROUP BY calculations are complete.
Not
e that the ORDER BY specification makes no distinction between aggregate and non-aggregate rows of the resu
lt set. For instance, you might wish to list sales figures in declining order, but still have the subtotals at the end of each group.
Simply ordering sales figures in descending sequence will not be sufficient, since that will place the subtotals (the largest values
) at the start of each group. Therefore, it is essential that the columns in the ORDER BY clause include co
lumns that differentiate aggregate from non-aggregate columns. This requirement means that queries using ORDER BY<
/code> along with aggregation extensions to GROUP BY will generally need to use one or more of the GR
OUPING functions.
The examples in this chapter show ROLLUP and CUBE used with the SUM fun
ction. While this is the most common type of aggregation, these extensions can also be used with all other functions available to the
GROUP BY clause, for example, COUNT, AVG, MIN, MAX, VARIANCE. COUNT, which is often needed in cross-tabular analyses, is likely to be th
e second most commonly used function.
The WITH clause (formally known as subquery_factoring_clause
) enables you to reuse the same query block in a SELECT statement when it occurs more than once within a complex query.
WITH is a part of the SQL-99 standard. This is particularly useful when a query has multiple references to the same quer
y block and there are joins and aggregations. Using the WITH clause, Oracle retrieves the results of a query block and s
tores them in the user's temporary tablespace. Note that Oracle Database does not support recursive use of the WITH clau
se.
The following query is an example of where you can improve performance and write SQL more simply by using the WITH
code> clause. The query calculates the sum of sales for each channel and holds it under the name channel_summary. Then i
t checks each channel's sales total to see if any channel's sales are greater than one third of the total sales. By using the W
ITH clause, the channel_summary data is calculated just once, avoiding an extra scan through the large sales tabl
e.
Examp le 20-15 WITH Clause
WITH channel_summary AS (SELECT channels.channel_desc, SUM(am ount_sold) AS channel_total FROM sales, channels WHERE sales.channel_id = channels.channel_id GROUP BY channels.channel_desc) SELECT channel_desc, channel_total FROM channel_summary WHERE channel_total > (SELECT SUM(channel_total) * 1/3 FROM channel_summary); CH ANNEL_DESC CHANNEL_TOTAL -------------------- ------------- Direct Sales 57875260.6
Note that this exa mple could also be performed efficiently using the reporting aggregate functions described in Chapter 21, " SQL for Analysis and Reporting".
This section illustrates example s of working with hierarchical cubes.
Oracle
Database can specify hierarchical cubes in a simple and efficient SQL query. These hierarchical cubes represent the logical cubes ref
erred to in many OLAP products. To specify data in the form of hierarchical cubes, you can use one of the extensions to the GRO
UP BY clause, concatenated grouping sets, to generate the aggregates needed for a hierarchical cube of data. By u
sing concatenated rollup (rolling up along the hierarchy of each dimension and then concatenate them across multiple dimensions), you
can generate all the aggregations needed by a hierarchical cube.
Example 20-16 Concatenated ROLLUP
The following
shows the GROUP BY claus
e needed to create a hierarchical cube for a 2-dimensional example similar to 2013. The following simple syntax performs a concatenat
ed rollup:
GROUP BY ROLLUP(year, quarter, month), ROLLUP(Division, brand, item)
This concat
enated rollup takes the ROLLUP aggregations similar to those listed in Table 20-4, "Hierarchical CUB
E Example" in the prior section and performs a cross-product on them. The cross-product will create the 16 (4x4) aggregate groups
needed for a hierarchical cube of the data.
Analytic applications treat data as cubes, but they want only certain slices and regions of the cube. Concatenated rollup (hierarchical cube) enables relational data to be treated as cubes. To handle complex analytic queries, the fundamental techni que is to enclose a hierarchical cube query in an outer query that specifies the exact slice needed from the cube. Oracle Database op timizes the processing of hierarchical cubes nested inside slicing queries. By applying many powerful algorithms, these queries can b e processed at unprecedented speed and scale. This enables OLAP tools and analytic applications to use a consistent style of queries to handle the most complex questions.
Example 20-17 Hierarchical Cube Query
Consider the following analytic query . It consists of a hierarchical cube query nested in a slicing query.
SELECT month, division, sum_sales
FROM
(SELECT year, quarter, month, division, brand, item, SUM(sales) sum_sales,
GROUPING_ID(grouping-columns) gid
FROM sales, products, time
WHERE join-condition
GROUP BY ROLLUP(year, quarter, month),
ROLLUP(division,
brand, item))
WHERE division = 25 AND month = 200201 AND gid = gid-for-Division-Month;
The inner hierarchical cu
be specified defines a simple cube, with two dimensions and four levels in each dimension. It would generate 16 groups (4 Time levels
* 4 Product levels). The GROUPING_ID function in the query identifies the specific group each row belongs to, based on
the aggregation level of the grouping-columns in its argument.
The outer query applies the constraints needed for our
specific query, limiting Division to a value of 25 and Month to a value of 200201 (representing January 2002 in this case). In conce
ptual terms, it slices a small chunk of data from the cube. The outer query's constraint on the GID column, indicated in
the query by gid-for-division-month would be the value of a key indicating that the data is grouped as a combination of month. The GID constraint selects only those rows that are aggregated at the level o
f a GROUP BY month, division clause.
Oracle Database removes unneeded aggregation groups from query
processing based on the outer query conditions. The outer conditions of the previous query limit the result set to a single group agg
regating division and month. Any other groups involving year, month, brand<
/code>, and item are unnecessary here. The group pruning optimization recognizes this and transforms the query into:
SELECT month, division, sum_sales
FROM (SELECT null, null, month, divi
sion, null, null, SUM(sales) sum_sales,
GROUPING_ID(grouping-columns) gid
FROM
sales, products, time WHERE join-condition
GROUP BY month, division)
WHERE division = 25
AND month = 200201 AND gid = gid-for-Division-Month;
The bold items highlight the changed SQL. The inner query n
ow has a simple GROUP BY clause of month, division. The columns year
, quarter, brand and item have been converted to null to match the simplified GROUP BY clause. Because the query now requests just one group, fifteen out of sixteen groups are removed from the processi
ng, greatly reducing the work. For a cube with more dimensions and more levels, the savings possible through group pruning can be far
greater. Note that the group pruning transformation works with all the GROUP BY extensions: ROLLUP
code>, CUBE, and GROUPING SETS.
While the optimizer has simplified the previous query t
o a simple GROUP BY, faster response times can be achieved if the group is precomputed and stored in a mate
rialized view. Because OLAP queries can ask for any slice of the cube many groups may need to be precomputed and stored in a material
ized view. This is discussed in the next section.
OLAP requires fast response times for multiple users, and this in turn demands that significant parts of an OLA P cube be precomputed and held in materialized views. The Oracle Database enables great flexibility in the use of materialized views for OLAP.
Data warehouse designers can choose exactly how much data to materialize. A data warehouse can have the full hierarc hical cube materialized. While this will take the most storage space, it ensures quick response for any query within the cube. On the other hand, a data warehouse could have just partial materialization, saving storage space, but allowing only a subset of possible q ueries to be answered at highest speed. If an OLAP environment's queries cover the full range of aggregate groupings possible in its data set, it may be best to materialize the whole hierarchical cube.
This means that each dimension's aggregation hierarchy is precomputed in combination with each of the other dimensions. Naturally, precomputing a full hierarchical cube requires more disk sp ace and higher creation and refresh times than a small set of aggregate groups. The trade-off in processing time and disk space versu s query performance needs to be considered before deciding to create it. An additional possibility you could consider is to use data compression to lessen your disk space requirements.
See O racle Database SQL Reference for compression syntax and restrictions and "Storage And Table C ompression" for details regarding compression.
This section shows complete and partial hierarchical cube materialized views. Many of th e examples are meant to illustrate capabilities, and do not actually run.
In a data warehouse where rolling window scenario is
very common, it is recommended that you store the hierarchical cube in multiple materialized views - one for each level of time you
are interested in. Hence, a complete hierarchical cube will be stored in four materialized views: sales_hierarchical_mon_cube_m
v, sales_hierarchical_qtr_cube_mv, sales_hierarchical_yr_cube_mv, and sales_hierarchical_all_c
ube_mv.
The following statements create a complete hierarchical cube stored in a set of three composite partitioned and one list partitioned materialized view.
Example 20-18 Complete Hierarchical Cube Materialized View
CREATE MATERIALIZED VIEW sales_hierarchical_mon_cube_mv
PARTITION BY RANGE (mon)
SUBPA
RTITION BY LIST (gid)
REFRESH FAST ON DEMAND
ENABLE QUERY REWRITE AS
SELECT calendar_year yr, calendar_quarter_desc qtr, calendar_mon
th_desc mon,
country_id, cust_state_province, cust_city,
prod_category, prod_subcategory, prod_name,
GROUPING_ID(calendar
_year, calendar_quarter_desc, calendar_month_desc,
country_id, cust_state_province, cust_city,
prod_c
ategory, prod_subcategory, prod_name) gid,
SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales,
COUNT(*) c_star
FROM sales s
, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id = t.time_id
GROUP BY calendar_
year, calendar_quarter_desc, calendar_month_desc,
ROLLUP(country_id, cust_state_province, cust_city),
ROLLUP(prod_category, prod_
subcategory, prod_name),
...;
CREATE MATERIALIZED VIEW sales_hierarchical_qtr_cube_mv
REFRESH FAST ON DEMAND
ENABLE QUERY REWRITE AS
SELECT calendar_year yr, calendar_quarter_desc qtr,
country_id, cust_state_province, cust_city,
prod_category, prod_subcate
gory, prod_name,
GROUPING_ID(calendar_year, calendar_quarter_desc,
country_id, cust_state_province, cust_city,
prod_category, prod_subcategory, prod_name) gid,
SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales,
COUNT(*
) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id
AND s.time_id =
t.time_id
GROUP BY calendar_year, calendar_quarter_desc,
ROLLUP(country_id, cust_state_province, cust_city),
ROLLUP(prod_category
, prod_subcategory, prod_name),
PARTITION BY RANGE (qtr)
SUBPARTITION BY LIST (gid)
...;
CREATE MATERIALIZED VIEW sales_hierarchica
l_yr_cube_mv
PARTITION BY RANGE (year)
SUBPARTITION BY LIST (gid)
REFRESH FAST ON DEMAND
ENABLE QUERY REWRITE AS
SELECT calendar_year
yr, country_id, cust_state_province, cust_city,
prod_category, prod_subcategory, prod_name,
GROUPING_ID(calendar_year, cou
ntry_id, cust_state_province, cust_city,
prod_category, prod_subcategory, prod_name) gid,
SUM(amount_sold) s_sale
s, COUNT(amount_sold) c_sales, COUNT(*) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_
id = p.prod_id AND s.time_id = t.time_id
GROUP BY calendar_year,
ROLLUP(country_id, cust_state_province, cust_city),
ROLLUP(prod_
category, prod_subcategory, prod_name),
...;
CREATE MATERIALIZED VIEW sales_hierarchical_all_cube_mv
REFRESH FAST ON DEMAND
ENABLE Q
UERY REWRITE AS
SELECT country_id, cust_state_province, cust_city,
prod_category, prod_subcategory, prod_name,
GROUPING_ID(
country_id, cust_state_province, cust_city,
prod_category, prod_subcategory, prod_name) gid,
SUM(amount_sold) s_s
ales, COUNT(amount_sold) c_sales, COUNT(*) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.pr
od_id = p.prod_id AND s.time_id = t.time_id
GROUP BY ROLLUP(country_id, cust_state_province, cust_city),
ROLLUP(prod_categor
y, prod_subcategory, prod_name),
PARTITION BY LIST (gid)
...;
This allows use of PCT refresh on the materialized views sales_hierarchical_qtr_cube_mv, and sales_hierarchical_yr_cube_mv on partition maintenance operations to sales table. PCT refresh can also be used when there have been significant changes to the ba
se table and log based fast refresh is estimated to be slower than PCT refresh. You can just specify the method as force (method =>
; '?') in to refresh sub-programs in the DBMS_MVIEW package and Oracle Database will pick the best method o
f refresh. See "Partition Change Tracking (PCT) Refresh" for more information regarding PCT refres
h.
Because sales_hierarchical_qtr_cube_mv does not contain any column from times table, PCT refresh
is not enabled on it. But, you can still call refresh sub-programs in the DBMS_MVIEW package with method as force (metho
d => '?') and Oracle Database will pick the best method of refresh.
If you are interested in a partial cube (t hat is, a subset of groupings from the complete cube), then Oracle Corporation recommends storing the cube as a "federated cube". A f ederated cube stores each grouping of interest in a separate materialized view.
calendar_year yr, calen
dar_quarter_desc qtr, calendar_month_desc mon,
country_id, cust_state_province, cust_city, prod_category,
prod_subcategory, prod_nam
e,
CREATE MATERIALIZED VIEW sales_mon_city_prod_mv
PARTITION BY RANGE (mon)
...
BUILD DEFERRED
REFRESH FAST ON DEMAND
USING TRUSTE
D CONSTRAINTS
ENABLE QUERY REWRITE AS
SELECT calendar_month_desc mon, cust_city, prod_name, SUM(amount_sold) s_sales,
COUNT(am
ount_sold) c_sales, COUNT(*) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod
_id
AND s.time_id = t.time_id
GROUP BY calendar_month_desc, cust_city, prod_name;
CREATE MATERIALIZED VIEW sales_qtr_city_prod_mv
P
ARTITION BY RANGE (qtr)
...
BUILD DEFERRED
REFRESH FAST ON DEMAND
USING TRUSTED CONSTRAINTS
ENABLE QUERY REWRITE AS
SELECT calendar
_quarter_desc qtr, cust_city, prod_name,SUM(amount_sold) s_sales,
COUNT(amount_sold) c_sales, COUNT(*) c_star
FROM sales s, products
p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id = t.time_id
GROUP BY calendar_quarter_des
c, cust_city, prod_name;
CREATE MATERIALIZED VIEW sales_yr_city_prod_mv
PARTITION BY RANGE (yr)
...
BUILD DEFERRED
REFRESH FAST ON D
EMAND
USING TRUSTED CONSTRAINTS
ENABLE QUERY REWRITE AS
SELECT calendar_year yr, cust_city, prod_name, SUM(amount_sold) s_sales,
COUNT(amount_sold) c_sales, COUNT(*) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_
id =p.prod_id AND s.time_id = t.time_id
GROUP BY calendar_year, cust_city, prod_name;
CREATE MATERIALIZED VIEW sales_mon_city_scat_m
v
PARTITION BY RANGE (mon)
...
BUILD DEFERRED
REFRESH FAST ON DEMAND
USING TRUSTED CONSTRAINTS
ENABLE QUERY REWRITE AS
SELECT calen
dar_month_desc mon, cust_city, prod_subcategory,
SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star
FROM sa
les s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id =t.time_id
GROUP BY calend
ar_month_desc, cust_city, prod_subcategory;
CREATE MATERIALIZED VIEW sales_qtr_city_cat_mv
PARTITION BY RANGE (qtr)
...
BUILD DEFERR
ED
REFRESH FAST ON DEMAND
USING TRUSTED CONSTRAINTS
ENABLE QUERY REWRITE AS
SELECT calendar_quarter_desc qtr, cust_city, prod_categ
ory cat,
SUM(amount_sold) s_sales, COUNT(amount_sold) c_sales, COUNT(*) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id =p.prod_id AND s.time_id =t.time_id
GROUP BY calendar_quarter_desc, cust_city, prod_categor
y;
CREATE MATERIALIZED VIEW sales_yr_city_all_mv
PARTITION BY RANGE (yr)
...
BUILD DEFERRED
REFRESH FAST ON DEMAND
USING TRUSTED C
ONSTRAINTS
ENABLE QUERY REWRITE AS
SELECT calendar_year yr, cust_city, SUM(amount_sold) s_sales,
COUNT(amount_sold) c_sales,
COUNT(*) c_star
FROM sales s, products p, customers c, times t
WHERE s.cust_id = c.cust_id AND s.prod_id = p.prod_id AND s.time_id =
t.time_id
GROUP BY calendar_year, cust_city;
These materialized views can be created as BUILD DEFERRED
and then, you can execute DBMS_MVIEW.REFRESH_DEPENDENT(number_of_failures, 'SALES', 'C' ...) so that the complet
e refresh of each of the materialized views defined on the detail table SALES is scheduled in the most efficient order.
Please refer to section on "Scheduling Refresh".
Because each of these materialized views i
s partitioned on the time level (month, quarter, or year) present in the SELECT list, PCT is enabled on SALES table for each one of them, thus providing an opportunity to apply PCT refresh method in addition to FAST and C
OMPLETE refresh methods.