Tải bản đầy đủ (.pdf) (10 trang)

Hướng dẫn học Microsoft SQL Server 2008 part 34 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (602.72 KB, 10 trang )

Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 292
Part II Manipulating Data With Select
Because SQL is now returning information from a set, rather than building a record set of rows, as soon
as a query includes an aggregate function, every column (in the column list, in the expression, or in the
ORDER BY) must participate in an aggregate function. This makes sense because if a query returned
the total number of order sales, then it could not return a single order number on the summary row.
Because aggregate functions are expressions, the result will have a null column name. Therefore, use an
alias to name the column in the results.
To demonstrate the mathematical aggregate functions, the following query produces a
SUM(), AVG(),
MIN(),andMAX() of the amount column. SQL Server warns in the result that null values are ignored
by aggregate functions, which are examined in more detail soon:
SELECT SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]
FROM RawData ;
Result:
Sum Avg Min Max

946 47 11 91
Warning: Null value is eliminated by an aggregate
or other SET operation.
There’s actually more to the COUNT() function than appears at first glance. The next query exercises
four variations of the
COUNT() aggregate function:
SELECT COUNT(*) AS CountStar,
COUNT(RawDataID) AS CountPK,
COUNT(Amount) AS CountAmount,
COUNT(DISTINCT Region) AS Regions
FROM RawData;


Result:
CountStar CountPK CountAmount Regions

24 24 20 4
Warning: Null value is eliminated by an aggregate
or other SET operation.
To examine this query in detail, the first column, COUNT(*), counts every row, regardless of any values
in the row.
COUNT(RawDataID) counts all the rows with a non-null value in the primary key. Because
primary keys, by definition, can’t have any nulls, this column also counts every row. These two methods
of counting rows have the same query execution plan, same performance, and same result.
The third column,
COUNT(Amount), demonstrates why every aggregate query includes a warning.
It counts the number of rows with an actual value in the
Amount column, and it ignores any rows
292
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 293
Aggregating Data 12
with a null value in the Amount column. Because there are four rows with null amounts, this
COUNT(Amount) finds only 20 rows.
COUNT(DISTINCT region) is the oddball of this query. Instead of counting rows, it counts the
unique values in the region column. The
RawData table data has four regions: MidWest, NorthEast,
South,andWest. Therefore, COUNT(DISTINCT region) returns 4.NotethatCOUNT(DISTINCT *)
is invalid; it requires a specific column.
Aggregates, averages, and nulls
Aggregate functions ignore nulls, which creates a special situation when calculating averages. A SUM()
or AVG() aggregate function will not error out on a null, but simply skip the row with a null. For this
reason, a

SUM()/COUNT(*) calculation may provide a different result from an AVG() function. The
COUNT(*) function includes every row, whereas the AVG() function might divide using a smaller count
of rows.
To test this behavior, the next query uses three methods of calculating the average amount, and each
method generates a different result:
SELECT AVG(Amount) AS [Integer Avg],
SUM(Amount) / COUNT(*) AS [Manual Avg],
AVG(CAST((Amount) AS NUMERIC(9, 5))) AS [Numeric Avg]
FROM RawData;
Result:
Integer Avg Manual Avg Numeric Avg

47 39 47.300000
The first column performs the standard AVG() aggregate function and divides the sum of the amount
(946) by the number of rows with a non-null value for the amount (20).
The
SUM(AMOUNT)/COUNT(*) calculation in column two actually divides 946 by the total number of
rows in the table (24), yielding a different answer.
The last column provides the best answer. It uses the
AVG() function so it ignores null values, but it
also improves the precision of the answer. The trick is that the precision of the aggregate function is
determined by the data type precision of the source values. SQL Server’s Query Optimizer first converts
the
Amount values to a numeric(9,5) data type and then passes the values to the AVG() function.
Using aggregate functions within the Query Designer
When using Management Studio’s Query Designer (select a table in the Object Explorer ➪ Context
Menu ➪ Edit Top 200 Rows), a query can be converted into an aggregate query using the Group By
toolbar button, as illustrated in Figure 12-2.
293
www.getcoolebook.com

Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 294
Part II Manipulating Data With Select
FIGURE 12-2
Performing an aggregate query within Management Studio’s Query Designer. The aggregate function
for the column is selected using the drop-down box in the
Group By column.
For more information on using the Query Designer to build and execute queries, turn to
Chapter 6, ‘‘Using Management Studio.’’
Beginning statistics
Statistics is a large and complex field of study, and while SQL Server does not pretend to replace a full
statistical analysis software package, it does calculate standard deviation and variance, both of which are
important for understanding the bell-curve spread of numbers.
An average alone is not sufficient to summarize a set of values (in the lexicon of statistics, a ‘‘set’’
is referred to as a population). The value in the exact middle of a population is the statistical mean
or median (which is different from the average or arithmetic mean). The difference, or how widely
dispersed the values are from the mean, is called the population’s variance. For example, the populations
294
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 295
Aggregating Data 12
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) and (4, 4, 5, 5, 5, 5, 6, 6) both average to 5, but the values in the first set
vary widely from the median, whereas the second set’s values are all close to the median. The standard
deviation is the square root of the variance and describes the shape of the bell curve formed by the
population.
The following query uses the
StDevP() and VarP() functions to return the statistical variance and the
standard deviation of the entire population of the
RawData table:
SELECT
StDevP(Amount) as [StDev],

VarP(Amount) as [Var]
FROM RawData;
Result:
StDevP VarP

24.2715883287435 589.11
To perform extensive statistical data analysis, I recommend exporting the query result set
to Excel and tapping Excel’s broad range of statistical functions.
The statistical formulas differ slightly when calculating variance and standard deviation from the entire
population versus a sampling of the population. If the aggregate query includes the entire population,
then use the
StDevP() and VarP() aggregate functions, which use the bias or n method of calculating
the deviation.
However, if the query is using a sampling or subset of the population, then use the
StDev() and
Var() aggregate functions so that SQL Server will use the unbiased or n-1 statistical method. Because
GROUP BY queries slice the population into subsets, these queries should always use StDevP() and
VarP() functions.
All of these aggregate functions also work with the OVER() clause; see Chapter 13,
‘‘Windowing and Ranking.’’
Grouping within a Result Set
Aggregate functions are all well and good, but how often do you need a total for an entire table? Most
aggregate requirements will include a date range, department, type of sale, region, or the like. That
presents a problem. If the only tool to restrict the aggregate function were the
WHERE clause, then
database developers would waste hours replicating the same query, or writing a lot of dynamic SQL
queries and the code to execute the aggregate queries in sequence.
Fortunately, aggregate functions are complemented by the
GROUP BY function, which automatically par-
titions the data set into subsets based on the values in certain columns. Once the data set is divided into

295
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 296
Part II Manipulating Data With Select
subgroups, the aggregate functions are performed on each subgroup. The final result is one summation
row for each group, as shown in Figure 12-3.
A common example is grouping the sales result by salesperson. A
SUM() function without the grouping
would produce the
SUM() of all sales. Writing a query for each salesperson would provide a SUM() for
each person, but maintaining that over time would be cumbersome. The grouping function automatically
creates a subset of data grouped for each unique salesperson, and then the
SUM() function is calculated
for each salesperson’s sales. Voil
`
a.
FIGURE 12-3
The group by clause slices the data set into multiple subgroups.
group
group
group
group
group
group
row
row
row
row
row
row

Data
Source(s)
Where
From
Col(s)
Expr(s)
Data
Set
Data
Set
Having
Order
By
Predicate
Simple groupings
Some queries use descriptive columns for the grouping, so the data used by the GROUP BY clause is
the same data you need to see to understand the groupings. For example, the next query groups by
category:
SELECT Category,
Count(*) as Count,
Sum(Amount) as [Sum],
Avg(Amount) as [Avg],
Min(Amount) as [Min],
Max(Amount) as [Max]
FROM RawData
GROUP BY Category;
Result:
Category Count Sum Avg Min Max

X 5 225 45 11 86

Y 15 506 46 12 91
Z 4 215 53 33 83
The first column of this query returns the Category column. While this column does not have an
aggregate function, it still participates within the aggregate because that’s the column by which the query
is being grouped. It may therefore be included in the result set because, by definition, there can be
only a single category value in each group. Each row in the result set summarizes one category, and the
aggregate functions now calculate the row count, sum average, minimum value, and maximum value for
each category.
296
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 297
Aggregating Data 12
SQL is not limited to grouping by a column. It’s possible to group by an expression, but note that the
exact same expression must be used in the
SELECT list, not the individual columns used to generate
the expression.
Nor is SQL limited to grouping by a single column or expression. Grouping by multiple columns and
expressions is quite common. The following query is an example of grouping by two expressions that
calculate year number and quarter from
SalesDate:
SELECT Year(SalesDate) as [Year], DatePart(q,SalesDate) as [Quarter],
Count(*) as Count,
Sum(Amount) as [Sum],
Avg(Amount) as [Avg],
Min(Amount) as [Min],
Max(Amount) as [Max]
FROM RawData
GROUP BY Year(SalesDate), DatePart(q,SalesDate);
Result:
Year Quarter Count Sum Avg Min Max


2009 1 6 218 36 11 62
2009 2 6 369 61 33 86
2009 3 8 280 70 54 91
2008 4 4 79 19 12 28
For the purposes of a GROUP BY, null values are considered equal to other nulls and are
grouped together into a single result row.
Grouping sets
Normally, SQL Server groups by every unique combination of values in every column listed in the
GROUP BY clause. Grouping sets is a variation of that theme that’s new for SQL Server 2008. With
grouping sets, a summation row is generated for each unique value in each set. You can think of
grouping sets as executing several
GROUP BY queries (one for each grouping set) and then combining, or
unioning, the results.
For example, the following two queries produce the same result. The first query uses two
GROUP BY
queries unioned together; the second query uses the new grouping set feature:
SELECT NULL AS Category,
Region,
COUNT(*) AS Count,
SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]
FROM RawData
GROUP BY Region
297
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 298
Part II Manipulating Data With Select

UNION
SELECT Category,
Null,
COUNT(*) AS Count,
SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]
FROM RawData
GROUP BY Category;
SELECT Category,
Region,
COUNT(*) AS Count,
SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]
FROM RawData
GROUP BY GROUPING SETS (Category, Region);
Result (same for both queries):
Category Region Count Sum Avg Min Max

NULL MidWest 3 145 48 24 83
NULL NorthEast 6 236 59 28 91
NULL South 12 485 44 11 86
NULL West 380403644
X NULL 7 225 45 11 86
Y NULL 12 506 46 12 91
Z NULL 5 215 53 33 83
There’s more to grouping sets than merging multiple GROUP BY queries; they’re also used with ROLLUP

and CUBE, covered later in this chapter.
Filtering grouped results
When combined with grouping, filtering can be a problem. Are the row restrictions applied before the
GROUP BY or after the GROUP BY? Some databases use nested queries to properly filter before or after
the
GROUP BY. SQL, however, uses the HAVING clause to filter the groups. At the beginning of this
chapter, you saw the simplified order of the SQL
SELECT statement’s execution. A more complete order
is as follows:
1. The
FROM clause assembles the data from the data sources.
2. The
WHERE clause restricts the rows based on the conditions.
3. The
GROUP BY clause assembles subsets of data.
4. Aggregate functions are calculated.
298
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 299
Aggregating Data 12
5. The HAVING clause filters the subsets of data.
6. Any remaining expressions are calculated.
7. The
ORDER BY sorts the results.
Continuing with the
RawData sample table, the following query removes from the analysis any
grouping ‘‘having’’ an average of less than or equal to 25 by accepting only those summary rows with an
average greater than 25:
SELECT Year(SalesDate) as [Year],
DatePart(q,SalesDate) as [Quarter],

Count(*) as Count,
Sum(Amount) as [Sum],
Avg(Amount) as [Avg]
FROM RawData
GROUP BY Year(SalesDate), DatePart(q,SalesDate)
HAVING Avg(Amount) > 25
ORDER BY [Year], [Quarter];
Result:
Year Quarter Count Sum Avg

2006 1 6 218 36
2006 2 6 369 61
2006 3 8 280 70
Without the HAVING clause, the fourth quarter of 2005, with an average of 19, would have been
included in the result set.
Aggravating Queries
A few aspects of GROUP BY queries can be aggravating when developing applications. Some developers
simply avoid aggregate queries and make the reporting tool do the work, but the Database Engine will
be more efficient than any client tool. Here are four typical aggravating problems and my recommended
solutions.
Including group by descriptions
The previous aggregate queries all executed without error because every column participated in the
aggregate purpose of the query. To test the rule, the following script adds a category table and then
attempts to return a column that isn’t included as an aggregate function or
GROUP BY column:
CREATE TABLE RawCategory (
RawCategoryID CHAR(1) NOT NULL PRIMARY KEY,
CategoryName VARCHAR(25) NOT NULL
);
299

www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 300
Part II Manipulating Data With Select
INSERT RawCategory (RawCategoryID, CategoryName)
VALUES (’X’, ‘Sci-Fi’),
(’Y’, ‘Philosophy’),
(’Z’, ‘Zoology’);
ALTER TABLE RawData
ADD CONSTRAINT FT_Category
FOREIGN KEY (Category)
REFERENCES RawCategory(RawCategoryID);
including data outside the aggregate function or group by
SELECT R.Category, C.CategoryName,
Sum(R.Amount) as [Sum],
Avg(R.Amount) as [Avg],
Min(R.Amount) as [Min],
Max(R.Amount) as [Max]
FROM RawData AS R
INNER JOIN RawCategory AS C
ON R.Category = C.RawCategoryID
GROUP BY R.Category;
As expected, including CategoryName in the column list causes the query to return an error message:
Msg 8120, Level 16, State 1, Line 1
Column ‘RawCategory.CategoryName’ is invalid in the select list
because it is not contained in either an aggregate function or
the GROUP BY clause.
Here are three solutions for including non-aggregate descriptive columns. Which solution performs best
depends on the size and mix of the data and indexes.
The first solution is to simply include the additional columns in the
GROUP BY clause:

SELECT R.Category, C.CategoryName,
Sum(R.Amount) as [Sum],
Avg(R.Amount) as [Avg],
Min(R.Amount) as [Min],
Max(R.Amount) as [Max]
FROM RawData AS R
INNER JOIN RawCategory AS C
ON R.Category = C.RawCategoryID
GROUP BY R.Category, C.CategoryName
ORDER BY R.Category, C.CategoryName;
Result:
Category CategoryName Sum Avg Min Max

X Sci-Fi 225 45 11 86
Y Philosophy 506 46 12 91
Z Zoology 215 53 33 83
300
www.getcoolebook.com
Nielsen c12.tex V4 - 07/21/2009 12:46pm Page 301
Aggregating Data 12
Another simple solution might be to include the descriptive column in an aggregate function that
accepts text, such as
MIN() or MAX(). This solution returns the descriptor while avoiding grouping by
an additional column:
SELECT Category,
MAX(CategoryName) AS CategoryName,
SUM(Amount) AS [Sum],
AVG(Amount) AS [Avg],
MIN(Amount) AS [Min],
MAX(Amount) AS [Max]

FROM RawData R
JOIN RawCategory C
ON R.Category = C.RawCategoryID
GROUP BY Category
ORDER BY Category,
CategoryName
Another possible solution, although more complex, is to embed the aggregate function in a subquery
and then include the additional columns in the outer query. In this solution, the subquery does the
grunt work of the aggregate function and
GROUP BY, leaving the outer query to handle the JOIN and
bring in the descriptive column(s). For larger data sets, this may be the best-performing solution:
SELECT sq.Category, C.CategoryName,
sq.[Sum], sq.[Avg], sq.[Min], sq.[Max]
FROM (SELECT Category,
Sum(Amount) as [Sum],
Avg(Amount) as [Avg],
Min(Amount) as [Min],
Max(Amount) as [Max]
FROM RawData
GROUP BY Category ) AS sq
INNER JOIN RawCategory AS C
ON sq.Category = C.RawCategoryID
ORDER BY sq.Category, C.CategoryName;
Which solution performs best depends on the data mix. If it’s an ad hoc query, then the simplest query
to write is probably the first solution. If the query is going into production as part of a stored proce-
dure, then I recommend testing all three solutions against a full data load to determine which solution
actually performs best. Never underestimate the optimizer.
Including all group by values
The GROUP BY functions occur following the where clause in the logical order of the query. This can
present a problem if the query needs to report all of the

GROUP BY column values even though the data
needs to be filtered. For example, a report might need to include all the months even though there’s no
data for a given month. A
GROUP BY query won’t return a summary row for a group that has no data.
The simple solution is to use the
GROUP BY ALL option, which includes all GROUP BY values regardless
of the
WHERE clause. However, it has a limitation: It only works well when grouping by a single expres-
sion. A more severe limitation is that Microsoft lists it as deprecated, meaning it will be removed from a
future version of SQL Server. Nulltheless, here’s an example.
301
www.getcoolebook.com

×