Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 59 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (151.36 KB, 10 trang )

558 Hands-On Microsoft SQL Server 2008 Integration Services
backups and reduce the size of the backup file. As SQL Server backup compression
compresses the backup, the device I/O decreases due to reduced size, but the CPU
usage increases due to compression overhead. However, the effect of reduced size
and I/O makes backups much faster. If you find that the CPU usage has increased
to the level that it starts affecting other applications, you can create a low-priority
compressed backup in a session whose CPU usage is limited by the Resource Governor.
The Resource Governor is another new component that is released in SQL Server
Enterprise Edition to enable you to manage SQL Server workload and resources by
specifying limits on resource consumption by the incoming requests. Refer to Books
Online for more details on the Resource Governor. Though the backup compression is
a feature in Enterprise Edition, you can restore compressed backups on any edition of
SQL Server.
There are some restrictions that you should be aware of; for instance, you cannot
mix compressed and uncompressed backups on the same media, and you don’t get
much compression of the database when you use transparent data encryption at the
same time; hence the use of both together is not recommended. You can use backup
compression in large data warehouse implementations to achieve the following goals:
Reduce the size of SQL backups in the vicinity of 50 percent or more.
c
Save hard disk space to keep backups online. c
Be able to keep more copies of backups online for the same storage space. c
Reduce the time required to back up or restore a database. c
Backup compression configuration is specified at the server level but can be overridden.
The default setting is off for the server. You can change this setting in SSMS by checking
the Compress Backup option in the Database Settings page of Server Properties or by
using the sp_configure stored procedure to set the default value of backup compression
and then execute the reconfigure statement. You can override the server level setting for
one-off or single backups by using one of the following methods:
Specify the Compress Backup or Do Not Compress Backup option in the Set
c


Backup Compression field while using the Back Up Database task in an SSIS
package. Refer to the bottom of Figure 5-30 in Chapter 5 to see this option.
Specify the Compress Backup or Do Not Compress Backup option in the Options
c
page of the Back Up Database dialog box while backing up a database in SSMS.
Specify the WITH NO_COMPRESSION or WITH COMPRESSION switch
c
in the Backup statement.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 559
MERGE Statement
SQL Server 2008 includes this new T-SQL statement that can be used to perform
multiple DML operations—i.e., INSERT, UPDATE, and DELETE operations—on
a table or view in a single statement. As the operations are applied in one statement, by
their very nature they are executed under a single atomic operation. Using a MERGE
statement, you join a source table with a target table or view and then perform multiple
DML operations against the target table based on the results of that join. Typically,
you apply these operations using a batch or a stored procedure individually, which
means the source and the target tables are evaluated and processed multiple times, once
for each operation. For a MERGE statement this evaluation happens only once for all
the operations, thus making it more efficient when compared to applying the DML
operations individually.
The MERGE statement specifies a target table in the MERGE INTO clause and
the source table in the USING clause. Then it uses the ON clause to specify the join
condition for the source and target tables, which primarily determines the rows in the
source table that have a match in the target table, the rows that do not, and the rows in
the target table that do not have a match in the source table. For the ON clause, it is
recommended that at least you have unique indexes on join columns in both the source
and target tables for better performance. After evaluation of match cases, you can use
any or all of the three WHEN clauses to perform a specific DML action on a given
row. The clauses are:

WHEN MATCHED THEN
c You can UPDATE or DELETE the given row
in the target table for every row that exists in both the target table and the source
table.
WHEN NOT MATCHED [BY TARGET] THEN
c You can INSERT the
given row in the target table for every row that exists in the source table, but not
in the target table.
WHEN NOT MATCHED BY SOURCE THEN
c You can UPDATE or
DELETE the given row in the target table for every row that exists in the target
table, but not in the source table.
You can also specify a search condition with each of the WHEN clauses to choose
the rows to apply to the DML operation. The MERGE statement also supports the
OUTPUT clause, enabling you to return attributes from the modified rows. The
OUTPUT clause includes a virtual column called $action, which returns the action that
modified the row—i.e., INSERT, UPDATE, or DELETE.
560 Hands-On Microsoft SQL Server 2008 Integration Services
You can embed a MERGE statement inside the Execute SQL task to use it in your
SSIS package. Typically, you would stage data to a staging table to change capture
before loading it in the data warehouse with an Integration Services package. Such
an SSIS package would include a Lookup task to identify whether a row is a new
row or an update, an SQL Server Destination to INSERT new rows, and OLE DB
Command transformations to perform UPDATE and DELETE operations. The
Lookup transformation and the OLE DB Command transformation work on a row-
by-row basis; thus they perform at a speed that can’t match the set-based operation of a
MERGE statement. If you stage data for change capture and make it as a source table
for a MERGE statement, you will find that the data is loaded at a far better rate than
using SSIS with or without staging data. This performance becomes better especially on
servers where lookup is working against large data sets and is running short of memory.

You can also replace a Slowly Changing Dimension (SCD) transformation with a
MERGE statement in some cases. Again, as the MERGE evaluates the source and
the target data only once, it performs much better than the SCD transformation,
which otherwise has been recognized as a performance pain point in SSIS. In the
following example code, DimProduct is updated with the changes being received
in the ProductChanges table. The changes might include changes in price for some
existing products that need to be updated and some new products that need to be
added. Considering the changes as Type 2, the IsCurrent flag has to be reset to 0 for
the updates and new rows for them need to be inserted along with the new products.
Look at the inner query where the MERGE statement updates the IsCurrent flag to 0
under the WHEN
MATCHED clause and adds a new row under the WHEN NOT
MATCHED clause for the new products. Finally, it outputs the rows affected with the
action taken for them and then the outer query inserts the rows that have been updated
back into DimProduct table.
INSERT INTO DimProduct (ProductID, ProductName, Price, IsCurrent)
SELECT ProductID, ProductName, Price, 1
FROM
(
MERGE DimProduct as TGT
USING ProductChanges AS SRC
ON (TGT.ProductID = SRC. ProductID and TGT.IsCurrent = 1)
WHEN MATCHED THEN
UPDATE SET TGT.IsCurrent = 0
WHEN NOT MATCHED THEN
INSERT VALUES (SRC.ProductID, SRC.ProductName, SRC.Price, 1)
OUTPUT $action, SRC.ProductID, SRC.ProductName, SRC.Price
) AS Changes (action, ProductID, ProductName, Price)
WHERE action = 'UPDATE';
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 561

Just as MERGE can replace of some of the SSIS components for performance
reasons, it can be used with change data capture (CDC) functionality to perform inserts
and updates, thus replacing parameterized OLE DB Command transformation and
resulting in a considerable gain in performance. One such application could be moving
staged data into production servers.
GROUP BY Extensions
The GROUP BY clause has been enhanced in SQL Server 2008 and now adds a new
operator: GROUPING SETS. Actually, the GROUP BY clause in SQL Server 2008
now supports ROLLUP, CUBE, and GROUPING SETS operators in line with
ANSI SQL-2006 standards. So, an ISO-compliant syntax has been adopted, though
a non-ISO-compliant syntax is still supported for backward compatibility—e.g., the
earlier operators supported in SQL Server 2005 (WITH CUBE and WITH ROLLUP)
are still supported but will be removed from future versions. As you know, a GROUP
BY clause enables you to select one summary row for each group of rows created from a
selected set of rows on the basis of values of one or more columns or expressions. Adding
the operators to a GROUP BY clause provide an enhanced result set.
The ROLLUP operator generates simple aggregate groupings with expressions
rolled up from right to left and the number of groupings in the result set equal to
number of expressions plus one. The CUBE operator generates groupings for every
combination of the expressions without any regard to expression order and generates
2
n
groupings, where n is the number of expressions used with the CUBE operator.
The GROUPING SETS allow you to produce multiple groupings of data only for
the specified groups in a single result set. This is simple and better than the ROLLUP
and CUBE operator that generate the full set of aggregations. GROUPING SETS
can specify groupings equivalent to those returned by ROLLUP or CUBE and can
also work in conjunction with ROLLUP and CUBE. The result set thus generated is
equivalent to a UNION ALL of the specified groups. So using GROUPING SET,
you can generate only the required levels of groupings and the information can be

readily made available to reports or requesting applications in a ready-to-digest format.
You might use a pivot table to display the results. However, the important thing to
take away is that using the GROUPING SETS operator can allow you to retrieve
aggregated information in the required levels of grouping all in one single statement
and above all efficiently. This could enhance your data analysis experience and improve
reporting performance. If you are using an Aggregate transformation to produce
aggregations for a report, you may find using GROUPING SETS performs better
in some cases, especially where you are running a data flow task only for performing
aggregations.
562 Hands-On Microsoft SQL Server 2008 Integration Services
Star Join Query Processing Enhancement
This is one of those improvements in SQL Server 2008 Enterprise Edition that is
applied automatically and does not require users to make any changes to their queries.
If you are using a dimensional data model for your warehouse, you may realize this
performance gain just by upgrading to SQL Server 2008 and without making any change
to your database structure or the queries. Microsoft claims to have achieved 15 to 30 percent
improvement in the whole of the star join payload on the dimensional database server,
whereas some individual queries can benefit from it by a factor of seven. The star join
optimization is provided in Microsoft SQL Server 2008 Enterprise Edition.
In a dimensional data warehouse, most of the queries perform aggregations on
columns of a large fact table and join it to one or many smaller dimensional tables,
apply filter conditions on the non-key dimensional table columns and form groups
by one or many dimensional table columns. Such queries are called star join queries.
These queries follow a similar processing pattern that you can find out by looking
at the execution query plan of some of these queries. Typically, the fact table will be
scanned for a range of data by running a seek on the clustered index that will be hash-
joined with the results of the seek on one of the dimension table. These hash joins are
repeated for as many times as there are dimensions involved in the query, and finally
the results are sent to an aggregation hash.
If you check out this plan on an SQL Server 2008 server, you will notice some

bitmap filters have also been applied. The bitmap filters, also called the bloom filters,
are generated as a by-product of the hash joins. The bloom filters are data structures
that can probabilistically test whether an element belongs to a set. Though this
technology can allow some false positives, it does not allow any false negatives. So,
once a fact table row fails the test, it will be excluded from further processing. SQL
Server 2008 can generate multiple bloom filters, as against SQL Server 2005, which
could do only one. These filters are pushed down in the query execution path to the
scan of the fact table to eliminate the nonqualifying rows. Also, the SQL Server 2008
query execution engine can change the sequence in which these filters are applied on
the basis of their selectivity—i.e., by placing the most selective filter first, and then the
next most selective filter, and so on—thus extracting maximum performance out of
this enhancement. So, the bitmap filters enable SQL Server to eliminate nonqualifying
fact table rows from further processing quite early on during the query evaluation. This
avoids unnecessary processing of the data rows that would have been dropped later in
the query processing anyway and results in saving a considerable amount of CPU time.
Change Data Capture
When you are loading data into a data warehouse system from a source system, you
need to know about the rows in the source system that have been either changed or
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 563
inserted or deleted. Until now, the options were to either extract complete source
system data to a staging table to compare it against the data warehouse, so as to
determine the changes or to use alternate techniques such as timestamp columns,
triggers, or complex queries for this purpose. Though these methods work, they have
some issues. The first option is very inefficient due to its doing lot of work to find out
the changes. Implementing the second option is also not a good choice, as timestamp
columns require application changes and triggers or complex queries are not efficient.
SQL Server 2008 Enterprise Server (and Developer Edition) has provided an efficient
feature called Change Data Capture (CDC) to track data changes such as inserts,
updates, and deletions. So, instead of comparing staged data with data warehouse data
or using intrusive methods to find the changes in data, you can use CDC to identify

changes and eventually load the data warehouse in incremental steps.
The Change Data Capture feature is not enabled by default, so you have to enable
CDC first in order to use it. The CDC feature is enabled at two levels—the database
and table levels. When you enable CDC on a database, it creates a new schema, CDC,
and creates a user account, CDC, in the database. The CDC schema is used to store
all the change tables and their metadata. When you decide to track a table for changes,
you enable CDC on that table. Enabling CDC on first table creates the following:
A change table to capture changes to data and metadata
c
Up to two query functions to allow you to retrieve data from the change tables. By c
default, all changes will be captured and only one query function is created in this
case; however, if you want to track only the net changes over a period of time, you
can do so. You can specify a parameter @support_net_changes = 1 in the CDC
enabling command to return only the net changes. is setting creates the second
query function that allows you to return only the net changes. is feature can also
potentially reduce the number of updates you perform on your data warehouse.
A group of change data capture metadata tables to retain metadata configuration
c
detail
Two SQL Server jobs: capture job and cleanup job. e SQL Server Agent service
c
must be running when you enable CDC tracking on a table.
When the changes are applied to the tracked source table, the database engine as
usual writes the changes in to the transaction log. The CDC capture job reads the log
automatically and adds information about the changes into the associated change table.
This table holds the capture columns from the source table to capture the changed data
and five additional metadata columns to provide information relevant to the captured
change. Two columns are of particular importance and worth mentioning here. The
first column, __$start_lsn, records the commit log sequence number (LSN) that
564 Hands-On Microsoft SQL Server 2008 Integration Services

was assigned to the change. The commit LSN not only identifies changes that were
committed within the same transaction but orders them as well.
The second column we want to discuss is __$operation, which records the operation
that is associated with the change, for instance, 1 = delete, 2 = insert, 3 = source record
prior to update, and 4 = source record after update. This column makes the ETL very
efficient by eliminating the need to identify whether a row is an update, a delete, or an
insert. Refer to Figure 12-5, which shows methods to load a data warehouse both with
and without CDC. With CDC, instead of performing an expensive lookup operation,
you just split data conditionally using a Conditional Split transformation. The split
condition uses the __$operation column to divert the rows to appropriate output.
Partitioned Table Parallelism
In SQL Server 2005, when you run a query against large tables that have been partitioned
across several disks probably based on dates, it has been observed that the executor
allocates threads in an inconsistent manner. If a query touches only one partition, the
executor allocates all the threads to the query; however, if a query needs to touch more
Read data from
source systems
Stage data for
change capture
Read data from
stage database
Perform lookup
operation against
data warehouse
Update data
warehouse
Insert into data
warehouse
Map the starting
and ending LSNs

for the increment
capture interval
Prepare query
Read data from
CDC schema
Perform
conditional split
opertion
Update data
warehouse
Insert into data
warehouse
Update data
warehouse
Loading a Data
Warehouse without CDC
Loading a Data
Warehouse with CDC
Figure 12-5 Loading a data warehouse using Change Data Capture
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 565
than one partition, it gets one thread allocated per partition, though additional threads
might still be available on the server. Sometimes this has produced inconsistent results
for the same or similar queries, depending on when they have been executed. In SQL
Server 2008, the parallel query plans against partitioned tables have been improved
to utilize more threads regardless of how many partitions a query touches. The SQL
Server query executor allocates all the available threads to partition table queries in a
round robin fashion, keeping CPUs fully utilized at all times so that the performance
of the same or similar queries over time is more comparable. The performance boost
achieved with this round robin–style thread allocation is noticeably high, especially
when more processor cores are available compared to the number of partitions a query

touches. You don’t need to configure anything in the SQL Server, as this feature works
by default.
Partition-Aligned Indexed Views
In SQL Server 2005, both indexed views and the partition tables can be used together.
For instance, if you have a very large fact table that has been partitioned, say, on a
yearly basis and you want to create summarized aggregates on this fact table, you use
indexed views to achieve this. So far, so good; however, there is an issue with the way
people use fact table partitioning. The fact table is generally partitioned on the basis of
a time period, which could be a month, quarter, or year, depending upon your business
needs, and as a time period completes, the oldest partition need to be switched out
and a new partition has to be switched in to keep data for a fixed period of time. For
example, if you are required to keep data for five years and you are keeping partitions
on year-by-year basis, then at the beginning of a year you need to remove the oldest
partition (six years old) and add a new partition for the new running year. The point
to note here is that switching a partition in and out from the fact table is very efficient
and is best possible way you can manage data in terms of retention. This way, your data
warehouse will always keep the data for the defined period of time and the retained
data is partitioned for performance. This is known as a sliding window scenario. The
issue with SQL Server 2005 is that you cannot switch a partition in or out from a
partitioned table that has indexed views created on top of them. You will end up
dropping the indexed view, switching the partition in or out, and then recreating the
indexed view. The creation of an indexed view could be a very expensive process that
can delay data delivery to your end users.
SQL Server 2008 aligns both these features and makes them work together. You
don’t need to drop and recreate indexed views when switching a partition in or out
of it. With the enhancement of partition-aligned indexed views, you save lot of
extra processing that would otherwise be required to rebuild aggregates on the entire
partitioned table.
566 Hands-On Microsoft SQL Server 2008 Integration Services
Summary

This chapter has covered basics about data warehousing and some of the enhancements
provided in the SQL Server database engine. The two new editions that have been
introduced in the R2 release are interesting to watch, as they realize an approach to bring
together software, hardware, and best practices to achieve the highest performance. This
chapter is targeted to wet your feet, so if you feel interested, go and take a deep dive in
the area of data warehousing. For now, stay on as we still have some interesting topics to
cover: migration, package deployment, troubleshooting, and performance tuning.
Deploying Integration
Services Packages
Chapter 13
In This Chapter
c
Package Configurations
c
Deployment Utility
c
Deploying Integration
Services Projects
c
Custom Deployment
c
Summary

×