Tải bản đầy đủ (.pdf) (40 trang)

Tài liệu SQL Server MVP Deep Dives- P20 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.05 MB, 40 trang )

714
C
HAPTER
56
Incorporating data profiling in the ETL process
inside the package. Typically, you will store the
XML
in a file if you are profiling data to
be reviewed by a person at a later time, and plan on using the Data Profile Viewer
application to review it. Storing the
XML
output in a variable is most often done when
you want to use the profile information later in the same package, perhaps to make an
automated decision about data quality.
The
XML
output includes both the profile requests (the input to the task) and the
output from each profile requested. The format of the output varies depending on
which profile generated it, so you will see different elements in the
XML
for a Column
Null Ratio profile than you will for a Column Length Distribution profile. The
XML
contains a lot of information, and it can be difficult to sort through to find the infor-
mation you are looking for. Fortunately, there is an easier user interface to use.
The Data Profile Viewer, shown in figure 2, provides a graphical interface to the
data profile information. You can open
XML
files generated by the Data Profiling task
in it and find specific information much more easily. In addition, the viewer repre-
sents some of the profile information graphically, which is useful when you are look-


ing at large quantities of data. For example, the Column Length Distribution profile
displays the count associated with specific lengths as a stacked bar chart, which means
you can easily locate the most frequently used lengths.
Figure 2 Data Profile Viewer
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
715
Introduction to the Data Profiling task
The Data Profile Viewer lets you sort most columns in the tables that it displays,
which can aid you in exploring the data. It also allows you to drill down into the detail
data in the source system. This is particularly useful when you have located some bad
data in the profile, because you can see the source rows that contain the data. This
can be valuable if, for example, the profile shows that several customer names are
unusually long. You can drill into the detail data to see all the data associated with
these outlier rows. This feature does require a live connection to the source database,
though, because the source data is not directly included in the data profile output.
One thing to be aware of with the Data Profile Viewer: not all values it shows are
directly included in the
XML
. It does some additional work on the data profiles before
presenting them to you. For example, in many cases it calculates the percentage of
rows that a specific value in the profile applies to. The raw
XML
for the data profile
only stores the row counts, not the percentages. This means that if you want to use the
XML
directly, perhaps to display the information on a web page, you may need to cal-
culate some values manually. This is usually a straightforward task.
Constraints of the Data Profiling task
As useful as the Data Profiling task is, there are still some constraints that you need to

keep in mind when using it. The first one most people encounter is in the types of
data sources it will work with. The Data Profiling task requires that the data to be pro-
filed be in
SQL
Server 2000 or later. This means you can’t use it to directly profile data
in Oracle tables, Access databases, Excel spreadsheets, or flat files. You can work
around this by importing the data you need into
SQL
Server prior to profiling it. In
fact, there are other reasons why you may want the data in
SQL
Server in advance,
which will be touched on in this section.
The Data Profiling task also requires that you use an
ADO
.
NET
connection man-
ager. Typically, in
SSIS
,
OLE

DB
connection managers are used, as they tend to per-
form better. This may mean creating two connection managers to the same database,
if you need to both profile data and import it in the same package.
Using the Data Profile Viewer does require a
SQL
Server installation, because the

viewer is not packaged or licensed as a redistributable component. It is possible to
transform the
XML
output into a more user-friendly format by using
XSL
Transforma-
tions (
XSLT
) to translate it into
HTML
, or to write your own viewer for the information.
The task’s performance can vary greatly, depending both on the volume of data
you are profiling and on the types of profiles you have requested. Some profiles, such
as the Column Pattern profile, are resource intensive and can take quite a while on a
large table. One way to address this is to work with a subset of the data, rather than the
entire table. It’s important to get a representative sample of the data for these pur-
poses, so that the data profile results aren’t skewed. This is another reason that having
the data in
SQL
Server can be valuable. You can copy a subset of the data to another
table for profiling, using a
SELECT
that returns a random sampling of rows (as dis-
cussed in “Selecting Rows Randomly from a Large Table” from
MSDN
: http://
msdn.microsoft.com/en-us/library/cc441928.aspx). If the data is coming from an
external source, such as a flat file, you can use the Row Sampling or Percentage
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

716
C
HAPTER
56
Incorporating data profiling in the ETL process
Sampling components in an
SSIS
data flow to create a representative sample of the
data to profile. Note that when sampling data, care must be taken to ensure the data is
truly representative, or the results can be misleading. Generally it’s better to profile
the entire data set.
Making the Data Profiling task dynamic
Why would you want to make the Data Profiling task dynamic? Well, as an example,
think about profiling a new database. You could create a new
SSIS
package, add a Data
Profiling task, and use the Quick Profile option to create profile requests for all the
tables in the database. You’d then have to repeat these steps for the next new database
that you want to profile. Or what if you don’t want to profile all the tables, but only a
subset of them? To do this through the task’s editor, you would need to add each table
individually. Wouldn’t it be easier to be able to dynamically update the task to profile
different tables in your database?
Most tasks in
SSIS
can be made dynamic by using configurations and expressions.
Configurations are used for settings that you wish to update each time a package is
loaded, and expressions are used for settings that you want to update during the pack-
age execution. Both expressions and configurations operate on the properties of tasks
in the package, but depending on what aspect of the Data Profiling task you want to
change, it may require special handling to behave in a dynamic manner.

Changing the database
Because the Data Profiling task uses connection managers to control the connection
to the database, it is relatively easy to change the database it points to. You update the
connection manager, using one of the standard approaches in
SSIS
, such as an expres-
sion that sets the
ConnectionString
property, or a configuration that sets the same
property. You can also accomplish this by overriding the connection manager’s setting
at runtime using the
/Connection
switch of
DTEXEC
.
Bear in mind that although you can switch databases this way, the task will only
work if it is pointing to a
SQL
Server database. Also, connection managers only control
the database that you are connecting to, and not the specific tables. The profile
requests in the task will still be referencing the original tables, so if the new database
does not contain tables with the same names, the task will fail. What is needed is a way
to change the profile requests to reference new tables.
Altering the profile requests
As noted earlier, you can configure the Data Profiling task through the Data Profiling
Task Editor, which configures and stores the profile requests in the task’s
Profile-
Requests
property. But this property is a collection object, and collection objects
can’t be set through expressions or configurations, so, at first glance, it appears that

you can’t update the profile requests.
Fortunately, there is an additional property that can be used for this on the Data Pro-
filing task. This is the
ProfileInputXml
property, which stores the XML representation
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
717
Making the Data Profiling task dynamic
of the profile requests. The
ProfileInputXml
property is not visible in the Properties
window in
BIDS
, but you can see it in the Property Expressions Editor dialog box, or in
the Package Configuration Wizard’s property browser. You can set an
XML
string into
this property using either an expression or a configuration. For it to work properly, the
XML
must conform to the DataProfile.xsd schema mentioned earlier.
Setting the ProfileInputXml property
So how can you go about altering the
ProfileInputXml
property to profile a different
table? One way that works well is to create a string variable in the
SSIS
package to hold
the table name (named
TableName

) and a second variable to hold the schema name
(named
SchemaName
). Create a third variable that will hold the
XML
for the profile
requests (named
ProfileXML
), and set the
EvaluateAsVariable
property of the
ProfileXML
variable to
True
. In the
Expression
property, you’ll need to enter the
XML
string for the profile, and concatenate in the table and schema variables.
To get the
XML
to use as a starting point, you can configure and run the Data Pro-
file task with its output directed to a file. You’ll then need to remove the output infor-
mation from the file, which can be done by removing all of the elements between the
<DataProfileOutput>
and
<Profiles>
tags, so that the
XML
looks similar to listing 1.

You may have more or less
XML
, depending on how many profiles you configured the
task for initially.
<?xml version="1.0" encoding="utf-16"?>
<DataProfile xmlns:xsi=" /> xmlns:xsd=" /> xmlns=" /> <DataSources />
<DataProfileInput>
<ProfileMode>Exact</ProfileMode>
<Timeout>0</Timeout>
<Requests>
<ColumnNullRatioProfileRequest ID="NullRatioReq">
<DataSourceID>{8D7CF241-6773-464A-87C8-60E95F386FB2}</DataSourceID>
<Table Schema="Production" Table="Product" />
<Column IsWildCard="true" />
</ColumnNullRatioProfileRequest>
<ColumnStatisticsProfileRequest ID="StatisticsReq">
<DataSourceID>{8D7CF241-6773-464A-87C8-60E95F386FB2}</DataSourceID>
<Table Schema="Production" Table="Product" />
<Column IsWildCard="true" />
</ColumnStatisticsProfileRequest>
</Requests>
</DataProfileInput>
<DataProfileOutput>
<Profiles />
</DataProfileOutput>
</DataProfile>
Listing 1 Data profile XML prior to making it dynamic
No profile output
should be included
Licensed to Kerri Ross <>

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
718
C
HAPTER
56
Incorporating data profiling in the ETL process
Once you have the
XML
, you need to change a few things to use it in an expression.
First, the entire string needs to be put inside double quotes (
"
). Second, any existing
double quotes need to be escaped, using a backslash (\). For example, the
ID
attri-
bute
ID="StatisticsReq"
needs to be formatted as
ID=\"StatisticsReq\"
. Finally,
the profile requests need to be altered to include the table name variable created pre-
viously. These modifications are shown in listing 2.
"<?xml version=\"1.0\" encoding=\"utf-16\"?>
<DataProfile xmlns:xsi=\" /> xmlns:xsd=\" /> xmlns=\" /> <DataSources />
<DataProfileInput>
<ProfileMode>Exact</ProfileMode>
<Timeout>0</Timeout>
<Requests>
<ColumnNullRatioProfileRequest ID=\"NullRatioReq\">
<DataSourceID>{8D7CF241-6773-464A-87C8-60E95F386FB2}</DataSourceID>

<Table Schema=\"" + @[User::SchemaName] +
"\" Table=\"" +
@[User::TableName] + "\" />
<Column IsWildCard=\"true\" />
</ColumnNullRatioProfileRequest>
<ColumnStatisticsProfileRequest ID=\"StatisticsReq\">
<DataSourceID>{8D7CF241-6773-464A-87C8-60E95F386FB2}</DataSourceID>
<Table Schema=\"" + @[User::SchemaName] +
"\" Table=\"" +
@[User::TableName] + "\"/>
<Column IsWildCard=\"true\" />
</ColumnStatisticsProfileRequest>
</Requests>
</DataProfileInput>
<DataProfileOutput>
<Profiles />
</DataProfileOutput>
</DataProfile>"
To apply this
XML
to the Data Profiling task, open the Property Expressions Editor by
opening the Data Profiling Task Editor and going to the Expressions page. Select the
ProfileInputXml
property, and set the expression to be the
ProfileXML
variable.
Now the task is set up so that you can change the target table by updating the
Schema-
Name
and

TableName
variables, with no modification to the task necessary.
Now that we’ve made the task dynamic, let’s move on to making decisions based on
the output of the task.
Listing 2 Data profiling XML after converting to an expression
Use variables for
schema and table name
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
719
Making data-quality decisions in the ETL

Making data-quality decisions in the ETL
The Data Profiling task output can be used to make decisions about the quality of
your data, and by incorporating the task output into your
ETL
process, you can auto-
mate these decisions. By taking things a little further, you can make these decisions
self-adjusting as your data changes over time. We’ll take a look at both scenarios in the
following sections.
Excluding data based on quality
Most commonly, the output of the Data Profiling task will change the flow of your
ETL
depending on the quality of the data being processed in your
ETL
. A simple example
of this might be using the Column Null Ratio profile to evaluate a Customer table
prior to extracting it from the source system. If the null ratio is greater than 30 per-
cent for the Customer Name column, you might have your
SSIS

package set up to
abort the processing and log an error message. This is an example of using data profil-
ing information to prevent bad data from entering your data warehouse.
In situations like the preceding, though, a large percentage of rows that may have
had acceptable data quality would also be excluded. For many data warehouses, that’s
not acceptable. It’s more likely that these “hard” rules, such as not allowing null values
in certain columns, will be implemented on a row-by-row basis, so that all acceptable
data will be loaded into the warehouse, and only bad data will be excluded. In
SSIS
,
this is often accomplished in the data flow by using Conditional Split transformations
to send invalid data to error tables.
Adjusting rules dynamically
A more complex example involves using data profiling to establish what good data
looks like, and then using this information to identify data of questionable quality. For
example, if you are a retailer of products from multiple manufactures, your Product
table will likely have the manufacturer’s original part number, and each manufacturer
may have its own format for part numbers. In this scenario, you might use the Column
Pattern profile against a known good source of part numbers, such as your Product
table or your Product master, to identify the regular expressions that match the part
numbers. During the execution of your
ETL
process, you could compare new incom-
ing part numbers with these regular expressions to determine if they match the
Expressions in SSIS
Expressions in
SSIS
are limited to producing output no longer than 4,000 characters.
Although that is enough for the example in this chapter, you may need to take it into
account when working with multiple profiles. You can work around the limitation by

executing the Data Profiling task multiple times, with a subset of the profiles in each
execution to keep the expression under the 4,000-character limit.
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
720
C
HAPTER
56
Incorporating data profiling in the ETL process
known formats for part numbers. As new products are added to the known good
source of part numbers, new patterns will be included in the profile, and the rule will
be adjusted dynamically.
It’s worth noting that this type of data-quality check is often implemented as a
“soft” rule, so the row is not prohibited from entering the data warehouse. After all,
the manufacturer may have implemented a new part-numbering scheme, or the part
number could have come from a new manufacturer that is not in the Product dimen-
sion yet. Instead of redirecting the row to an error table, you might set a flag on the
row indicating that there is a question as to the quality of the information, but allow it
to enter the data warehouse anyway. This would allow the part number to be used for
recording sales of that product, while still identifying a need for someone to follow up
and verify that the part number is correct. Once they have validated the part number,
and corrected it if necessary, the questionable data flag would be removed, and that
product could become part of the known good set of products. The next time that you
generate a Column Pattern profile against the part numbers, the new pattern will be
included, and new rows that conform to it will no longer be flagged as questionable.
As mentioned earlier, implementing this type of logic in your
ETL
process can
allow it to dynamically adjust data-quality rules over time, and as your data quality gets
better, the ETL process will get better at flagging questionable data.

Now let’s take a look at how to use the task output in the package.
Consuming the task output
As mentioned earlier, the Data Profiling task produces its output as
XML
, which can
be stored in a variable or a file. This
XML
output will include both the profile requests
and the output profiles for each request.
Capturing the output
If you are planning to use the output in the same package that the profiling task is in,
you will usually want to store the output
XML
in a package variable. If the output will
be used in another package, how you store it will depend on how the other package
will be executed. If the second package will be executed directly from the package
performing the profiling through an Execute Package task, you can store the output
in a variable and use a Parent Package Variable configuration to pass it between the
packages. On the other hand, if the second package will be executed in a separate
process or at a different time, storing the output in a file is the best option.
Regardless of whether the output is stored in a variable or a file, it can be accessed
in a few different ways. Because the output is stored as
XML
, you can make use of the
XML
task to use it in the control flow, or the
XML
source to use it in the data flow. You
can also use the Script task or the Script component to manipulate the
XML

output
directly using .
NET
code.
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
721
Consuming the task output
Using SSIS XML functionality
The
XML
task is provided in
SSIS
so that you can work with
XML
in the control flow.
Because the Data Profiling task produces
XML
, it is a natural fit to use the
XML
task to
process the data profile output. Primarily, the
XSLT
or
XPATH
operations can be used
with the profile
XML
.
The

XSLT
operation can be used to transform the output into a format that’s easier
to use, such as filtering the profile output down to specific profiles that you are inter-
ested in, which is useful if you want to use the
XML
source to process it. The
XSLT
operation can also be used to remove the default namespace from the
XML
docu-
ment, which makes using
XPATH
against it much easier.

XPATH
operations can be used to retrieve a specific value or set of nodes from the
profile. This option is illustrated by the Trim Namespaces
XML
task in the sample
package that accompanies this chapter, showing how to retrieve the null count for a
particular column using
XPATH
.
NOTE
The sample package for this chapter can be found on the book’s website
at http:
//www.manning.com/SQLServerMVPDeepDives.
In the data flow, the
XML
source component can be used to get information from the

Data Profiling task output. You can do this in two ways, one of which is relatively
straightforward if you are familiar with
XSLT
. The other is more complex to imple-
ment but has the benefit of not requiring in-depth
XSLT
knowledge.
If you know
XSLT
, you can use an
XML
task to transform and simplify the Data Pro-
filing task output prior to using it in the
XML
source, as mentioned previously. This
can help avoid having to join multiple outputs from the
XML
source, which is dis-
cussed shortly.
If you don’t know
XSLT
, you can take a few additional steps and use the
XML
source directly against the Data Profiling task output. First, you must provide an .
XSD
file for the
XML
source, but the .
XSD
published by Microsoft at ro-

soft.com/sqlserver/2008/DataDebugger/DataProfile.xsd is too complex for the
XML
source. Instead, you will need to generate a schema using an existing data profile that
you have saved to a file. Second, you have to identify the correct outputs from the
XML
source. The
XML
source creates a separate output for each distinct element type
in the
XML
: the output from the Data Profiling task includes at least three distinct
New to XML?
If you are new to
XML
, the preceding discussion may be a bit confusing, and the rea-
sons for taking these steps may not be obvious. If you’d like to learn more about
working with
XML
in
SSIS
, please review these online resources:

General
XML
information: />
Working with
XML
in
SSIS
: />XML

/
default.aspx
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
722
C
HAPTER
56
Incorporating data profiling in the ETL process
elements for each profile you include, and for most profiles it will have four or more.
This can lead to some challenges in finding the appropriate output information from
the
XML
source. Third, because the
XML
source does not flatten the
XML
output, you
have to join the multiple outputs together to assemble meaningful information. The
sample package on the book’s website (http:
//www.manning.com/SQLServerMVP-
DeepDives) has an example of doing this for the Column Pattern profile. The data
flow is shown in figure 3.
In the data flow shown in figure 3, the results of the Column Pattern profile are
being transformed from a hierarchical structure (typical for
XML
) to a flattened struc-
ture suitable for saving into a database table. The hierarchy for a Column Pattern pro-
file has five levels that need to be used for the information we are interested in, and
each output from the

XML
source includes one of these levels. Each level contains a
column that ties it to the levels used below it. In the data flow, each output from the
XML
source is sorted, so that consistent ordering is ensured. Then, each output,
which represents one level in the hierarchical structure, is joined to the output repre-
senting the next level down in the hierarchy. Most of the levels have a
ColumnPattern-
Profile_ID
, which can be used in the Merge Join transformation to join the levels,
but there is some special handling required for the level representing the patterns, as
they need to be joined on the
TopRegexPatterns_ID
instead of the
ColumnPattern-
Profile_ID
. This data flow is included in the sample package for this chapter, so you
can review the logic if you wish.
Figure 3 Data flow to reassemble a Column Pattern profile
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
723
Consuming the task output
Using scripts
Script tasks and components provide another means of accessing the information in
the Data Profiling task output. By saving the output to a package variable, you make it
accessible within a Script task. Once in the Script task, you have the choice of per-
forming direct string manipulation to get the information you want, or you can use
the
XmlDocument

class from the
System.Xml
namespace to load and process the out-
put
XML
. Both of these approaches offer a tremendous amount of flexibility in work-
ing with the
XML
. As working with
XML
documents using .
NET
is well documented, we
won’t cover it in depth here.
Another approach that requires scripting is the use of the classes in the DataPro-
filer.dll assembly. These classes facilitate loading and interacting with the data profile
through a custom
API
, and the approach works well, but this is an undocumented and
unsupported
API
, so there are no guarantees when using it. If this doesn’t scare you
off, and you are comfortable working with unsupported features (that have a good
chance of changing in new releases), take a look at “Accessing a data profile program-
matically” on the
SSIS
Team Blog ( />12/accessing-a-data-profile-programmatically.aspx) for an example of using the
API
to
load and retrieve information from a data profile.

Incorporating the values in the package
Once you have retrieved values from the data profile output, using one of the meth-
ods discussed in the previous sections, you need to incorporate it into the package
logic. This is fairly standard
SSIS
work.
Most often, you will want to store specific values retrieved from the profile in a
package variable, and use those variables to make dynamic decisions. For example,
consider the Column Null Ratio profiling we discussed earlier. After retrieving the
null count from the profile output, you could use an expression on a precedence con-
straint to have the package stop processing if the null count is too high.
In the data flow, you will often use Conditional Split or Derived Column transfor-
mations to implement the decision-making logic. For example, you might use the
Data Profiling task to run a Column Length Distribution profile against the product
description column in your Product table. You could use a Script task to process the
profile output and determine that 95 percent of your product descriptions fall
between 50 and 200 characters. By storing those boundary values in variables, you
could check for new product descriptions that fall outside of this range in your
ETL
.
You could use the Conditional Split transformation to redirect these rows to an error
table, or the Derived Column transformation to set a flag on the row indicating that
there might be a data-quality issue.
Some data-quality checking is going to require more sophisticated processing. For
the Column Pattern checking scenario discussed earlier, you would need to imple-
ment a Script component in the data flow that can take a list of regular expressions
and apply them against the column that you wanted to check. If the column value
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
724

C
HAPTER
56
Incorporating data profiling in the ETL process
matched one or more of the regular expressions, it would be flagged as
OK
. If the col-
umn value didn’t match any of the regular expressions, it would be flagged as ques-
tionable, or redirected to an error table. Listing 3 shows an example of the code that
can perform this check. It takes in a delimited list of regular expression patterns, and
then compares each of them to a specified column.
public class ScriptMain : UserComponent
{
List<Regex> regex = new List<Regex>();
public override void PreExecute()
{
base.PreExecute();
string[] regExPatterns;
IDTSVariables100 vars = null;
this.VariableDispenser.LockOneForRead("RegExPatterns", ref vars);
regExPatterns =
vars["RegExPatterns"].Value.ToString().Split("~".ToCharArray());
vars.Unlock();
foreach (string pattern in regExPatterns)
{
regex.Add(new Regex(pattern, RegexOptions.Compiled));
}
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{

if (Row.Size_IsNull) return;
foreach (Regex r in regex)
{
Match m = r.Match(Row.Size);
if (m.Success)
{
Row.GoodRow = true;
}
else
{
Row.GoodRow = false;
}
}
}
}
Summary
Over the course of this chapter, we’ve looked at a number of ways that the Data Profil-
ing task can be used in
SSIS
, from using it to get a better initial understanding of your
data to incorporating it into your ongoing
ETL
processes. Being able to make your
ETL
process dynamic and more resilient to change is important for ongoing mainte-
nance and usability of the
ETL
system. As data volumes continue to grow, and more
Listing 3 Script component to check column values against a list of patterns
Licensed to Kerri Ross <>

Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
725
Summary
data is integrated into data warehouses, the importance of data quality increases as
well. Establishing
ETL
processes that can adjust to new data and still provide valid
feedback about the quality of that data is vital to keeping up with the volume of infor-
mation we deal with today.
About the author
John Welch is Chief Architect with Mariner, a consulting firm spe-
cializing in enterprise reporting and analytics, data warehousing,
and performance management solutions. John has been working
with business intelligence and data warehousing technologies for
seven years, with a focus on Microsoft products in heterogeneous
environments. He is an
MVP
and has presented at Professional
Association for
SQL
Server (
PASS
) conferences, the Microsoft Busi-
ness Intelligence conference, Software Development West (
SD
West), Software Management Conference (
ASM
/
SM
), and others. He has also contrib-

uted to two recent books on
SQL
Server 2008: Microsoft
SQL
Server 2008 Management
and Administration (Sams, 2009) and Smart Business Intelligence Solutions with Microsoft
SQL
Server 2008 (Microsoft Press, 2009).

Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
726
57 Expressions in SQL Server
Integration Services
Matthew Roche
SQL
Server Integration Services (
SSIS
) is Microsoft’s enterprise extract, transform,
and load (
ETL
) platform, and is used in large-scale business intelligence projects
and small-scale import/export jobs around the world. Although
SSIS
contains an
impressive set of features for solving a range of data-centric problems, one fea-
ture—expressions—stands out as the most important for
SSIS
developers to master.
Expressions in

SSIS
are a mechanism to add dynamic functionality to
SSIS
pack-
ages; they are the primary tool that
SSIS
developers can use to build packages to solve
complex real-world problems. This chapter examines
SSIS
expressions from the per-
spective of providing elegant solutions to common problems and presents a set of
tested techniques that will allow you to take your
SSIS
packages to the next level.
SSIS packages: a brief review
Before we can dive into the deep end with expressions, we need to look at
SSIS
packages—the context in which expressions are used. Packages in
SSIS
are the units
of development and deployment; they’re what you build and execute, and have a
few common components, including

Control flow—The execution logic of the package, which is made up of tasks,
containers, and precedence constraints. Each package has a single control
flow.

Data flow—The high-performance data pipeline that powers the core
ETL
functionality in

SSIS
, and is made up of sources, transformations, and desti-
nations. The
SSIS
data flow is implemented as a task, which allows multiple
data flow tasks to be added to a package’s control flow.

Connection managers—Shared components that allow the control flow and
data flow to connect to databases, files, and other resources outside of the
package.
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
727
Expressions: a quick tour

Variables —The sole mechanism for sharing information between components
in an
SSIS
package; variables have deep integration with expressions as well.
SSIS
packages include more than just these elements, but for the purposes of this
chapter, that’s enough review. Let’s move on to the good stuff: expressions!
Expressions: a quick tour
Expressions add dynamic functionality to
SSIS
packages using a simple syntax based
on a subset of the C language. Expression syntax does not include any control of flow
(looping, branching, and so on) or data modification capabilities. Each expression
evaluates to a single scalar value, and although this can often seem restrictive to devel-
opers who are new to

SSIS
, it allows expressions to be used in a variety of places within
a package.
How can we use expressions in a package? The simplest way is to use property
expressions. All containers in
SSIS
, including tasks and the package itself, have an
Expressions
property, which is a collection of expressions and the properties to which
their values will be assigned. This allows
SSIS
package developers to specify their own
code—the expression—that is evaluated whenever a property of a built-in or third-
party component is accessed. How many other development tools let you do that?
Let’s look at an example. Figure 1 shows the properties for an Execute
SQL
Task
configured to execute a
DELETE
statement.
Although this Execute
SQL
Task is functional, it isn’t particularly useful unless the
package always needs to delete the order details for
[OrderID]=5
. This task would be
much more useful if it instead deleted whatever order number was current for the pack-
age execution. To implement this dynamic behavior, we’re going to take two steps.
First, we’re going to add a new variable, named
OrderID

, to the package. (If you don’t
Figure 1
Static task properties
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
728
C
HAPTER
57
Expressions in SQL Server Integration Services
know how to do this already, consider it an exercise—we won’t walk through adding a
variable step by step.) Second, we’re going to add a property expression to the
Sql-
StatementSource
property of the Execute
SQL
Task. To do this, we’ll follow the steps
illustrated in figure 2.
1
In the Properties window, select Execute
SQL
Task and then click on the ellipsis
(...) button next to the
Expressions
property. This will cause the Property
Expressions Editor dialog box to be displayed.
2
In the Property Expressions Editor dialog box, select the
SqlStatementSource
property from the drop-down list in the Property column.

3
Click on the ellipsis button in the Expression column. This will cause the
Expression Builder dialog box to be displayed. (Please note that figure 2 shows
only a subset of the Expression Builder dialog box to better fit on the printed
page.)
4
Enter the following expression in the Expression text box:
"DELETE FROM [dbo].[Order Details] WHERE [OrderID] = " + (DT_WSTR, 50)
@[User::OrderID]
5
Click on the Evaluate Expression button to display the output of the expression
in the Evaluated Value text box. (At this point it may be useful to copy and paste
this value into a
SQL
Server Management Studio query window to ensure that
the expression was constructed correctly.)
Figure 2 Adding a property expression
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
729
Expressions in the control flow
6
Click on the
OK
buttons to close the Expression Builder and Property Expres-
sions Editor windows and save all changes.
7
Execute the package to ensure that the functionality added through the expres-
sion behaves as required.
Several important techniques are demonstrated in these steps:


We started with a valid static value before we added the expression. Instead of
starting off with a dynamic
SQL
statement, we started with a static statement
which we tested to ensure that we had a known good starting point.

We added a single piece of dynamic functionality at a time. Because our exam-
ple was simple, we only added a single piece of dynamic functionality in total;
but if we were adding both a dynamic
WHERE
clause and a dynamic table name,
we would’ve added each dynamic expression element to the static
SQL
state-
ment individually.

We tested the expression after each change. This basic technique is often over-
looked, but it’s a vital timesaver. The Expression Editor has limited debugging
capabilities, and locating errors in a complex expression can be painfully diffi-
cult. By testing the expression after each change, the scope of debugging can
be significantly reduced.
With this example setting the stage, let’s dive deeper into
SSIS
expressions by illustrat-
ing how they can be used to add dynamic functionality to our packages, and solve real-
world problems.
Expressions in the control flow
We’ll continue by looking at expressions in the
SSIS

control flow. Although the exam-
ple in the previous section is technically a control flow example (because we applied a
property expression to a property of a task, and tasks are control flow components)
there are more interesting examples and techniques we can explore. One of the most
important—and overlooked—techniques is using expressions with precedence con-
straints to conditionally execute tasks.
Consider the following requirements:

If a specific table exists in the target database, execute a data flow task.

If the table does not exist, execute an Execute
SQL
Task to create the table, and
then execute the data flow task.
If this problem needed to be solved using a traditional programming language, the
developer would add an
if
statement and that would be that. But
SSIS
does not
include an
if
statement, a branching task, or the like, so the solution, although sim-
ple, is not always obvious.
An often-attempted approach to solve this problem is to add a property expression
to the
Disabled
property of the Execute
SQL
Task. The rationale here is that if the Exe-

cute
SQL
Task is disabled then it won’t execute, and only the data flow task will run. The
main problem with this approach is that the
Disabled
property is designed to be used
Licensed to Kerri Ross <>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.

×