Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 51 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (228 KB, 10 trang )

478 Hands-On Microsoft SQL Server 2008 Integration Services
30. Drop an Excel Destination just below the Splitting Fuzzy Grouped Duplicates
component and join both of these components using the green arrow. Select
Fuzzy Grouped Matches in the Output field of the Input Output Selection dialog
box and click OK. Rename this Excel destination Fuzzy Grouped Duplicates.
31. Double-click the Excel destination to configure it. Select DuplicateOwners
Connection in the OLE DB Connection Manager field if it’s not already selected.
32. Click the New button next to the “Name of the Excel sheet” field and verify in the
CREATE TABLE statement that it is creating the Fuzzy Grouped Duplicates
table—i.e., worksheet in Excel—then click OK to accept. Select Fuzzy_Grouped_
Duplicates in this field.
33. Go to the Mappings page to create the necessary column mappings automatically.
Note that in addition to _key_in, _key_out, and _score columns, the Fuzzy Lookup
transformation has added one _Similarity_ColumnName column for each column
that participated in the fuzzy grouping. Click OK to close this editor.
34. Drop an OLE DB destination on the Data Flow surface below the Splitting
Fuzzy Grouped Duplicates component and join the component to the OLE DB
destination using the available green arrow. Select Canonical Row in the Output
field of Input Output Selection dialog box and click OK.
35. Double-click the OLE DB destination and select [dbo].[Owner] table in the
Name of the table or the view field. Go to the Mappings page and verify that the
necessary mappings have been automatically created. Click OK to close this editor.
Rename the OLE DB Destination Owner. Press -- to save all the files
in the project.
Exercise (Execute Removing Duplicates Package)
Finally, execute the package to see how the various transformations remove the
duplicate data.
36. You can add the data viewers after each transformation to see the records that
have been removed from the pipeline. Later, I’ve explained how you can re-run
the package to see the workings of various transformations over and over. As a
first go, just run it without any data viewer.


37. Press 5 to execute the package. When the package completes execution, note the
number of records after each transformation (see Figure 10-32).
Following is an explanation of the execution results:
e Excel source brings 13 rows into the data flow.
c
e Sort transformation removes two exact duplicate records—one for Johnathon c
Skinner and one for Kathrine Morris.
Chapter 10: Data Flow Transformations 479
e Lookup transformation then matches Johnathon Skinner in the reference c
table and diverts it to the Excel destination; the remaining ten rows are then
passed on to the Fuzzy Lookup transformation.
e Fuzzy Lookup transformation fuzzy matches John, Jonothon, and Jonathon
c
with the Johnathon in the reference table and marks all three records with a high
similarity and confidence score. is high similarity and confidence score is then
used by the Conditional Split transformation to filter out these three records to
the Excel destination.
Figure 10-32 Executing Removing Duplicates package
480 Hands-On Microsoft SQL Server 2008 Integration Services
e remaining seven records are then passed on to Fuzzy Grouping transformation, c
which looks for records that are likely duplicates in the data flow and groups them
together using _key_in, _key_out, and _score column values. ese values are then
used by the conditional Split transformation to filter out records for Kathy, Kath,
and Kathey that are fuzzy-grouped together. e remaining four unique records are
then sent to the OLE DB destination for loading into the Owner table.
Review
You’ve seen various types of duplicates and the methods for removing them in this package.
This package will give you a kick start into real-life de-duplication problems. However, bear
in mind that whenever you use Fuzzy Lookup and Fuzzy Group transformations, you need
to find the similarity threshold values by running the package on sample data that correctly

represents the main data. The more effort you put into finding the value of the similarity
threshold that works with your data, the more accurate your results will be.
If you want to re-run the package, you need to execute the following SQL statement
against the Campaign database in the SQL Server Management Studio. This will
refresh the Owner table for you to run the package with the same results.
USE Campaign;
DROP TABLE [dbo].[Owner];
SELECT * INTO [dbo].[Owner] FROM Owner_original;
Summary
Having used lots of Data Flow transformations, you must by now feel a lot more
confident and ready for real-life challenges. You have used various transformations
to perform functions such as pivoting; sorting; performing an exact lookup for de-
duplication of data; and standardizing data, fuzzy lookup, and fuzzy groups to eliminate
duplicates in a pipeline, aggregate data, and load a slowly changing dimension table.
You’ve also studied several other preconfigured transformations that are straightforward
to use in your packages. Having come so far, you can now create control flow and data
flow in your packages to perform workflow and transformation functions, store and
manage your SSIS packages, and secure them as well. In the next chapter, you will
study how to deploy your packages in an enterprise environment.
Programming
Integration Services
Chapter 11
In This Chapter
c
The Two Engines of
Integration Services
c
Programming Options
c
Extending Packages with

Scripting
c
The Legacy Scripting Task:
ActiveX Script Task
c
Script Task
c
Script Component
c
Script Task vs.
Script Component
c
Summary
482 Hands-On Microsoft SQL Server 2008 Integration Services
B
y now you have worked with almost all the preconfigured tasks and
components that Integration Services provides and can appreciate the power
and ease it provides to developers for building enterprise-wide solutions.
However, businesses are doing so many different things that it is sometimes not easy or
even possible to build a solution for every scenario that can exist in an enterprise using
the preconfigured tasks and components. SSIS does provide a way to cover even those
complex scenarios. You can extend SSIS by writing your own custom code.
Not only can you extend SSIS using custom code, Microsoft has tried to make it
easier to extend your custom solution and, hence, provides you various programming
levels. SSIS provides a much enhanced object model that can be easily programmed
with different options to choose from, based on the problem you’re trying to solve.
You can choose to extend your packages using scripting, you can develop custom
components that can be deployed into SSIS and used as preconfigured components,
or you can program your packages all over from scratch. In this chapter you will learn
more about these options and when to choose the one that is more appropriate for

your particular scenario. However, at this point I want to clarify that as programming
Integration Services is a vast subject, it will be very difficult to cover it completely or
even do some justice to it in just one chapter. This subject probably requires a complete
book in itself. So, in this chapter scripting SSIS has been covered in detail, as I think
it is the one area that will be of interest to most readers, and other options have been
covered only up to an introduction level so that you can choose the best method to
extend SSIS. Refer to the Books Online for other programming options.
The Two Engines of Integration Services
As you know, all of your packages contain Control Flow tasks and most of them also
contain a Data Flow task, which is a special task and has its own components such
as sources, destinations, and transformations. You also understand that the work
flow and the management of the tasks are designed in Control Flow pane while the
data movement and the transformations are designed in Data Flow pane. If you refer
back to Figure 1-1, you can notice that top half of the figure—the object model of
the Integration Services run time—includes connection managers, event handlers,
and log providers, along with tasks and containers. This represents the Integration
Services run-time engine and, as you can see, provides the necessary infrastructure for
package execution and management support such as execution order, logging, event
handling, connections, breakpoints, and transactions. The second engine of Integration
Services, shown in the lower half of the Architecture diagram, manages the data
movement and the transformations. The Data Flow task that performs the actual work
of data movement and transformation runs under the management of the Data Flow
Chapter 11: Programming Integration Services 483
engine, also popularly called the pipeline. When you drop the first Data Flow task on
the Control Flow Designer surface, you invoke the data flow engine. Though you’ll
have only one Control Flow within a package, you can include multiple Data Flow
tasks in the package, with each Data Flow task able to support multiple data sources,
transformations, and destinations.
These two engines of Integration Services provide complete control over the
execution of a package and the flexibility to deal with buffer-oriented data movement

and transformations in a very efficient way. You may also notice in the architecture
diagram that both engines provide scope for building custom objects such as custom
tasks, custom log providers, and custom connection managers for the run-time engine,
and custom data flow components such as custom sources, custom transformations, and
custom destinations for the pipeline. In fact, whenever you extend Integration Services
programmatically, you will be working with these engines using different classes,
methods, and properties exposed by the engines.
Some of the tasks or components provided in Integration Services are written in
managed code, where the run-time engine and the data flow engine have been written
in the native code for providing enhanced performance, but they are exposed for
development through the managed object model of Integration Services and providing
ease of extension. The run-time engine represents the Microsoft.SqlServer.Dts.Runtime
namespace that contains the classes and interfaces to create packages, custom tasks,
and other package control flow elements. And the data flow engine represents the
Microsoft.SqlServer.Dts.Pipeline namespace that contains the classes used to develop
managed data flow components.
Programming Options
The object model of Integration Services allows you to program any aspect of SSIS; for
instance, you can extend the prebuilt functionality and manage interaction of SSIS with
other applications by building interfaces or SSIS packages created programmatically
by your custom-built application. This is possible because Integration Services fully
supports the Microsoft .NET Framework and allows you to choose any of the .NET-
compliant languages. The SSIS development team has done an excellent job by making
it easier to program SSIS by extending the packages, yet more powerful by enabling
the development of custom objects that can be developed outside the package, but can
be included in the package just like prebuilt objects once deployed into the Integration
Services object model. You can use Microsoft Visual Studio or any other development
environment with your preferred .NET-compliant language to write custom code. So,
you can choose from the options of simple one-off scripting or custom develop SSIS
objects or in fact build complete packages depending upon your requirements and

ability to program. Let’s explore these options in detail.
484 Hands-On Microsoft SQL Server 2008 Integration Services
Scripting
As mentioned earlier, when you have a need that can’t be met using prebuilt tasks or
components, you’ll need to create the required functionality. Now, if your need is one
off—i.e., the particular functionality is not going to be required in other packages and
you are looking for the least development effort—scripting SSIS should be your choice.
The code developed using scripting is generally reused within the development team
working on the same project. The developers who have worked with SQL Server Data
Transformation Services (versions 7.0 and 2000) might have used the scripting option
already. The ActiveX Script task is the only method to extend Data Transformation
Services (DTS), so many developers have used it extensively and some have actually
built quite complex scripting solutions that were deployed throughout the enterprise.
Later it has been realized that the cost of maintaining such solutions is quite high, as
the ActiveX Script task was not designed to create enterprise solutions. Integration
Services has overcome this limitation and includes many flexible scripting solutions
along with the possibility of developing reusable custom objects that are easy to
maintain. Though the ActiveX Script task has been provided in Integration Services,
you should refrain from using it for new development work. This is provided only
for backward compatibility, as an interim support while migrating your packages to
Integration Services.
The two scripting objects, the Script task and the Script component, replace the
Dts scripting functionality with a much better and more powerful programming
environment, Microsoft Visual Studio Tools for Applications (VSTA). This embedded
scripting environment allows you to choose Microsoft Visual Basic 2008 or Microsoft
Visual C# 2008 as your preferred language for scripting. You can create a custom task
for use in the Control Flow with the Script task or a custom component such as a
source or a transformation, or else a destination for use in the Data Flow task with the
Script component. When you use the VSTA environment to write scripts for either of
these script objects, the scripting environment creates lots of infrastructure code for you

and leaves you to focus on writing the code for the required functionality. This makes
writing scripts much easier using VSTA. There are several other benefits to use of this
powerful IDE, such as extensive debugging and testing of the written code. Due to
.NET Framework support, you can use the .NET namespaces, take advantage of the
class libraries, and also reference external .NET assemblies quite easily in your scripts.
This is a very powerful feature that can save your many man-days of redevelopment
effort for the assemblies that already exist. You can simply reference existing assemblies
and use already-developed business rules or functionality within your package.
All this power and ease of use doesn’t come free, though the cost of these benefits
is very minimal in this case. The code that you write in the Script task or Script
component resides in the package and is not available to other packages. If you want to
Chapter 11: Programming Integration Services 485
reuse the code, you will have to copy the script to other packages. To explain it further,
when you deploy a package that has been developed using only the prebuilt objects to
a different server, the code for prebuilt objects is not sent along with the code for your
package. The prebuilt objects are available as a compiled binary library to the package
within Integration Services, while the custom scripts obviously have not been published
and hence are not available for use with other packages. You can copy the code quite
easily to script objects in other packages if you need to; however, that will increase the
maintenance cost. Think about the code that needs to be modified and has been used
in hundreds of packages all over in the enterprise. It won’t be a welcome thing to do.
The facility to script yourself out of a requirement should be used carefully and should
be used where you know that the requirement is unique and will not be used in many
packages. If that’s not the case, you’ll be better off with the custom-built extensions
that have been developed from scratch by deriving from the base classes provided by the
Integration Services object model.
Developing Custom Objects from Scratch
If you do not want to use scripting to extend your packages because the custom code
might be used in multiple packages and you don’t want to undertake the hassle of
fixing several packages later on, you can build custom extensions in the managed code

from scratch. Using the managed object model of Integration Services, you can develop
extensions such as control flow tasks, connection managers, log providers, enumerators,
data flow sources, data flow transformations, and data flow destinations. To develop
a custom object, you will inherit from the appropriate base class as provided for the
functionality and build on that. For example, to develop a control flow task, you will
inherit from the Task base class, and for a data flow component you will inherit from
the PipelineComponent base class. The provision of a base class as a starting point
makes it much easier to develop custom extensions.
Once the object development has been completed, you will then build and deploy
the object assembly into the appropriate Integration Services and global assembly cache
(GAC). The object then can be added into the Toolbox within Visual Studio and can
be used as any other prebuilt object. You will need to deploy the custom extensions on
all the servers wherever you want to use them. For example, suppose a developer builds
a package using a custom component that has been installed on his computer and wants
to share this package with another team member. This new team member cannot use
the package until he or she installs the custom component on his or her computer, as the
component will be referenced locally on the computer. The availability of the custom
extension in the SSIS designer makes it very easy to reuse it. As mentioned earlier,
the code for the prebuilt components does not get copied into the package; rather, the
package references the objects only. This also applies to the custom-built extensions,
486 Hands-On Microsoft SQL Server 2008 Integration Services
and you do not need to worry about the deployment of the custom objects within your
packages. The custom-built extensions are not deployed with your packages; rather, they
are handled separately and keep your package deployments simple. This means that if
you need to make an enhancement or a change into a custom extension, you do not need
to modify all your packages, but will need to make change only in the custom extension
and the packages will automatically pick up this changed object at their next run.
Building Packages Programmatically
When you want to work with your packages programmatically, the object model
allows you to create, configure, load, and execute packages. You can create packages

dynamically and define the sources, transformations, and metadata of the selected
columns and destinations. Just to explain, think of a CRM application that you may
want to extend with ETL capabilities so that you can create a reporting data mart. This
CRM application is configured with different metadata for different clients, so you can
create SSIS packages programmatically reading metadata of the deployed application
from your application interfaces and avoid configuring SSIS packages manually for
each client. Such extended applications can save you and the customer a lot of time and
effort. Considering your requirements, you can create a grand application by creating
packages from scratch, including package objects, you can simplify your solution by
loading a template package and configuring it for the relevant changes, or else you can
simply load and run an existing package.
Extending Packages with Scripting
Now that you’ve an overview of programming options, let’s explore them in detail and
try some Hands-On exercises along the way as we proceed. I would like to clarify some
of the concepts about my approach in this chapter before we get deep into the exercises.
The focus will be to demonstrate on how you can implement your code in Integration
Services rather than on how to write the code, and to keep things simple and within
reason, only Visual Basic 2008 code will be listed.
The Legacy Scripting Task: ActiveX Script Task
If you have used DTS 2000, you might have used the ActiveX Script task to extend
your Dts packages. This powerful task was provided in DTS 2000 and helped
database developers to develop packages that otherwise wouldn’t be possible. Many
database developers and information analysts have exploited this task to customize
data transformation; apply business logic in the Dts package; manage files and folders;
dynamically set properties on tasks, connections, or global variables; and perform
Chapter 11: Programming Integration Services 487
complex computations on the data. To help smooth migration from DTS 2000 to
SSIS, Microsoft provided the ActiveX Script task in SSIS to run those custom-build
scripts until such time when the scripts can be upgraded to a more advanced scripting
task, simply called the Script task in SSIS.

The ActiveX Script task provided in SSIS is quite different than the one provided
in DTS 2000 in look and feel. The basic purpose of the ActiveX Script task in SSIS
is to allow you to run existing scripts, not develop new scripts. In fact, this task will
be removed from future releases of Integration Services. Better not to use this task to
develop new scripts, and opt instead for use of the more advanced and efficient Script
task for new development work.
Here are some of the benefits of using the Script task over the ActiveX Script task:
e Script task uses a much powerful development environment—Visual Studio
c
Tools for Applications—which provides an integrated development environment
(IDE) rich in features such as IntelliSense, color-coded syntax highlighting, line-
by-line debugging support, and online help.
It is easier to develop scripts in the Script task using either Visual Basic 2008 c
or Visual C# 2008, both of which are fully capable of referring external .NET
assemblies in addition to .NET Framework classes and libraries.
All the scripts developed in the Script task (and in Script component) are precompiled
c
and hence, yield enhanced performance due to fast execution at run time.
If you have to use this task to use an existing ActiveX script, follow these steps:
1. Drop the ActiveX Script task on the Designer surface and double-click it to
configure it.
2. Specify a Name and a Description for the task in the General page.
3. In the Script page Language field drop-down list, choose a scripting language
that was used to write the ActiveX script. The default choices available are the VB
Script Language and the JScript Language, though the ActiveX Script task can
support other scripting languages, depending on the scripting engines installed on
the local computer.
4. The Script field provides a simple interface where you can paste or type in your
ActiveX script. If you have an ActiveX script saved into a file, you can click
Browse and select the file, and your script will be read in by the task and shown

in the Script field. Click Save to save the contents of the Script field to a file, and
click Parse to parse the script.
5. The EntryMethod specifies the name of the method that is called from the
ActiveX Script task at run time.

×