Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 10 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (243.28 KB, 10 trang )

68 Hands-On Microsoft SQL Server 2008 Integration Services
Summary
You created an Integration Services blank project in Chapter 1. In this chapter, you
created packages using the SQL Server Import and Export Wizard and then added those
packages into your blank project. You also created a package directly in the BIDS again
using the SQL Server Import and Export Wizard. But above all, you explored those
packages by opening component properties and configurations, and now hopefully you
better understand the constitution of an Integration Services package. Last, you worked
with the Data Profiling Task to identify quality issues with your data. In the next chapter,
you will learn about the basic components, the nuts and bolts of Integration Services
packages, before jumping in to make complex packages in Chapter 4 using various
preconfigured components provided in BIDS.
Figure 2-14 Column Length Distribution Profiles
Nuts and Bolts of
the SSIS Workflow
Chapter 3
In This Chapter
c
Integration Services Objects
c
Solutions and Projects
c
File Formats
c
Connection Managers
c
Data Sources and Data
Source Views
c
SSIS Variables
c


Precedence Constraints
c
Integration Services
Expressions
c
Summary
70 Hands-On Microsoft SQL Server 2008 Integration Services
S
o far, you have moved data using the SQL Server Import and Export Wizard
and viewed the packages created by opening them in the Business Intelligence
Development Studio (BIDS). In this chapter, you will extend your learning
by understanding the nuts and bolts of Integration Services such as use of variables,
connection managers, precedence constraints, and SSIS Expressions. If you have
used Data Transformation Services (DTS 2000), you may grasp these issues quickly;
however, there is a lot of new stuff about them in Integration Services. Usability and
management of variables have been greatly enhanced, connectivity needs for packages
are now satisfied by connection managers, enhanced precedence constraints have been
included to provide you total control on the package workflow, and, above all, the SSIS
Expression language offers a powerful programming interface to let you generate values
at run time.
Integration Services Objects
Integration Services performs its operations with the help of various objects and
components such as connection managers, sources, tasks and transformations,
containers, event handlers, and destinations. All these components are threaded
together to achieve the desired functionality—that is, they work hand in hand, yet
they can be configured separately.
A major enhancement Microsoft has provided to DTS 2000 to make it Integration
Services is the separation of workflow from the data flow. SSIS provides two different
designer surfaces, which are effectively different integrated development environments
(IDEs) for developing packages. You can design and configure workflow in the Control

Flow Designer surface and the data movement and transformations in the Data Flow
Designer surface. Different components have been provided in each of the designer
environment, and the Toolbox window is unique with each environment.
The following objects are involved in an Integration Services package:
Integration Services package
c e top-level object in the SSIS component
hierarchy. All the work performed by SSIS tasks occurs within the context of
a package.
Control flow
c Helps to build the workflow in an ordered sequence using containers,
tasks, and precedence constraints. Containers provide structure to the package
and looping facility, tasks provide functionality, and precedence constraints build
an ordered workflow by connecting containers, tasks, and other executables in an
orderly fashion.
Data flow
c Helps to build the data movement and transformations in a package
using data adapters and transformations in ordered sequential paths.
Chapter 3: Nuts and Bolts of the SSIS Workflow 71
Connection managers c Handle all the connectivity needs.
Integration Services variables c Help to reuse or pass values between objects and
provide a facility to derive values dynamically at run time.
Integration Services event handlers
c Help extend package functionality using
events occurring at run time.
Integration Services log providers
c Help in capturing information when log-
enabled events occur at run time.
To enhance the learning experience while you are working with the SSIS components,
first you will be introduced to the easier and more often-used objects, and later will be
presented with the more complex configurations.

Solutions and Projects
Integration Services offers different environments for developing and managing your
SSIS packages. The SSIS packages are designed and developed in, most likely, the
development environment using BIDS, while the SQL Server Management Studio can
be used to deploy, manage, and run packages, though there are other options to deploy
and manage the packages as you will study in Chapter 13. Both environments have
special features and toolsets to help you perform the jobs efficiently.
While BIDS has the whole toolset to develop and deploy SSIS packages, SQL
Server Management Studio cannot be used to edit or design Integration Services
solutions or projects. However, in both environments, you use solutions and projects to
organize and manage your files and code in a logical, hierarchical manner. A solution is
a container that allows you to bring together scattered projects so that you can organize
and manage them as one unit. In general, you will use a solution to focus on one area of
the business—such as one solution for accounts and a separate solution for marketing.
However, complex business problems may require multiple solutions to achieve specific
objectives. Figure 3-1 shows a solution that not only affects multiple projects but also
includes projects of multiple types. This figure shows an analysis services project having
a Sales cube, an integration services projects having two SSIS packages, and a reporting
services project with a Monthly Sales report, all in one solution.
Within a solution, one or more projects, along with related files for databases,
connections, scripts, and miscellaneous files, can be saved together. Not only can multiple
projects be stored under one solution, but multiple types of projects can be stored under
one solution. For example, while working in BIDS, you can store a data transformation
project as well as a data-mining project under the same solution. Grouping multiple
projects in one solution has several benefits such as reduced development time, code
72 Hands-On Microsoft SQL Server 2008 Integration Services
reusability, interdependencies management, settings management for all the projects
at a single location, and the facility to save all the projects to Visual SourceSafe or
Team Foundation Server in the same hierarchical manner as you have in development
environment. Both SQL Server Management Studio and BIDS provide templates

for working with different types of projects. These templates provide appropriate
environments—such as designer surfaces, scripts, connections, and so on—for each
project with which you are working.
When you create a new project, Visual Studio tools automatically generate a solution
for you while giving you an option to create a separate folder for the solution. If you
don’t choose to create a directory for the solution, then the solution file is created along
with other project files in the same folder; however, if you choose to create a directory
for the solution, then a folder is created with project folder created under this as a
subfolder. So, you get a hierarchical structure created for you to which you can then add
various projects—data sources, data source views, SSIS packages, scripts, miscellaneous
files—as and when required. Solution Explorer lists the projects and the files contained
in them in a tree view that helps you to manage the projects and the files (as shown
Figure 3-1 Solution Explorer showing a solution with different types of projects
Chapter 3: Nuts and Bolts of the SSIS Workflow 73
in Figure 3-1). The logical hierarchy reflected in the tree view of a solution does not
necessarily relate to the physical storage of files and folders on the hard disk drive,
however. Solution Explorer provides the facility to integrate with Visual SourceSafe or
Team Foundation Server for version control, which is a great feature when you want to
track changes or roll back code.
File Formats
Whenever an ETL tool has to integrate with legacy systems, mainframes, or any other
proprietary database systems, the easiest way to transfer data between the systems is to
use flat files. Integration Services can deal with flat files that are fixed width, delimited,
and ragged right format types. For the benefit of users who are new to the ETL world,
these formats are explained next.
Fixed Width
If you have been working with mainframes or legacy systems, you may be familiar with
this format. Fixed-width files use different widths for columns, but the chosen width
per column stays fixed for all the rows, regardless of the contents of those columns. If
you open such a file, you will likely see lots of blank spaces between the two columns.

As most of the data in a column with variable data tends to be smaller than the width
provided, you’ll see a lot of wasted space. As a result, these types of files are more likely
to be larger in size than the other formats.
Delimited
The most common format used by most of the systems to exchange data with foreign
systems, delimited files separate the columns using a delimiter such as a comma or tab
and typically use a character combination (for example, a combination of carriage return
plus linefeed characters—{CR}{LF}) to delimit rows/records. Generally, importing
data using this format is quite easy, unless the delimiter used also appears in the data.
For example, if users are allowed to enter data in a field, some users may use a comma
while entering notes in the specified field, but this comma will be treated as column
delimiter and will distort the whole row format. This free-format data entry conflicts
with the delimiter and imports data in the wrong columns. Because of potential
conflicts, you need to pay particular attention to the quality of data you are dealing
with while choosing a delimiter. Delimited files are usually smaller in size compared to
fixed-width files, as the free space is removed by the use of a delimiter.
74 Hands-On Microsoft SQL Server 2008 Integration Services
Ragged Right
If you have a fixed-width file and one of the columns (the rightmost one) is a
nonuniform column, and you want to save some space, you can add a delimiter (such
as {CR}{LF}) at the end of the row and make it a ragged-right file. Ragged-right files
are similar to fixed-width files except they use a delimiter to mark the end of a row/
record—that is, in ragged-right files, the last column is of variable size. This makes the
file easier to work with when displayed in Notepad or imported into an application.
Also, some vendors use this type of format when they want the flexibility to change
the number of columns in the file. In such situations, they keep all the regular columns
(the columns that always exist) in the first part of the file and the columns that may
or may not exist combined as a single string of data in the end of the row. Depending
upon the columns that have been included the length of the last column will vary. The
applications generally use substring logic to separate out the columns from the last

variable-length combined column.
Connection Managers
As data grow in random places, it’s the job of the information analyst to bring it all
together to draw out pertinent information. The biggest problem of bringing together
such data sets and merging them to a single storage location is how to handle different
data sources, such as legacy mainframe systems, Oracle databases, flat files, Excel
spreadsheets, Microsoft Access files, and so on. Connection managers provided in
Integration Services come to the rescue.
In Chapter 2, you saw how the connection managers were used inside the package to
import data. The components defined inside an Integration Services package require that
physical connections be made to data stores during run time. The source adapter reads
data from the data source and then passes it on to the data flow for transformations, while
the destination adapter loads the transformed data to the destination store. Not only do
the extraction and loading components require connections, but these connections are
also required by some other components. For example, during the lookup, transformation
values are read from a reference table to perform transformations based on the values
in the lookup table. Then there are logging and auditing requirements that also need
connections to storage systems such as databases or text files.
A connection manager is a logical representation of a connection. You use a connection
manager to describe the connection properties at design time, and these are interpreted
to make a physical connection at run time by Integration Services. For example, at
design time, you can set a connection string property within a connection manager,
which is then read by the Integration Services run-time engine to make a physical
connection. A connection manager is stored in the package metadata and cannot be
shared with other packages.
Chapter 3: Nuts and Bolts of the SSIS Workflow 75
Connection managers enhance connection flexibility. Multiple connection managers
of the same type can be created to meet the needs of Integration Services packages and
enhance performance. For example, a package can use, say, five OLE DB connection
managers, all built on the same data connection.

You can add connection managers to your package using one of the following
methods in BIDS:
Choose New Connection from the SSIS menu.
c
Choose the New Connection command from the context menu that opens when c
you right-click the blank surface in the Connection Managers area.
Add a connection manager from within the editor or advanced editor dialog boxes
c
of some of the tasks, transformations, source adapters, and destination adapters
that require connection to a data store.
The connection managers you add to the project at design time appear in the
Connection Managers area in the BIDS designer surfaces, but they do not appear in
the Connection Managers collection in Package Explorer until you run the package
successfully for the first time. At run time, Integration Services resolves the settings of
all the added connections, sets the connection manager properties to each of them, and
then adds them to the Connection Managers collection in Package Explorer.
You will be using many of the connection managers in Hands-On exercises while
you create solutions for business problems later on. For now, open BIDS, create a
new blank project, and check out the properties of all the connection managers as you
read through the following descriptions. Figure 3-2, which appears in the later section
“Microsoft Connector 1.0 for SAP BI,” shows the list of all the connection managers
provided in SQL Server 2008 Integration Services.
ADO Connection Manager
The ADO Connection Manager enables a package to connect to an ADO recordset.
This connection manager has been provided mainly for legacy support. You will most
likely use it when you’re working with a legacy application that is using ActiveX Data
Objects (ADO) to connect to the data sources. You might have to use this connection
manager when developing a custom component where such legacy application is used.
ADO.NET Connection Manager
The current model of software applications is very different from the earlier connected,

tightly coupled client/server scenario, where a connection was held open for the lifetime.
Now, you’ve varied types of data stores and these data stores are being hit with several
76 Hands-On Microsoft SQL Server 2008 Integration Services
hundred connections every minute. ADO.NET overcomes these shortcomings and
provides disconnected data access, integration with XML, optimized interaction with
databases, and the ability to combine data from numerous data sources. These features
make ADO.NET connection managers quite reliable and flexible with lots of options;
however, they might be a little bit slower than the customized or dedicated connection
managers for a particular source. You can also have consistent access to data sources
using ADO.NET providers. The ADO.NET Connection Manager provides access
to data sources, such as SQL Server or sources exposed through OLE DB or XML,
using a .NET provider. You can choose from the .NET Framework Data Provider
for SQL Server (SqlClient), the .NET Framework Data Provider for Oracle Server
(OracleClient), the .NET Framework Data Provider for ODBC (Open Database
Connectivity), and the .NET Framework Data Provider for OLE DB. The configuration
options of the ADO.NET Connection Manager change, depending on the choice of
.NET provider.
Cache Connection Manager
The Cache Connection Manager is primarily used for creating cache for the Lookup
Transformation. When you have to repeatedly run a Lookup Transformation in a
package or have to share the reference (lookup) data set among multiple packages, then
you might prefer to persist this cache to a file to improve the performance. You would
then use a cache transformation, which in turn uses the Cache Connection Manager
to write the cached information to a cache file (.caw). Later in Chapter 10, “Data Flow
Transformations,” when you will be working with the Lookup Transformation, you
will use this connection manager to cache data to a file.
Excel Connection Manager
This connection manager provides access to the Microsoft Excel workbook file. It
is used when you add Excel Source or Excel Destination in your package. With the
launch of Excel 2007, the data provider for Excel is changed to OLE DB provider for

the Microsoft Office 12.0 Access Database Engine from the earlier used Microsoft
Jet OLE DB Provider. If you check the ConnectionString property of the Excel
Connection Manager after adding it using the Microsoft Excel 97-2003 version,
you will see the Provider listed as Microsoft.Jet.OLEDB.4.0, whereas this property
will show you the provider as Microsoft.ACE.OLEDB.12.0 when you add the
Excel Connection Manager using Microsoft Excel 2007 version. It is important to
understand the connection string, as you may need to write the connection string
yourself in some packages, for example, if you’re getting the file path at run time and
you want to dynamically create the connection string. Here is the connection string
shown for both versions of the Excel driver:
Chapter 3: Nuts and Bolts of the SSIS Workflow 77
Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\SSIS\RawFiles\
RawDataTxt.xls;Extended Properties="Excel 8.0;HDR=YES";
Provider=Microsoft.ACE.OLEDB.12.0; Data Source=C:\SSIS\RawFiles\
RawDataTxt.xlsx;Extended Properties="Excel 12.0;HDR=YES";
Note the differences between the providers for both the versions as has been explained
earlier. There are some additional properties that you need to specify in the extended
properties section. The first is that you use Excel 8.0 for Excel versions 97, 2000, 2002,
and 2003 in the extended properties, while you use Excel 12.0 for Excel 2007 version.
Second, you use the HDR property to specify if the first row has column names.
The default value is yes; that is, if you do not specify this property, the first row will
be deemed to contain columns. Also, sometimes the Excel driver fails to pick up some
values in the columns where you have string and numeric values mixed up. The Excel
driver samples, by default the first eight rows, to determine the data type of the column
and returns the null values if other data types exist in the column. You can override this
behavior by importing all the values as strings using the import mode setting IMEX=1
in the extended properties of the connection string.
If you will be deploying this connection manager to a 64-bit server, which is most
likely the case these days, you will need to run the package in 32-bit mode, as both
the aforesaid providers are available in 32-bit version only. You will need to run the

package using the 32-bit version of dtexec.exe from the 32-bit area, which is by default
in the C:\Program Files(x86)\Microsoft SQL Server\100\DTS\Binn folder.
File Connection Manager
This connection manager enables you to reference a file or folder that already exists
or is created at run time. While executing a package, Integration Services tasks and
data flow components need input for values of property attributes to perform their
functions. These input values can be directly configured by you within the component’s
properties, or they can be read from external sources such as files or variables. When
you configure to get this input information from a file, you use the File Connection
Manager. For example, the Execute SQL task executes an SQL statement, which can
be directly input by you in the Execute SQL task, or this SQL statement can be read
from a file.
You can use an existing file or folder, or you can create a file or a folder by using the
File Connection Manager. However, you can reference only one file or folder. If you
want to reference multiple files or folders, you must use a Multiple Files Connection
Manager, described a bit later.
To configure this connection manager, choose from the four available options in the
Usage Type field of the File Connection Manager Editor. Your choice in this field sets

×