Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 58 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (284.74 KB, 10 trang )

548 Hands-On Microsoft SQL Server 2008 Integration Services
In this model, the dimensions are denormalized and a business view of a dimension
is represented as a single table in the model. Also, the dimensions have simple
primary keys and the fact table has a set of foreign keys pointing to dimensions; the
combination of these keys forms a compound primary key for the fact table. This
structure provides some key benefits to this model by making it simple for users to
understand. Writing select queries for this model is easier due to simple joins between
dimensions and the fact tables. Also, the query performance is generally better due
to the reduced number of joins compared to other models. Finally, a star schema can
easily be extended by simply adding new dimensions as long as the fact entity holds the
related data.
Snowflake Model
Sometimes it is difficult to denormalize a dimension, or in other words, it makes
more sense to keep a dimension in a normalized form, especially when multiple levels
of relationships exist in dimension data or the child member has multiple parents
in a dimension. In this model, the dimension suitable for snowflaking is split into
its hierarchies and results into multiple tables linked to each other via relationships,
generally, one-to-many. The many-to-many relationship is also handled using a
bridge table between the dimensions, sometimes called a FactLessFact table. For
example, the AdventureWorksDW2008 products dimension DimProduct is a
snowflake dimension that is linked to DimProduct SubCategory, which is further
linked to the DimProductCategory table (refer to Figure 12-2). This structure makes
much sense to database developers or data modelers and helps users to write useful
queries, especially when an OLAP tool such as SSAS supports such a structure
and optimizes running of snowflaked queries. However, business users might find
it a bit difficult to work with and would prefer a star schema, so you have to find
a balance in choosing when to go for a snowflake schema. Though there is some
space saving by breaking a dimension into a snowflake schema, it is not high on the
preference list because first, the space is not very costly, and second, the dimension
tables are not huge and space savings are not expected to be considerable in many
cases. The snowflaking is done for the functional reasons rather than savings in disk


space. Finally, the queries written against a snowflake schema tend to use more joins
(because more dimension tables are involved) compared to a star schema, and this
will affect the query performance. You need to test for user acceptance for the speed
at which results are being returned.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 549
Building a Star Schema
The very first requirement in designing a data warehouse is your focus on the subject
area for which a business has engaged you to build a star schema. It’s easy to sway in
different directions while building a data warehouse, but whatever you do later in the
process, you must always keep a focus on the business value you are delivering with the
model. As a first step, capture all the business requirements and the purposes for which
they want to do this activity. You might end up meeting several business managers
and process owners to understand the requirements and match them against the data
available in source systems. At this stage, you will be creating a high-level data source
mappings document to meet the requirements. Once you have identified at a high
Figure 12-2 AdventureWorksDW2008 simplified snowflake schema
550 Hands-On Microsoft SQL Server 2008 Integration Services
level that the requirements can be met with the available data, the next step is to define
the dimensions and the measures in the process. While defining measures or facts,
it is important that you discuss with the business and understand clearly the level of
detail they will be interested in. This will decide the grain of your fact data. Typically,
you would want to keep lowest grain of data so that you have maximum flexibility for
future changes. In fact, defining the grain is one of the first steps while building a data
warehouse, as this is the cornerstone to collating the required information, for instance,
defining the roll-up measures.
At this stage, you are ready to create a low-level star schema data model and will be
defining attributes and fields for the dimension and fact tables. As part of the process,
you will also define primary keys for dimension tables that will also exist in fact tables
as a foreign key. These primary keys will need to be the new set of keys known as the
surrogate keys. The use of surrogate keys instead of source system keys or business

keys provides many benefits in the design; for instance, they provide protection against
changes in source systems keys, maintain history (by using SCD transformation),
integrate data from multiple source, and handle late arriving members, including facts
for which dimension members are missing. Generally, a surrogate key will be an auto-
incrementing non-null integer value like an identity and will form a clustered index on
the dimension. However, the Date dimension is a special dimension, commonly used
in the data warehouses, with a primary key based on the date instead of a continuous
number; for instance, 20091231 and 20100101 are date-based consecutive values, but
not in a serial number order.
While working on dimensions, you will need to identify some special dimensions.
First, look for role-playing dimensions. Refer to Figure 12-2 and note that the DateKey
of the DimDate dimension is connected multiple times with the fact table, once
each for the OrderDateKey, DueDateKey, and ShipDateKey columns. In this case,
DimDate is acting as a role-playing dimension. Another case is to figure out degenerate
dimensions in your data and place them alongside facts in the fact table. Next, you need
to identify any indicators or flags used in the facts that are of low cardinality and can be
grouped together in one junk dimension table. These miscellaneous flags held together
at one place make the user’s life much easier when they need to classify the analysis by
some indicators or flags. Finally, you will complete the exercise by listing the attributes’
change types, that is, whether they are Type 1 or Type 2 candidates. These attributes,
especially Type 2, help in maintaining history in the data warehouse. Keeping history
using SCD transformations has been discussed earlier in the chapter as well as in
Chapter 10. Though your journey to implement a data warehouse still has a long way
to go, after implementing the preceding measures, you will have implemented a basic
star schema structure. From here it will be easy to work with and extend this model
further to meets specific business process requirements, such as real-time data delivery
or snowflaking.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 551
SQL Server 2008 R2 Features and Enhancements
Several features have been provided in SQL Server 2008 R2 that can help you in

your data warehousing project by either providing new functionality or improving the
performance of commonly used features. It is not possible to cover them all here unless
I want to stretch myself beyond the scope of this book, especially when those features
do not reside in the realm of Integration Services. But I will still try to cover some
features that most data warehousing projects will use, while other features such as data
compression, sparse columns, new data types, Large UDTs, and minimal logging are
not covered.
SQL Server 2008 R2 Data Warehouse Editions
Microsoft has just launched SQL Server 2008 R2, which is built upon the strong
foundations and successes of SQL Server 2008 and is targeted at very large-scale data
warehouses, and at higher mission-critical-scale and self-service business intelligence.
SQL Server 2008 R2 has introduced two new premium editions to meet the demands
of large-scale data warehouses.
SQL Server 2008 R2 Datacenter
This edition is built on the Enterprise Edition code base, but provides the highest
levels of scalability and manageability. SQL Server 2008 R2 Datacenter is designed for
the highest levels of scalability that SQL Server platform can provide, virtualization,
and consolidation, and it delivers a high-performing data platform. Typical
implementations include a large-scale data warehouse server that can scale up to
support tens of terabytes of data, provide Master Data services, and implement very
large-scale BI applications such as self-service or power pivot for SharePoint. Following
are the key features:
As the Enterprise Edition is restricted to up to 25 editions and 4 virtual machines
c
(VMs), the Datacenter Edition is the next level if you need more than 25 instances
or more VMs. is also provides application and Multi-Server Management for
enrolling and gaining insights.
e Datacenter Edition has no limits on server maximum memory; rather, it is c
restricted by the limits of the operating system only. For example, it can support
up to 2TB of RAM if running on the Windows Server 2008 R2 Datacenter

Edition.
552 Hands-On Microsoft SQL Server 2008 Integration Services
It supports more than 8 processors and up to 256 logical processors for the highest c
levels of scale.
It has the highest virtualization support for maximum ROI on consolidation and
c
virtualization.
It provides high-scale complex event processing with SQL Server StreamInsight.
c
Advanced features such as the Resource Governor, data compression, and backup c
compression are included.
SQL Server 2008 R2 Parallel Data Warehouse
Since acquiring DATAllegro, a provider of large-volume, high-performance data
warehouse appliances, Microsoft has been working on consolidating hardware and
software solutions for high-end data warehousing under a project named Madison.
The SQL Server 2008 R2 Parallel Data Warehouse is the result of Project Madison.
The Parallel Data Warehouse is an appliance-based, highly scalable, highly reliable,
and high-performance data warehouse solution. Using SQL Server 2008 on Windows
Server 2008 in a massively parallel processing (MPP) configuration, Parallel Data
Warehouse can scale from tens to hundreds of terabytes, providing better and more
predictable performance, increased reliability, and lower cost per terabyte. It comes
with preconfigured hardware and software that is carefully balanced in one appliance,
thus making deployment quick and easy. Massively parallel processing enables it to
perform ultra-fast loading and high-speed backups, thus addressing two of the major
challenges facing modern data warehouses: data load and the backup times. You can
integrate existing SQL Server 2008–based data marts or mini–data warehouses with
parallel data warehouses via a hub-and-spoke architecture. This product was targeted
to be shipped alongside the SQL Server 2008 R2 release; however, the release of this
product has been slightly delayed awaiting customer feedback from the customer
Technology Adoption Program (TAP). The parallel data warehouse principles and

architecture are detailed later in this chapter.
SQL Server 2008 R2 Data Warehouse Solutions
Microsoft has recognized the need to develop data warehouse solutions to build
upon the successes of SQL Server 2008. This has resulted in Microsoft partnering
with several industry-leading hardware vendors to create best-of-breed balanced
configurations combining hardware and software to achieve the highest levels of
performance. Two such solutions are now available under the names of Fast Track
Data Warehouse and Parallel Data Warehouse.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 553
Fast Track Data Warehouse
The Fast Track Data Warehouse solution implements a CPU core-balanced approach
on the symmetric multiprocessor (SMP)–based SQL Server data warehouse using a
core set of configurations and database best practice guidelines. The fast track reference
architecture is a combination of hardware that is balanced for performance across all the
components and software configurations, such as Windows OS configurations, SQL
Server database layout, and indexing, along with a whole raft of other settings, best
practices, and documents to implement all of these objectives. The Fast Track Data
Warehouse servers can have two-, four-, or eight-processor configurations and can
scale from 4 terabytes to 30-plus terabytes and even more if compression capabilities
are used. The earlier available reference architectures are found to suffer from various
performance issues due to the simple fact that they have not been specifically designed
to suit the needs of one particular problem and hence, suffer from an unbalanced
architecture. For example, you may have seen that a server is busier and processing
too much I/O, but still the CPU utilization is not high enough to indicate the work
load. This is a simple example of mismatch or unbalance existing between various
components in currently available servers. The Fast Track Data Warehouse servers are
built on the use cases of a particular scenario—i.e., they are built to capacity to match
the required workload on the server rather than with a one-size-fits-all approach.
This approach of designing a balanced server provides predictable performance and
minimizes the risk of going over spec on the components such as by providing CPU

or storage that will never be utilized. The predictable performance and scalability is
achieved by adopting core principles, best practices, and methodologies, some of which
have been listed next.
It is built for data warehouse workloads.
c First of all, understand that the data
warehouse workload is quite different from that on the OLTP servers. While
OLTP transactions are made up of small read and write operations, data warehouse
queries usually perform large read and write operations. e data warehouse
queries are generally fewer in number, but they are more complex, requiring high
aggregations, and generally have date range restrictions. Furthermore, OLTP
transactions generate more random I/O, which causes slow response. To overcome
this, a large number of disks have been traditionally used, along with some other
optimization techniques such as building heavy indexes. ese techniques have
their own maintenance overheads and cause the data warehouse loading time
to increase. e Fast Track Data Warehouse uses a new way of optimizing data
warehouse workloads by laying data in a sequential architecture. Considering that
a data warehouse workload requires ranges of data, reading sequential data off the
disk drives is much efficient compared to random I/Os. All efforts are targeted to
554 Hands-On Microsoft SQL Server 2008 Integration Services
preserving sequential storage. e data is preferred to be served from disk rather
than from memory, as the performance achieved is much higher with sequential
storage. is results in fewer indexes, yielding savings in maintenance, decreased
loading time, lesser fragmentation of data, and reduced storage requirements.
It offers a holistic approach to component architecture. c e server is built with a balance
across all components, starting with disks, disk controllers, fiber channels HBAs,
and the Windows operating system, and ranging up to the SQL Server and then
to the CPU core. For example, on the basis of how much data can be consumed
per CPU core (200 MBps), the number of CPU cores is calculated for a given
workload and then the backward calculations are applied to all the components to
support the same bandwidth or capacity. is balance or the synchronization in

response by individual components provides the required throughput to match the
capabilities of the data warehouse application.
It is optimized for workload type.
c e Fast Track Data Warehouse servers are
designed and built considering the very nature of the database application. To
capture these details, templates and tools are provided to design and build the fast
track server. Several industry-leading vendors are participating in this program to
provide you out-of-the-box performance reference architecture servers. Businesses
can benefit from reduced hardware testing and tuning and rapid deployment.
Parallel Data Warehouse
When your data growth needs can no longer be satisfied with the scale-up approach
of the Fast Track Data Warehouse, you can choose the scale-out approach of Parallel
Data Warehouse, which has been built for very large data warehouse applications
using Microsoft SQL Server 2008. The symmetric multiprocessing (SMP) architecture
used in the Fast Track Data Warehouse is limited by the capacity of the components
such as CPU, memory, and hard disk drives that form part of the single computer.
This limitation is addressed by scaling out to a configuration consisting of multiple
computing nodes that have dedicated resources—i.e., CPUs, memory, and hard disk
space, along with an instance of SQL Server—connected in an MPP configuration.
Architecture and Hardware
The appliance hardware is built on industry-standard technology and is not proprietary
to one manufacturer, so you can choose from well-known hardware vendors such as
HP, Dell, IBM, EMC
2
, and Bull. This way, you can keep your hardware maintenance
costs low, as it will integrate nicely with already-existing infrastructure.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 555
As mentioned earlier, the Parallel Data Warehouse is built on Microsoft Windows
2008 Enterprise server nodes with their own dedicated hardware connected via a
high-speed fiber channel link, with each node running an instance of the SQL Server

2008 database server. The MPP appliance basically has one control node and several
compute nodes, depending on the data requirements. This configuration is extendable
from single-rack to multirack configurations; in the latter case, one rack could act as a
control node. The nodes are connected in a configuration called Ultra Shared Nothing
(refer to Figure 12-3), in which the large database tables are partitioned across multiple
nodes to improve the query performance. This architecture has no single point of
failure, and redundancy has been built in at all component levels.
Applications or users send requests to a control node that balances the requests
intelligently across all the compute nodes. Each compute node processes the request
it gets from the control node using its local resources and passes back the results to
the control node, which then collates the results before returning to the requesting
application or user. As the data is evenly distributed across multiple nodes and the
nodes process the requests in parallel, queries run many times faster on an MPP
appliance than on an SMP database server.
Like a Fast Track Data Warehouse server, an MPP appliance is also built under
tight specifications and carefully balanced configurations to eliminate performance
bottlenecks. Reference configurations have been designed for different use case
scenarios taking into account different types of workloads such as data loading,
reporting, and ad hoc queries. A control node automatically distributing workload
evenly, compute nodes being able to work on queries autonomously, system resources
balanced against each other, and design of reference configurations on use case
Figure 12-3 Parallel Data Warehouse architecture
Compute nodes
Node -1
Node -2
Node -N
Control node
High speed
fiber channel network
Node -0

556 Hands-On Microsoft SQL Server 2008 Integration Services
scenarios enable an MPP appliance to achieve predictable performance. Scalability
follows from here with the simple addition of capacity as the data volumes grow.
Hub-and-Spoke Architecture
Another important advantage with an MPP appliance is that it can be deployed
in a hub-and-spoke architecture. In this way, you can use an MPP appliance as
a hub while the spokes could be either MPP appliances or standard SQL Server
2008–based symmetric multiprocessing (SMP) servers (see Figure 12-4). Typically,
department users will connect to spokes to access data in their required formats. In
this configuration, the MPP appliance at the hub will host the enterprise data in the
lowest granularity and the spokes will contain data for their relevant department in
the schema and aggregations they require. This is possible because the spokes could
host any database application such as the SQL Server 2008 data mart or SQL Server
Analysis Services data mart, as best fits the user requirements. So, this architecture with
an MPP appliance at the hub and SMP database servers or MPP appliances as spokes
is a specialized configuration in which a grid of computers forms a very large-scale
data warehouse in a federated model. This grid of computers can be connected via a
high-speed network. Also, the nodes of an MPP appliance are connected via a high-
speed link and the hub processes data differently in different nodes, enabling parallel
Figure 12-4 A parallel data warehouse in hub-and-spoke architecture
SQL Server 2008
Fast Track Data Warehouse
SQL Server 2008
Analysis Services
SQL Server 2008
Reporting Services
SQL Server 2008 R2
Parallel Data Warehouse
SQL Server 2008
Fast Track Data Warehouse

SQL Server 2008
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 557
high-speed data transfer from node to node between hub and spoke units. Data transfer
speeds approaching 500GB per minute can be achieved, thus minimizing the overhead
associated with export and load operations.
The SQL Server 2008 R2 Parallel Data Warehouse MPP appliance integrates very
well with BI applications such as Integration Services, Reporting Services, and Analysis
Services. So, if you have an existing SQL Server 2008 data mart, it can be easily added
as a node in the grid. As spokes can be any SQL Server 2008 database application,
this architecture provides a best-fit approach to the problem. The enterprise data is
managed at the center in the MPP appliance under the enforcement of IT policies
and standards, while a business unit can still have a spoke that they can manage
autonomously. This flexible model is a huge business benefit and provides quick
deployments of data marts, bypassing the sensitive political issues. This way, you can
easily expand an enterprise data warehouse by adding an additional node that can be
configured according to the business unit requirements.
The Parallel Data Warehouse hub-and-spoke architecture utilizes the available
processing power in the best possible way by distributing work across multiple locations
in the grid. While the basic data management such as cleansing, standardization, and
metadata management is done in the hub according to the enterprise policies, the
application of business rules relevant to the business units and the analytical processing
is handled in the relevant spokes. The hub-and-spoke model offers benefits such as
parallel high-speed data movement among different nodes in the grid, the distribution
of workload, and the massively parallel architecture of the hub where all nodes work in
parallel autonomously, thus providing outstanding performance.
With all these listed benefits and many more that can be realized in individual
deployment scenarios, the hub-and-spoke reference architecture provides the best of
both worlds—i.e., the ease of data management of centralized data warehouses and the
flexibility to build data marts on use-case scenarios as with federated data marts.
SQL Server 2008 R2 Data Warehouse

Enhancements
In this section some of the SQL Server 2008 enhancements are covered, such as backup
compression, MERGE T-SQL statements, change data capture, and partition-aligned
Indexed views.
Backup Compression
Backup compression is a new feature provided in SQL Server 2008 Enterprise Edition
and above and due to its popularity, has since been included in the Standard Edition
and in the SQL Server 2008 R2 release as well. Backup compression helps to speed up

×