Tải bản đầy đủ (.pdf) (10 trang)

Hands-On Microsoft SQL Server 2008 Integration Services part 57 ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (118.57 KB, 10 trang )

538 Hands-On Microsoft SQL Server 2008 Integration Services
of an ODS cannot provide answers to such complex questions because either the data
is not available in the ODS systems or the data model doesn’t support such analysis.
This results into creation of a data warehouse or a decision support system. A data
warehouse or a DSS collects data from OLTP or ODS systems and might keep it
in multiple forms that is, in its most granular form and in aggregated form. A data
warehouse keeps data for longer periods of time (generally spread across several years)
even after it has been deleted from the source systems.
Data Warehouse Design Approaches
Now we know a bit about a data warehouse: that it keeps years of history, that it keeps
data in granular as well as aggregated format, and that the primary function of a data
warehouse is to do business intelligence or analytical reporting. So the next question is
how we design a data warehouse. There are two primary schools of thought—top-down
and bottom-up approaches—and both are good and relevant to their own applications
but also have associated risks involved.
Top-Down Design Approach
Bill Inmon, who is best known as the father of data warehousing, has defined
this approach in which he suggests a data warehouse to be at the center of the
Corporate Information Factory (CIF) designed using a normalized data model. The
Corporate Information Factory approach takes the holistic view of the enterprise
and its informational needs. In such a data warehouse, data is collected from most
of the organization’s operational systems and is held at the atomic level, that is,
at the lowest level of detail. Further, the subject-oriented dimensional data marts
containing aggregated data are built from the central atomic data warehouse to serve
the departmental needs. As the data warehouse becomes a source system for all the
analytical data marts and organizational reporting, this creates a consistent view—a
single version of truth across the enterprise. It is easier to realign the data warehouse
built with this approach to support business changes, and the data marts can be
recreated easily from the central data warehouse. However, the downside of this
approach is that the difficulty involved in building such a data warehouse, collating
the entire enterprise data from almost all the operational data sources, makes this a


very large project. So, implementing this project requires a high upfront investment, a
large effort, and a long time to deliver. This long delivery time results in users losing
patience and developing their own solutions, thus deviating from the original point
of consolidation. The departments or business units generally don’t buy into such
projects with large delivery timescales.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 539
Bottom-Up Design Approach
In the absence of a central data warehouse, when departments or business units
need a data mart for a specific business process, they can’t wait for the central data
warehouse to be built but create their own departmental data mart. Ralph Kimball,
who is a well-known author on data warehousing, is a proponent of this approach, for
which he suggests creating business-specific dimensional-modeled data marts first.
These data marts can contain atomic as well as aggregated data in a denormalized data
model. These data marts are created in phases and are linked together via conformed
dimensions. A strict discipline is adopted to implement conformed dimensions that
are consistent with each other and use the same structure, attributes, and attribute
values. Either these dimensions are exactly the same, including the keys, or one
is a perfect subset of the other. The backbone of this approach is the conformed
dimension’s principle, and management of the conformed dimensions is fundamental
to maintaining integrity of the data warehouse. Such a structure in which a data mart
is linked to another data mart via a conformed dimension is called data warehouse bus
architecture. Later these data marts are joined together to create an enterprise data
warehouse. Because dimensional data marts are at the center of the bus architecture
approach, the data model is simple to understand and use. The dimensional or star
schema data model enables queries to retrieve data very efficiently. Implementing a data
warehouse using this approach realizes business investment quite quickly—as soon as
the data marts start functioning. The data warehouse is developed in phases and hence
is more feasible than the central data warehouse approach. The downsides of choosing
this approach are complicated data loading routines to maintain integrity within the
data marts and the difficulty in modifying data warehouse structures when business

changes occur.
Data Warehouse Architecture
Based on the previous two design approaches, you can have two different architecture
layouts.
Centralized EDW with Dependent Data Marts
This architecture is based on Bill Inmon’s Corporate Information Factory model.
As the name implies, in this approach the data and applications reside on a central
mainframe system. In this model, the data marts are fed exclusively from the centralized
EDW in order to assure the single version of truth, thus creating a structure in which
the data marts depend on the centralized EDW. Large enterprises use this approach, as
it provides some key benefits that are vital to their success. However, there are instances
540 Hands-On Microsoft SQL Server 2008 Integration Services
when this approach has failed to achieve desired results due to long deployment periods
and huge upfront investments. The benefits and the issues involved are listed here:
A single version of truth is the biggest reason why enterprises adopt this approach.
c
Centralized data management makes it easy to apply corporate standards and c
practices and allow businesses to comply with the legal requirements; a recent
example of compliance is the application of PCI DSS standards issued by the PCI
Security Standards Council.
With a holistic approach, the development of such a warehouse does not follow c
the general principle for phased-iterative development.
Due to their implementation plans, centralized EDWs can become quite inflexible.
c
When business units want to create analytical data marts, they get frustrated by
this inflexibility and start moving away from this approach by creating federated
data marts.
Historically, the systems deployed in this architecture have been proprietary to a c
vendor that is slow to respond to fast improvements in technology and hence, they
do not benefit from the advances in general-purpose computer systems.

is is a high-risk approach, as implementing a centralized EDW is very expensive
c
with a huge upfront investment that is realized only after the data warehouse
project is completed and data marts are beginning to emerge. e maintenance and
scalability costs have been found to be high in such an approach due to inflexibility.
All these costs add up to a very high total cost of ownership.
Distributed Independent Data Marts
This architecture is based on the Ralph Kimball’s Bus Architecture approach and
is opposite to the monolithic design of a centralized EDW. In this architecture,
distributed departmental data marts are created as and when the need arises. These
data marts are designed independent of other data marts, and they get their data from
operational source systems rather than centralized data warehouse. These data marts
are highly relevant to their departments or the business units and are quick to build,
as the scope is generally quite clear. However, due to the lack of control over their
implementation, they often result in many versions of truth and are very difficult to
keep consistent across the enterprise. The benefits and issues affecting this architecture
are listed here:
e data marts are easy and relatively quick to build. e cost to build one data
c
mart seems low, as the return on investment starts pretty soon.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 541
ough the cost of developing a data mart seems low, the overall cost to develop a c
distributed data warehouse is not low. e cost is spread across several data marts
and remains unnoticed in the overall expenditure.
e data management across independent data marts is not easy. A lot of redundant
c
data exists across many data marts.
Many versions of truth exist across an enterprise, and a data mart does not align
c
with another department’s data mart, though the dimensions may align.

Data Warehouse Data Models
You can choose either of the two data modeling techniques–Entity-Relationship (ER)
modeling and dimensional modeling—depending on your application scenario. The entity
relationship model has been in use much longer than the relatively new dimensional
model.
Entity-Relationship Model
The ER modeling is more popular in operational applications or OLTP systems,
though it is also used in data warehousing as well. This is a requirement in the top-
down approach, where a central data warehouse is built using the ER model. There are
several graphical tools that help you to create an entity-relationship diagram (ERD) to
conceptualize data. These tools primarily use three basic constructs: entity, relationship,
and attribute.
Entity
c An entity is a concept, real or abstract, about which information is
collected. It represents a class of objects such as products, an object such as
a car, or an event such as a sale that can be classified by their properties and
characteristics. It will usually have a business definition and is defined uniquely
using primary keys in the data model.
Relationship
c A relationship is an association among entities in a model and
indicates how two or more entities are related to one another. For example, a
customer owns a car—this indicates the relationship, how the customer is related
to the car. It is represented by a line drawn between entities. e entities are related
to each other in differing cardinality, which defines the maximum number of
instances of one entity related to a single instance of another entity. e relationship
has one-to-one, one-to-many, and many-to-many cardinalities.
Attributes
c Both entities and relationships have attributes that describe their
characteristics or properties. For example, a car has a VIN number attribute.
542 Hands-On Microsoft SQL Server 2008 Integration Services

The relational databases are designed using the normalized ER model to remove
redundant data. Six normal forms have been defined to date, but a database is
considered adequately normalized if it is in third normal form (3NF). Normalization
is a systematic process for assigning attributes to entities to ensure that a database is
integral by breaking down information to its smallest divisible parts and removing data
duplication. It is an incremental process, which means that to be transformed into
3NF, the entities must first qualify for 2NF. The 1NF removes repetition in data by
creating one-to-many relationships between master and detail entities. For example,
you will remove repeating similar columns from a table into another table. 2NF takes
removal of repeating data a further step by removing duplicate rows of data from a table
into a separate table. 3NF removes the columns that are not dependent on the primary
key and resolves many-to-many relationships into unique values.
As the normalization level increases, the data is further broken down and granularity
of data is increased. Such highly granular data models are very efficient in returning
small amounts of information or updating small amounts of data. This is required by
OLTP systems that have many users working on small pieces of information. That’s the
reason the ER Model is highly successful in relational database applications. However,
the requirements of a data warehouse are different. The queries are usually small in
number, but they can perform huge I/O activity on the server. These requirements are
met with the dimensional model.
Dimensional Model
Dimensional modeling is a relatively newer modeling technique than ER modeling.
Recent trends in modeling preference favor dimensional data modeling, as it is simple
to build, is easily to understand even by business folks, and aligns with the questions
usually asked of a data warehouse. This model is very efficient at summarizing values
and presenting data to analytical tools. This model keeps the numerical values called
facts in one table, while the attributes that measure these values are grouped together in
tables called dimensions. This structure makes it particularly suitable to answer business
questions such as sales this quarter or the sales this quarter compared to sales the
previous quarter.

Fact
The numeric values along with some contextual data are stored in fact tables. A data
warehouse contains one or more subject-oriented fact tables. A fact typically represents
an item, an event, or a business transaction such as sale of an item that can be used to
perform business analysis. Further, a fact consists of some columns containing values
and some foreign keys linking to dimensions. As the transactions are added to the fact
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 543
table, it can soon become quite large; consider that each transaction could represent a
row in the fact table.
It is the granularity that can really affect the size of your fact data. You have to
understand the business requirements completely—what they call for now and in
the future—to decide the level of detail you want to keep in the fact data. You might
choose to keep every transaction, or you might choose to keep aggregated data at the
day level if business is never going to drill down to the transaction-level detail. Thus
the lower the granularity level, the more the data and hence, the more disk space will
be required. However, disk space should not be an issue, because data warehouses are
meant to be large in size. Always be careful while choosing granularity, as a change in
granularity means the whole data warehouse has to be reseeded.
Measure
A measure is closely related to a fact, and sometimes the terms are used interchangeably.
A measure is what you want to measure, and the fact is a measure with context. So, a
measure is a numeric value used to indicate the performance of the business, and the fact
equates to a row in the Fact table. A measure is used in combination with dimension
members, while the value is taken from facts. For example, TotalSales-by-year and
SupplyCost-by-month are measures.
Dimension
A dimension contains the same type of information broken down in levels of interest. For
example, a time dimension can contain year, quarter, month, and day levels. A dimension
contains the information that a business wants to analyze the facts with. This information
does not change very often. Typically, a dimension table is relatively quite small compared

to a fact table. A data warehouse can have many dimensions attached to each fact table.
Some of the common dimensions could be Time, Product, Employee, Location, and
Customer. A dimension is made up of members and hierarchies.
Dimension Members
Member of a dimension are arranged in levels, for instance, members in a time dimension
have different levels: day, week, month, quarter, and year. Similarly all cities, states, and
countries are members of a geography dimension. While analyzing the data, you can choose
any dimension member level and associated facts will be used in analysis automatically.
Dimension Hierarchies
The members of a dimension can be arranged in a hierarchical order with multiple
levels to create dimension hierarchies. You can create more than one hierarchy and a
dimension member can participate in more than one hierarchy. For example, you can
544 Hands-On Microsoft SQL Server 2008 Integration Services
create two hierarchies for time dimension—one with Day, Month, Quarter, and Year
as levels and the other one with Day and Week as levels.
Dimension Types
Dimensions can be designed in different ways to meet specific business functions and
to enhance performance. Four types of configurations are covered here.
Conformed Dimension
At times you will use a dimension in more than one subject-oriented data marts. If you
are keeping the dimensions exactly same or are sourcing them from the central data
warehouse, obviously making them the same, then these dimensions are called conformed
dimensions. A conformed dimension does not need to be exactly the same as the
main dimension; it is still conformed to the main dimension if it is a subset of the detailed
dimension. In this case, the attributes in a conformed dimension need to be labeled exactly
in the same way as in a detailed dimension. The most common example could be a date
dimension used across many data marts having same attributes such as date, month,
quarter, and year.
Junk Dimension
The business data generally contains some attributes that are not related to any

dimension but are associated more with the fact. These can be easily identified, as
generally these attributes represent themselves in the form of indicators or flags. For
example, in a car rental business flags such as IsDamaged, IsStolen, and IsExchanged
are common in the databases. These flags are not exactly part of any dimension, but
businesses do want to analyze data using these flags. If you leave them with the fact
data, the performance of the queries will be very poor due to the large size of the
fact table, and indexes won’t help due to yes/no nature of such flags. You could put
them in their own dimension, in which case you would have as many dimensions as
there are such attributes. Very frequently you will see that the number of indicators or
flags that exist in the data reaches 20 plus. In this case, your data mart design will be
cluttered with lots of dimensions that have only one member, enough to confuse users.
A recommended approach is to club all these nonrelated attributes and put them in a
table, thus creating a junk dimension. For instance, you can select distinct combinations
of the attributes and add them to the junk dimension where each distinct row is
assigned a surrogate key that is referenced in the Fact table. Keeping the flags and
indicators in one dimension makes it easier for users to find out these attributes and is
useful in that the queries perform much better.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 545
Degenerate Dimension
Another type of attribute in the data is the number associated with each business
transaction, such as an invoice number or a ticket number. Though these attributes are
not actively used in analysis, every now and then businesses do have a requirement to
look for such attributes, as they link back with operational systems. These attributes
are also very useful in reconciling the data warehouse with ODS systems. When the
grain of the fact table is the same as that of a transaction, the likelihood of a degenerate
dimension increases. As the fact grain is the same as that of the degenerate dimension,
so sometimes it forms an integral part of the primary key of the fact table. A degenerate
dimension should exist along with the facts in the fact table. There is no point in
creating a separate dimension table for a degenerate dimension, as it will grow with the
fact and will become quite large.

Role-Playing Dimensions
These types of dimensions exist when the same attribute is used multiple times. For
instance, a fact row could contain SaleDate and DeliveryDate columns, both pointing
to the date attribute of the date dimension. In this case, the key of the date attribute
of the date dimension is used multiple times in the fact row and is often referred to as
a role-playing dimension. You create a table alias or a view to use the date dimension
foreign key in the fact table.
Loading a Dimension Using
a Slowly Changing Dimension
Loading a static dimension is a simple one-off task, but you will come across dimensions
that change with time and will find loading such a dimension a challenging task. Some of
the data warehouse dimensions do not change that often. For instance, a date dimension’s
members stay as is and never change. By contrast, other business dimensions do change
with time as a business evolves; for example, a product could change in size and volume.
Typically, a row in the dimension table will have different attribute requirements:
Some attributes never change, such as IDs, or in the case of a car, a VIN number.
c
Some attributes will change but a business always want to see the current value; c
for instance, a business may not bother to see the old description of the product
if the description of a product is changed. is type of change is called a Type 1
change in which the attribute value is overwritten and a business cannot go back
and see the old value or learn when the change in value has been applied.
546 Hands-On Microsoft SQL Server 2008 Integration Services
Some attributes will change and the business will want to see the current value as c
well as the historical value. is happens when a business is interested in analyzing
the facts before and after a particular change has been applied. For instance, a
business might want to see the effect on sales when the size of a can of beans is
changed from 400 g to 350 g. is type of change is called a Type 2 change.
Most of the business requirements can be covered with use of the preceding types.
c

Other types of changes have been defined such as Type 3 and so on; however, their
primary function is to improve storage efficiency and query performance. ese
change types are not covered here. Refer to data warehousing books if you want to
know more about them. Also, the SCD transformation in SSIS supports only up
to Type 2 changes.
You have studied the Slowly Changing Dimension (SCD) transformation in
Chapter 10 and have done a Hands-On exercise as well. Here just to recap: the SCD
transformation is designed to help you load a dimension that changes in time, which
is generally a challenge with an ETL tool. The SCD transformation supports the
following attribute change types to support the previously mentioned requirements.
Fixed Attribute Change Type
c is change type supports the attributes that do
not change (Type 0) and align with the first scenario mentioned previously.
Changing Attribute Change Type
c Using this change type, you can load the
attribute values that are Type 1 changes, and it overwrites the existing values with
the new values. is is an in-place modification.
Historic Attribute Change Type
c In this change type, a new row is added that
will be valid for the current or future transactions. Typically, three columns are
used to handle this type of change—StartDate, EndDate, and IsActive. When
a row is getting a Type 2 update, it will update the IsActive flag to ‘N’ and will
timestamp the EndDate with the current date and time to indicate that this
row is no longer active, while the activity period of this row can be found using
StartDate and EndDate. Also, at the same time, a new row is added with same
values in all the columns except in the Type 2 column that is getting the update.
In this Type 2 column, the updated new value is inserted, StartDate gets the
current date and time stamp, EndDate is kept as null, and the IsActive column
gets a ‘Y’ value to indicate that this row is active for the particular member.
Among other methods, SCD transformation can be tested to load a dimension. If

you think that your dimension is too huge and the SCD transformation is not a fit for
the purpose, you can always create a script in your package to load a slowly changing
dimension.
Chapter 12: Data Warehousing and SQL Server 2008 Enhancements 547
Data Model Schema
You can model your data model with two different types of schema.
Star Schema
As mentioned earlier, a data mart is a subject-oriented mini–data warehouse and will
have one or few fact tables surrounded by a relatively large number of dimensions.
When these are drawn on a piece of paper, the structure looks like a star, hence the
term star schema. The star schema is a simple model in which the denormalized
dimensions are connected to facts using foreign key relationships. This model has
become a building block in dimensional modeling. In a large data warehouse where
multiple fact entities can exist, you can imagine a multiple star schema model with each
dimension connected with several fact entities. Figure 12-1 shows a simple example of
a star schema data model.
FactSales
DimCustomer
DimSalesTerritory
DimDate
DimCurrency DimProduct
DimSalesPerson
Figure 12-1 A simple star schema data model

×