Tải bản đầy đủ (.pdf) (10 trang)

Hướng dẫn học Microsoft SQL Server 2008 part 8 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (577.39 KB, 10 trang )

Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 32
Part I Laying the Foundation
Planning Data Stores
T
he enterprise data architect helps an organization plan the most effective use of information throughout
the organization. An organization’s data store configuration includes multiple types of data stores, as
illustrated in the following figure, each with a specific purpose:

Operational databases
,or
OLTP
(online transaction processing)
databases
collect
first-generation transactional data that is essential to the day-to-day operation of
the organization and unique to the organization. An organization might have an
operational data store to serve each unit or function within it. Regardless of the organi-
zation’s size, an organization with a singly focused purpose may very well have only
one operational database.
■ For performance, operational stores are tuned for a balance of data retrieval and
updates, so indexes and locking are key concerns. Because these databases receive
first-generation data, they are subject to data update anomalies, and benefit from
normalization. A typical organizational data store configuration includes several
operational data stores feeding multiple data marts and a single master data store (see
graphic).
ReferenceDB
Data Warehouse
Sales Data
Mart
Manufacturing
Data Mart


Manufacturing
OLTP
Sales OLTP
Alternate
Location
Mobile Sales
OLTP

Caching data stores
, sometime called
reporting databases
, are optional read-only
copies of all or part of an operational database. An organization might have multiple
caching data stores to deliver data throughout the organization. Caching data stores
continued
32
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 33
Data Architecture 2
continued
might use SQL Server replication or log shipping to populate the database and are
tuned for high-performance data retrieval.

Reference data stores
are primarily read-only, and store generic data required by the
organization but which seldom changes — similar to the reference section of the
library. Examples of reference data might be unit of measure conversion factors or ISO
country codes. A reference data store is tuned for high-performance data retrieval.

Data warehouses

collect large amounts of data from multiple data stores across the
entire enterprise using an
extract-transform-load (ETL)
process to convert the data
from the various formats and schema into a common format, designed for ease of
data retrieval. Data warehouses also serve as the archival location, storing historical
data and releasing some of the data load from the operational data stores. The data
is also pre-aggregated, making research and reporting easier, thereby improving the
accessibility of information and reducing errors.
■ Because the primary task of a data warehouse is data retrieval and analysis, the data-
integrity concerns present with an operational data store don’t apply. Data warehouses
are designed for fast retrieval and are not normalized like master data stores. They
are generally designed using a basic star schema or snowflake design. Locks generally
aren’t an issue, and the indexing is applied without adversely affecting inserts or
updates.
Chapter 70, ‘‘BI Design,’’ discusses star schemas and snowflake designs used in data ware-
housing.
■ The analysis process usually involves more than just SQL queries, and uses data cubes that consoli-
date gigabytes of data into dynamic pivot tables. Business intelligence (BI) is the combination of the
ETL process, the data warehouse data store, and the acts of creating and browsing cubes.
■ A common data warehouse is essential for ensuring that the entire organization researches the same
data set and achieves the same result for the same query — a critical aspect of the Sarbanes-Oxley
Act and other regulatory requirements.

Data marts
are subsets of the data warehouse with pre-aggregated data organized specifically to
serve the needs of one organizational group or one data domain.

Master data store
,or

master data management
(MDM), refers to the data warehouse that combines
the data from throughout the organization. The primary purpose of the master data store is to
provide a single version of the truth for organizations with a complex set of data stores and multiple
data warehouses.
Smart Database Design
My career has focused on turning around database projects that were previously considered failures and
recommending solutions for ISV databases that are performing poorly. In nearly every case, the root
cause of the failure was the database design. It was too complex, too clumsy, or just plain inadequate.
33
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 34
Part I Laying the Foundation
Without exception, where I found poor database performance, I also found data modelers who insisted
on modeling alone or who couldn’t write SQL queries to save their lives.
Throughout my career, what began as an observation was reinforced into a firm conviction. The
database schema is the foundation of the database project; and an elegant, simple database design
outperforms a complex database both in terms of the development process and the final performance of
the database application. This is the basic idea behind the Smart Database Design.
While I believe in a balanced set of goals for any database, including performance, usability, data
integrity, availability, extensibility, and security, all things being equal, the crown goes to the database
that always provides the right answer with lightning speed.
Database system
A database system is a complex system. By complex, I mean that the system consists of multiple compo-
nents that interact with one another, as shown in Figure 2-1. The performance of one component affects
the performance of other components and thus the entire system. Stated another way, the design of one
component will set up other components, and the whole system, to either work well together or to frus-
trate those trying to make the system work.
FIGURE 2-1
The database system is the collective effort of the server environment, maintenance jobs, the client

application, and the database.
• Four Components
AD-306 • Performance Decisions
The Database System
2006 PASS Community Summit
DB
Instead of randomly trying performance tips (and the Internet has an overwhelming number of SQL
Server performance and optimization tips), it makes more sense to think about the database as a system
and then figure out how the components of the database system affect one another. You can then use
this knowledge to apply the performance techniques in a way that provides the most benefit.
34
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 35
Data Architecture 2
Every database system contains four broad technologies or components: the database itself, the server
platform, the maintenance jobs, and the client’s data access code, as illustrated in Figure 2-2. Each com-
ponent affects the overall performance of the database system:
■ The server environment is the physical hardware configuration (CPUs, memory, disk spindles,
I/O bus), the operating system, and the SQL Server instance configuration, which together pro-
vide the working environment for the database. The server environment is typically optimized
by balancing the CPUs, memory and I/O, and identifying and eliminating bottlenecks.
■ The database maintenance jobs are the steps that keep the database running optimally (index
defragmentation, DBCC integrity checks, and maintaining index statistics).
■ The client application is the collection of data access layers, middle tiers, front-end applications,
ETL (extract, transform, and load) scripts, report queries, or SSIS (SQL Server Integration
Services) packages that access the database. These can not only affect the user’s perception of
database performance, but can also reduce the overall performance of the database system.
■ Finally, the database component includes everything within the data file: the physical schema,
T-SQL code (queries, stored procedures, user-defined functions (UDFs), and views), indexes,
and data.

FIGURE 2-2
Smart Database Design is the premise that an elegant
physical schema
makes the data intuitively
obvious and enables writing great set-based
queries
that respond well to
indexing
. This in turn creates
short, tight transactions, which improves
concurrency
and
scalability
, while reducing the aggregate
workload of the database. This flow from layer to layer becomes a methodology for designing and
optimizing databases.
Adv Scalability
Concurrency
Indexing
Set-Based
Schema
Enables
Enables
Enables
Enables
All four database components must function well together to produce a high-performance database sys-
tem; if one of the components is weak, then the database system will fail or perform poorly.
However, of these four components, the database itself is the most difficult component to design and
the one that drives the design of the other three components. For example, the database workload
35

www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 36
Part I Laying the Foundation
determines the hardware requirements. Maintenance jobs and data access code are both designed around
the database; and an overly complex database will complicate both the maintenance jobs and the data
access code.
Physical schema
The base layer of Smart Database Design is the database’s physical schema. The physical schema includes
the database’s tables, columns, primary and foreign keys, and constraints. Basically, the ‘‘physical’’
schema is what the server creates when you run data definition language (DDL) commands. Designing
an elegant, high-performance physical schema typically involves a team effort and requires numerous
design iterations and reviews.
Well-designed physical schemas avoid overcomplexity by generalizing similar types of objects, thereby
creating a schema with fewer entities. While designing the physical schema, make the data obvious to
the developer and easy to query. The prime consideration when converting the logical database design
into a physical schema is how much work is required in order for a query to navigate the data struc-
tures while maintaining a correctly normalized design. Not only is the schema then a joy to use, but it
also makes it easier to code correct queries, reducing the chance of data integrity errors caused by faulty
queries.
Other hallmarks of a well-designed schema include the following:
■ The primary and foreign keys are designed for raw physical performance.
■ Optional data (e.g., second address lines, name suffixes) is designed using patterns (nullable
columns, surrogate nulls, or missing rows) that protect the integrity of the data both within
the database and through the query.
Conversely, a poorly designed (either non-normalized or overly complex) physical schema encourages
developers to write iterative code, code that uses temporary buckets to manipulate data, or code that
will be difficult to debug or maintain.
Agile Modeling
A
gile development is popular for good reasons. It gets the job done more quickly and often produces

a better result than traditional methods. Agile development also fits well with database design and
development.
The traditional waterfall process steps through four project phases: requirements gathering, design, develop-
ment, and implementation. While this method may work well for some endeavors, when creating software,
the users often don’t know what they want until they see it, which pushes discovery beyond the requirements
gathering phase and into the development phase.
Agile development addresses this problem by replacing the single long waterfall with numerous short cycles
or iterations. Each iteration builds out a working model that can be tested, and enables users to play with the
continued
36
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 37
Data Architecture 2
continued
software and further discover their needs. When users see rapid progress and trust that new features can be
added, they become more willing to allow features to be planned into the life cycle of the software, instead
of insisting that every feature be implemented in the next version.
When I’m developing a database, each iteration is usually 2–5 days long and is a mini cycle of discovery,
coding, unit testing, and more discoveries with the client. A project might consist of a dozen of these tight
iterations; and with each iteration, more features are fleshed out in the database and code.
Set-based queries
SQL Server is designed to handle data in sets. SQL is a declarative language, meaning that the SQL
query describes the problem, and the Query Optimizer generates an execution plan to resolve the
problem as a set.
Application programmers typically develop while-loops that handle data one row at a time. Iterative
code is fine for application tasks such as populating a grid or combo box, but it is inappropriate for
server-side code. Iterative T-SQL code, typically implemented via cursors, forces the database engine
to perform thousands of wasteful single-row operations, instead of handling the problem in one larger,
more efficient set. The performance cost of these single-row operations is huge. Depending on the task,
SQL cursors perform about half as well as set-based code, and the performance differential grows with

the size of the data. This is why set-based queries, based on an obvious physical schema, are so critical
to database performance.
A good physical schema and set-based queries set up the database for excellent indexing, further
improving the performance of the query (see Figure 2-2).
However, queries cannot overcome the errors of a poor physical schema and won’t solve the perfor-
mance issues of poorly written code. It’s simply impossible to fix a clumsy database design by throwing
code at it. Poor database designs tend to require extra code, which performs poorly and is difficult to
maintain. Unfortunately, poorly designed databases also tend to have code that is tightly coupled (refers
directly to tables), instead of code that accesses the database’s abstraction layer (stored procedures and
views). This makes it all that much harder to refactor the database.
Indexing
An index is an organized pointer used to locate information in a larger collection. An index is only
useful when it matches the needs of a question. In this case, it becomes the shortcut between a ques-
tion and the right answer. The key is to design the fewest number of shortcuts between the right
questions and the right answers.
A sound indexing strategy identifies a handful of queries that represent 90% of the workload and, with
judicious use of clustered indexes and covering indexes, solves the queries without expensive bookmark
lookup operations.
An elegant physical schema, well-written set-based queries, and excellent indexing reduce transaction
duration, which implicitly improves concurrency and sets up the database for scalability.
37
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 38
Part I Laying the Foundation
Nevertheless, indexes cannot overcome the performance difficulties of iterative code. Poorly written SQL
code that returns unnecessary columns is much more difficult to index and will likely not take advan-
tage of covering indexes. Moreover, it’s extremely difficult to properly index an overly complex or non-
normalized physical schema.
Concurrency
SQL Server, as an ACID-compliant database engine, supports transactions that are atomic, consistent,

isolated, and durable. Whether the transaction is a single statement or an explicit transaction within
BEGIN TRAN COMMIT TRAN statements, locks are typically used to prevent one transaction from
seeing another transaction’s uncommitted data. Transaction isolation is great for data integrity, but
locking and blocking hurt performance.
Multi-user concurrency can be tuned by limiting the extraneous code within logical transactions, setting
the transaction isolation level no higher than required, keeping trigger code to a minimum, and perhaps
using snapshot isolation.
A database with an excellent physical schema, well-written set-based queries, and the right set of indexes
will have tight transactions and perform well with multiple users.
When a poorly designed database displays symptoms of locking and blocking issues, no amount of
transaction isolation level tuning will solve the problem. The sources of the concurrency issue are
the long transactions and additional workload caused by the poor database schema, lack of set-based
queries, or missing indexes. Concurrency tuning cannot overcome the deficiencies of a poor database
design.
Advanced scalability
With each release, Microsoft has consistently enhanced SQL Server for the enterprise. These technologies
can enhance the scalability of heavy transaction databases.
The Resource Governor, new in SQL Server 2008, can restrict the resources available for different sets of
queries, enabling the server to maintain the SLA agreement for some queries at the expense of other less
critical queries.
Indexed views were introduced in SQL Server 2000. They actually materialize the view as a clustered
index and can enable queries to select from joined data without hitting the joined tables, or to pre-
aggregate data. In effect, an indexed view is a custom covering index that can cover across multiple
tables.
Partitioned tables can automatically segment data across multiple filegroups, which can serve as an auto-
archive device. By reducing the size of the active data partition, the requirements for maintaining the
data, such as defragging the indexes, are also reduced.
Service Broker can collect transactional data and process it after the fact, thereby providing an ‘‘over-
time’’ load leveling as it spreads a five-second peak load over a one-minute execution without delaying
the calling transaction.

While these high-scalability features can extend the scalability of a well-designed database, they are lim-
ited in their ability to add performance to a poorly designed database, and they cannot overcome long
38
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 39
Data Architecture 2
transactions caused by a lack of indexes, iterative code, or all the multiple other problems caused by an
overly complex database design.
The database component is the principle factor determining the overall monetary cost of the database. A
well-designed database minimizes hardware costs, simplifies data access code and maintenance jobs, and
significantly lowers both the initial and the total cost of the database system.
A performance framework
By describing the dependencies between the schema, queries, indexing, transactions, and scalability,
Smart Database Design is a framework for performance.
The key to mastering Smart Database Design is understanding the interaction, or cause-and-effect
relationship, between these hierarchical layers (schema, queries, indexing, concurrency). Each layer
enables the next layer; conversely, no layer can overcome deficiencies in lower layers. The practical
application of Smart Database Design takes advantage of these dependencies when developing or
optimizing a database by employing the right best practices within each layer to support the next layer.
Reducing the aggregate workload of the database component has a positive effect on the rest of the
database system. An efficient database component reduces the performance requirements of the server
platform, increasing capacity. Maintenance jobs are easier to plan and also execute faster when the
database component is designed well. There is less client access code to write and the code that needs
to be written is easier to write and maintain. The result is an overall database system that’s simpler to
maintain, cheaper to run, easier to connect to from the data access layer, and that scales beautifully.
Although it’s not a perfect analogy, picturing a water fountain on a hot summer day can help demon-
strate how shorter transactions improve overall database performance. If everyone takes a small, quick
sip from the fountain, then no queue forms; but as soon as someone fills up a liter-sized Big Gulp cup,
others begin to wait. Regardless of the amount of hardware resources available to a database, time is
finite, and the greatest performance gain is obtained by eliminating the excess work of wastefully long

transactions, or throwing away the Big Gulp cup.
The quick sips of a well-designed query hitting an elegant, properly indexed database will outperform
and be significantly easier on the budget than the Bug Gulp cup, with its poorly written query or cursor,
on a poorly designed database missing an index.
Striving for database design excellence is a smart business move with an excellent estimated return on
investment. From my experience, every day spent on database design saves two to three months of
development and maintenance time. In the long term, it’s far cheaper to design the database correctly
than to throw money or labor at project overruns or hardware upgrades.
The cause-and-effect relationship between the layers helps diagnose performance problems as well.
When a system is experiencing locking and blocking problems, the cause is likely found in the indexing
or query layers. I’ve seen databases that were drowning under the weight of poorly written code.
However, the root cause wasn’t the code; it was the overly complex, anti-normalized database design
that was driving the developers to write horrid code.
The bottom line? Designing an elegant database schema is the first step in maximizing the performance
of the overall database system, while reducing costs.
39
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 40
Part I Laying the Foundation
Issues and objections
I’ve heard objections to the Smart Database Design framework and I like to address them here. Some
say that buying more hardware is the best way to improve performance. I disagree. More hardware
only masks the problem until it explodes later. Performance problems tend to grow exponentially as
DB size grows, whereas hardware performance grows more or less linearly over time. One can almost
predict when even the ‘‘best’’ hardware available no longer suffices to get acceptable performance. In
several cases, I’ve seen companies spend incredible amounts to upgrade their hardware and they saw
little or no improvement because the bottleneck was the transaction locking and blocking and poor
code. Sometimes, a faster CPU only waits faster. Strategically, reducing the workload is cheaper than
increasing the capacity of the hardware.
Some claim that fixing one layer can overcome deficiencies in lower layers. It’s true that a poor schema

will perform better when properly indexed than without indexes. However, adding the indexes doesn’t
really solve the deficiencies, it only masks the deficiencies. The code is still doing extra work to compen-
sate for the poor schema. The cost of developing code and designing correct indexes is still higher for
the poor schema. Any data integrity or extensibility risks are still there.
Some argue that they would like to apply Smart Database Design but they can’t because the database is
a third-party database and they can’t modify the schema or the code. True, for most third-party prod-
ucts, the database schema and queries are not open for optimization, and this can be very frustrating if
the database needs optimization. However, most vendors are interested in improving their product and
keeping their clients happy. Both clients and vendors have contracted with me to help identify areas of
opportunity and suggest solutions for the next revision.
Some say they’d like to apply Smart Database Design but they can’t because any change to the schema
would break hundreds of other objects. It’s true — databases without abstraction layers are expensive
to alter. An abstraction layer decouples the database from the client applications, making it possible
to change the database component without affecting the client applications. In the absence of a
well-designed abstraction layer, the first step toward gaining system performance is to create one. As
expensive as it may seem to refactor the database and every application so that all communications
go through an abstraction layer, the cost of not doing so could very well be that IT can’t respond to
the organization’s needs, forcing the company to outsource or develop wasteful extra databases. At the
worst, the failure of the database to be extensible could force the end of the organization.
In both the case of the third-party database and the lack of abstraction, it’s still a good idea to optimize
at the lowest level possible, and then move up the layers; but the best performance gains are made when
you can start optimizing at the lowest level of the database component, the physical schema.
Some say that a poorly designed database can be solved by adding more layers of code and converting the
database to an SOA-style application. I disagree. The database should be refactored with a clean normalized
design and a proper abstraction layer. This will reduce the overall workload and solve a host of usability
and performance issues much better than simply wrapping a poorly designed database with more code.
Summary
When introducing the optimization chapter in her book Inside SQL Server 2000, Kalen Delaney correctly
writes that optimization can’t be added to a database after it has been developed; it has to be designed
into the database from the beginning.

40
www.getcoolebook.com
Nielsen c02.tex V4 - 07/21/2009 12:02pm Page 41
Data Architecture 2
This chapter presented the concept of the Information Architecture Principle, unpacked the six database
objectives, and then discussed the Smart Database Design, showing the dependencies between the layers
and how each layer enables the next layer.
In a chapter packed with ideas, I’d like to highlight the following:
■ The database architect position should be equally involved in the enterprise-level design and
the project-level designs.
■ Any database design or implementation can be measured by six database objectives: usability,
extensibility, data integrity, performance, availability, and security. These objectives don’t have
to compete — it’s possible to design an elegant database that meets all six objectives.
■ Each day spent on the database design will save three months later.
■ Extensibility is the most expensive database objective to correct after the fact. A brittle
database — one that has ad hoc SQL directly accessing the table from the client — is the
worst design possible. It’s simply impossible to fix a clumsy database design by throwing code
at it.
■ Smart Database Design is the premise that an elegant physical schema makes the data intuitively
obvious and enables writing great set-based queries that respond well to indexing. This in turn
creates short, tight transactions, which improves concurrency and scalability while reducing the
aggregate workload of the database. This flow from layer to layer becomes a methodology for
designing and optimizing databases.
■ Reducing the aggregate workload of the database has a greater positive effect than buying more
hardware.
From this overview of data architecture, the next chapter digs deeper into the concepts and patterns of
relational database design, which are critical for usability, extensibility, data integrity, and performance.
41
www.getcoolebook.com

×