Tải bản đầy đủ (.pdf) (85 trang)

Database Management systems phần 10 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (530.07 KB, 85 trang )

Additional Topics 823
developing transactions with the ACID properties. In addition to providing a uniform
interface to the services of different resource managers, a TP monitor also routes
transactions to the appropriate resource managers. Finally, a TP monitor ensures that
an application behaves as a transaction by implementing concurrency control, logging,
and recovery functions, and by exploiting the transaction processing capabilities of the
underlying resource managers.
TP monitors are used in environments where applications require advanced features
such as access to multiple resource managers; sophisticated request routing (also called
workflow management); assigning priorities to transactions and doing priority-
based load-balancing across servers; and so on. A DBMS provides many of the func-
tions supported by a TP monitor in addition to processing queries and database up-
dates efficiently. A DBMS is appropriate for environments where the wealth of trans-
action management capabilities provided by a TP monitor is not necessary and, in
particular, where very high scalability (with respect to transaction processing activ-
ity) and interoperability are not essential.
The transaction processing capabilities of database systems are improving continually.
For example, many vendors offer distributed DBMS products today in which a transac-
tion can execute across several resource managers, each of which is a DBMS. Currently,
all the DBMSs must be from the same vendor; however, as transaction-oriented services
from different vendors become more standardized, distributed, heterogeneous DBMSs
should become available. Eventually, perhaps, the functions of current TP monitors
will also be available in many DBMSs; for now, TP monitors provide essential infras-
tructure for high-end transaction processing environments.
28.1.2 New Transaction Models
Consider an application such as computer-aided design, in which users retrieve large
design objects from a database and interactively analyze and modify them. Each
transaction takes a long time
—minutes or even hours, whereas the TPC benchmark
transactions take under a millisecond
—and holding locks this long affects performance.


Further, if a crash occurs, undoing an active transaction completely is unsatisfactory,
since considerable user effort may be lost. Ideally we want to be able to restore most
of the actions of an active transaction and resume execution. Finally, if several users
are concurrently developing a design, they may want to see changes being made by
others without waiting until the end of the transaction that changes the data.
To address the needs of long-duration activities, several refinements of the transaction
concept have been proposed. The basic idea is to treat each transaction as a collection
of related subtransactions. Subtransactions can acquire locks, and the changes made
by a subtransaction become visible to other transactions after the subtransaction ends
(and before the main transaction of which it is a part commits). In multilevel trans-
824 Chapter 28
actions, locks held by a subtransaction are released when the subtransaction ends.
In nested transactions, locks held by a subtransaction are assigned to the parent
(sub)transaction when the subtransaction ends. These refinements to the transaction
concept have a significant effect on concurrency control and recovery algorithms.
28.1.3 Real-Time DBMSs
Some transactions must be executed within a user-specified deadline.Ahard dead-
line means the value of the transaction is zero after the deadline. For example, in a
DBMS designed to record bets on horse races, a transaction placing a bet is worthless
once the race begins. Such a transaction should not be executed; the bet should not
be placed. A soft deadline means the value of the transaction decreases after the
deadline, eventually going to zero. For example, in a DBMS designed to monitor some
activity (e.g., a complex reactor), a transaction that looks up the current reading of a
sensor must be executed within a short time, say, one second. The longer it takes to
execute the transaction, the less useful the reading becomes. In a real-time DBMS, the
goal is to maximize the value of executed transactions, and the DBMS must prioritize
transactions, taking their deadlines into account.
28.2 INTEGRATED ACCESS TO MULTIPLE DATA SOURCES
As databases proliferate, users want to access data from more than one source. For
example, if several travel agents market their travel packages through the Web, cus-

tomers would like to look at packages from different agents and compare them. A
more traditional example is that large organizations typically have several databases,
created (and maintained) by different divisions such as Sales, Production, and Pur-
chasing. While these databases contain much common information, determining the
exact relationship between tables in different databases can be a complicated prob-
lem. For example, prices in one database might be in dollars per dozen items, while
prices in another database might be in dollars per item. The development of XML
DTDs (see Section 22.3.3) offers the promise that such semantic mismatches can be
avoided if all parties conform to a single standard DTD. However, there are many
legacy databases and most domains still do not have agreed-upon DTDs; the problem
of semantic mismatches will be frequently encountered for the foreseeable future.
Semantic mismatches can be resolved and hidden from users by defining relational
views over the tables from the two databases. Defining a collection of views to give
a group of users a uniform presentation of relevant data from multiple databases is
called semantic integration. Creating views that mask semantic mismatches in a
natural manner is a difficult task and has been widely studied. In practice, the task
is made harder by the fact that the schemas of existing databases are often poorly
Additional Topics 825
documented; thus, it is difficult to even understand the meaning of rows in existing
tables, let alone define unifying views across several tables from different databases.
If the underlying databases are managed using different DBMSs, as is often the case,
some kind of ‘middleware’ must be used to evaluate queries over the integrating views,
retrieving data at query execution time by using protocols such as Open Database Con-
nectivity (ODBC) to give each underlying database a uniform interface, as discussed
in Chapter 5. Alternatively, the integrating views can be materialized and stored in
a data warehouse, as discussed in Chapter 23. Queries can then be executed over the
warehoused data without accessing the source DBMSs at run-time.
28.3 MOBILE DATABASES
The availability of portable computers and wireless communications has created a new
breed of nomadic database users. At one level these users are simply accessing a

database through a network, which is similar to distributed DBMSs. At another level
the network as well as data and user characteristics now have several novel properties,
which affect basic assumptions in many components of a DBMS, including the query
engine, transaction manager, and recovery manager;
Users are connected through a wireless link whose bandwidth is ten times less
than Ethernet and 100 times less than ATM networks. Communication costs are
therefore significantly higher in proportion to I/O and CPU costs.
Users’ locations are constantly changing, and mobile computers have a limited
battery life. Therefore, the true communication costs reflect connection time and
battery usage in addition to bytes transferred, and change constantly depending
on location. Data is frequently replicated to minimize the cost of accessing it from
different locations.
As a user moves around, data could be accessed from multiple database servers
within a single transaction. The likelihood of losing connections is also much
greater than in a traditional network. Centralized transaction management may
therefore be impractical, especially if some data is resident at the mobile comput-
ers. We may in fact have to give up on ACID transactions and develop alternative
notions of consistency for user programs.
28.4 MAIN MEMORY DATABASES
The price of main memory is now low enough that we can buy enough main memory
to hold the entire database for many applications; with 64-bit addressing, modern
CPUs also have very large address spaces. Some commercial systems now have several
gigabytes of main memory. This shift prompts a reexamination of some basic DBMS
826 Chapter 28
design decisions, since disk accesses no longer dominate processing time for a memory-
resident database:
Main memory does not survive system crashes, and so we still have to implement
logging and recovery to ensure transaction atomicity and durability. Log records
must be written to stable storage at commit time, and this process could become
a bottleneck. To minimize this problem, rather than commit each transaction as

it completes, we can collect completed transactions and commit them in batches;
this is called group commit. Recovery algorithms can also be optimized since
pages rarely have to be written out to make room for other pages.
The implementation of in-memory operations has to be optimized carefully since
disk accesses are no longer the limiting factor for performance.
A new criterion must be considered while optimizing queries, namely the amount
of space required to execute a plan. It is important to minimize the space overhead
because exceeding available physical memory would lead to swapping pages to disk
(through the operating system’s virtual memory mechanisms), greatly slowing
down execution.
Page-oriented data structures become less important (since pages are no longer the
unit of data retrieval), and clustering is not important (since the cost of accessing
any region of main memory is uniform).
28.5 MULTIMEDIA DATABASES
In an object-relational DBMS, users can define ADTs with appropriate methods, which
is an improvement over an RDBMS. Nonetheless, supporting just ADTs falls short of
what is required to deal with very large collections of multimedia objects, including
audio, images, free text, text marked up in HTML or variants, sequence data, and
videos. Illustrative applications include NASA’s EOS project, which aims to create a
repository of satellite imagery, the Human Genome project, which is creating databases
of genetic information such as GenBank, and NSF/DARPA’s Digital Libraries project,
which aims to put entire libraries into database systems and then make them accessible
through computer networks. Industrial applications such as collaborative development
of engineering designs also require multimedia database management, and are being
addressed by several vendors.
We outline some applications and challenges in this area:
Content-based retrieval: Users must be able to specify selection conditions
based on the contents of multimedia objects. For example, users may search for
images using queries such as: “Find all images that are similar to this image” and
“Find all images that contain at least three airplanes.” As images are inserted into

Additional Topics 827
the database, the DBMS must analyze them and automatically extract features
that will help answer such content-based queries. This information can then be
used to search for images that satisfy a given query, as discussed in Chapter 26.
As another example, users would like to search for documents of interest using
information retrieval techniques and keyword searches. Vendors are moving to-
wards incorporating such techniques into DBMS products. It is still not clear how
these domain-specific retrieval and search techniques can be combined effectively
with traditional DBMS queries. Research into abstract data types and ORDBMS
query processing has provided a starting point, but more work is needed.
Managing repositories of large objects: Traditionally, DBMSs have concen-
trated on tables that contain a large number of tuples, each of which is relatively
small. Once multimedia objects such as images, sound clips, and videos are stored
in a database, individual objects of very large size have to be handled efficiently.
For example, compression techniques must be carefully integrated into the DBMS
environment. As another example, distributed DBMSs must develop techniques
to efficiently retrieve such objects. Retrieval of multimedia objects in a distributed
system has been addressed in limited contexts, such as client-server systems, but
in general remains a difficult problem.
Video-on-demand: Many companies want to provide video-on-demand services
that enable users to dial into a server and request a particular video. The video
must then be delivered to the user’s computer in real time, reliably and inex-
pensively. Ideally, users must be able to perform familiar VCR functions such as
fast-forward and reverse. From a database perspective, the server has to contend
with specialized real-time constraints; video delivery rates must be synchronized
at the server and at the client, taking into account the characteristics of the com-
munication network.
28.6 GEOGRAPHIC INFORMATION SYSTEMS
Geographic Information Systems (GIS) contain spatial information about cities,
states, countries, streets, highways, lakes, rivers, and other geographical features, and

support applications to combine such spatial information with non-spatial data. As
discussed in Chapter 26, spatial data is stored in either raster or vector formats. In
addition, there is often a temporal dimension, as when we measure rainfall at several
locations over time. An important issue with spatial data sets is how to integrate data
from multiple sources, since each source may record data using a different coordinate
system to identify locations.
Now let us consider how spatial data in a GIS is analyzed. Spatial information is most
naturally thought of as being overlaid on maps. Typical queries include “What cities
lie on I-94 between Madison and Chicago?” and “What is the shortest route from
Madison to St. Louis?” These kinds of queries can be addressed using the techniques
828 Chapter 28
discussed in Chapter 26. An emerging application is in-vehicle navigation aids. With
Global Positioning Systems (GPS) technology, a car’s location can be pinpointed, and
by accessing a database of local maps, a driver can receive directions from his or her
current location to a desired destination; this application also involves mobile database
access!
In addition, many applications involve interpolating measurements at certain locations
across an entire region to obtain a model, and combining overlapping models. For ex-
ample, if we have measured rainfall at certain locations, we can use the TIN approach
to triangulate the region with the locations at which we have measurements being the
vertices of the triangles. Then, we use some form of interpolation to estimate the
rainfall at points within triangles. Interpolation, triangulation, map overlays, visual-
izations of spatial data, and many other domain-specific operations are supported in
GIS products such as ESRI Systems’ ARC-Info. Thus, while spatial query processing
techniques as discussed in Chapter 26 are an important part of a GIS product, con-
siderable additional functionality must be incorporated as well. How best to extend
ORDBMS systems with this additional functionality is an important problem yet to
be resolved. Agreeing upon standards for data representation formats and coordinate
systems is another major challenge facing the field.
28.7 TEMPORAL AND SEQUENCE DATABASES

Currently available DBMSs provide little support for queries over ordered collections
of records, or sequences, and over temporal data. Typical sequence queries include
“Find the weekly moving average of the Dow Jones Industrial Average,” and “Find the
first five consecutively increasing temperature readings” (from a trace of temperature
observations). Such queries can be easily expressed and often efficiently executed by
systems that support query languages designed for sequences. Some commercial SQL
systems now support such SQL extensions.
The first example is also a temporal query. However, temporal queries involve more
than just record ordering. For example, consider the following query: “Find the longest
interval in which the same person managed two different departments.” If the period
during which a given person managed a department is indicated by two fields from and
to, we have to reason about a collection of intervals, rather than a sequence of records.
Further, temporal queries require the DBMS to be aware of the anomalies associated
with calendars (such as leap years). Temporal extensions are likely to be incorporated
in future versions of the SQL standard.
A distinct and important class of sequence data consists of DNA sequences, which are
being generated at a rapid pace by the biological community. These are in fact closer
to sequences of characters in text than to time sequences as in the above examples.
The field of biological information management and analysis has become very popular
Additional Topics 829
in recent years, and is called bioinformatics. Biological data, such as DNA sequence
data, is characterized by complex structure and numerous relationships among data
elements, many overlapping and incomplete or erroneous data fragments (because ex-
perimentally collected data from several groups, often working on related problems,
is stored in the databases), a need to frequently change the database schema itself as
new kinds of relationships in the data are discovered, and the need to maintain several
versions of data for archival and reference.
28.8 INFORMATION VISUALIZATION
As computers become faster and main memory becomes cheaper, it becomes increas-
ingly feasible to create visual presentations of data, rather than just text-based reports.

Data visualization makes it easier for users to understand the information in large
complex datasets. The challenge here is to make it easy for users to develop visual
presentation of their data and to interactively query such presentations. Although a
number of data visualization tools are available, efficient visualization of large datasets
presents many challenges.
The need for visualization is especially important in the context of decision support;
when confronted with large quantities of high-dimensional data and various kinds of
data summaries produced by using analysis tools such as SQL, OLAP, and data mining
algorithms, the information can be overwhelming. Visualizing the data, together with
the generated summaries, can be a powerful way to sift through this information and
spot interesting trends or patterns. The human eye, after all, is very good at finding
patterns. A good framework for data mining must combine analytic tools to process
data, and bring out latent anomalies or trends, with a visualization environment in
which a user can notice these patterns and interactively drill down to the original data
for further analysis.
28.9 SUMMARY
The database area continues to grow vigorously, both in terms of technology and in
terms of applications. The fundamental reason for this growth is that the amount of
information stored and processed using computers is growing rapidly. Regardless of
the nature of the data and its intended applications, users need database management
systems and their services (concurrent access, crash recovery, easy and efficient query-
ing, etc.) as the volume of data increases. As the range of applications is broadened,
however, some shortcomings of current DBMSs become serious limitations. These
problems are being actively studied in the database research community.
The coverage in this book provides a good introduction, but is not intended to cover
all aspects of database systems. Ample material is available for further study, as this
830 Chapter 28
chapter illustrates, and we hope that the reader is motivated to pursue the leads in
the bibliography. Bon voyage!
BIBLIOGRAPHIC NOTES

[288] contains a comprehensive treatment of all aspects of transaction processing. An intro-
ductory textbook treatment can be found in [77]. See [204] for several papers that describe new
transaction models for nontraditional applications such as CAD/CAM. [1, 668, 502, 607, 622]
are some of the many papers on real-time databases.
Determining which entities are the same across different databases is a difficult problem;
it is an example of a semantic mismatch. Resolving such mismatches has been addressed
in many papers, including [362, 412, 558, 576]. [329] is an overview of theoretical work in
this area. Also see the bibliographic notes for Chapter 21 for references to related work on
multidatabases, and see the notes for Chapter 2 for references to work on view integration.
[260] is an early paper on main memory databases. [345, 89] describe the Dali main memory
storage manager. [359] surveys visualization idioms designed for large databases, and [291]
discusses visualization for data mining.
Visualization systems for databases include DataSpace [515], DEVise [424], IVEE [23], the
Mineset suite from SGI, Tioga [27], and VisDB [358]. In addition, a number of general tools
are available for data visualization.
Querying text repositories has been studied extensively in information retrieval; see [545] for
a recent survey. This topic has generated considerable interest in the database community
recently because of the widespread use of the Web, which contains many text sources. In
particular, HTML documents have some structure if we interpret links as edges in a graph.
Such documents are examples of semistructured data; see [2] for a good overview. Recent
papers on queries over the Web include [2, 384, 457, 493].
See [501] for a survey of multimedia issues in database management. There has been much
recent interest in database issues in a mobile computing environment, for example, [327, 337].
See [334] for a collection of articles on this subject. [639] contains several articles that cover
all aspects of temporal databases. The use of constraints in databases has been actively
investigated in recent years; [356] is a good overview. Geographic Information Systems have
also been studied extensively; [511] describes the Paradise system, which is notable for its
scalability.
The book [695] contains detailed discussions of temporal databases (including the TSQL2
language, which is influencing the SQL standard), spatial and multimedia databases, and

uncertainty in databases. Another SQL extension to query sequence data, called SRQL, is
proposed in [532].
A
DATABASEDESIGNCASESTUDY:
THEINTERNETSHOP
Advice for software developers and horse racing enthusiasts: Avoid hacks.
—Anonymous
We now present an illustrative, ‘cradle-to-grave’ design example. DBDudes Inc., a
well-known database consulting firm, has been called in to help Barns and Nobble
(B&N) with their database design and implementation. B&N is a large bookstore
specializing in books on horse racing, and they’ve decided to go online. DBDudes first
verify that B&N is willing and able to pay their steep fees and then schedule a lunch
meeting
—billed to B&N, naturally—to do requirements analysis.
A.1 REQUIREMENTS ANALYSIS
The owner of B&N has thought about what he wants and offers a concise summary:
“I would like my customers to be able to browse my catalog of books and to place orders
over the Internet. Currently, I take orders over the phone. I have mostly corporate
customers who call me and give me the ISBN number of a book and a quantity. I
then prepare a shipment that contains the books they have ordered. If I don’t have
enough copies in stock, I order additional copies and delay the shipment until the new
copies arrive; I want to ship a customer’s entire order together. My catalog includes
all the books that I sell. For each book, the catalog contains its ISBN number, title,
author, purchase price, sales price, and the year the book was published. Most of my
customers are regulars, and I have records with their name, address, and credit card
number. New customers have to call me first and establish an account before they can
use my Web site.
On my new Web site, customers should first identify themselves by their unique cus-
tomer identification number. Then they should be able to browse my catalog and to
place orders online.”

DBDudes’s consultants are a little surprised by how quickly the requirements phase
was completed
—it usually takes them weeks of discussions (and many lunches and
dinners) to get this done
—but return to their offices to analyze this information.
831
832 Appendix A
A.2 CONCEPTUAL DESIGN
In the conceptual design step, DBDudes develop a high level description of the data
in terms of the ER model. Their initial design is shown in Figure A.1. Books and
customers are modeled as entities and are related through orders that customers place.
Orders is a relationship set connecting the Books and Customers entity sets. For each
order, the following attributes are stored: quantity, order date, and ship date. As soon
as an order is shipped, the ship date is set; until then the ship date is set to null,
indicating that this order has not been shipped yet.
DBDudes has an internal design review at this point, and several questions are raised.
To protect their identities, we will refer to the design team leader as Dude 1 and the
design reviewer as Dude 2:
Dude 2: What if a customer places two orders for the same book on the same day?
Dude 1: The first order is handled by creating a new Orders relationship and the second
order is handled by updating the value of the quantity attribute in this relationship.
Dude 2: What if a customer places two orders for different books on the same day?
Dude 1: No problem. Each instance of the Orders relationship set relates the customer
to a different book.
Dude 2: Ah, but what if a customer places two orders for the same book on different
days?
Dude 1: We can use the attribute order date of the orders relationship to distinguish
the two orders.
Dude 2: Oh no you can’t. The attributes of Customers and Books must jointly contain
a key for Orders. So this design does not allow a customer to place orders for the same

book on different days.
Dude 1: Yikes, you’re right. Oh well, B&N probably won’t care; we’ll see.
DBDudes decides to proceed with the next phase, logical database design.
A.3 LOGICAL DATABASE DESIGN
Using the standard approach discussed in Chapter 3, DBDudes maps the ER diagram
shown in Figure A.1 to the relational model, generating the following tables:
CREATE TABLE Books ( isbn CHAR(10),
title CHAR(80),
author CHAR(80),
qty
in stock INTEGER,
price REAL,
year
published INTEGER,
PRIMARY KEY (isbn))
Design Case Study: An Internet Shop 833
isbn
title price
year_published
qty_in_stockauthor
cardnumcid
Customers
addresscname
order_date
ship_dateqty
Orders
Books
Figure A.1 ER Diagram of the Initial Design
CREATE TABLE Orders ( isbn CHAR(10),
cid INTEGER,

qty INTEGER,
order
date DATE,
ship
date DATE,
PRIMARY KEY (isbn,cid),
FOREIGN KEY (isbn) REFERENCES Books,
FOREIGN KEY (cid) REFERENCES Customers )
CREATE TABLE Customers ( cid INTEGER,
cname CHAR(80),
address CHAR(200),
cardnum CHAR(16),
PRIMARY KEY (cid)
UNIQUE (cardnum))
The design team leader, who is still brooding over the fact that the review exposed
a flaw in the design, now has an inspiration. The Orders table contains the field
order
date and the key for the table contains only the fields isbn and cid. Because of
this, a customer cannot order the same book on different days, a restriction that was
not intended. Why not add the order
date attribute to the key for the Orders table?
This would eliminate the unwanted restriction:
CREATE TABLE Orders ( isbn CHAR(10),

PRIMARY KEY (isbn,cid,ship
date),
)
The reviewer, Dude 2, is not entirely happy with this solution, which he calls a ‘hack’.
He points out that there is no natural ER diagram that reflects this design, and stresses
834 Appendix A

the importance of the ER diagram as a design document. Dude 1 argues that while
Dude 2 has a point, it is important to present B&N with a preliminary design and get
feedback; everyone agrees with this, and they go back to B&N.
The owner of B&N now brings up some additional requirements that he did not mention
during the initial discussions: “Customers should be able to purchase several different
books in a single order. For example, if a customer wants to purchase three copies of
‘The English Teacher’ and two copies of ‘The Character of Physical Law,’ the customer
should be able to place a single order for both books.”
The design team leader, Dude 1, asks how this affects the shippping policy. Does B&N
still want to ship all books in an order together? The owner of B&N explains their
shipping policy: “As soon as we have have enough copies of an ordered book we ship
it, even if an order contains several books. So it could happen that the three copies
of ‘The English Teacher’ are shipped today because we have five copies in stock, but
that ‘The Character of Physical Law’ is shipped tomorrow, because we currently have
only one copy in stock and another copy arrives tomorrow. In addition, my customers
could place more than one order per day, and they want to be able to identify the
orders they placed.”
The DBDudes team thinks this over and identifies two new requirements: first, it
must be possible to order several different books in a single order, and second, a
customer must be able to distinguish between several orders placed the same day. To
accomodate these requirements, they introduce a new attribute into the Orders table
called ordernum, which uniquely identifies an order and therefore the customer placing
the order. However, since several books could be purchased in a single order, ordernum
and isbn are both needed to determine qty and ship
date in the Orders table.
Orders are assigned order numbers sequentially and orders that are placed later have
higher order numbers. If several orders are placed by the same customer on a single
day, these orders have different order numbers and can thus be distinguished. The
SQL DDL statement to create the modified Orders table is given below:
CREATE TABLE Orders ( ordernum INTEGER,

isbn CHAR(10),
cid INTEGER,
qty INTEGER,
order
date DATE,
ship
date DATE,
PRIMARY KEY (ordernum, isbn),
FOREIGN KEY (isbn) REFERENCES Books
FOREIGN KEY (cid) REFERENCES Customers )
Design Case Study: An Internet Shop 835
A.4 SCHEMA REFINEMENT
Next, DBDudes analyzes the set of relations for possible redundancy. The Books rela-
tion has only one key (isbn), and no other functional dependencies hold over the table.
Thus, Books is in BCNF. The Customers relation has the key (cid), and since a credit
card number uniquely identifies its card holder, the functional dependency cardnum →
cid also holds. Since cid is a key, cardnum is also a key. No other dependencies hold,
and so Customers is also in BCNF.
DBDudes has already identified the pair ordernum, isbn as the key for the Orders
table. In addition, since each order is placed by one customer on one specific date, the
following two functional dependencies hold:
ordernum → cid,andordernum → order
date
The experts at DBDudes conclude that Orders is not even in 3NF. (Can you see why?)
They decide to decompose Orders into the following two relations:
Orders(ordernum
, cid, order date,and
Orderlists(ordernum, isbn
, qty, ship date)
The resulting two relations, Orders and Orderlists, are both in BCNF, and the decom-

position is lossless-join since ordernum is a key for (the new) Orders. The reader is
invited to check that this decomposition is also dependency-preserving. For complete-
ness, we give the SQL DDL for the Orders and Orderlists relations below:
CREATE TABLE Orders ( ordernum INTEGER,
cid INTEGER,
order
date DATE,
PRIMARY KEY (ordernum),
FOREIGN KEY (cid) REFERENCES Customers )
CREATE TABLE Orderlists ( ordernum INTEGER,
isbn CHAR(10),
qty INTEGER,
ship
date DATE,
PRIMARY KEY (ordernum, isbn),
FOREIGN KEY (isbn) REFERENCES Books)
Figure A.2 shows an updated ER diagram that reflects the new design. Note that
DBDudes could have arrived immediately at this diagram if they had made Orders an
entity set instead of a relationship set right at the beginning. But at that time they did
not understand the requirements completely, and it seemed natural to model Orders
836 Appendix A
isbn
title price
year_published
qty_in_stockauthor
cardnumcid
Customers
addresscname
Orders
Books

qty ship_date order_date
Order_List
Place_Order
ordernum
Figure A.2 ER Diagram Reflecting the Final Design
as a relationship set. This iterative refinement process is typical of real-life database
design processes. As DBDudes has learned over time, it is rare to achieve an initial
design that is not changed as a project progresses.
The DBDudes team celebrates the successful completion of logical database design and
schema refinement by opening a bottle of champagne and charging it to B&N. After
recovering from the celebration, they move on to the physical design phase.
A.5 PHYSICAL DATABASE DESIGN
Next, DBDudes considers the expected workload. The owner of the bookstore expects
most of his customers to search for books by ISBN number before placing an order.
Placing an order involves inserting one record into the Orders table and inserting one
or more records into the Orderlists relation. If a sufficient number of books is available,
a shipment is prepared and a value for the ship
date in the Orderlists relation is set. In
addition, the available quantities of books in stocks changes all the time since orders
are placed that decrease the quantity available and new books arrive from suppliers
and increase the quantity available.
The DBDudes team begins by considering searches for books by ISBN. Since isbn is
a key, an equality query on isbn returns at most one record. Thus, in order to speed
up queries from customers who look for books with a given ISBN, DBDudes decides
to build an unclustered hash index on isbn.
Next, they consider updates to book quantities. To update the qty
in stock value for
a book, we must first search for the book by ISBN; the index on isbn speeds this
up. Since the qty
in stock value for a book is updated quite frequently, DBDudes also

considers partitioning the Books relation vertically into the following two relations:
Design Case Study: An Internet Shop 837
BooksQty(isbn, qty), and
BookRest(isbn
, title, author, price, year published).
Unfortunately, this vertical partition would slow down another very popular query:
Equality search on ISBN to retrieve full information about a book would require a
join between BooksQty and BooksRest. So DBDudes decide not to vertically partition
Books.
DBDudes thinks it is likely that customers will also want to search for books by title
and by author, and decides to add unclustered hash indexes on title and author—these
indexes are inexpensive to maintain because the set of books is rarely changed even
though the quantity in stock for a book changes often.
Next, they consider the Customers relation. A customer is first identified by the unique
customer identifaction number. Thus, the most common queries on Customers are
equality queries involving the customer identification number, and DBDudes decides
to build a clustered hash index on cid to achieve maximum speedup for this query.
Moving on to the Orders relation, they see that it is involved in two queries: insertion
of new orders and retrieval of existing orders. Both queries involve the ordernum
attribute as search key and so they decide to build an index on it. What type of
index should this be—a B+ tree or a hash index? Since order numbers are assigned
sequentially and thus correspond to the order date, sorting by ordernum effectively
sorts by order date as well. Thus DBDudes decides to build a clustered B+ tree index
on ordernum. Although the operational requirements that have been mentioned until
know favor neither a B+ tree nor a hash index, B&N will probably want to monitor
daily activities, and the clustered B+ tree is a better choice for such range queries. Of
course, this means that retrieving all orders for a given customer could be expensive
for customers with many orders, since clustering by ordernum precludes clustering by
other attributes, such as cid.
The Orderlists relation mostly involves insertions, with an occasional update of a

shipment date or a query to list all components of a given order. If Orderlists is kept
sorted on ordernum, all insertions are appends at the end of the relation and thus very
efficient. A clustered B+ tree index on ordernum maintains this sort order and also
speeds up retrieval of all items for a given order. To update a shipment date, we need
to search for a tuple by ordernum and isbn. The index on ordernum helps here as well.
Although an index on ordernum, isbn would be better for this purpose, insertions
would not be as efficient as with an index on just ordernum; DBDudes therefore decides
to index Orderlists on just ordernum.
838 Appendix A
A.5.1 Tuning the Database
We digress from our discussion of the initial design to consider a problem that arises
several months after the launch of the B&N site. DBDudes is called in and told that
customer enquiries about pending orders are being processed very slowly. B&N has
become very successful, and the Orders and Orderlists tables have grown huge.
Thinking further about the design, DBDudes realizes that there are two types of orders:
completed orders, for which all books have already shipped, and partially completed or-
ders, for which some books are yet to be shipped. Most customer requests to look up
an order involve partially completed orders, which are a small fraction of all orders.
DBDudes therefore decides to horizontally partition both the Orders table and the Or-
derlists table by ordernum. This results in four new relations: NewOrders, OldOrders,
NewOrderlists, and OldOrderlists.
An order and its components are always in exactly one pair of relations—and we
can determine which pair, old or new, by a simple check on ordernum—and queries
involving that order can always be evaluated using only the relevant relations. Some
queries are now slower, such as those asking for all of a customer’s orders, since they
require us to search two sets of relations. However, these queries are infrequent and
their performance is acceptable.
A.6 SECURITY
Returning to our discussion of the initial design phase, recall that DBDudes completed
physical database design. Next, they address security. There are three groups of users:

customers, employees, and the owner of the book shop. (Of course, there is also the
database administrator who has universal access to all data and who is responsible for
regular operation of the database system.)
The owner of the store has full privileges on all tables. Customers can query the Books
table and can place orders online, but they should not have access to other customers’
records nor to other customers’ orders. DBDudes restricts access in two ways. First,
they design a simple Web page with several forms similar to the page shown in Figure
22.1 in Chapter 22. This allows customers to submit a small collection of valid requests
without giving them the ability to directly access the underlying DBMS through an
SQL interface. Second, they use the security features of the DBMS to limit access to
sensitive data.
The Web page allows customers to query the Books relation by ISBN number, name of
the author, and title of a book. The Web page also has two buttons. The first button
retrieves a list of all of the customer’s orders that are not completely fulfilled yet. The
second button will display a list of all completed orders for that customer. Note that
Design Case Study: An Internet Shop 839
customers cannot specify actual SQL queries through the Web; they can only fill in
some parameters in a form to instantiate an automatically generated SQL query. All
queries that are generated through form input have a WHERE clause that includes the
cid attribute value of the current customer, and evaluation of the queries generated
by the two buttons requires knowledge of the customer identification number. Since
all users have to log on to the Web site before browsing the catalog, the business logic
(discussed in Section A.7) must maintain state information about a customer (i.e., the
customer identification number) during the customer’s visit to the Web site.
The second step is to configure the database to limit access according to each user
group’s need to know. DBDudes creates a special customer account that has the
following privileges:
SELECT ON Books, NewOrders, OldOrders, NewOrderlists, OldOrderlists
INSERT ON NewOrders, OldOrders, NewOrderlists, OldOrderlists
Employees should be able to add new books to the catalog, update the quantity of a

book in stock, revise customer orders if necessary, and update all customer information
except the credit card information. In fact, employees should not even be able to see a
customer’s credit card number. Thus, DBDudes creates the following view:
CREATE VIEW CustomerInfo (cid,cname,address)
AS SELECT C.cid, C.cname, C.address
FROM Customers C
They give the employee account the following privileges:
SELECT ON CustomerInfo, Books,
NewOrders, OldOrders, NewOrderlists, OldOrderlists
INSERT ON CustomerInfo, Books,
NewOrders, OldOrders, NewOrderlists, OldOrderlists
UPDATE ON CustomerInfo, Books,
NewOrders, OldOrders, NewOrderlists, OldOrderlists
DELETE ON Books, NewOrders, OldOrders, NewOrderlists, OldOrderlists
In addition, there are security issues when the user first logs on to the Web site using
the customer identification number. Sending the number unencrypted over the Internet
is a security hazard, and a secure protocol such as the SSL should be used.
There are companies such as CyberCash and DigiCash that offer electronic commerce
payment solutions, even including ‘electronic’ cash. Discussion of how to incorporate
such techniques into the Website are outside the scope of this book.
840 Appendix A
A.7 APPLICATION LAYERS
DBDudes now moves on to the implementation of the application layer and considers
alternatives for connecting the DBMS to the World-Wide Web (see Chapter 22).
DBDudes note the need for session management. For example, users who log in to
the site, browse the catalog, and then select books to buy do not want to re-enter
their customer identification number. Session management has to extend to the whole
process of selecting books, adding them to a shopping cart, possibly removing books
from the cart, and then checking out and paying for the books.
DBDudes then considers whether Web pages for books should be static or dynamic.

If there is a static Web page for each book, then we need an extra database field in
the Books relation that points to the location of the file. Even though this enables
special page designs for different books, it is a very labor intensive solution. DBDudes
convinces B&N to dynamically assemble the Web page for a book from a standard
template instantiated with information about the book in the Books relation.
This leaves DBDudes with one final decision, namely how to connect applications to
the DBMS. They consider the two main alternatives that we presented in Section 22.2:
CGI scripts versus using an application server infrastructure. If they use CGI scripts,
they would have to encode session management logic—not an easy task. If they use
an application server, they can make use of all the functionality that the application
server provides. Thus, they recommend that B&N implement server-side processing
using an application server.
B&N, however, refuses to pay for an application server and decides that for their
purposes CGI scripts are fine. DBDudes accepts B&N’s decision and proceeds to build
the following pieces:
The top level HTML pages that allow users to navigate the site, and various forms
that allow users to search the catalog by ISBN, author name, or author title. An
example page containing a search form is shown in Figure 22.1 in Chapter 22. In
addition to the input forms, DBDudes must develop appropriate presentations for
the results.
The logic to track a customer session. Relevant information must be stored either
in a server-side data structure or be cached in hte customer’s browser using a
mechanism like cookies. Cookies are pieces of information that a Web server
can store in a user’s Web browser. Whenever the user generates a request, the
browser passes along the stored information, thereby enabling the Web server to
‘remember’ what the user did earlier.
The scripts that process the user requests. For example, a customer can use a
form called ‘Search books by title’ to type in a title and search for books with that
Design Case Study: An Internet Shop 841
title. The CGI interface communicates with a script that processes the request.

An example of such a script written in Perl using DBI for data access is shown in
Figure 22.4 in Chapter 22.
For completeness, we remark that if B&N had agreed to use an application server,
DBDudes would have had the following tasks:
As in the CGI-based architecture, they would have to design top level pages that
allow customers to navigate the Web site as well as various search forms and result
presentations.
Assuming that DBDudes select a Java-based application server, they have to write
Java Servlets to process form-generated requests. Potentially, they could reuse
existing (possibly commercially available) JavaBeans. They can use JDBC as a
database interface; examples of JDBC code can be found in Section 5.10. Instead
of programming Servlets, they could resort to Java Server Pages and annotate
pages with special JSP markup tags. An example of a Web page that includes
JSP commands is shown in Section 22.2.1.
If DBDudes select an application server that uses proprietary markup tags, they
have to develop Web pages by using such tags. An example using Cold Fusion
markup tags can be found in Section 22.2.1.
Our discussion thus far only covers the ‘client-interface’, the part of the Web site that
is exposed to B&N’s customers. DBDudes also need to add applications that allow
the employees and the shop owner to query and access the database and to generate
summary reports of business activities.
This completes our discussion of Barns and Nobble. While this study only describes
a small part of a real problem, we saw that a design even at this scale involved non-
trivial tradeoffs. We would like to emphasize again that database design is an iterative
process and that therefore it is very important not to lock oneself down early on in a
fixed model that is too inflexible to accomodate a changing environment. Welcome to
the exciting world of database management!
B
THEMINIBASESOFTWARE
Practice is the best of all instructors.

—Publius Syrus, 42 B.C.
Minibase is a small relational DBMS, together with a suite of visualization tools, that
has been developed for use with this book. While the book makes no direct reference to
the software and can be used independently, Minibase offers instructors an opportunity
to design a variety of hands-on assignments, with or without programming. To see an
online description of the software, visit this URL:
dbbook/minibase.html
The software is available freely through ftp. By registering themselves as users at
the URL for the book, instructors can receive prompt notification of any major bug
reports and fixes. Sample project assignments, which elaborate upon some of the
briefly sketched ideas in the project-based exercises at the end of chapters, can be seen
at
dbbook/minihwk.html
Instructors should consider making small modifications to each assignment to discour-
age undesirable ‘code reuse’ by students; assignment handouts formatted using Latex
are available by ftp. Instructors can also obtain solutions to these assignments by
contacting the authors (, ).
B.1 WHAT’S AVAILABLE
Minibase is intended to supplement the use of a commercial DBMS such as Oracle or
Sybase in course projects, not to replace them. While a commercial DBMS is ideal
for SQL assignments, it does not allow students to understand how the DBMS works.
Minibase is intended to address the latter issue; the subset of SQL that it supports is
intentionally kept small, and students should also be asked to use a commercial DBMS
for writing SQL queries and programs. Minibase is provided on an as-is basis with no
warranties or restrictions for educational or personal use. It includes the following:
842
The Minibase Software 843
Code for a small single-user relational DBMS, including a parser and query opti-
mizer for a subset of SQL, and components designed to be (re)written by students
as project assignments: heap files, buffer manager, B+ trees, sorting,andjoins.

Graphical visualization tools to aid in students’ exploration and understanding of
thebehaviorofthebuffer management, B+ tree,andquery optimization compo-
nents of the system. There is also a graphical tool to refine a relational database
design using normalization.
B.2 OVERVIEW OF MINIBASE ASSIGNMENTS
Several assignments involving the use of Minibase are described below. Each of these
has been tested in a course already, but the details of how Minibase is set up might vary
at your school, so you may have to modify the assignments accordingly. If you plan to
use these assignments, you are advised to download and try them at your site well in
advance of handing them to students. We have done our best to test and document
these assignments, and the Minibase software, but bugs undoubtedly persist. Please
report bugs at this URL:
dbbook/minibase.comments.html
I hope that users will contribute bug fixes, additional project assignments, and exten-
sions to Minibase. These will be made publicly available through the Minibase site,
together with pointers to the authors.
B.2.1 Overview of Programming Projects
In several assignments, students are asked to rewrite a component of Minibase. The
book provides the necessary background for all of these assignments, and the assign-
ment handout provides additional system-level details. The online HTML documen-
tation provides an overview of the software, in particular the component interfaces,
and can be downloaded and installed at each school that uses Minibase. The projects
listed below should be assigned after covering the relevant material from the indicated
chapter.
Buffer manager (Chapter 7): Students are given code for the layer that man-
ages space on disk and supports the concept of pages with page ids. They are
asked to implement a buffer manager that brings requested pages into memory if
they are not already there. One variation of this assignment could use different
replacement policies. Students are asked to assume a single-user environment,
with no concurrency control or recovery management.

HF page (Chapter 7): Students must write code that manages records on a
page using a slot-directory page format to keep track of records on a page. Possible
844 Appendix B
variants include fixed-length versus variable-length records and other ways to keep
track of records on a page.
Heap files (Chapter 7): Using the HF page and buffer manager code, students
are asked to implement a layer that supports the abstraction of files of unordered
pages, that is, heap files.
B+ trees (Chapter 9): This is one of the more complex assignments. Students
have to implement a page class that maintains records in sorted order within a
page and implement the B+ tree index structure to impose a sort order across
several leaf-level pages. Indexes store key, record-pointer pairs in leaf pages, and
data records are stored separately (in heap files). Similar assignments can easily
be created for Linear Hashing or Extendible Hashing index structures.
External sorting (Chapter 11): Building upon the buffer manager and heap
file layers, students are asked to implement external merge-sort. The emphasis is
on minimizing I/O, rather than on the in-memory sort used to create sorted runs.
Sort-merge join (Chapter 12): Building upon the code for external sorting,
students are asked to implement the sort-merge join algorithm. This assignment
can be easily modified to create assignments that involve other join algorithms.
Index nested-loop join (Chapter 12): This assignment is similar to the sort-
merge join assignment, but relies on B+ tree (or other indexing) code, instead of
sorting code.
B.2.2 Overview of Nonprogramming Assignments
Four assignments that do not require students to write any code (other than SQL, in
one assignment) are also available.
Optimizer exercises (Chapter 13): The Minibase optimizer visualizer offers
a flexible tool to explore how a typical relational query optimizer works. It ac-
cepts single-block SQL queries (including some queries that cannot be executed
in Minibase, such as queries involving grouping and aggregate operators). Stu-

dents can inspect and modify synthetic catalogs, add and drop indexes, enable or
disable different join algorithms, enable or disable index-only evaluation strate-
gies, and see the effect of such changes on the plan produced for a given query.
All (sub)plans generated by an iterative System R style optimizer can be viewed,
ordered by the iteration in which they are generated, and details on a given plan
can be obtained readily. All interaction with the optimizer visualizer is through a
GUI and requires no programming.
The assignment introduces students to this tool and then requires them to answer
questions involving specific catalogs, queries, and plans generated by controlling
various parameters.
The Minibase Software 845
Buffer manager viewer (Chapter 12): This viewer lets students visualize
how pages are moved in and out of the buffer pool, their status (e.g., dirty bit,
pin count) while in the pool, and some statistics (e.g., number of hits). The as-
signment requires students to generate traces by modifying some trace-generation
code (provided) and to answer questions about these traces by using the visual-
izer to look at them. While this assignment can be used after covering Chapter
7, deferring it until after Chapter 12 enables students to examine traces that are
representative of different relational operations.
B+ tree viewer (Chapter 9): This viewer lets students see a B+ tree as it is
modified through insert and delete statements. The assignment requires students
to work with trace files, answer questions about them, and generate operation
traces (i.e., a sequence of inserts and deletes) that create specified kinds of trees.
Normalization tool (Chapter 15): The normalization viewer is a tool for nor-
malizing relational tables. It supports the concept of a refinement session,in
which a schema is decomposed repeatedly and the resulting decomposition tree is
then saved. For a given schema, a user might consider several alternative decom-
positions (more precisely, decomposition trees), and each of these can be saved
as a refinement session. Refinement sessions are a very flexible and convenient
mechanism for trying out several alternative decomposition strategies. The nor-

malization assignment introduces students to this tool and asks design-oriented
questions involving the use of the tool.
Assignments that require students to evaluate various components can also be devel-
oped. For example, students can be asked to compare different join methods, different
index methods, and different buffer management policies.
B.3 ACKNOWLEDGMENTS
The Minibase software was inpired by Minirel, a small relational DBMS developed by
David DeWitt for instructional use. Minibase was developed by a large number of
dedicated students over a long time, and the design was guided by Mike Carey and R.
Ramakrishnan. See the online documentation for more on Minibase’s history.

REFERENCES
[1] R. Abbott and H. Garcia-Molina. Scheduling real-time transactions: a performance
evaluation. ACM Transactions on Database Systems, 17(3), 1992.
[2] S. Abiteboul. Querying semi-structured data. In Intl. Conf. on Database Theory, 1997.
[3] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[4] S. Abiteboul and P. Kanellakis. Object identity as a query language primitive. In Proc.
ACM SIGMOD Conf. on the Management of Data, 1989.
[5] S. Abiteboul and V. Vianu. Regular path queries with constraints. In Proc. ACM Symp.
on Principles of Database Systems, 1997.
[6] K. Achyutuni, E. Omiecinski, and S. Navathe. Two techniques for on-line index mod-
ification in shared nothing parallel databases. In Proc. ACM SIGMOD Conf. on the
Management of Data, 1996.
[7] S. Adali, K. Candan, Y. Papakonstantinou, and V. Subrahmanian. Query caching and
optimization in distributed mediator systems. In Proc. ACM SIGMOD Conf. on the
Management of Data, 1996.
[8] M. E. Adiba. Derived relations: A unified mechanism for views, snapshots and dis-
tributed data. In Proc. Intl. Conf. on Very Large Databases, 1981.
[9] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. Naughton, R. Ramakrishnan, and
S. Sarawagi. On the computation of multidimensional aggregates. In Proc. Intl. Conf.

on Very Large Databases, 1996.
[10] D. Agrawal and A. El Abbadi. The generalized tree quorum protocol: an efficient
approach for managing replicated data. ACM Transactions on Database Systems, 17(4),
1992.
[11] D. Agrawal, A. El Abbadi, and R. Jeffers. Using delayed commitment in locking pro-
tocols for real-time databases. In Proc. ACM SIGMOD Conf. on the Management of
Data, 1992.
[12] R. Agrawal, M. Carey, and M. Livny. Concurrency control performance-modeling:
Alternatives and implications. In Proc. ACM SIGMOD Conf. on the Management of
Data, 1985.
[13] R. Agrawal and D. DeWitt. Integrated concurrency control and recovery mecha-
nisms: Design and performance evaluation. ACM Transactions on Database Systems,
10(4):529–564, 1985.
[14] R. Agrawal and N. Gehani. ODE (Object Database and Environment): The language
and the data model. In Proc. ACM SIGMOD Conf. on the Management of Data, 1989.
[15] R. Agrawal, J. E. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clus-
tering of high dimensional data for data mining. In Proc. ACM SIGMOD Conf. on
Management of Data, 1998.
[16] R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance perspective.
IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925, December 1993.
[17] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast Discovery
of Association Rules. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-
rusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 12, pages
307–328. AAAI/MIT Press, 1996.
847

×