Tải bản đầy đủ (.pdf) (99 trang)

FUNDAMENTALS OF DATABASE SYSTEMS Fourth Edition phần 10 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (3.75 MB, 99 trang )

29.2 Multimedia Databases I
929

Marketing,
advertising,
retailing,
entertainment, and
travel:
There
are virtually no limits
to using multimedia information in these
applications-from
effective sales presenta-
tions
to
virtual tours of cities
and
art
galleries.
The
film industry has already shown
the
power of special effects in creating animations
and
synthetically designed ani-
mals, aliens,
and
special effects.
The
use of predesigned stored objects in multimedia
databases will


expand
the
range of these applications.

Real-time
control
and
monitoring:
Coupled with active database technology (see
Chap-
ter 24), multimedia
presentation
of information
can
be a very effective means for
monitoring
and
controlling complex tasks such as manufacturing operations, nuclear
power plants, patients in intensive care units,
and
transportation systems.
Commercial
Systems
for Multimedia Information Management.
There
are no
OBMSs
designed for
the
sole purpose of multimedia

data
management,
and
therefore there
are
none
that
have
the
range of functionality required to fully support all of
the
multimedia information
management
applications
that
we discussed above. However,
several
OBMSs
today support multimedia data types; these include lnformix Dynamic
Server,
OB2
Universal database
(UOB)
of
IBM,
Oracle
9
and
10, CA-
JASMINE,

Sybase,
OOB
II.
All
of these
OBMSs
have
support for objects,
which
is essential for modeling a variety of
complex multimedia objects.
One
major problem
with
these systems is
that
the
"blades,
cartridges,
and
extenders" for
handling
multimedia
data
are designed in a very ad
hoc
manner.
The
functionality is provided
without

much
apparent
attention
to scalability
and
performance.
There
are products available
that
operate
either
stand-alone or in
conjunction
with
other
vendors' systems to allow retrieval of image
data
by
content.
They
include Virage, Excalibur,
and
IBM's
QBIC.
Operations
on
multimedia
need
to be
standardized.

The
MPEG-
7
and
other
standards are addressing some of these issues.
29.2.5 Selected Bibliography on Multimedia Databases
Multimedia
database
management
is
becoming
a very heavily researched area
with
sev-
eral industrial projects
on
the
way. Grosky (1994, 1997) provides two
excellent
tutori-
als
on
the
topic. Pazandak
and
Srivastava (1995) provide
an
evaluation
of database

systems
related
to
the
requirements of
multimedia
databases. Grosky et al. (1997) con-
tains
contributed
articles including a survey
on
content-based
indexing
and
retrieval by
]agadish
(1997).
Faloutsos
et
al. (1994) also discuss a system for image querying by
con-
tent.
Li et al. (1998)
introduce
image
modeling
in
which
an image is viewed as a hierar-
chical

structured
complex
object
with
both
semantics
and
visual properties. Nwosu et
al. (1996)
and
Subramanian
and
]ajodia
(1997)
have
written
books
on
the
topic.
Lassila (1998) discusses
the
need
for
metadata
for accessing
mutimedia
information
on
the

web;
the
semantic
web effort is summarized
in
Fensel (2000).
Khan
(2000) did a
dissertation
on
ontology-based
information
retrieval.
Uschold
and
Gruninger
(1996) is
a good resource
on
ontologies
Corcho
et
al. (2003) compare ontology languages
and
discuss
methodologies
to
build
ontologies.
Multimedia

content
analysis, indexing,
and
filtering are discussed in
Dimitrova
(1999). A survey of
content-based
multimedia
930
I Chapter 29 Emerging Database Technologies and Applications
retrieval is
provided
by Yoshitaka
and
Ichikawa
(1999).
The
following
WWW
references
may be
consulted
for
additional
information:
CA-
JASMINE
(Multimedia
ODBMS):
/>Excalibur technologies:

Virage, Inc
(Content
based image retrieval):
IBM's
QBlC
(Query by Image
Content)
product:
29.3 GEOGRAPHIC INFORMATION
SYSTEMS
Geographic information systems
(GIS)
are used to collect, model, store,
and
analyze
information describing physical properties of
the
geographical world.
The
scope of
GIS
broadly encompasses two types of data: (1) spatial data, originating from maps, digital
images, administrative
and
political boundaries, roads, transportation networks; physical
data
such
as rivers, soil characteristics, climatic regions,
land
elevations,

and
(2) nonspa-
tial data, such as socio-economic
data
(like census counts), economic data,
and
sales or
marketing information. GIS is a rapidly developing
domain
that
offers highly innovative
approaches
to
meet
some challenging technical demands.
29.3.1 GIS Applications
It
is possible to divide
GISs
into three categories: (1) cartographic applications, (2) digital
terrain modeling applications,
and
(3) geographic objects applications. Figure 29.3
summarizes these categories.
Incartographic
and
terrain modeling applications, variations in spatial attributes are
captured-for
example, soil characteristics, crop density,
and

air quality. In geographic
objects applications, objects of interest are identified from a physical
domain-for
example, power plants, electoral districts, property parcels, product distribution districts,
and
city landmarks.
These
objects are related
with
pertinent
application
data-which
may be, for this specific example, power consumption, voting patterns, property sales
volumes,
product
sales volume,
and
traffic density.
The
first two categories of GIS applications require a field-based representation,
whereas
the
third
category requires an object-based one.
The
cartographic approach
involves special functions
that
can
include

the
overlapping of layers of maps to combine
attribute
data
that
will allow, for example,
the
measuring of distances in three-
dimensional space
and
the
reclassification of
data
on
the
map. Digital terrain modeling
requires a digital representation of parts of earth's surface using
land
elevations at sample
points
that
are
connected
to yield a surface model such as a three-dimensional net
(connected
lines in
3D)
showing
the
surface terrain.

It
requires functions of interpolation
between
observed points as well as visualization. Inobject-based geographic applications,
additional spatial functions are needed to deal
with
data
related to roads, physical
pipelines,
communication
cables, power lines,
and
such. For example, for a given region,
29.3 Geographic Information
Systems
I 931
GIS Applications
r>:
Cartographic
Irrigation
Crop yield
analysis
Land
evaluation
Planning and
facilities
management
Landscape
studies
Traffic pattern

analysis
Digital Terrain
Modeling Applications
Earth science
resource studies
Civil engineering
and military
evaluation
Soil surveys
Air and water
pollution studies
Flood control
Water resource
management
Geographic Objects
Applications
Car navigation
systems
Geographic market
analysis
Utility distribution
and consumption
Consumer product
and
services-
economic analysis
FIGURE 29.3 A possible classification
of
GIS applications (Adapted from Adam and
Gangopadhyay (1997)).

comparable maps
can
be used for comparison at various points of time to show changes in
certain
data
such as locations of roads, cables, buildings,
and
streams.
29.3.2 Data Management Requirements
of
GIS
The
functional requirements of
the
GIS applications above translate into
the
following data-
base requirements.
Data
Modeling
and Representation. GIS data
can
be broadly represented in two
formats:
(l)
vector and (2) raster. Vector data represents geometric objects such as points,
lines, and polygons.
Thus
a lake may be represented as a polygon, a river by a series of line
segments. Raster data is characterized as an array of points, where each

point
represents
the
value of an attribute for a real-world location. Informally, raster images are n-dimensional
arrays where
each
entry is a
unit
of
the
image and represents an attribute. Two-dimensional
units are called
pixels,
while three-dimensional units are called
voxels.
Three-dimensional
elevation
data
is stored in a raster-based digital elevation model
(OEM)
format.
Another
ras-
ter format called triangular irregular
network
(TIN) is a topological vector-based approach
that
models surfaces by connecting sample points as vertices of triangles and has a
point
density

that
may vary with
the
roughness of
the
terrain. Rectangular grids (or elevation
932 IChapter 29 Emerging Database Technologies and Applications
matrices) are two-dimensional array structures. In digital
terrain
modeling
(OTM),
the
model also may be used by substituting
the
elevation with some attribute of interest such as
population density or air temperature.
GIS
data often includes a temporal structure in addi-
tion to a spatial structure. For example, traffic flow or average vehicular speeds in traffic may
be measured every 60 seconds at a set of points in a roadway nework.
Data
Analysis.
GIS
data undergoes various types of analysis. For example, in applica-
tions such as soil erosion studies, environmental impact studies, or hydrological runoff simu-
lations,
OTM
data may undergo various types of geomorphometric
analysis-measurements
such as slope values,

gradients
(the
rate of change in altitude),
aspect
(the
compass direction
of
the
gradient),
profile
convexity
(the
rate of change of gradient),
plan
convexity
(the con-
vexity of contours and
other
parameters).
When
GIS
data is used for decision support appli-
cations, it may undergo aggregation and expansion operations using data warehousing, as
we discussed in Section 28.3. In addition, geometric operations (to compute distances,
areas, volumes), topological operations (to compute overlaps, intersections, shortest paths),
and temporal operations
(to
compute internal-based or event-based queries) are involved.
Analysis involves a number of temporal and spatial operations, which were discussed in
Chapter

24.
Data
Integration.
GISs
must integrate
both
vector and raster data from a variety of
sources. Sometimes edges and regions are inferred from a raster image to form a vector model,
or conversely, raster imagessuch as aerial photographs are used to update vector models.
Sev-
eral coordinate systemssuch as Universal Transverse Mercator
(UTM),
latitude/longitude, and
local cadastral systems are used to identify locations. Data originating from different coordi-
nate systems requires appropriate transformations. Major public sources of geographic data,
including the
TIGER
files maintained by U.S. Department of Commerce, are used for road
maps by many Web-based map drawing tools (e.g., ). Often there are
high-accuracy, attribute-poor maps
that
have to be merged with low-accuracy, attribute-rich
maps. This is done with a process called "rubber-banding" where the user defines a set of
con-
trol points in both maps and the transformation of the low accuracy map is accomplished by
lining up
the
control points. A major integration issue is to create and maintain attribute
information (such as air quality or traffic flow), which can be related to and integrated with
appropriate geographical information over time as both evolve.

Data
Capture.
The
first step in developing a spatial database for cartographic model-
ing is to capture
the
two-dimensional or three-dimensional geographical information in dig-
ital
form-a
process
that
is sometimes impeded by source map characteristics such as
resolution, type of projection, map scales, cartographic licensing, diversity of measurement
techniques, and coordinate system differences. Spatial data
can
also be captured from
remote sensors in satellites such as Landsat,
NORA,
and Advanced Very High Resolution
Radiometer
(AVHRR)
as well as
SPOT
HRV
(High Resolution Visible Range Instrument),
which is free of interpretive bias and very accurate. For digital terrain modeling, data
cap-
ture methods range from manual to fully automated.
Ground
surveys are

the
traditional
approach and
the
most accurate,
but
they are very time consuming.
Other
techniques
include photogrammetric sampling and digitizing cartographic documents.
29.3 Geographic Information Systems I
933
29.3.3 Specific GIS Data Operations
GISapplications are conducted through
the
use of special operators such as
the
following:
1.
Interpolation:
This
process derives elevation
data
for points at
which
no samples
have
been
taken.
It

includes
computation
at single points,
computation
for a rect-
angular grid or along a contour,
and
so forth. Most interpolation methods are
based
on
triangulation
that
uses
the
TIN
method
for interpolating elevations
inside
the
triangle based
on
those of its vertices.
2.
Interpretation:
Digital terrain modeling involves
the
interpretation
of operations
on
terrain

data
such as editing, smoothing, reducing details, and enhancing.
Additional
operations involve
patching
or zipping
the
borders of triangles
(in
TIN
data),
and
merging,
which
implies combining overlapping models
and
resolving
conflicts
among
attribute data. Conversions among grid models,
contour
models,
and
TIN
data
are involved in
the
interpretation
of
the

terrain.
3.
Proximity
analysis:
Several classes of proximity analysis include computations of
"zones of interest" around objects, such as
the
determination
of a buffer around a
car
on
a highway.
Shortest
path
algorithms using 2D or 3D information is an
important
class of proximity analysis.
4. Raster
image
processing:
This
process
can
be divided
into
two categories: (1) map
algebra, which is used to integrate geographic features on different map layers to
produce new maps algebraically;
and
(2) digital image analysis,

which
deals
with
analysis of a digital image for features such as edge
detection
and object detection.
Detecting
roads in a satellite image of a city is an example of
the
latter.
5. Analysis of
networks:
Networks occur in GIS in many contexts
that
must be ana-
lyzed and may be subjected to segmentations, overlays, and so on. Network overlay
refers to a type of spatial join where a given
network-for
example, a highway net-
work-is
joined
with
a
point
database-for
example, incident
locations-to
yield,
in
this

case, a profile of high-incident roadways.
Other
Database
Functionality.
The
functionality of a GIS database is also subject
to
other
considerations.
• Extensibility: GISs are required to be extensible to accommodate a variety of con-
stantly evolving applications
and
corresponding
data
types. If a standard DBMS is
used, it must allow a core set of
data
types
with
a provision for defining additional
types
and
methods for those types.
• Dataquality
control:
As in many
other
applications, quality of source
data
is of par-

amount
importance for providing accurate results to queries.
This
problem is par-
ticularly significant in
the
GIS
context
because of
the
variety of data, sources, and
measurement techniques involved
and
the
absolute accuracy expected by applica-
tions users.
6.
Visualization:
A crucial function in GIS is related to
visualization-the
graphical
display of terrain information
and
the
appropriate representation of application
934
IChapter 29 Emerging Database Technologies and Applications
attributes to go
with
it. Major visualization techniques include (1)

contouring
through
the
use of
isolines,
spatial units of lines or arcs of equal attribute values; (2)
hillshading,
an illumination
method
used for qualitative relief depiction using var-
ied light intensities for individual facets of
the
terrain model; and (3)
perspective
displays,
three-dimensional images of terrain model facets using perspective projec-
tion
methods from computer graphics.
These
techniques impose cartographic data
and
other
three-dimensional objects on terrain
data
providing animated scene ren-
derings such as those in flight simulations
and
animated movies.
Such requirements clearly illustrate
that

standard
RDBMSs
or
ODBMSs
do
not
meet the
special needs of
GIS.
It is therefore necessary to design systems
that
support
the
vector and
raster representations and
the
spatial functionality as well as
the
required
DBMS
features. A
popular
GIS
software called
ARC-INFO,
which is not a
DBMS
but integrates
RDBMS
functionality in

the
INFO
part of
the
system, is brieflydiscussed in
the
subsection
that
follows.
More systems are likely to be designed in
the
future to work with relational or object
databases
that
will contain some of
the
spatial and most of
the
nonspatial information.
29.3.4 An Example of a GIS Software: ARC-INFO
ARC/INFo-a popular
GIS
software launched in 1981 by Environmental System Research
Institute (ESRr)-uses
the
arc
node
model to store spatial data. A geographic
layer-ealled
coverage

in ARC/INFO-eonsists of three primitives: (1) nodes (points), (2) arcs (similar to
lines), and (3) polygons.
The
arc is
the
most important of
the
three and stores a large
amount
of topological information.
An
arc has a start node and an
end
node (and it there-
fore has direction too).
Inaddition,
the
polygons to
the
left and
the
right of
the
arc are also
stored along with
each
arc. As there is no restriction on
the
shape of
the

arc, shape points
that
have no topological information are also stored along with each arc.
The
database
managed by
the
INFO
RDBMS
thus consists of three required tables: (1) node attribute table
(NAT), (2) arc attribute table (AAT), and (3) polygon attribute table (PAT). Additional
information
can
be stored in separate tables and joined with any of these three tables.
The
NAT contains an
internal
!D for
the
node, a user-specified !D,
the
coordinates of
the
node,
and
any
other
information associated with
that
node (e.g., names of the

intersecting roads at
the
node).
The
AAT contains an internal !D for
the
are, a
user-
specified !D,
the
internal
!D of
the
start
and
end
nodes,
the
internal !D of
the
polygons to
the
left
and
the
right, a series of coordinates of shape points (if any),
the
length
of the are,
and

any
other
data
associated
with
the
arc (e.g.,
the
name
of
the
road
the
arc represents).
The
PAT contains an internal ID for
the
polygon, a user-specified !D,
the
area of the
polygon,
the
perimeter of
the
polygon,
and
any
other
associated data (e.g., name of the
county

the
polygon represents).
Typical spatial queries are related to adjacency, containment, and connectivity. The arc
node model has enough information to satisfyall three types of queries, but the
RDBMS
isnot
ideally suited for this type of querying. A simple example will highlight the number of timesa
relational database has to be queried to extract adjacency information. Assume that we are
trying to determine whether two polygons, A and
B, are adjacent to each other. We
would
have to exhaustively look at
the
entire AAT
to
determine whether there is an edge that has A
29.3 Geographic Information Systems I 935
on one side and B on the other.
The
search cannot be limited to the edges of either polygon as
we do
not
explicitly store all the arcs
that
make a polygon in the
PAT.
Storing all the arcs in
the
PAT
would be redundant because all the information is already there in the

AAT.
ESRI
has released
Arc/Storm (Arc
Store Manager)
which
allows multiple users to use
the
same
GIS,
handles distributed databases,
and
integrates
with
other
commercial
RDBMSs
like
ORACLE,
INFORMIX,
and
SYBASE.
While
it offers many performance
and
functional advantages over
ARC/INFO,
it is essentially an
RDBMS
embedded

within
a
GIS.
29.3.5 Problems and Future
Issues
in GIS
GIS
is an expanding application area of databases, reflecting an explosion in
the
number of
end
users using digitized maps, terrain data, space images, weather data, and traffic informa-
tion
support data. As a consequence, an increasing number of problems related to
GIS
appli-
cations has
been
generated and will need to be solved:
1. New
architectures:
GISapplications will
need
a
new
client-server architecture
that
will benefit from existing advances in
RDBMS
and

ODBMS
technology.
One
possi-
ble solution is to separate spatial from nonspatial
data
and
to
manage
the
latter
entirely by a
DBMS.
Such
a process calls for appropriate modeling
and
integration
as
both
types of
data
evolve.
Commercial
vendors find
that
it is more viable to
keep a small
number
of
independent

databases with an automatic posting of
updates across
them.
Appropriate tools for
data
transfer,
change
management,
and
workflow
management
will be required.
2. Versioningand
object
life-cycle
approach:
Because of constantly evolving geographi-
cal features,
GISs
must
maintain
elaborate cartographic
and
terrain
data-a
man-
agement
problem
that
might

be eased by
incremental
updating coupled
with
update authorization schemes for different levels of users.
Under
the
object life-
cycle approach,
which
covers
the
activities of creating, destroying,
and
modifying
objects as well as promoting versions
into
permanent
objects, a complete set of
methods
may be predefined to
control
these activities for GISobjects.
3. Data
standards:
Because of
the
diversity of representation schemes
and
models,

formalization of
data
transfer.standards is crucial for
the
success of
GIS.
The
inter-
national
standardization body (rso
Tc2l0
and
the
European standards body
(CEN
Tc278)
are
now
in
the
process of debating relevant
issues-among
them
conversion between vector
and
raster
data
for fast query performance.
4. Matching
applications

and data
structures:
Looking again at Figure 27.5, we see
that
a classification of GISapplications is based on
the
nature
and
organization of data.
In
the
future, systems covering a wide range of
functions-from
market analysis
and
utilities to car
navigation-will
need
boundary-oriented
data
and
functional-
ity.
On
the
other
hand,
applications in
environmental
science, hydrology,

and
agriculture will require more area-oriented
and
terrain model data. It is
not
clear
that
all this functionality
can
be supported by a single general-purpose
GIS.
The
specialized needs of
GISs
will require
that
general purpose
DBMSs
must be
936
IChapter 29 Emerging Database Technologies and Applications
enhanced
with
additional
data
types
and
functionality before full-fledged
GIS
applications

can
be supported.
5. Lack
of
semantics in data structures:
This
is
evident
especially in maps. Information
such as highway
and
road crossings may be difficult to
determine
based on the
stored data. One-way streets are also
hard
to represent in
the
present
GISs.
Trans-
portation
CAD
systems
have
incorporated
such
semantics
into
GIS.

29.3.6 Selected Bibliography for GIS
There
are a number of books written
on
GIS.
Adam
and
Gangopadhyay (1997) and Laurini
and
Thompson (1992) focus on
GIS
database
and
information management problems.
Kemp (1993) gives an overview of GIS issues and
data
sources. Huxhold (1991) gives an
intruduction to
Urban
GIS.
Maguire et al. (1991) have a very good collection of GIS-related
papers.
Antenucci
(1998) presents a discussion of
the
GIS
technologies. Shekhar and
Chawla (2002) discusses issues and approaches to spatial
data
management which is at the

core of all
GIS.
Demers (2002) is
another
recent book on
the
fundamentals of
GIS.
Bosso-
maier
and
Green
(2002) is a primer on
GIS
operations, languages, metadata paradigms and
standards. Peng
and
Tsou (2003) discusses
Internet
GISwhich includes a suite of emerging
new technologies aimed at making GISmore mobile, powerful, and flexible, as well as better
able to share
and
communicate geographic information.
The
TIGER
files for road data in the
United
States are managed by
the

U.S.
Department
of Commerce (1993). Laser-Scan's
Web
site ( is a good source of information.
Environmental
System Research
Institute
(ESRI)
has an
excellent
library of
GIS
books for all levels at .
The
GIS terminology is defined at http://
www.esri.com/library/glossary/glossary.html.
The
university of Edinburgh maintains a
GIS
WWW
resource list at />29.4 GENOME DATA MANAGEMENT
29.4.1 Biological Sciences and Genetics
The
biological sciences encompass an enormous variety of information. Environmental sci-
ence gives us a view of
how
species live
and
interact in a world filled with natural phenom-

ena. Biology and ecology study particular species.
Anatomy
focuses on
the
overall structure
of an organism, documenting
the
physical aspects of individual bodies. Traditional medicine
and
physiology break
the
organism into systems and tissues and strive to collect information
on
the
workings of these systems
and
the
organism as a whole. Histology and cell biology
delve into
the
tissue and cellular levels and provide knowledge about the inner structure
and
function of
the
cell. This wealth of information
that
has been generated, classified,and
stored for centuries has only recently become a major application of database technology.
Genetics
has emerged as an ideal field for

the
application of information technology.
In a broad sense, it
can
be
thought
of as
the
construction
of models based on information
29.4
Genome
Data Management I
937
about
genes-which
can
be defined as basic units of
heredity-and
populations
and
the
seeking
out
of relationships in
that
information.
The
study of genetics
can

be divided
into
three
branches: (1)
Mendelian
genetics, (2) molecular genetics,
and
(3) population
genetics.
Mendelian
genetics is
the
study of
the
transmission of traits between
generations. Molecular genetics is
the
study of
the
chemical structure
and
function of
genes at
the
molecular level. Population genetics is
the
study of
how
genetic information
varies across populations of organisms.

Molecular genetics provides a more detailed look at genetic information by allowing
researchers to examine
the
composition, structure,
and
function of genes.
The
origins of
molecular genetics
can
be traced to two important discoveries.
The
first occurred in 1869
when
Friedrich Miescher discovered nuclein
and
its primary component, deoxyribonucleic
acid
(DNA).
In subsequent research DNA
and
a related compound, ribonucleic acid
(RNA),
were found to be composed of nucleotides (a sugar, a phosphate, and a base, which
combined to form nucleic acid) linked into long polymers via
the
sugar and phosphate.
The
second discovery was
the

demonstration in 1944 by Oswald Avery
that
DNA was indeed
the
molecular substance carrying genetic information. Genes were thus shown to be composed
of chains of nucleic acids arranged linearly on chromosomes and to serve three primary
functions:
(1) replicating genetic information between generations, (2) providing
blueprints for
the
creation of polypeptides,
and
(3) accumulating
changes-thereby
allowing evolution to occur. Waston
and
Crick found
the
double-helix structure of
the
DNA
in 1953,
which
gave molecular genetics research a new direction.
6
Discovery of
the
DNA
and
its structure is hailed as probably

the
most important biological work of the last
100 years,
and
the
field it opened may be
the
scientific frontier for the
next
100. In 1962,
Watson, Crick, and Wilkins won
the
Nobel Prize for physiology/medicine for this
breakthrough.
7
29.4.2 Characteristics
of
Biological Data
Biological data exhibits many special characteristics
that
make management of biological
information a particularly challenging problem. We will thus begin by summarizing the
characteristics related to biological information, and focusing on a multidisciplinary field
called bioinforrnatics
that
has emerged, with graduate degree programs now in place in sev-
eral universities. Bioinformatics addresses information management of genetic information
with special emphasis on
DNA sequence analysis. It needs to be broadened into a wider
scope to harness all types of biological

information-its
modeling, storage, retrieval,
and
management. Moreover, applications of bioinformatics span design of targets for drugs,
study of mutations
and
related diseases, anthropological investigations on migration pat-
terms of tribes,
and
therapeutic treatments.
Characteristic
1:
Biological
data
is
highly
complex when
compared
with most other
domains
or
applications.
Definitions of such
data
must thus be able to represent a complex
substructure of
data
as well as relationships
and
to ensure

that
no
information is lost
6. See Nature, 171:737 1953.
7. />938
I Chapter 29 Emerging Database Technologies and Applications
during biological
data
modeling.
The
structure of biological
data
often
provides an
additional
context
for
interpretation
of
the
information. Biological information systems
must be able to represent any level of complexity in any
data
schema, relationship, or
schema
substructure-not
just hierarchical, binary, or table data. As an example,
MITOMAP is a database
documenting
the

human
mitochondrial genome.f
This
single
genome is a small, circular piece of
DNA
encompassing information about 16,569
nucleotide bases; 52 gene loci
encoding
messenger RNA, ribosomal RNA,
and
transfer
RNA; 1000
known
population
variants; over 60
known
disease associations;
and
a limited
set of knowledge
on
the
complex molecular interactions of
the
biochemical energy
producing pathway of oxidative phosphorylation. As
might
be expected, its management
has

encountered
a large
number
of problems; we
have
been
unable to use
the
traditional
RDBMS or ODBMS approches to capture all aspects of
the
data.
Characteristic
2: The amount and
range
of
variability
in data is
high.
Hence,
biological
systems must be flexible in
handling
data
types
and
values.
With
such a wide range of
possible

data
values, placing constraints
on
data
types must be limited since this may
exclude
unexpected
values-e.g.,
outlier
values-that
are particularly
common
in the
biological domain. Exclusion of such values results in a loss of information. In addition,
frequent exceptions to biological
data
structures may require a choice of
data
types to be
available for a given piece of data.
Characteristic
3:
Schemas
in
biological
databases
change
at a
rapid
pace. Hence, for

improved information flow between generations or releases of databases, schema
evolution
and
data
object
migration must be supported.
The
ability to
extend
the schema,
a frequent occurrence in
the
biological setting, is unsupported in most relational and
object database systems. Presently systems such as
GenBank
rerelease
the
entire database
with
new schemas
once
or twice a year
rather
than
incrementally changing
the
system as
changes become necessary.
Such
an evolutionary database would provide a timely and

orderly
mechanism
for following changes to individual
data
entities in biological
databases over time.
This
sort of tracking is
important
for biological researchers to be able
to access and reproduce previous results.
Characteristic
4: Representations of the same data by different
biologists
will
likely
be
different (even when using the same system).
Hence,
mechanisms for "aligning" different
biological schemas or different versions of schemas should be supported.
Given
the
complexity of biological data,
there
are a multitude of ways of modeling any given entity,
with
the
results
often

reflecting
the
particular focus of
the
scientist.
While
two individuals
may produce different
data
models if asked
to
interpret
the
same entity, these models will
likely
have
numerous points in common. In such situations, it would be useful to
biological investigators to be able to
run
queries across these
common
points. By linking
data
elements in a
network
of schemas, this could be accomplished.
Characteristic
5: Most
users
of

biological
datado not
require
write
access
to the
database;
read-only
access
is adequate.
Write
access is limited to privileged users called
curators.
For
example,
the
database created as
part
of
the
MITOMAP project has
on
average more than
8. Detailsof
MITOMAP
and its information complexity can be seen in Kogelniket al. (1997, 1998)
and at http://www. mitomap.org.
29.4
Genome
Data Management I

939
15,000 users per
month
on
the
Internet.
There
are fewer
than
twenty
noncurator
generated submissions
to
MITOMAP
every
month.
In
other
words,
the
number
of users
requiring write access is small. Users generate a wide variety of read-access
patterns
into
the
database, but these
patterns
are
not

the
same as those seen in traditional relational
databases. User requested ad
hoc
searches
demand
indexing of
often
unexpected
combinations of
data
instance classes.
Characteristic 6: Most
biologists
are not
likely
to have any
knowledge
of the internal
structure of the
database
or about schema
design.
Biological database interfaces should
display information to users in a
manner
that
is applicable to
the
problem they are trying

to
address
and
that
reflects
the
underlying
data
structure. Biological users usually know
which
data
they
require,
but
they
have
no
technical
knowledge of
the
data
structure or
how
a
DBMS
represents
the
data.
They
rely

on
technical
users
to
provide
them
with
views
into
the
database. Relational schemas fail to provide cues or any intuitive information to
the
user regarding
the
meaning
of
their
schema. Web interfaces in particular often
provide preset search interfaces,
which
may limit access
into
the
database. However, if
these interfaces are generated directly from database structures, they are likely to produce
a wider possible range of access,
although
they may
not
guarantee usability.

Characteristic 7: The context of data
gives
added
meaning for its use in
biological
applications.
Hence,
context
must be
maintained
and
conveyed to
the
user
when
appropriate. In addition, it should be possible to integrate as many
contexts
as possible
to
maximize
the
interpretation
of a biological
data
value. Isolated values are of less use in
biological systems. For example,
the
sequence of a
DNA
strand

is
not
particularly useful
without
additional information describing its organization, function,
and
such. A single
nucleotide
on
a
DNA
strand, for example, seen in
context
with
nondisease-causing
DNA
strands, could be seen as a causative
element
for sickle cell anemia.
Characteristic 8: Definingand
representing
complex
queries
is extremely importantto the
biologist.
Hence,
biological systems must support complex queries.
Without
any
knowledge of

the
data
structure (see Characteristic 6), average users
cannot
construct a
complex query across
data
sets on
their
own. Thus, in order
to
be truly useful, systems
must provide some tools for building these queries. As
mentioned
previously, many
systems provide predefined query templates.
Characteristic 9:
Users
of
biological
information often
require
access
to
"old"
values
of the
data-particularly when verifying
previously
reported

results.
Hence,
changes to
the
values of
data
in
the
database must be supported
through
a system of archives. Access to
both
the
most
recent
version of a
data
value
and
its previous version are
important
in
the
biological domain. Investigators consistently
want
to query
the
most up-to-date data, but
they must also be able
to

reconstruct previous work
and
reevaluate prior
and
current
information. Consequently, values
that
are about to be updated in a biological database
cannot
simply be
thrown
away.
All
of these characteristics clearly
point
to
the
fact
that
today's
DBMSs
do
not
fully
cater
to
the
requirements of complex biological data. A new direction in database
management
systems is necessary,"

9. See Kogelnik et al. (1997, 1998) for further details.
940
IChapter 29 Emerging Database Technologies and Applications
29.4.3 The Human Genome Project and Existing
Biological Databases
The
term
genome
is defined as
the
total genetic information
that
can
be obtained about an
entity.
The
human genome, for example, generally refers to
the
complete set of genes
required to create a
human
being estimated
to
be more
than
30,000 genes spread over 23
pairs of chromosomes,
with
an estimated 3 to 4 billion nucleotides.
The

goal of
the
Human
Genome
Project (HGP) has
been
to obtain
the
complete
sequence-the
ordering of the
bases-of
those nucleotides. A rough draft of entire
human
genome sequence was
announced
in June 2000
and
the
13-year effort will
end
in year 2003 with
the
completion of
the
human
genetic sequence. In isolation,
the
human
DNA

sequence is
not
particularly use-
ful.
The
sequence
can
however be combined with
other
data
and
used as a powerful tool to
help address questions in genetics, biochemistry, medicine, anthropology, and agriculture.
In
the
existing genome databases,
the
focus has been on "curating" (or collecting with some
initial scrutiny
and
quality check)
and
classifying information about genome sequence data.
In addition to
the
human
genome, numerous organisms such as E.coli,
Drosophila,
and
C

.elegans
have
been
investigated. We will briefly discuss some of
the
existing database
sys-
tems
that
are supporting or
have
grown
out
of the
Human
Genome
Project.
GenBank.
The
preeminent
DNA
sequence database in
the
world today is GenBank,
maintained
by
the
National
Center
for Biotechnology Information

(NCB!)
of the
National
Library of Medicine (NLM).lt was established in 1978 as a central repository for
DNAsequence data.
Since
then
it has expanded somewhat in scope to include expressed
sequence tag data,
protein
sequence data, three-dimensional
protein
structure, taxonomy,
and
links to
the
biomedical literature
(MEDLINE).
As of release 135.0 in April 2003,
GenBank
contains
over 31 billion nucleotide bases of more
than
24 million sequences
from over 100,000 species
with
roughly 1400 new organisms being added
each
month.
The

database size in flat file format is over 100 GB uncompressed
and
has
been
doubling
every 15 months.
Through
international
collaboration
with
the
European Molecular
Biology Laboratory(EMBL) in
the
U.K.
and
the
DNA
Data
Bank of Japan (DDBJ), data
are exchanged
among
the
three sites on a daily basis.
The
mirroring of sequence data at
the
three
sites affords fast access to this
data

to scientists in varous geographical parts of
the
world.
While
it is a complex, comprehensive database,
the
scope of its coverage is focused
on
human
sequences
and
links to
the
literature.
Other
limited
data
sources (e.g. three-
dimensional structure
and
OMIM,
discussed below),
have
been
added recently by
reformatting
the
existing
OMIM
and

PDB
databases
and
redesigning
the
structure of the
GenBank
system to accommodate these new
data
sets.
The
system is maintained as a
combination
of flat files, relational databases, and
files
containing
Abstract
Syntax
Notation
One
(ASN.l)-a
syntax for defining data structures
developed for
the
telecommunications industry. Each
GenBank
entry is assigned a unique
identifier by
the
NCB!.

Updates are assigned a new identifier, with the identifier of the
original entity remaining unchanged for archival purposes. Older references
to an entity
thus do
not
inadvertently indicate a new
and
possibly inappropriate value.
The
most
current
concepts also receive a second set of unique identifiers (UIDs), which mark the
29.4
Genome
Data Management I 941
most up-to-date form of a
concept
while allowing older versions to be accessed via
their
original identifier.
The
average user of
the
database is
not
able to access
the
structure of
the
data

directly
for querying or
other
functions,
although
complete snapshots of
the
database are available
for
export
in a
number
of formats, including ASN.1.
The
query
mechanism
provided is via
the
Entrez application (or its World Wide
Web
version),
which
allows keyword,
sequence,
and
GenBank
UID searching
through
a static interface.
The

Genome
Database
(GOB).
Created
in 1989,
the
Genome
Database
(GOB)
is a
catalog of
human
gene mapping data, a process
that
associates a piece of information
with
a particular
location
on
the
human
genome.
The
degree of precision of this
location
on
the
map depends
upon
the

source of
the
data,
but
it is usually
not
at
the
level of
individual nucleotide bases.
GOB
data
includes
data
describing primarily map information
(distance
and
confidence limits),
and
Polymerase
Chain
Reaction
(PCR) probe
data
(experimental conditions,
PCR
primers,
and
reagents used). More recently efforts
have

been
made to add
data
on
mutations
linked to genetic loci, cell lines used in experiments,
DNA
probe libraries,
and
some limited polymorphism
and
population
data.
The
GOB
system is built
around
SYBASE,
a commercial relational
DBMS,
and
its
data
are modeled using standard Entity-Relationship techniques (see
Chapters
3
and
4).
The
implementors of

GOB
have
noted
difficulties in using this model to capture more
than
simple map
and
probe data.
In
order to improve
data
integrity
and
to simplify
the
programming for application writers,
GOB
distributes a Database Access Toolkit.
However, most users use a
Web
interface to search
the
ten
interlinked
data
managers.
Each
manager keeps track of
the
links (relationships) for

one
of
the ten
tables
within
the
GOB
system. As
with
GenBank,
users are given only a very high-level view of
the
data
at
the
time of searching
and
thus
cannot
easily make use of any knowledge gleaned from
the
structure of
the
GOB
tables.
Search
methods
are most useful
when
users are simply looking

for an index
into
map or probe data. Exploratory ad
hoc
searching of
the
database is
not
encouraged by present interfaces.
Integration
of
the
database structures of
GOB
and
OMIM
(see below) was
never
fully established.
Online
Mendelian
Inheritance
in
Man.
Online
Mendelian
Inheritance
in
Man
(OMIM)

is an
electronic
compendium
of information
on
the
genetic basis of
human
disease. Begun in hard-copy form by Victor McCusick in 1966
with
1500 entries, it was
converted
to a full-text electronic form
between
1987
and
1989 by
the
GOB.
In 1991 its
administration was transferred from
Johns
Hopkins
University to
the
NCBI,
and
the
entire
database was

converted
to
NCBI's
GenBank
format. Today it
contains
more
than
14,000
entries.
OMIM
covers material on five disease areas based loosely
on
organs
and
systems.
Any
morphological, biochemical, behavioral, or
other
properties
under
study are referred to as
phenotype
of an individual (or a cell).
Mendel
realized
that
genes
can
exist in numerous

different forms
known
as alleles. A genotype refers to
the
actual allelic composition of an
individual.
The
structure of
the
phenotype
and
genotype entries contains textual data loosely
structured as general descriptions, nomenclature, modes of inheritance, variations, gene
942
I
Chapter
29 Emerging
Database
Technologies
and
Applications
structure, mapping, and numerous lesser categories.
The
full-text entries were converted to
an
ASN.1 structured format
when
OMIM was transferred to
the
NCB!.

This
greatly improved
the
ability to link OMIM data to
other
databases and it also provided a rigorous structure for
the
data. However,
the
basic form of
the
database remained difficult to modify.
EcoCyc.
The
Encyclopedia of
Escherichia
coli Genes and Metabolism (EcoCyc) is a
recent experiment in combining information about
the
genome and
the
metabolism ofE.
coli
K-12.
The
database was created in 1996 as a collaboration between Stanford Research
Institute
and
the
Marine Biological Laboratory. It catalogs and describes

the
known genes
of E
.coli,
the
enzymes encoded by those genes,
and
the
biochemical reactions catalyzed by
each
enzyme and
their
organization into metabolic pathways. In so doing, EcoCyc spans
the
sequence and function domains of genomic information.
It
contains 1283 compounds
with
965 structures as well as lists of bonds and atoms, molecular weights,
and
empirical
formulas.
It
contains 3038 biochemical reactions described using 269 data classes.
An
object-oriented
data
model was first used to implement
the
system, with data

stored
on
Ocelot, a frame knowledge representation system. EcoCyc
data
was arranged in
a hierarchy of object classes based
on
the
observations
that
(1)
the
properties of a
reaction are
independent
of an enzyme
that
catalyzes it, and (2) an enzyme has a number
of properties
that
are "logically distinct" from its reactions.
EcoCyc provides two methods of querying: (1) direct (via predefined queries) and (2)
indirect (via hypertext navigation). Direct queries are performed using menus and dialogs
that
can
initiate a large
but
finite set of queries. No navigation of
the
actual data

structures is supported. In addition, no mechanism for evolving
the
schema is
documented.
Table 29.1 summarizes
the
features of
the
major genome-related databases, as well as
HGMOB
and
ACEOB databases. Some additional protein databases exist; they contain
information about protein structures.
Prominent
protein databases include SWISS-
PROT
at
the
University of Geneva, Protein Data Bank (POB) at Brookhaven National
Laboratory, and
Protein
Identification Resource (PIR) at
National
Biomedical Research
Foundation.
Over
the
past
ten
years, there has been an increasing interest in the applications of

databases in biology and medicine. GenBank,
GOB, and OMIM have been created as central
repositories of certain types of biological data but, while extremely useful, they do not yet
cover
the
complete spectrum of
the
Human
Genome
Project data. However, efforts are
under way around
the
world to design new tools and techniques
that
will alleviate the data
management problem for
the
biological scientists and medical researchers.
Gene
Ontology.
We already explained
the
concept
of ontologies in Section 29.2.3
in
the
context
of modeling of multimedia information.
Gene
Ontology (GO)

Consortium
was formed in 1998 as a collaboration among three model organism
databases: FlyBase, Mouse
Genome
Informatics
(MGI)
and Saccharomyces or yeast
Genome
Database
(SGD).
Its goal is to produce a structured, precisely defined, common,
controlled vocabulary for describing
the
roles of genes and gene products in any organism.
With
the
completion of genome sequencing of many species, it has been observed that a
large fraction of genes among organisms display similarity in biological roles and
29.4
Genome
Data Management I
943
biologists
have
acknowledge
that
there
is likely to be a single limited universe of genes
and
proteins

that
are conserved in most or all living cells.
On
the
other
hand,
genome
data
is increasing exponentially
and
there
is no uniform way to interpret
and
conceptualize
the
shared biological elements.
Gene
Ontology makes possible
the
annotation
of gene products using a
common
vocabulary based on
their
shared biological
attributes
and
interoperability
between
genomic databases.

The
GO
Consortium has developed three ontologies: Molecular function, biological
process, and cellular component, to describe attributes of genes, gene products or gene
product groups. Molecular function is defined as
the
biochemical activity of a gene product.
Biological process refers to a biological objective to which
the
gene or gene product
contributes. Cellular
component
refers to
the
place in
the
cell where a gene product is
active. Each ontology comprises a set of well-defined vocabularies of terms
and
relationships.
The
terms are organized in
the
form of directed acyclic graphs (DAGs), in
TABLE
29.1 SUMMARY OF THEMAJOR GENOME-RELATED DATABASES
DATABASE
MAJOR
INITIAL
CURRENT

DB
PROBLEM
PRIMARY
DATA
NAME
CONTENT
TECHNOLOGY
TECHNOLOGY
AREAS
TYPES
Genbank
DNA/RNA
Text
files Flat-file/ASN.1
Schema
brows- Text, numeric,
sequence,
ing, schema
some complex
protein
evolution, link- types
ing to
other
dbs
OMIM Disease
Index cards/text
Flat-file/
ASN.l
Unstructured,
Text

phenotypes
and
files
free
text
entries
genotypes, etc. linking to
other
dbs
GDB
Genetic
map Flat file Relational
Schema
expan- Text, numeric
linkage
data
sion/evolution,
complex
objects, linking
to
other
dbs
ACEDB
Genetic
map
00
00
Schema
expan- Text, numeric
linkage data,

sion/evolution,
sequence
data
linking to
other
(non-human)
dbs
HGMDB
Sequence
and
Flat
file-
Flat-file-
Schema
expan- Text
sequence application
application sion/evolution,
variants specific
specific linking to
other
dbs
EcoCyc Biochemical
00
00
Locked into
Complex
types,
reactions
and
class hierarchy, text, numeric

pathways
schema
evolution
944
I Chapter 29 Emerging Database Technologies
and
Applications
which a term node may have multiple parents and multiple children. A child term can be
an
instance
of (is a) or a partof its parent. In
the
latest release of the
GO
database, there are
over 13,000 terms and more
than
18,000 relationships between terms.
The
annotation of
gene products is operated independently by each of
the
collaborating databases. A subset of
the
annotations is included in
GO
database, which contains over 1,386,000 gene products
and 5,244,000 associations between gene products and
GO
terms.

The
Gene
Ontology was implemented using MySQL, an
open
source relational
DBMS and a
monthly
database release is available in
SQL
and XML formats. A set of
tools and libraries,
written
in C, Java, Perl and XML etc, is available for database access
and
development
of
applications. Web-based
and
stand-alone
GO
browsers are available
from
the
GO
consortium.
29.4.4 Selected Bibliography for Genome Databases
Bioinformatics has become a popular area of research in recent years and many workshops
and conferences are being organized around this topic. Robbins (1993) gives a good
over-
view while Frenkel (1991) surveys

the
human
genome project with its special role in bioin-
formatics at large. Cutticchia et
a1.
(1993), Benson et
a1.
(2002), and Pearson et
a1.
(1994)
are references on
GOB, GenBank, and OMIM. In an international collaboration among
GeneBank (
USA),
DNA
Data Bank of Japan (DDBJ) ( />homology.html) , and Euporean Molecular Biology Laborarory (EBML) (Stoesser G, 2003),
data are exchanged amongst
the
collaborating databases on a daily basis to achieve optimal
synchronization Wheeler et
a1.
(2000) discuss
the
various tools
that
currently allow
users
access and analysis of
the
data available in

the
databases.
Wallace (1995) has
been
a pioneer in
the
mitochondrial genome research, which
deals with a specific part of
the
human
genome;
the
sequence and organizational details of
this area appear
in
Anderson
et al. (1981)
Recent
work in Kogelnik et al. (1997, 1998)
and Kogelnik (1998) addresses
the
development of a generic solution to the data
management
problem in biological sciences by developing a prototype solution. Apweiler
et al. (2003) review
the
core Bioinformatics resources maintained at
the
European
Bioinformatics Institute (EBI) (such as Swiss-prot

+ TrEMBL) and summarize important
issues of database management of such resources.
They
discuss three main types of
databases: Sequence Databases such as DDBJJEMBL/ GENEBANK Nucleotide Sequence
Database; Secondary Databases such as PROSITE,
PRINTS
and Pfam; and Integrated
Databases such as InterPro,
that
integrates data from six major protein signature databases
(Pfam, PRINTS, ProDom, PROSITE, SMART, and TIGRFAMs).
The
European Bioinformatics Institute Macromolecular Structure Database (E-
MSD),
which
is a relational database ( (Boutselakis et al.,
2003) is designed to be a single access
point
for protein and nucleic acid structures and
related information.
The
database is derived from Protein Data Bank (PDB) entries. The
search database contains an extensive set of derived properties, goodness-of-fit indicators,
and links to
other
EBI databases including InterPro,
GO,
and SWISS-PROT, together
with links to SCOP,

CATH,
PFAM
and
PROSITE. Karp (1996) discusses
the
problems of
interlinking
the
variety of databases
mentioned
in this section. He defines two types of
29.4 Genome Data Management I 945
links: those
that
integrate
the
data and those
that
relate
the
data
between databases.
These were used to design
the
Ecocyc database.
Some of
the
important
web links include
the

following:
The
Human
Genome
sequence information
can
be found at: />The
MITOMAP
database
developed
in
Kogelnik
(1998)
can
be
accessed
at
/>The
biggest
protein
database
SWISS-PROT
can
be
accessed
from
/>The
ACEDB
database
information

is
available
at
:8080/acedocs/.
Alternative
Diagrammatic
Notations for
ER
Models
Figure
A.I
shows a
number
of different diagrammatic
notations
for representing ER
and
EER model concepts. Unfortunately,
there
is no standard
notation:
different database
design practitioners prefer different
notations.
Similarly, various CASE (computer-aided
software engineering) tools
and
OOA
(object-oriented analysis) methodologies use various
notations.

Some
notations
are associated
with
models
that
have
additional concepts
and
constraints beyond those
of
the
ER
and
EER models described in
Chapters
3
and
24, while
other
models
have
fewer concepts
and
constraints.
The
notation
we used in
Chapter
3 is

quite close to
the
original
notation
for ER diagrams,
which
is still widely used. We discuss
some
alternate
notations
here.
Figure
Al
(a) shows different
notations
for displaying
entity
types/classes, attributes,
and
relationships. In
Chapters
3
and
24, we used
the
symbols marked (i) in Figure
AI(a)-namely,
rectangle, oval,
and
diamond.

Notice
that
symbol
(ii)
for
entity
types/
classes, symbol
[ii ) for attributes,
and
symbol (ii) for relationships are similar,
but
they are
used by different methodologies to represent
three
different concepts.
The
straight line
symbol (iii) for representing relationships is used by several tools
and
methodologies.
Figure A. 1(b) shows some
notations
for
attaching
attributes
to
entity
types. We used
notation

(i).
Notation
(ii)
uses
the
third
notation
(iii) for attributes from Figure
Al
(a).
The
last two
notations
in Figure
Al(b)-(iii)
and
(iv)-are
popular in
OOA
methodologies
and
in some CASE tools. In particular,
the
last
notation
displays
both
the
attributes
and

the
methods
of a class, separated by a horizontal line.
947
948
I Appendix A Alternative Diagrammatic Notations for ER Models
(a)
entity type/class symbols
(i)
[I]
(ii)
CD
attribute symbols
(i)
5D
(ii)
?
(iii)
0
A
relationship symbols
(i)
~
(ii)
CD
(iii)
R
EMPLOYEE
(ii)
I 8

ssn
EMPLOYEE Name
Address
(iii)
EMPLOYEE
Ssn
Name
Address
(iv)
1 1
Ssn
Name
Address
(v)
T
~
c
(iv)
Hire_emp
(d)
Fire_emp
(c)
(i)
~
(i)
~
(ii)
(
(ii)
~

(iii)
~
(iii)
(1,1)
(O,n)
(iv)
<>-
(iv)
Q-«
(v)
(vi)
*-
(e)
(ii)
C
FIGURE A.l Alternative notations. (a) Symbols for entity type/class, attribute, and
relationship. (b) Displaying attributes. (c) Displaying cardinality ratios. (d) Various
(min, max) notations. (e) Notations for displaying specialization/generalization.
Figure
A.I
(c) shows various
notations
for representing
the
cardinality ratio of binary
relationships. We used
notation
(i) in
Chapters
3

and
24.
Notation
(ii)-known
as the
chicken
feet
notation-is
quite popular.
Notation
(iv) uses
the
arrow as a functional
reference (from
the
N
to
the
1 side)
and
resembles our
notation
for foreign keys in the
relational model (see Figure 7.7);
notation
(v)-used
in
Bachman
diagrams-uses the
Appendix A Alternative Diagrammatic Notations for

ER
Models I
949
arrow in the
reverse
direction
(from
the
1 to
the
N side). For a 1:1 relationship,
(ii)
uses a
straight line
without
any
chicken
feet; (iii) makes
both
halves of
the
diamond
white;
and
(iv) places arrowheads on
both
sides. For an
M:N
relationship,
(ii)

uses
chicken
feet at
both
ends of
the
line; (iii) makes
both
halves of
the
diamond
black;
and
(iv) does
not
display any arrowheads.
Figure
A.l(d)
shows several variations for displaying (min, max) constraints,
which
are used to display
both
cardinality ratio
and
total/partial participation.
Notation
(ii)
is
the
alternative

notation
we used in Figure 3.15
and
discussed in
Section
3.7.4. Recall
that
our
notation
specifies
the
constraint
that
each
entity
must participate in at least
min
and
at
most max relationship instances.
Hence,
for a 1:1 relationship,
both
max values are 1;
and
for M:N,
both
max values are n. A
min
value greater

than
0 (zero) specifies total
participation
(existence dependency). In methodologies
that
use
the
straight line for
displaying relationships, it is
common
to
reverse
the positioning of
the
(min, max)
constraints, as
shown
in (iii).
Another
popular
technique-which
follows
the
same
positioning as
(iii)-is
to
display
the
min as 0 (Uoh" or circle,

which
stands for zero) or as
I (vertical dash,
which
stands for 1),
and
to display
the
max as I (vertical dash,
which
stands for 1) or as
chicken
feet
(which
stands for
n),
as
shown
in (iv).
Figure
A.l(e)
shows some
notations
for displaying specialization/generalization. We
used
notation
(i) in
Chapter
14, where a d in
the

circle specifies
that
the
subclasses (S1,
S2,
and
S3) are disjoint
and
an a specifies overlapping subclasses.
Notation
(ii)
uses G
(for generalization) to specify disjoint,
and
Gs to specify overlapping; some
notations
use
the
solid arrow, while others use
the
empty arrow (shown at
the
side).
Notation
(iii) uses
a triangle
pointing
toward
the
superclass,

and
notation
(v) uses a triangle
pointing
toward
the
subclasses; it is also possible
to
use
both
notations
in
the
same methodology,
with
(iii)
indicating generalization
and
(v) indicating specialization.
Notation
(iv) places
the
boxes
representing subclasses
within
the
box representing
the
superclass.
Of

the
notations
based
on (vi), some use a single-lined arrow,
and
others use a double-lined arrow (shown at
the
side).
The
notations
shown in Figure
A.l
show only some of
the
diagrammatic symbols
that
have
been
used or suggested for displaying database conceptual schemes.
Other
notations,
as well as various
combinations
of
the
preceding,
have
also
been
used. It would

be useful to establish a standard
that
everyone would adhere to, in order to
prevent
misunderstandings
and
reduce confusion.
Parameters
of
Disks
The
most
important
disk
parameter
is
the
time required to locate an arbitrary disk block,
given its block address,
and
then
to transfer
the
block between
the
disk
and
a
main
mem-

ory buffer.
This
is
the
random
access time for accessing a disk block.
There
are three
time
components
to consider:
1.
Seek
time
(s):
This
is
the
time
needed
to mechanically position
the
read/write
head
on
the
correct track for movable-head disks. (For fixed-head disks, it is the
time
needed
to electronically switch to

the
appropriate read/write head.) For
movable
head
disks this time varies,
depending
on
the
distance between
the
cur-
rent
track
under
the
read/write
head
and
the
track specified in
the
block address.
Usually,
the
disk manufacturer provides an average seek time in milliseconds.
The
typical range of average seek time is 10 to 60 msec.
This
is
the

main
"culprit" for
the
delay involved in transferring blocks between disk
and
memory.
2.
Rotational
delay
(rd):
Once
the
read/write
head
is at
the
correct track,
the
user
must wait for
the
beginning of
the
required block to rotate
into
position
under
the
read/write head.
On

the
average, this takes
about
the
time for
half
a revolution of
the
disk,
but
it actually ranges from immediate access (if
the
start of
the
required
block is in position
under
the
read/write
head
right after
the
seek) to a full disk
revolution
(if
the
start
of
the
required block just passed

the
read/write
head
after
951
952
IAppendix C Parameters of Disks
the
seek). If
the
speed
of
disk rotation is p revolutions per minute (rpm),
then
the
average rotational delay rd is given by
rd =
(1/2)*(1/p)
min =
(60*1000)/(2*p)
msec
A typical value for p is 10,000 rpm, which gives a rotational delay of rd = 3 msec.
For fixed-head disks, where
the
seek time is negligible, this
component
causes the
greatest delay in transferring a disk block.
3.
Block

transfer
time
(btt):
Once
the
read/write
head
is at
the
beginning of the
required block, some time is needed
to
transfer
the
data
in
the
block. This block
transfer time depends
on
the
block size,
the
track size, and
the
rotational speed. If
the
transfer
rate
for

the
disk is tr bytes/msec and
the
block size is B bytes, then
btt
=
B/tr
msec
If we
have
a track size of 50 Kbvtes and p is 3600 rpm,
the
transfer rate in bytes/
msec is
tr
= (50*1000)/(60*1000/3600) = 3000
bytes/msec
In this case,
btt
= B/3000 msec, where B is
the
block size in bytes.
The
average time needed to find
and
transfer a block, given its block address, is
estimated by
(s
+ rd +
btt)

msec
This
holds for
either
reading or writing a block.
The
principal method of reducing
this time is to transfer several blocks
that
are stored on
one
or more tracks of
the
same
cylinder;
then
the
seek time is required only for
the
first block. To transfer consecutively k
noncontiguous blocks
that
are
on
the
same
cylinder,
we
need
approximately

s + (k *
(rd
+
btt))
msec
In this case, we
need
two or more buffers in
main
storage, because we are
continuously reading or writing
the
k blocks, as we discussed in Section 4.3.
The
transfer
time per block is reduced
even
further
when
consecutive
blocks
on
the
same track or
cylinder are transferred.
This
eliminates
the
rotational delay for all
but

the
first block, so
the
estimate for transferring k consecutive blocks is
5 + rd + (k *
btt)
msec
A more accurate estimate for transferring consecutive blocks takes into account the
interblock gap (see
Section
5.2.1), which includes
the
information
that
enables the read/
write
head
to determine which block it is about to read. Usually,
the
disk manufacturer
provides a
bulk
transfer
rate
(btr)
that
takes
the
gap size into account
when

reading
consecutively stored blocks.
If
the
gap size is G bytes,
then
btr
= (B/(B + G)) *
tr
bytes/msec
The
bulk transfer rate is
the
rate of transferring useful bytes in
the
data
blocks. The
disk read/write head must go over all bytes
on
a track as
the
disk rotates, including
the
bytes in
the
interblock gaps,
which
store control information
but
not

real data. When
the
bulk transfer rate is used,
the
time needed to transfer
the
useful
data
in
one
block out
Appendix C Parameters of Disks I
953
of several consecutive blocks is
B/btr.
Hence,
the
estimated time to read k blocks
consecutively stored
on
the
same cylinder becomes
5 + rd + (k *
(B/btr))
msec
Another
parameter
of disks is
the
rewrite time.

This
is useful in cases
when
we read a
block from
the
disk
into
a
main
memory buffer, update
the
buffer,
and
then
write
the
buffer back
to
the
same disk block on
which
it was stored. In many cases,
the
time
required to update
the
buffer in
main
memory is less

than
the
time required for
one
disk
revolution. If we know
that
the
buffer is ready for rewriting,
the
system
can
keep
the
disk
heads
on
the
same track,
and
during
the
next
disk revolution
the
updated buffer is
rewritten back to
the
disk block.
Hence,

the
rewrite time
Trw'
is usually estimated to be
the
time
needed
for
one
disk revolution:
Trw
= 2
~,
rd msec
To summarize, here is a list of
the
parameters we
have
discussed
and
the
symbols we
use for them:
seek time: s msec
rotational
delay: rd msec
block transfer time:
btt
msec
rewrite time:

Trw
msec
transfer rate: tr byres/msec
bulk transfer rate:
btr
bytes/msec
block size: B bytes
interblock
gap size: G bytes
Overview of the
QSE
Language
The
Query-By-Example (QBE) language is
important
because it is
one
of
the
first graphi-
cal query languages
with
minimum
syntax developed for database systems. It was devel-
oped at IBM Research
and
is available as an IBM commercial
product
as
part

of
the
QMF
(Query
Management
Facility) interface
option
to DB2.
The
language was also imple-
mented
in
the
PARADOX DBMS,
and
is related to a point-and-click type interface in
the
ACCESS DBMS (see
Chapter
10). It differs from SQL in
that
the
user does
not
have
to spec-
ify a structured query explicitly; rather,
the
query is formulated by filling in templates of
relations

that
are displayed
on
a
monitor
screen. Figure 9.5 shows how these templates
may look for
the
database of Figure 7.6.
The
user does
not
have
to remember
the
names of
attributes or relations, because they are displayed as
part
of these templates. In addition,
the
user does
not
have
to follow any rigid syntax rules for query specification; rather,
con-
stants
and
variables are
entered
in

the
columns of
the
templates to construct an example
related to
the
retrieval or update request. QBE is related to
the
domain
relational calculus,
as we shall see,
and
its original specification has
been
shown
to be relationally complete.
D.l BASIC RETRIEVALS
IN
QBE
In QBE, retrieval queries are specified by filling in
one
or more rows in
the
templates of
the
tables. For a single relation query, we
enter
either
constants
or example

elements
(a QBE
term) in
the
columns of
the
template
of
that
relation.
An
example
element
stands for a
955
956 I Appendix D
Overview
of the QBE Language
ADDRESS
DEPARTMENT
I
DEPT_LOCATIONS
I
DNUMBER
I
DLOCATION
I
ESSN
I WORKS ON
I ~

HOURS I
PNAME
RELATIONSHIP
FIGURE
D.l
The relational schema of Figure 7.6 as it may be displayed by QBE.
domain
variable
and
is specified as an example value preceded by
the
underscore charac-
ter
L). Additionally, a
P.
prefix (called
the
P
dot
operator) is
entered
in certain columns
to indicate
that
we would like
to
print
(or display) values in those columns for our result.
The
constants

specify values
that
must be exactly
matched
in those columns.
For example, consider
the
query
QO:
"Retrieve
the
birthdate
and
address of
John
B.
Smith."
We show in Figures 9.6(a)
through
9.6(d) how this query
can
be specified in a
progressively more terse form in
QBE. In Figure 9.6(a) an example of an employee is pre-
sented
as
the
type of row
that
we are interested in. By leaving

John
B.
Smith
as constants
in
the
FNAME,
MINH,
and
LNAME
columns, we are specifying an
exact
match
in those columns.
All
the
rest of
the
columns are preceded by an underscore indicating
that
they are domain
(a)
ADDRESS
P._100
Main,
Houston,
TX
ADDRESS
P._100
Main,

Houston,
TX
(c)
ADDRESS
ADDRESS
P.
FIGURE D.2 Four ways of specifying the query QO in QBE.

×