for Environmental Management of Military Lands
CEMML
Database Design Primer
A Beginners Guide
to
Creating a Database
By William Sprouse
CEMML TPS 05-10
CENTER FOR ENVIRONMENTAL MANAGEMENT OF MILITARY LANDS
Colorado State University
Fort Collins, CO 80523-1490
June 2005
Database Design Primer
A Beginners Guide to Creating a Database
By William Sprouse
Center for Environmental Management of Military Lands
Colorado State University
1490 Campus Delivery
Fort Collins, CO 80523-1490
(970) 491-2748
This document was produced under the Land Condition Trend Analysis (LCTA) Technical Support for the
U.S. Army Environmental Center (AEC) contract; contract number DAAD13-00-D-5003-0041, task order
number CS01-SC-03-016.
Table of Contents
1 Introduction 1
2 Concepts of Creating a Database 1
2.1 Tables 1
2.2 Relationships 2
2.3 Attributes 2
3 Creating a Database in Microsoft Access 3
3.1 Create Tables (Entities) and Their Attributes 3
4 Defining Relationships in Microsoft Access 7
5 Conclusion 10
5.1 Introduction to Data Management 10
5.1.1 LCTA Data Management 11
5.2 Further Reading 12
6 References 13
1
1 Introduction
Creating a database is often looked upon as an intimidating process. However, with a little knowledge and
some thought this process can become manageable. This document will attempt to provide the reader with
the knowledge required to develop their own databases. The next section will discuss the steps in creating
the conceptual database, which will describe the database and its' components. Section 3 will discuss
creating the physical database in Microsoft Access.
2 Concepts of Creating a Database
A database is a collection of information typically stored on a computer. A database can be thought of as
an electronic filing system. One type of database, and the type discussed here, is a relational database. A
relational database is a collection of tables with relationships.
A database is designed to describe a situation. A situation is a well-defined set of circumstances that can be
described using a sufficiently complete natural language. For example, Fort X is responsible for managing
their natural resources. Land Condition Trend Analysis (LCTA) is the component of the Integrated
Training Area Management (ITAM) Program that provides for the collecting, inventorying, monitoring,
managing, and analyzing of tabular and spatial data concerning land conditions on an installation. These
data are intended to provide information to effectively manage land use and natural resources and supply
information for a variety of decision support and information management systems. A database can be
developed that is a collection of tables with relationships that represent the situation above and store LCTA
data.
A database will contain three major components; tables, also called entities, relations, and attributes.
Tables are the nouns of the situation. They represent a person, place, or thing that can be uniquely
identified by type and occurrence. Relations are the verbs of a situation and relate the nouns of a situation
with one another. Attributes are the modifiers of a situation. They are qualifiers of an entity or a relation
describing its character, quantity, or degree of extent.
2.1 Tables
Tables can be described as either major or minor. Major tables are the important, dynamic tables of a
situation. Addition and deletion of data from these tables are common. Minor tables are small and static
where additions and deletions are rare. Domain tables, discussed later, are minor tables.
The fist step in developing a relational database is discovering the entities or tables. Ask yourself to name
a person, place or thing (a noun) that you'd like to keep track of. Keep in mind that all entities are nouns
but not all nouns are entities. Plant name, for example, is an attribute of ground cover. Tables are
characterized by attributes. Do not confuse tables with instances of the table. Ground cover is a valid
table; litter, bare ground, and red oak are instances of that table.
A table is a set or collection of like things called instances. These instances correspond to rows of data in
the table. The key concept here is "like things". Consider the following example. Using the original
LCTA protocol, ground cover data is collected every meter along a 100 meter long plot, resulting in 100
rows of data for each plot in a given year. Additional data that is required for each plot includes the
training area, vegetation type, and the type of inventory (i.e. initial inventory, long-term inventory, or short-
term monitoring). One way of storing this data is to add all of the ground cover and plot information items
2
to one table. This would not be a good choice because training area, vegetation type, and the type of
inventory do not change for those 100 points. It is true that plot number and recording date repeat for the
100 points, but these are used to identify the data. A better design would be to store the plot information
data in a separate table. This is precisely what is done in the original LCTA database design. Information
about each plot for a given year is stored in the PlotSurv table. The ground cover data is stored in the
GndCover table.
After discovering all of the major tables consider any domain tables that are needed. These tables will
provide a means of restricting data for a given data field to a set of valid values after relations have been
defined. They can also be a source of values for data entry forms in Microsoft Access.
For each table determine a means for distinguishing each occurrence of the table from all other occurrences
of the table. This will be known as the primary key of the table. By definition a primary key can not
contain nulls, duplicates, or can not change. Make sure that the selected primary key always exits, is
always unique, and the attributes that make up the primary key do not change over time. It is possible to
have more than one attribute defined as the primary key. The combination of plot number and date, for
example, are legitimate candidates for a primary key.
2.2 Relationships
Once all of the tables have been defined the next step is to define the relationships. Compare each of the
tables, two at a time, and determine if they are related. Only direct, not indirect, relationships are of
interest here. For example, employees, departments, and buildings are items of interest in a given situation.
If it is true that each employee works in only one department, and that each department is located in only
one building, the relation between employee and building is indirect. If it is known that the Sales
department is located in building A, and that Smith works in Sales, it can be inferred that Smith works in
building A.
These relations will be defined in the database through the use of referential integrity constraints. The
process of defining these constraints in Access is discussed in Section 4.
2.3 Attributes
The final step of developing a database is discovering attributes. Attributes are characteristics of an entity
that the user wishers to capture. For each table that has been defined, ask yourself: "What other
characteristics of the entity are of interest?" Only real attributes, and not derived attributes, are of interest.
A derived attribute can be calculated from values elsewhere in the database and should be avoided if at all
possible. The only time a derived attribute should be considered is when it will overcome a performance
issue. This is common when database code is written to perform calculations and temporary, or working,
tables are created to store data during this process.
After discovering an attribute determine the key of the attribute. Does the value for the attribute in any row
depend on the value of the primary key for that row? If the answer is no the attribute is either misplaced or
there is a missing table. For example, suppose a table has been defined with the name of Part and primary
key of Part Number. A possible attribute for this entity is Part Name. The part name depends on the part
number and thus is a valid attribute for this table. Likewise the price of the part depends on the part
number and is also a valid attribute for this table. Consider order number as an attribute for this table.
Each order placed by a customer is assigned a unique order number. Although each order contains parts,
order number does not depend on the value of the part number. Order number is not a valid attribute for
this table. If order number is not valid for any of the defined tables this is probably an indication of a
missing table.
3
For each attribute discovered determine if it can contain null values. Also determine if all values for that
attribute must always be unique, no duplicates allowed.
3 Creating a Database in Microsoft Access
Section 2 discussed the process of designing a conceptual database. Once the tables, their attributes, and
relations have been determined the physical database can be created. This section will cover the processes
of creating a database specific to Microsoft Access.
3.1 Create Tables (Entities) and Their Attributes
New tables are created in Access by selecting the Tables option on the left of the database window and
clicking the New toolbar button (see Figure 1).
Figure 1: Database Window
A new dialog will appear showing the new table options (Figure 2). Select the Design View option and
click the OK button.
4
Figure 2: New Table Options
The table design view will be displayed. When creating a new table the view will be empty. Figure 3
shows the design view for an existing table. The design view is used to define the table attributes, also
called fields. For each field of a table certain properties must be defined. Three of these properties are
required for every data element and should be considered carefully during the design process. These
include data type, field size (applicable only for text and number fields), and requirement rule.
Dates can be stored in the database as a text string or a date data type. If a field is defined as a date then
special Structured Query Language (SQL) functions can be used to extract the day, month, or year from the
date. If the date is stored as a text string it must first be converted to a date before built-in date functions
can be used against it. Also, the date data type will insure all dates entered in the database are valid dates.
Numbers can also be stored as text or a type of number, which includes integer, long integer, and real
numbers. If any type of analyses will be done on this data define the data type as a number. As discussed
above, this will allow for the use of built-in number functions.
Figure 3 shows the design view for the PlotSurv table. The primary keys for this table are InstalID, PlotID,
and RecDate, denoted by the key icon to the left of each field name.
5
Figure 3: Design View of the PlotSurv Table
Primary keys are defined by selecting the field, or fields, and clicking the primary key button shown in
Figure 4. To select a field name click the small box to the left of the field name. To select multiple fields
hold down the Ctrl button while selecting the fields.
Figure 4: Primary Key Toolbar Button
Figure 5 and Figure 6 show the field properties for a text and number field respectively. Notice there is a
slight difference in these two. Field size for the text field represents the maximum number of characters
allowed; the field size for the number field represents the type of number.
6
Figure 5: Field Properties of the InstalID field in the PlotSurv Table
Figure 6: Field Properties of the PlotID field in the PlotSurv Table
Avoid creating field names with spaces or dashes "-", use the underscore "_" instead. Access allows the
uses of spaces in field names but many databases do not. When a field name containing spaces is used in a
query, or referenced in code, the name must be surrounded by brackets "[]". Using the underscore
character avoids the use of these brackets and will help avoid problems in the future. The same holds true
for table names.
Another good practice is to enter descriptions for all fields. This will facilitate the use of the database by
others less familiar with the design. The descriptions are displayed in the lower left portion of the Access
window, status bar, when a table is open in datasheet view and the cursor is resting in a data field. Table
descriptions are entered in the table property dialog box (Figure 7). To see this dialog single click a table
name in the database window (Figure 1) then click the right mouse button while the cursor is over the table
name. Select Properties from the menu. To display these descriptions in the database window select View
from the main menu and select the Details option.
7
Figure 7: Table Properties
4 Defining Relationships in Microsoft Access
Defining relationships in Microsoft Access is relatively easy once tables have been created, primary keys
defined, and relationships determined. Access offers a graphical environment in which to define
relationships eliminating the need to learn the SQL syntax for these procedures.
Relationships can be viewed in Microsoft Access by selecting Tools from the main menu then selecting
Relationships. Once the relationships window is open the relationship toolbar should be visible. To view
the relationship toolbar, if it is not visible, select View / Toolbars / Relationship from the main menu.
The most common buttons used in this toolbar are Show All Relationships (Figure 8), Clear Layout
(Figure 9), and Show Table (Figure 10). The Show All Relationships button will add all tables to the
window that have a defined relationship, The Clear Layout button will remove all items currently
displayed in the window and the Show Table button allows the user to add specific tables to the view.
Figure 8: Show All Relationships Button
Figure 9: Clear Layout Button
8
Figure 10: Show Table Button
Tables in the relationship window are represented similar to that of Figure 11. This figure represents the
PlotSurv table, also called an entity, and its' data fields, or attributes. The bolded field names represent the
primary key of the table.
Figure 11: The PlotSurv Entity and its Attributes
Figure 12 illustrates the relationship window with three tables visible, PlotMast, PlotSurv, and GndCover.
The lines connecting the tables represent relationships. For example, two lines are drawn between
PlotMast and PlotSurv representing a relationship which involves the InstalID and PlotID data fields.
These lines have a number 1 above the left side of the lines and an infinity symbol (∞) above the right side
of the lines. These marks represent a 1-to-many relationship between PlotMast and PlotSurv. In other
words, for every unique combination of InstalID and PlotID in PlotMast there can be zero or more rows of
data in PlotSurv having the same combination of InstalID and PlotID. In this case PlotMast is the parent
table, which contains the primary keys for the relationship, and PlotSurv is the child table, which contains
the foreign keys.
Not only does PlotSurv contain the foreign keys of the relationship with PlotMast, but it also has its' own
primary key. The primary key for the PlotSurv table is made up of the InstalID, PlotID, and RecDate
fields.
9
Figure 12: Relationships window in Micrsoft Access
Relationships are defined in the relationship window (Figure 12). To define a new relationship click the
field name, or names, in the parent table that represent the primary key. The primary key are bolded in the
relationships window. To select more than one column hold down the CTRL key while clicking the names.
After selecting the names move the cursor over one of the highlighted names and click the left mouse
button. While keeping the mouse button depressed drag the cursor over the child table and release the
mouse button. A dialog box similar to Figure 13 will appear. If the primary key of the parent table only
contains one field the Edit Relationships dialog box will already have the relationships defined. If the
primary key of the parent table contains multiple fields the links between the parent and child table will
need to be defined. The fields listed in the left column of the Edit Relationships dialog box represent the
primary key of the parent table. Select the corresponding fields in the right column, which represent the
foreign keys of the child table.
The final steps in defining the relationships include selecting the options at the bottom of the Edit
Relationships dialog box. Place a check mark in the Enforce Referential Integrity option. This will enforce
the relationship between the parent table and the child table. The other two options, Cascade Update
Related Fields and Cascade Delete Related Records, are optional. Cascade Update Related Fields will
automatically update the foreign key in the child table when the primary key in the parent table is changed.
For example, if the RecDate field of the PlotSurv table is changed the database will automatically change
the corresponding data in the GndCover table (see Figure 12). Similarly, if data is deleted in the PlotSurv
table the corresponding data in the GndCover table will be deleted if the Cascade Delete Related Records
option is turned on. It is recommended that the Cascade Delete Related Records is not used if the database
will be used by multiple people. This option can cause the unintentional loss of a great amount of data.
The properties for a relationship can be viewed by double clicking on one of the relationship lines. Figure
13 shows the resulting dialog box that appears from this action. The information in Figure 13 represents
the properties for the relationship between PlotMast and PlotSurv. Here we see that PlotMast is the parent
table, listed on the left side, and PlotSurv is the child table, listed on the right. The InstalID data field of
the PlotMast table controls the InstalID of the PlotSurv table. The PlotID data field of the PlotMast table
controls the PlotID of the PlotSurv table. When more than one data field is listed the primary key is called
a composite key, which means it contains more than one data field.
10
Figure 13: Edit Relationships Dialog Box
5 Conclusion
In this brief document the concepts of designing a database have been discussed along with the steps
required to create the physical database in Microsoft Access. Data management is the next item to consider
when dealing with databases. Data management is beyond the scope of this document but a brief
introduction specific to LCTA is presented below.
5.1 Introduction to Data Management
Data management is not a new concept. Even Charles Babbage (1792-1871) recognized its importance
over one hundred year ago. Charles Babbage was commissioned by the Royal Navy to produce accurate
navigation tables, which resulted in the design of the Analytical Engine (thought by some to be the first
crude computer). In his writings Charles Babbage stated, "Verification of data is part of the cost of
obtaining it, and the two can’t be separated." This is very true when the data in question is ecological data.
The cost of collecting ecological data is high and many studies of natural phenomena cannot be replicated
because time is unidirectional. This suggests that data management play a very important roll in the
collection and analysis of ecological data.
LCTA depends heavily on the reliability and validity of its data. However, LCTA has no well-defined
organizational element charged with all aspects of data management, which can quickly become a serious
and large function. Historically, data management was often performed external to the installation through
the U.S. Army Construction Engineering Research Laboratories (USACERL) and technical support
provided by the Army Environmental Center (AEC). More recently, one person is often responsible for the
entire information management element of LCTA. Unfortunately data administration of LCTA is the most
overlooked role in LCTA.
11
Some of the driving factors that make data management of LCTA data an essential process of ITAM
follow. Insufficient quality control processes in the past have left many databases with incorrect data.
Errors during data collection are inherent to the data collection process. Insufficient data management
skills of the LCTA personnel have also led to incorrect data. These include lack of knowledge of the
database concept, database software, and the knowledge of potential errors and their effect.
As more and more data is collected the task of data management becomes more intensive. Once LCTA
data is collected and uploaded to a database it is often hard to visualize just how much data an installation
owns. Megabytes mean little beyond the amount of disk space a file occupies. To help add proportions to
the amount of LCTA data an installation can collect let us look at an example.
Assume an installation with six years of data and an average of 224 plots/year recorded all of its' LCTA
data using the standard LCTA paper forms (Tazik et. al. 1992) on single sided sheets. After the sixth year
of data collection a total of 42,822 pages would have accumulated. This equates to 85.65 reams of paper or
a stack of paper that would stand 14.27 feet.
Unfortunately there is no automatic tool that can be pointed at a data set and told to just "Fix" the data.
Data management starts before data collection, continues during data collection, and plays a large role after
the data is collected. Next LCTA data management is discussed.
5.1.1 LCTA Data Management
5.1.1.1 A Priori
One key process that reduces the need of extensive data management, after data collection, is proper
training of the field crew. Ensuring the field crew understands the data they are collecting, collection
methodologies, and valid data requirements ensures good data going into the database. Some a priori data
management tasks are listed below.
Vegetation Identification
Make sure the field crew is familiar with all common species found on the installation. Use voucher
specimens if available for examples. Explain the coding used for vegetation identification and provide a
list of valid codes.
Data Collection Methodologies
Thoroughly cover the methodologies used on the LCTA plots. If using the standard LCTA methodology
acquaint the field crew with USA CERL technical report U.S Army Land Condition - Trend Analysis
(LCTA) Plot Inventory Field Methods. If additional, or different, methodologies are used create an
installation LCTA inventory methods manual outlining all requirements.
Acceptable Codes
In addition to vegetation codes much of the LCTA data utilizes codes. Examples include military
disturbance, basal cover categories, and vegetation condition. The LCTA handheld data logger restricts the
entry of data to only acceptable codes in most cases. Be prepared for those instances when paper forms are
required due to data logger failure or when paper forms are used to collect installation specific needs. Also
develop a protocol for coding unknown species.
Metadata Collection
Instruct the field crew to record any information that will help explain discrepancies in the data later. For
example, if an unknown plant species is found and collected for later identification, the field crew should
note the code used when entering the data and any site information that may be helpful for identification.
12
5.1.1.2 During The Collection Process
The best time to find errors in the data is while the field crew is still on the installation. What at first
inspection appears to be an error could be some special circumstance that the field crew failed to document.
The LCTA coordinator should develop procedures for entering the data into the database weekly and
checking the data for errors. Some of the more common errors to check for include unknown vegetation
codes, missing data, improperly recorded data, and missing plots.
If data is collected with the handheld data loggers the process of loading data is easy. If data is recorded on
field sheets check them each week for any errors. Once the data has been transcribed and loaded into the
database a second check is needed to find any transcription errors. In the next section data management
tasks are discussed that will help facilitate the weekly data checks.
5.1.1.3 Post Hoc
Post hoc data management tasks are performed on data that is already in the database. Most of the analyses
for LCTA are based on the data that are found in the installation database, thus, this task is very important.
Earlier we defined the terms reliability and validity. Internal validity is the ability to draw proper
conclusions from a database. This includes understanding the meaning of the data. To truly understand the
meaning of the data in a database you must understand the interpretation of the various measures within a
database (variables).
5.2 Further Reading
Refer to Chapter 5, Data Management, of the LCTA II Technical Reference Manual for further discussion
on data management, which is available on the CEMML web site at:
13
6 References
Anderson, A.B., Sprouse, W., Kowalski, D., and Brozka, R. Land Condition Trend Analysis (LCTA) Data
Collection Software Users Manual: Version 1.0. USACERL ADP Report 95/13, 1995. Champaign, IL.
Bruce, T.A Designing Quality Databases with IDEF1X Information Models. Dorset House Publishing,
1992.
How to Design a Relational Data Model, Extended Relational Analysis, Workshop Version 6.2 (1998).
Relational Systems Corporation. Birmingham, MI.
Michener, William K. (ed.). Research Data Management in the Ecological Sciences (1986). University of
South Carolina.
Newton, J.J. and D.C. Wahl, eds Manual for Data Administration. NIST Special Publication 500-208.
National Institute of Standards and Technology [NIST], Gaithersburg, MD, March 1993.
Sprouse, W. and Anderson, A. B. Land Condition Trend Analysis (LCTA) Program Data Dictionary:
Version 1.0. USACERL ADP Report EN-95/03. Champaign, IL.
Tazik, D.J., Warren, S.D., Diersing, V.E., Shaw, R.B., Brozka, R.J., Bagley, C.F., and Whitworth, W.R.
(1992). U.S. Army Land Condition-Trend Analysis (LCTA) Plot Inventory Field Methods. USACERL
Technical Report N-92/03. Champaign, IL.