Tải bản đầy đủ (.pdf) (53 trang)

RELATIONAL MANAGEMENT and DISPLAY of SITE ENVIRONMENTAL DATA - PART 4 pptx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.01 MB, 53 trang )

PART FOUR - MAINTAINING
THE DATA
© 2002 by CRC Press LLC
CHAPTER 13
IMPORTING DATA
The most important component of any data management system is the data in it. Manual and
automated entry, importing, and careful checking are critical components in ensuring that the data
in the system can be trusted, at least to the level of its intended use. For many data management
projects, the bulk of the work is finding, organizing, and inputting the data, and then keeping up
with importing new data as it comes along. The cost of implementing the technology to store the
data should be secondary. The EDMS can be a great time-saver, and should more than pay for its
cost in the time saved and greater quality achieved using an organized database system. The time
savings and quality improvement will be much greater if the EDMS facilitates efficient data
importing and checking.
MANUAL ENTRY
Sometimes there’s no other way to get data into the system other than transcribing it from hard
copy, usually by typing it in. This process is slow and error prone, but if it’s the only way, and if
the data is important enough to justify it, then it must be done. The challenge is to do the entry
cost-effectively while maintaining a sufficient level of data quality.
Historical entry
Often the bulk of manual entry is for historical data. Usually this is data in hard-copy files. It
can be found in old laboratory reports, reports which have been submitted to regulators, and many
other places.
DATA SELECTION - WHAT’S REALLY IMPORTANT?
Before embarking on a manual entry project, it is important to place a value on the data to be
entered. The importance of the data and the cost to enter it must be balanced. It is not unusual for a
data entry project for a large site, where an effort is made to locate and input a comprehensive set
of data for the life of the facility, to cost tens or hundreds of thousands of dollars. The decision to
proceed should not be taken lightly.
LOCATING AND ORGANIZING DATA
The next step, and often the most difficult, is to find the data. This is often complicated by the


fact that over time many different people or even different organizations may have worked on the
© 2002 by CRC Press LLC
project, and the data may be scattered across many different locations. It may even be difficult to
locate people who know or can find out what happened in the past. It is important to locate as
much of this historical data as possible, and then the portion selected as described in the previous
section can be inventoried and input.
Once the data has been found, it should be inventoried. On small projects this can be done in
word processor or spreadsheet files. For larger projects it is appropriate to build a database just to
track documents and other items containing the data, or include this information in the EDMS.
Either way, a list should be made of all of the data that might be entered. This list should be
updated as decisions are made about what data is to be entered, and then updated again as the data
is entered and checked. If the data inventory is stored in the EDMS, it should be set up so that after
the data is imported it can be tracked back to the original source documents to help answer
questions about the origin of the data.
TOOLS TO HELP WITH CORRECT ENTRY
There are a number of ways to enter the data, and these options provide various levels of
assistance in getting clean data into the system.
Entry and review process – Probably the most common approach used in the environmental
industry is manual entry followed by visual review. In this process, someone types in the data, then
it is printed out in a format similar to the one that was used for import. Then a second person
compares every piece of data between the two pieces of paper, and marks any inconsistencies.
These are then remedied in the database, and the corrections checked. The end result, if done
conscientiously, is reliable data. The process is tedious for those involved, and care should be
taken that those doing it keep up their attention to detail, or quality goes down. Often it is best to
mix this work with other work, since it is hard to do this accurately for days on end. Some people
are better at it than others, and some like it more than others. (Most don’t like it very much.)
Double entry – Another approach is to have the data entered twice, by two different people,
and then have special software compare the two copies. Data that does not match is then entered
again. This technique is not as widely used as the previous one in the environmental industry
perhaps because existing EDMS software does not make this easy to do, and maybe also because

the human checking in the previous approach sounds more reliable.
Scanning and OCR – Hardware and software are widely available to scan hard copy
documents into digital format, and then convert it into editable text using optical character
recognition (OCR). The tools to do this have improved immensely over the last few years, such
that error rates are down to just a few errors per page. Unfortunately, the highest error rates are
with older documents and with numbers, both of which are important in historical entry of
environmental data. Also, because the formats of old documents are widely variable, it is difficult
to fit the data into a database structure after it has been scanned. These problems are most likely to
be overcome, from the point of view of environmental data entry, when there is a large amount of
data in a consistent format, with the pages in good condition. Unless you have this situation, then
scanning probably won’t work. However, this approach has been known to work on some projects.
After scanning, a checking step is required to maintain quality.
Voice entry – As with scanning, voice recognition has taken great strides in recent years.
Systems are available that do a reasonable job of converting a continuous stream of spoken words
into a word processing document. Voice recognition is also starting to be used for on-screen
navigation, especially for the handicapped. It is probably too soon to tell whether this technology
will have a large impact on data entry.
Offshore entry – There are a number of organizations in countries outside the United States,
especially Mexico and India, that specialize in high-volume data entry. They have been very
successful in some industries, such as processing loan applications. Again, the availability of a
large number of documents in the same format seems to be the key to success in this approach, and
a post-entry checking step is required.
© 2002 by CRC Press LLC
Figure 55 - Form entry of analysis data
Form entry vs. spreadsheet entry – EDMS programs usually provide a form-based system
for entering data, and the form usually has fields for all the data at each level, such as site, station,
sample, and analysis. Figure 55 shows an example of this type of form. This is usually best for
entering a small amount of data. For larger data entry projects, it may be useful to make a
customized form that matches the source documents to simplify input. Another common approach
is to enter the data into a spreadsheet, and then use the import tool of the EDMS to check and

import the data. Figure 56 shows this approach. This has two benefits. The EDMS may have better
data checking and cleanup tools as part of the import than it does for form entry. Also, the person
entering the data into the spreadsheet doesn’t necessarily need a license for the EDMS software,
which can save the project money. Sometimes it is helpful to create spreadsheet templates with
things like station names, dates, and parameter lists using cut and paste in one step, and then have
the results entered in a second step.
Ongoing entry
There may be situations where data needs to be manually entered on an ongoing basis. This is
becoming less common as most sources of data involve a computerized step, so there is usually a
way to import the data electronically. If not, approaches as described above can be used.
ELECTRONIC IMPORT
The majority of data placed into the EDMS is usually in digital format in some form or other
before it is brought into the system. The implementers of the system should provide a data transfer
standard (DTS) so that the electronic data deliverables (EDDs) created by the laboratory for the
EDMS contain the appropriate data elements in a format suitable for easy import. An example DTS
is shown in Appendix C.
© 2002 by CRC Press LLC
Figure 56 - Spreadsheet entry of analysis data
Automated import routines should be provided in the EDMS so that data in the specified
format (or formats if the system supports more than one) can be easily brought into the system and
checked for consistency. Data review tracking options and procedures must be provided. In
addition, if it is found that a significant amount of digital data exists in other formats, then imports
for those formats should be provided. In some cases, importing those files may require operator
involvement if, for example, the file is a spreadsheet file of sample and analytical data but does not
contain site or station information. These situations usually must be addressed on a case-by-case
basis.
Historical entry
Electronic entry of historical data involves several issues including selecting, locating, and
organizing data, and format and content issues.
Data selection, location, and organization – The same issues exist here as in manual input in

terms of prioritizing what data will be brought into the EDMS. Then it is necessary to locate and
catalog the data, whatever format it is in, such as on a hard drive or on diskettes.
Format issues – Importing historical data in digital format involves figuring out what is in the
files and how it is formatted, and then finding a way to import it, either interactively using queries
or automatically with a menu-driven system. Most modern data management programs can read a
variety of file formats including text files, spreadsheets, word processing documents, and so on.
Usually the data needs to be organized and reformatted before it can be merged with other data
already in the EDMS. This can be done either in its native format, such as in a spreadsheet, or
imported into the database program and organized there. If each file is in a different format, then
there can be a big manual component to this. If there are a lot of data files in the same format, it
may be possible to automate the process to a large degree.
© 2002 by CRC Press LLC
Content issues – It is very important that the people responsible for importing the data have a
detailed understanding of the content of the data being imported. This includes knowing where the
data was acquired and when, how it is organized, and other details like detection limits, flags, and
units, if they are not in the data files. Great care must be exercised here, because often details like
these change over time, often with little or no documentation, and are important in interpreting the
data.
Ongoing entry
The EDMS should provide the capability to import analytical data in the format(s) specified in
the data transfer standard. This import capability must be robust and complete, and the software
and import procedures must address data selection, format, and content issues, and special issues
such as field data, along with consistency checking as described in a later section.
Data selection – For current data in a standard format, importing may not be very time-
consuming, but it may still be necessary to prioritize data import for various projects. The return on
the time invested is the key factor.
Format and content issues – It may be necessary to provide other import formats in addition
to those in the data transfer standard. The identification of the need to implement other data
formats will be made by project staff members. The content issues for ongoing entry may be less
than for historical data, since the people involved in creating the files are more likely to be

available to provide guidance, but care must still be taken to understand the data in order to get it
in right.
Field data – In the sampling process for environmental data there is often a field component
and a laboratory component. More and more the data is being gathered in the field electronically. It
is sometimes possible to move this data digitally into the EDMS. Some hard copy information is
usually still required, such as a chain of custody to accompany the samples, but this can be
generated in the field and printed there. The EDMS needs to be able to associate the field data
arriving from one route with the laboratory data from another route so both types of data are
assigned to the correct sample.
Understanding duplicated and superseded data
Environmental projects generate duplicated data in a variety of ways. Particular care should be
taken with duplicated data at the Samples and Analyses levels. Duplicate samples are usually the
result of the quality assurance process, where a certain number of duplicates of various types are
taken and analyzed to check the quality of the sampling and analysis processes. QC samples are
described in more detail in Chapter 15. A sample can also be reanalyzed, resulting in duplicated
results at the Analyses level. These results can be represented in two ways, either as the original
result plus the reanalysis, or as a superseded (replaced) original result plus the new, unsuperseded
result. The latter is more useful for selection purposes, because the user can easily choose to see
just the most current (unsuperseded) data, whereas selecting reanalyzed data is not as helpful
because not all samples will have been reanalyzed. Examples of data at these two levels and the
various fields that can be involved in the duplications at the levels are shown in Figure 57.
Obtaining clean data from laboratories
Having an accurate, comprehensive, historical database for a facility provides a variety of
benefits, but requires that consistency be enforced when data is being added to the database.
Matching analytical data coming from laboratories with previous data in a database can be a time-
consuming process.
© 2002 by CRC Press LLC
* Unique Index - For Water Samples
* Unique Index - For Water Analyses
Duplicate - Samples Level

Superseded - Analysis Level
Sample No.
Sample No.*
(Station, Date,
Matrix, Filt., Dup.)
1
2
3
1
1
Station*
Parameter
Name*
MW-1
Field pH
Field pH
Field pH
Field pH
Naphthalene
Naphthalene
Naphthalene
Sample
Date*
Leach
Method*
8/1/2000
None
None
None
None

None
None
None
Matrix*
Basis*
Water
None
None
None
None
None
None
None
Filtered*
Superseded*
Tota l
0
1
2
3
0
1
2
Duplicate*
Val ue
Code
0
1
2
None

None
None
None
Original
DL1
DL2
QC Code
Dilution
Factor
Original
Field Dup.
Split
1
50
10
Lab ID
Reportable
Result
2000-001
2000-002
2000-003
N
Y
N
Figure 57 - Duplicate and superseded data
Variation in station names, spelling of constituent names, abbreviation of units, and problems
with other data elements can result in data that does not tie in with historical data, or, even worse,
does not get imported at all because of referential integrity constraints. An alternative is a time-
consuming data checking and cleanup process with each data deliverable, which is standard
operating procedure for many projects.

WORKING WITH LABS - STANDARDIZING DELIVERABLES
The process of getting the data from the laboratory in a consistent, usable format is a key
element of a successful data management system. Appendix C contains a data transfer standard
(DTS) that can be used to inform the lab how to deliver data. EDDs should be in the same format
every time, with all of the information necessary to successfully import the data into the database
and tie it with field samples, if they are already there. Problems with EDDs fall into two general
areas: 1) data format problems and 2) data content problems. In addition, if data is gathered in the
field (pH, turbidity, water level, etc.) then that data must be tied to the laboratory data once the
data administrator has received both data sets. Data format problems fall into two areas: 1) file
format and 2) data organization. The DTS can help with both of these by defining the formats (text
file, Excel spreadsheet, etc.) acceptable to the data management system, and the columns of data in
the file (data elements, order, width, etc.). Data content problems are more difficult, because they
involve consistency between what the lab is generating and what is already in the database.
Variation in station names (is it “MW1” or “MW-1”?), spelling of constituent names, abbreviation
of units, and problems with other data elements can result in data that does not tie in with historical
data. Even worse, the data may not get imported at all because of referential integrity constraints
defined in the data management system.
© 2002 by CRC Press LLC
Figure 58 - Export laboratory reference file
USING REFERENCE FILES AND A CLOSED-LOOP SYSTEM
While project managers expect their laboratories to provide them with “clean” data, on most
projects it is difficult for the laboratory to deliver data that is consistent with data already in the
database. What is needed is a way for the project personnel to keep the laboratory updated with
information on the various data elements that must be matched in order for the data to import
properly. Then the laboratory needs a way to efficiently check its electronic data deliverable
(EDD) against this information prior to delivering it to the user. When this is done, then project
personnel can import the data cleanly, with minimal impact on the data generation process at the
laboratory.
It is possible to implement a system that cuts the time to import a laboratory deliverable by a
factor of five to ten over traditional methods. The process involves a DTS as described in

Appendix C to define how the data is to be delivered, and a closed-loop reference file system
where the laboratory compares the data it is about to deliver to a reference file provided by the
database user. Users employ their database software to create the reference file. This reference file
is then sent to the laboratory. The laboratory prepares the electronic data deliverable (EDD) in the
usual way, following the DTS, and then uses the database software to do a test import against the
reference file. If the EDD imports successfully, the laboratory sends it to the user. If it does not, the
laboratory can make changes to the file, test it again, and once successful, send it to the user. Users
can then import this file with a minimum of effort because consistency problems have been
eliminated before they receive it. This results in significant time-savings over the life of a project.
If the database tracks which laboratories are associated with which sites, then the creation of
the reference file can start with selection of the laboratory. An example screen to start the process
is shown in Figure 58.
In this example, the software knows which sites are associated with the laboratory, and also
knows the name to be used for the reference file. The user selects the laboratory, confirms the file
name, and clicks on Create File. The file can then be sent to the laboratory via email or on a disk.
This process is done any time there are significant changes to the database that might affect the
laboratory, such as installation of new stations (for that laboratory’s sites) or changes to the lookup
tables.
There are many benefits to having a centralized, open database available to project personnel.
In order to have this work effectively the data in the database must be accurate and consistent.
Achieving this consistency can be a time-consuming process. By using a comprehensive data
transfer standard, and the closed-loop system described above, this time can be minimized. In one
organization the average time to import a laboratory deliverable was reduced from 30 minutes
down to 5 minutes using this process. Another major benefit of this process is higher data quality.
This increase in quality comes from two sources. The first is that there will be fewer errors in the
data deliverable, and consequently fewer errors in the database, because a whole class of errors
© 2002 by CRC Press LLC
related to data mismatches has been completely eliminated. A second increase in quality is a
consequence of the increased efficiency of the import process. The data administrator has more
time to scrutinize the data during and after import, making it easier to eliminate many other errors

that would have been missed without this scrutiny.
Automated checking
Effective importing of laboratory and other data should include data checking prior to import
to identify errors and to assist with the resolution of those errors prior to placing the data in the
system. Data checking spans a range of activities from consistency checking through verification
and validation. Performing all of the checks won’t ensure that no bad data ever gets into the
database, but it will cut down significantly on the number of errors. The verification and validation
components are discussed in more detail in Chapter 16. The consistency checks should include
evaluation of key data elements, including referential integrity (existence of parents); valid site
(project) and station (well); valid parameters, units, and flags; handling of duplicate results (same
station, sample date and depth, and parameter); reasonable values for each parameter; comparison
with like data; and comparison with previous data.
The software importing the data should perform all of the data checks and report on the results
before importing the data. It’s not helpful to have it give up after finding one error, since there may
well be more, and it might as well find and flag all of them so you can fix them all at once.
Unfortunately, this is not always possible. For example, valid station names are associated with a
specific site, so if the site in the import file is wrong, or hasn’t been entered in the sites table, then
the program can’t check the station names. Once the program has a valid site, though, it should be
able to perform the rest of the checks before stopping.
Of course, all of this assumes that the file being imported is in a format that matches what the
software is looking for. If site name is in the column where the result values should be, the import
should fail, unless the software is smart enough to straighten it out for you.
Figure 59 shows an example of a screen where the user is being asked what software-assisted
data checking they want performed, and how to handle specific situations resulting from the
checking.
Figure 59 - Screen for software-assisted data checking
© 2002 by CRC Press LLC
Figure 60 - Screen for editing data prior to import
You might want to look at the data prior to importing it. Figure 60 shows an example of a
screen to help you do this. If edits are made to the laboratory deliverable, it is important that a

record be kept of these changes for future reference.
REFERENTIAL INTEGRITY CHECKING
A properly designed EDMS program based on the relational model should require that a
parent entry exist before related child entries can be imported. (Surprisingly, not all do.) This
means that a site must exist before stations for that site can be entered, and so on through stations,
samples, and analyses. Relationships with lookups should also be enforced, meaning that values
related to a lookup, such as sample matrix, must be present and match entries in the lookup table.
This helps ensure that “orphan” data does not exist in the tables. Unfortunately, the database
system itself, such as Access, usually doesn’t give you much help when referential integrity
problems occur. It fails to import the record(s), and provides an error message that may, or may
not, give you some useful information about what happened. Usually it is the job of the application
software running within the database system to check the data and provide more detailed
information about problems.
CHECKING SITES AND STATIONS
When data is obtained from the lab it must contain information about the sites and samples
associated with the data. It is usually not a good idea to add this data to the main data tables
automatically based on the lab data file. This is because it is too easy to get bad records in these
two tables and then have the data being imported associated with those bad records. In our
experience, it is more likely that the lab has misspelled the station name than that you really drilled
a new well, although obviously this is not always the case. It is better to enter the sites and stations
first, and then associate the samples and analyses with that data during import. Then the import
should check to make sure the sites and stations are there, and tell you if they aren’t, so you can do
something about it.
On many projects the sample information follows two paths. The samples and field data are
gathered in the field. The samples go to the laboratory for analysis, and that data arrives in the
electronic data deliverable (EDD) from the laboratory. The field data may arrive directly from the
field, or may be input by the laboratory.
© 2002 by CRC Press LLC
Figure 61 - Helper screen for checking station names
If the field data arrives separately from the laboratory data, it can be entered into the EDMS

prior to arrival of the EDD from the laboratory. This entry can be done in the field in a portable
computer or PDA, in a field office at the site, or in the main office. Then the EDMS needs to be
able to associate the field information with the laboratory information when the EDD is imported.
Another approach is to enter the sample information prior to the sampling event. Then the
EDD can check the field data and laboratory data as it arrives for completeness. The process needs
to be flexible enough to accommodate legitimate changes resulting from field activities (well MW-
1 was dry), but also notify the data administrator of data that should be there but is missing. This
checking can be performed on data at both the sample and analyses levels.
The screen shown in Figure 61 shows the software helping with the data checking process.
The user has imported a laboratory data file that has some problems with station names. The
program is showing the names of the stations that don’t match entries already in the database, and
providing a list of valid stations to choose from. The user can step through the problem stations,
choosing the correct names. If they are able to correctly match all of the stations, the import can
proceed. If not, they will need to put this import aside while they research the station names that
have problems.
The import routine may provide an option to convert the data to consistent units, and this is
useful for some projects. For other projects (perhaps most), data is imported as it was reported by
the laboratory, and conversion to consistent units is done, if at all, at retrieval time. This is
discussed in Chapter 19. The decision about whether to convert to consistent units during import
should be made on a project-by-project basis, based on the needs of the data users. In general, if
the data will be used entirely for site analysis, it probably makes sense to convert to consistent units
so retrieval errors due to mixed units are eliminated. If the data will be used for regulatory and
litigation purposes, it is better to import the data as-is, and do conversion on output.
CHECKING PARAMETERS, UNITS, AND FLAGS
After the import routine is happy with the sites and stations in the file, it should check the
other data, as much as possible, to try to eliminate inconsistent data. Data in the import file should
be compared to lookup tables in the database to weed out errors. Parameter names in particular
provide a great opportunity for error, as do reporting units, flags, and other data.
© 2002 by CRC Press LLC
Figure 62 - Screen for entering defaults for required values

The system should provide screens similar to Figure 61 to help fix bad values, and flag records
that have issues that can’t be resolved so that they can be researched and fixed.
Note that comparing values against existing data like sites and stations, or against lookups,
only makes sure that the data makes sense, not that it is really right. A value can pass a comparison
test against a lookup and still be wrong. After a successful test of the import file, it is critical that
the actual data values be checked to an appropriate level before the data is used.
Sometimes the data being imported may not contain all of the data necessary to satisfy
referential integrity constraints. For example, historical data being imported may not have
information on sample filtration or measurement basis, or even the sample matrix, if all of the data
in the file has the same matrix. The records going into the tables need to have values in these fields
because of their relationships to the lookup tables, and also so that the data is useful. It is helpful if
the software provides a way to set reasonable defaults for these values, as shown in Figure 62, so
the data can be imported without a lot of manual editing. Obviously, this feature should be used
with care, based on good knowledge of the data being imported, to avoid assigning incorrect
values.
OTHER CHECKS
There are a number of other checks that the software can perform to improve the quality of the
data being imported.
Checking for repeated import – In the confusion of importing data, it is easy to accidentally
import, or at least try to import, the same data more than once. The software should look for this,
tell you about it, and give you the opportunity to stop the import. It is also helpful if the software
gives you a way to undo an import later if a file shouldn’t have been imported for one reason or
another.
Parameter-specific reasonableness – Going beyond checking names, codes, etc., the
software should check the data for reasonableness of values on a parameter-by-parameter basis.
For example, if a pH value comes in outside the range of 0 to 14, then the software could notice
and complain. Setting up and managing a process like this takes a considerable amount of effort,
but results in better data quality.
Comparison with like data – Sometimes there are comparisons that can be made within the
data set to help identify incorrect values. One example is comparing total dissolved solids reported

by the lab with the sum of all of the individual constituents, and flag the data if the difference
© 2002 by CRC Press LLC
exceeds a certain amount. Another is to do a charge balance comparison. Again, this is not easy to
set up and operate, but results in better data quality.
Comparison with previous data – In situations where data is being gathered on a regular
basis, new data can be compared to historical data, and data that is more than or less than previous
data by a certain amount (usually some number of standard deviations from the mean) is suspect.
These data points are often referred to as outliers. The data point can then be researched for error,
re-sampled, or excluded, depending on the regulations for that specific project. The field of
statistical quality control has various tools for performing this analysis, including Shewhart and
Cumulative Sum control charts and other graphical and non-graphical techniques. See Chapters 20
and 23 for more information.
CONTENT-SPECIFIC FILTERING
At times there will be specific data content that needs to be handled in a special way during
import. Some data will require specific attention when it is present in the import. For example, one
project that we worked on had various problems over time with phenols. At different times the
laboratory reported phenols in different ways. For this project, any file that contained any variety
of phenol required specific attention. In another case, the procedure for a project specified that
tentatively identified compounds (TICs) should not be imported at all. The database software
should be able to handle these two situations, allowing records with specific data content to be
either flagged or not imported. Figure 63 shows an example of a screen to help with this.
Some projects allow the data administrator to manually select which data will be imported.
This sounds strange to many people, but we have worked on projects where each line in the EDD is
inspected to make sure that it should be imported. If a particular constituent is not required by the
project plan, and the laboratory delivered it anyway, that line is deleted prior to import. In Figure
60 the checkbox near the top of the screen is used for this purpose. The software should allow
deleted records to be saved to a file for later reference if necessary.
Figure 63 - Screen to configure content-specific filtering
© 2002 by CRC Press LLC
Figure 64 - Screens showing results of a successful and an unsuccessful import

TRACKING IMPORTS
Part of the administration of the data management task should include keeping records of the
import process. After trying an import, the software should notify you of the result. Records should
be kept of both unsuccessful and successful imports. Figures 65 and 66 are examples of reports that
can be printed and saved for this purpose.
Figure 65 - Report showing an unsuccessful import
© 2002 by CRC Press LLC
Figure 66 - Report showing a successful import
The report from the unsuccessful import can be used to resolve problems prior to trying the
import again. At this stage it is helpful for the software to be able to summarize the errors so an
error that occurs many times is shown only once. Then each type of error can be fixed generically
and the report re-run to make sure all of the errors have been remedied so you can proceed with the
import.
The report from the successful import provides a permanent record of what was imported. This
report can be used for another purpose as well. In the upper left corner is a panel (shown larger in
Figure 67) showing the data review steps that may apply to this data. This report can be circulated
among the project team members and used to track which review steps have been performed. After
all of the appropriate steps have been performed, the report can be returned to the data
administrator to enter the upgraded review status for the analyses.
UNDOING AN IMPORT
Despite your best efforts, sometimes data is imported that either should not have been
imported or is incorrect. An Undo Import feature can do this automatically for you if the software
provides this feature. The database software should track the data that you import so you can undo
an import if necessary. You might need to do this if you find out that a particular file that you
imported has errors and is being replaced, or if you accidentally imported a file twice. An undo
import feature should be easy to use but sophisticated, leaving samples in the database that have
analyses from a different import, and undoing superseded values that were incremented by the
import of the file being undone. Figure 68 shows a form to help you select an import for deletion.
© 2002 by CRC Press LLC
Figure 67 - Section of successful import report used for data review

Figure 68 - Form to select an import for deletion
TRACKING QUALITY
A constant focus on quality should be maintained during the import process. Each result in the
database should be marked with flags regarding lab and other problems, and should also be marked
© 2002 by CRC Press LLC
with the level of data review that has been applied to that result. An example of a screen to assist
with maintaining data review status is shown in Figure 75 in Chapter 15.
If the import process is managed properly using software with a sufficiently sophisticated
import tool, and if the data is checked properly after import, then the resulting data will be of a
quality that it is useful to the project. The old axiom of “garbage in, garbage out” holds true with
environmental data. Another old axiom says “a job worth doing is worth doing well,” or in other
words, “If you don’t have time to do it right the first time, how are you ever going to find time to
do it again?” These old saws reinforce the point that the time invested in implementing a robust
checking system and using it properly will be rewarded by producing data that people can trust.
© 2002 by CRC Press LLC
CHAPTER 14
EDITING DATA
Once the data is in the database it is sometimes necessary to modify it. This can be done
manually or using automated tools, depending on the task to be accomplished. These two processes
are described here. Due to the focus on data integrity, a log of all changes to the data should be
maintained, either by the software or manually in a logbook.
MANUAL EDITING
Sometimes it is necessary to go into the database and change specific pieces of data content.
Actually, modification of data in an EDMS is not as common as an outsider might expect. For the
most part, the data comes from elsewhere, such as the field or the laboratory, and once it is in it
stays the way it is. Data editing is mostly limited to correcting errors (which, if the process is
working correctly, should be minimal) and modifying data qualifiers such as review status and
validation flags.
The data management system will usually provide at least one way to manually edit data.
Sometimes the user interface will provide more than one way to view and edit data. Two examples

include form view (Figure 69) and datasheet view (Figure 70).
Figure 69 - Site data editing screen in form view
© 2002 by CRC Press LLC
Figure 70 - Site data editing screen in datasheet view
AUTOMATED EDITING
If the changes involve more than one record at a time, then it probably makes sense to use an
automated approach. For specific types of changes that are a standard part of data maintenance,
this should be programmed into the system. Other changes might be a one-time action, but involve
multiple records with the same change, so a bulk update approach using ad hoc queries is better.
Standardized tasks
Some data editing activities are a relatively common activity. For these activities, especially if
they involve a lot of records to be changed or a complicated change process, the software should
provide an automated or semi-automated process to assist the data administrator with making the
changes. The examples given here include both a simple process and a complicated one to show
how the system can provide this type of capability.
UPDATING REVIEW STATUS
It’s important to track the review status of the data, that is, what review steps have been
performed on the data. An automated editing step can help update the data as review steps are
completed. Automated queries should allow the data administrators to update the review status
flags after appropriate data checks have been made. An example of a screen to assist with
maintaining data review status is shown in Figure 75 in Chapter 15.
REMOVAL OF DUPLICATED ENTRIES
Repeated records can enter the database in several ways. The laboratory may deliver data that
has already been delivered, either a whole EDD or part of one. Data administrators may import the
same file twice without noticing. (The EDMS should notify them if they try to do this.) Data that
has been imported from the lab may also be imported from a data validator with partial or complete
overlap. The lab may include field data, which has already been imported, along with its data.
However it gets in, this repeated data provides no value and should be removed, and records kept
of the changes that were made to the database. However, duplicated data resulting from the quality
control process usually is of value to the project, and should not be removed.

Repeated information can be present in the database at the samples level, the analyses level, or
both. The removal of duplicated records should address both levels, starting at the samples level,
and then moving down to the analyses level. This order is important because removing repeated
samples can result in more repeated analyses, which will then need to be removed. The samples
component of the duplicated record removal process is complicated by the fact that samples have
analyses underneath them, and when a duplicate sample is removed, the analyses should probably
not be lost, but rather moved to the remaining sample. The software should help you do this by
letting you pick the sample to which you want to move the analyses. Then the software should
modify the superseded value of the affected analyses, if necessary, and assign them to the other
sample.
© 2002 by CRC Press LLC
Figure 71 - Form for moving analyses from a duplicate sample
The analyses being moved may in fact represent duplicated data themselves, and the
duplicated record removal at the analyses level can be used to remove these results. The analyses
component of the duplicated record removal process must deal with the situation that, in some
cases, redundant data is desirable. The best example is groundwater samples, where four
independent observations of pH are often taken, and should all be saved. The database should
allow you to specify for each parameter and each site and matrix how many observations should be
allowed before the data is considered redundant.
The first step in the duplicated record removal process is to select the data for the duplicate
removal process. Normally you will want to work with all of the data for a sampling event. Once
you have selected the set of data to work on, the program should look for samples that might be
repeated information. It should do this by determining samples that have the same site, station,
matrix, sample date, top and base, and lab sample ID. Once the software has made its
recommendations for samples you might want to remove, the data should be displayed for you to
confirm the action. Before removing any samples, you should print a report showing the samples
that are candidates for removal. You should then make notes on this report about any actions taken
regarding removal of duplicated sample records, and save the printed report in the project file.
If a sample to be removed has related analyses, then the analyses must be moved to another
sample before the candidate sample can be deleted. This might be the case if somehow some

analyses were associated with one sample in the database and other analyses with another, and in
fact only one sample was taken. In that case, the analyses should be moved to the sample with a
duplicate value of zero from the one with a higher duplicate value, and then the sample with a
higher duplicate value should be deleted. The software should display the sample with the higher
duplicate value first, as this is the one most likely to be removed, and display a sample that is a
likely target to move the analyses to. A screen for a sample with analyses to be moved might look
like Figure 71. The screen has a notation that the sample has analyses, and provides a combo box,
in gray, for you to select a sample to move the analyses to. If the sample being displayed does not
have analyses, or once they have been moved to another sample, then it can be deleted. In this case,
the screen might look like Figure 72.
Once you have moved analyses as necessary and deleted improper duplicates, the program
should look for analyses that might contain repeated information. It can do this using the following
process: 1) Determine all of the parameters in the selection set. 2) Determine the number of desired
observations for each parameter. Use site-specific information if it is present. If it is not, use global
information. If observation data is not available, either site-specific or global, for one or more
parameters, the software should notify you, and provide the option of stopping or proceeding. 3)
Determine which analyses for each parameter exceed the observations count.
© 2002 by CRC Press LLC
Figure 72 - Form for deleting duplicate samples for a sample without analyses
Next, the software should recommend analyses for removal. The goal of this process is to
remove duplicated information, while, for each sample, retaining the records with the most data.
The program can use the following process: 1) Group all analyses where the sample and parameter
are the same. 2) If all of the data is exactly the same in all of the fields (except for AnalysisNumber
and Superseded), mark all but one for deletion. 3) If all of the data is not exactly the same, look at
the Value, AnalyticMethod, AnalDate_D, Lab, DilutionFactor, QCAnalysisCode, and
AnalysisLabID fields. If the records are different in any of these fields, keep them. For records that
are the same in all of these fields, mark all but one for deletion. (The user should be able to modify
the program’s selections prior to deletion.) If the data in all of these fields is the same, then keep
the record with the greatest number of other data fields populated, and mark the others for removal.
Once the software has made its recommendations for analyses to be removed, the data should be

displayed in a form such as that shown in Figure 73.
In this example, the software has selected several analyses for removal. Visible on the screen
are two Arsenic and two Chloride analyses, and one of each has been selected for removal. In this
case, this appears appropriate, since the data is exactly duplicated. The information on this screen
should be reviewed carefully by someone very familiar with the site. You should look at each
analysis and the recommendation to confirm that the software has selected the correct action. After
selecting analyses for removal, but before removing any analyses, you should print a report
showing the analyses that have been selected for removal. You should save the printed report in the
project file.
There are two parts to the Duplicated Record Removal process for analyses. The first part is
the actual removal of the analytical records. This can be done with a simple delete query, after
users are asked to confirm that they really want to delete the records. The second part is to modify
the superseded values as necessary to remove any gaps caused by the removal process. This should
be done automatically after the removal has been performed.
PARAMETER PRINT REORDERING
This task is an example of a relatively simple process that the software can automate. It has to
do with the order that results appear on reports. A query or report may display the results in
alphabetical order by parameter name. The data user may not want to see it this way. A more useful
order may be to see the data grouped by category, such as all of the metals followed by all of the
organics. Or perhaps the user wants to enter some specific order, and have the system remember it
and use it.
© 2002 by CRC Press LLC
Figure 73 - Form for deleting duplicated analyses
A good way to implement a specific order is to have a field somewhere in the database, such
as in the Parameters table, that can be used in queries to display the data in the desired order. For
the case where users want the results in a specific order, they can manually edit this field until the
order is the way they want it.
For the case of putting the parameters in order by category, the software can also help. A tool
can be provided to do the reordering automatically. The program needs to open a query of the
parameters in order by category and name, and then assign print orders in increasing numbers from

the first to the last. If the software is set up to skip some increment between each, then the user can
slip a new one in the middle without needing to redo the reordering process. The software can also
be set up to allow you to specify an order for the categories themselves that is different from
alphabetical, in case you want the organics first instead of the metals.
Ad hoc queries
Where the change to be made affects multiple records, but will performed only once, or a
small number of times over the life of the database, it doesn’t make sense to provide an automated
tool, but manual entry is too tedious. An example of this is shown in Figure 74. The problem is that
when the stations were entered, their current status was set to “z” for “Unknown,” even though
only active wells were entered at that time. Now that some inactive wells are to be entered, the
status needs to be set to “s” for “In service.”
Figure 74 shows an example of an update query to do this. The left panel shows the query in
design view, and the right panel in SQL view. The data administrator has told the software to
update the Stations table, setting the CurrentStatusCode field to “s” where it is currently “z.” The
query will then make this change for all of the appropriate records in one step, instead of the data
administrator having to make the change to each record individually.
This type of ad hoc query can be a great time saver in the hands of a knowledgeable user. It
should be used with great care, though, because of the potential to cause great damage to the
database. Changes made in this way should be fully documented in the activity log, and backup
copies of the database maintained in case it is done wrong.
© 2002 by CRC Press LLC
Figure 74 - Ad hoc query showing a change to the CurrentStatusCode field
© 2002 by CRC Press LLC
CHAPTER 15
MAINTAINING AND TRACKING
DATA QUALITY
If the data in your database is not of sufficient quality, people won’t (and shouldn’t) use it.
Managing the quality of the data is just as important as managing the data itself. This chapter and
the next cover a variety of issues related to quality terminology, QA/QC samples, data quality
procedures and standards, database software support for quality analysis and tracking, and

protection from loss. General data quality issues are contained in this chapter, and issues specific to
data verification and validation in the next.
QA VS. QC
Quality assurance (QA) is an integrated system of activities involving planning, quality
control, quality assessment, reporting, and quality improvement to ensure that a product or service
meets defined standards of quality with a stated level of confidence. Quality control (QC) is the
overall system of technical activities whose purpose is to measure and control the quality of a
product or service so that it meets the needs of users. The aim is to provide quality that is
satisfactory, adequate, dependable, and economical (EPA, 1997a). In an over-generalization, QA
talks about it and QC does it. Since the EDMS involves primarily the technical data and activities
that surround it, including quantification of the quality of the data, it comes under QC more than
QA. An EMS and the related EMIS (see Chapter 1), on the other hand, cover the QA component.
THE QAPP
The quality assurance project plan (QAPP) provides guidance to the project to maintain the
quality of the data gathered for the project. The following are typical minimum requirements for a
QAPP for EPA projects:
Project management
• Title and approval sheet.
• Table of Contents – Document control format.
• Distribution List – Distribution list for the QAPP revisions and final guidance.
• Project/Task Organization – Identify individuals or organizations participating in the project
and discuss their roles, responsibilities, and organization.
© 2002 by CRC Press LLC
• Problem Definition/Background – 1) State the specific problem to be solved or the decision to
be made. 2) Identify the decision maker and the principal customer for the results.
• Project/Task Description – 1) Hypothesis test, 2) expected measurements, 3) ARARs or other
appropriate standards, 4) assessment tools (technical audits), 5) work schedule and required
reports.
• Data Quality Objectives for Measurement – Data decision(s), population parameter of interest,
action level, summary statistics, and acceptable limits on decision errors. Also, scope of the

project (domain or geographical locale).
• Special Training Requirements/Certification – Identify special training that personnel will
need.
• Documentation and Record – Itemize the information and records that must be included in a
data report package, including report format and requirements for storage, etc.
Measurement/data acquisition
• Sampling Process Designs (Experimental Design) – Outline the experimental design, including
sampling design and rationale, sampling frequencies, matrices, and measurement parameter of
interest.
• Sampling Methods Requirements – Sample collection method and approach.
• Sample Handling and Custody Requirements – Describe the provisions for sample labeling,
shipment, chain of custody forms, procedures for transferring and maintaining custody of
samples.
• Analytical Methods Requirements – Identify analytical method(s) and equipment for the study,
including method performance requirements.
• Quality Control Requirements – Describe routine (real-time) QC procedures that should be
associated with each sampling and measurement technique. List required QC checks and
corrective action procedures.
• Instrument/Equipment Testing Inspection and Maintenance Requirements – Discuss how
inspection and acceptance testing, including the use of QC samples, must be performed to
ensure their intended use as specified by the design.
• Instrument Calibration and Frequency – Identify tools, gauges and instruments, and other
sampling or measurement devices that need calibration. Describe how the calibration should
be done.
• Inspection/Acceptance Requirements for Supplies and Consumables – Define how and by
whom the sampling supplies and other consumables will be accepted for use in the project.
• Data Acquisition Requirements (Non-direct Measurements) – Define the criteria for the use of
non-measurement data such as data that comes from databases or literature.
• Data Management – Outline the data management scheme including the path and storage of
the data and the data record-keeping system. Identify all data handling equipment and

procedures that will be used to process, compile, and analyze the data.
Assessment/oversight
• Assessments and Response Actions – Describe the assessment activities for this project.
• Reports to Management – Identify the frequency, content, and distribution of reports issued to
keep management informed.
Data validation and usability
• Data Review, Validation, and Verification Requirements – State the criteria used to accept or
reject the data based on quality.
© 2002 by CRC Press LLC

×