Tải bản đầy đủ (.pdf) (150 trang)

Business research methods - part 4 (page 451 to 600)

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (27.49 MB, 150 trang )

>appendix 15a

Previous research on the topic.
A pilot test or pretest of the data instrument among
a sample drawn from the population.
A rule of thumb (one-sixth of the range based on six
standard deviations within 99.73 percent confidence).

r

If the range is from 0 to 30 meals, the rule-of-thumb
method produces a standard deviation of 5 meals. The researchers want more precision than the rule-of-thumb
method provides, so they take a pilot sample of 25 and
find the standard deviation to be 4.1 meals.

Population Size

F

A final factor affecting the size of a random sample is the
size of the population. When the size of the sample exceeds 5 percent of the population, the finite limits of the
population constrain the sample size needed. A correction
factor is available in that event.
The sample size is computed for the first construct,
meal frequency, as follows:

Determining Sample Size

the arithmetic mean, with proportions, it is p (the proportion of the population that has a given attribute)'-in this
case, interest in joining the dining club. And instead of the
standard deviation, dispersion is measured in terms of p X


q (in which q is the proportion of the population not having the attribute, and q = (1 - p). The measure of dispersion of the sample statistic also changes from the standard
error of the mean to the standard error of the proportion
We calculate a sample size I?ased on these data by making the same two subjective decisionsdeciding on an acceptable interval estimate and the degree of confidence.
Assume that from a pilot test, 30 percent of the students
and employees say they will join the dining club. We decide to estimate the true proportion in the population
within 10 percentage points of this figure ( p = 0.30
0.10). Assume further that we want to be 95 percent confident that the population parameter is within ? 0.10 of the
sample proportion. The calculation of the sample size proceeds as before:

+

-t- 0.10 = desired interval range within which the popu-

lation proportion is expected (subjective
decision)
1.96 up= 95 percent confidence level for estimating the
interval within which to expect the population
proportion (subjective decision)
up= 0.05 1 = standard error of the proportion

(0.1011.96)
pq = measure of sample dispersion (used here as an
estimate of the population dispersion)
n = sample size

where
0 ,=

0.255 (0.5111.96)


If the researchers are willing to accept a larger interval
range (+ 1 meal), and thus a larger amount of risk, then
they can reduce the sample size to n = 65.

Calculating the Sample Size for
Questions Involving Proportions
The second key question concerning the dining club study
was "What percentage of the population says it would join
the dining club, based on the projected rates and services?' In business, we often deal with proportion data.
An example is a CNN poll that projects the percentage of
people who expect to vote for or against a proposition or a
candidate. This is usually reported with a margin of error
of 5 percent.
In the Metro U study, a pretest answers this question
using the same general procedure as before. But instead of

+

The sample size of 81 persons is based on an infinite
population assumption. If the sample size is less than
5 percent of the-population, there is little to be gained by
using a finite population adjustment. The students interpreted the data found with a sample of 81 chosen randomly from the population as: "We can be 95 percent
confident that 30 percent of the respondents would say
they would join the dining club with a margin of error of
? 10 percent."
Previously, the researchers used pilot testing to generate
the variance estimate for the calculation. Suppose this is
not an option. Proportions data have a feiture concerning



>part Ill

The Sources and Collect~on Data
of

the variance that is not found with interval or ratio data.
The pq ratio can never exceed 0.25. For example, if p =
0.5, then q = 0.5, and their product is 0.25. If either p or q
is greater than 0.5, then their product is smaller than 0.25
(0.4 X 0.6 = 0.24, and so on). When we have no information regarding the probable p value, we can assume that p
= 0.5 and solve for the sample size.

where
pg = measure of dispersion
n = sample size
a = standard error of the proportion
,

If we use this maximum variance estimate in the dining
club example, we find the sample size needs to be 96 persons in order to have an adequate sample for the question
about joining the club.
When there are several investigativ; questions of
strong interest, researchers calculate the sample size for
each such variable-as we did in the Metro U study for
"meal frequency" and "joining." The researcher then
chooses the calculation that generates the largest sample.
This ensures that all data will be collected with the necessary level of precision.






-ssa3old y3leas
-a1 aqi JO aseyd s!yi u sdals aqi suaUa1 1-9 I i!q!qx~ 'papauo3 pue pa1ea~3.1 dam s.10.11a
y
aq
hriua eiep 1eyi d a ~ s ~ y l 3 u u n p 11 .cap papallo3 ay1 JO Buypue~slapunue 01 Buypea1 dals
s!
huym!lald layloue s Llewurns 1e3gspels ahpdp3sap e Bupedald .s!sd1eue l o j a1epdoldde
y
alom ale ~ e sqw~o pagyswp pue pa3npa~ mloj M ~ . Imolj uoys1a~uo3lyay1 pue elep aql
~
01
JO L3em33e ay1 salnsua 1eq1 dl!~!i3e ayl SF pue hrlua elep PUT! ' 3 ~ 1 ~'3u!l!pa sspnpu! UO!$
03
- e ~ e d a ~ d .slsd@ue eiep oi sumn~
elea
uo!lua11e s,.1aq31easa1 e 'MOU 01 uy3aq e1ep aql a3uo


rchapter 16

Data P t q ~ a r a t ~ o n Descr~ption
and

F

Exhibit 16-1 Data Preparation in the Research Process

Po4co& Free,r m

. n
r nu-..6:~...

> Editing
The customary first step in analysis is to edit the raw data. Editing detects errors and omissions, corrects them when possible, and certifies that maximum data quality standards are
achieved. The editor's purpose is to guarantee that data are:
Accurate.
Consistent with the intent of the question and other information in the survey.
Uniformly entered.
Complete.
Arranged to simplify coding and tabulation.
In the following question asked of adults18 or older, one respondent checked two categories, indicating that he was a retired officer and currently serving on active duty.
Please indicate your current military status:
Active duty
National Guard

Reserve

Retired

Separated

Never served i n the military


The editor's responsibility is to decide which of the responses is both consistent with the
intent of the question or other information in the survey and most accurate for this individual participant.

Field Editing
In large projects, field editing review is a responsibility of the field supervisor. It, too,

should be done soon after the data have been gathered. During the stress of data collection
in a personal interview and paper-and-pencil recording in an observation, the researcher often uses ad hoc abbreviations and special symbols. Soon after the interview, experiment, or
observation, the investigator should review the reporting forms. It is difficult to complete
what was abbreviated or written in shorthand or noted illegibly if the entry is not caught
that day. When entry gaps are present from interviews, a callback should be made rather
than guessing what the respondent "probably would have said." Self-interviewing has no
place in quality research.
A second important control function of the field supervisor is to validate the field results. This normally means he or she will reinterview some percentage of the respondents,
at least on some questions, verifying that they have participated and that the interviewer
performed adequately. Many research firms will recontact about 10 percent of the respondents in this process of data validation.

Central Editing

Western Wats, a data
collection specialist, reminds
us that speed without accuracy
won't help a researcher
choose the right direction.
"After all, being quick on the
draw doesn't do any good if
you miss the mark."
www.westernwats.com

At this point, the data should get a thorough editing. For a small study, the use of a single
editor produces maximum consistency. In large studies, editing tasks should be allocated so
that each editor deals with one entire section. Although the latter approach will not identify
inconsistencies between answers in different sections, the problem can be handled by identifvina cluestions in different sections that
.
I
might point to possible inconsistency and

having one editor check the data generated
by these questions.
Sometimes it is obvious that an entry is
incorrect-for example, when data clearly
specify time in days (e.g., 13) when it was
requested in weeks (you expect a number of
4 or less)-or is entered in the wrong place.
When replies are inappropriate or missing,
the editor can sometimes detect the proper
answer by reviewing the other information
in the data set. This practice, however,
should be limited to the few cases where it
is obvious what the correct answer is. It
may be better to contact the respondent for
correct information, if time and budget allow. Another alternative is for the editor to
strike out the answer if it is inappropriate.
Here an editing entry of "no answer" or
"unknown" is called for.
Another problem that editing can detect
concerns faking an interview that never took
place. This "armchair interviewing" is difficult to spot, but the editor is in the best posi-


>chapter 16 Data Preparat~orl
and Descript~on

tion to do so. One approach is to check responses to open-ended questions. These are most
difficult to fake. Distinctive response patterns in other questions will often emerge if data
falsification is occurring. To uncover this, the editor must analyze as a set the instruments
used by each interviewer.

Here are some useful rules to guide editors in their work:
Be familiar with instructions given to interviewers and coders.
Do not destroy, erase, or make illegible the original entry by the interviewer; original
entries should remain legible.
Make all editing entries on an instrument in some distinctive color and in a standardized form.
Initial all answers changed or supplied.
Place initials and date of editing on each instrument completed.

> Coding
Codhg involves assigning numbers or other symbols to answers so that the responses can
be grouped into a limited number of categories. In coding, categories are the partitions of
a data set of a given variable (for example, if the variable is gender, the paairions are d e
and female). Categorization is the process of using rules to partition a body of data. Both
closed and free-response questions must be coded.
The categorization of data sacrifices some data detail but is necessary for eff~cient
analysis. Most statistical and banneritable software programs work more efficiently in the
numeric mode. Instead of entering the word male orfemale in response to a question that
mks for the identification of one%gender, we would use numeric codes (for example, O for
male m d 1 for female). Nutneric coding simplifies the researcher's task in amvesting a
nominal variable, like gender, to a "dummy variable," a topic we discuss in Chapter 20.
Statistical software also can use alphanumeric codes, as when we use M and F, 0 other let1
ters, in combination with numbers and symbols for gender.

Cumulabve

Vew Good Qualm

+
Mlsslng


Good Qualm
Average Qualm
Poor Qualm
Total
System

2

80

3
83

120
24
964
36
1000

20 0
125
25

loan

1000

CUmUlatlw

Vew Good Qralm

Good Qualm
Average Qualm
Total
M~ssmng System
Total

32 5
15 7
36

163
100 0

83 8
1000

The researcher here requested
a frequency printout of all
variables when 83 cases had
been entered. SPSS presents
them sequentially in one
document. The left frame
indicates all the variables
included in this particular
output file. Both variables
Qua12 and Qua13 indicate 3
missing cases. This would be a
cautionary flag to a good
researcher. During editing the
researcher would want to verify

that these are true instances
where participants did not rate
the quality of both objects,
rather than data entry errors.
www.spss.com


>part IV

Arialys~s
and Plesentarlon of Ddta

I

I

CBS: Some Labs

Codebook Construction
A codebook, or coding scheme, contains each variable in the study and specifies the application of coding rules to the variable. It is used by the researcher or research staff to promote more accurate and more efficient data entry. It is also the definitive source for
locating the positions of variables in the data file during analysis. In many statistical programs, the coding scheme is integral to the data file. Most codebooks--computerized or
not--contain the question number, variable name, location of the variable's code on the input medium (e.g., spreadsheet or SPSS data file), descriptors for the response options, and
whether the variable is alphabetic or numeric. An example of a paper-based codebook is
shown in Exhibit 16-2. Pilot testing of an instrument provides sufficient information about
the variables to prepare a codebook. A
codebook used with pilot data may reveal coding problems that will need to be corrected before the data for the final study are
collected and processed.

Coding Closed Questions
The responses to closed questions include scaled items for which answers can be anticipated. Closed questions are favored by researchers over open-ended questions for their efficiency and specificity. They are easier to code, record, and analyze. When codes are

established in the instrument design phase of tho research process, it is possible to precode
the questionnaire during the design stage. With computerized survey design, and computerassisted, computer-administered,or online collection of data, precoding is necessary as the
software tallies data as they are collected. Preceding is particularly helpful for manual data
entry (for ex&nple, from mail or self-administered surveys) because it makes the intermediate step of completing a data entry coding sheet unnecessary. With a precoded instrument, the codes for variable categories are accessible directly from the questionnaire. A
participant, interviewer, field supervisor, or researcher (depending on the data collection
method) is able to assign the appropriate code on the instrument by checking, circling, or
printing it in the proper coding location.


. chapter 16

Uara Prepa~aliorl Descr~ptlon
ancl

> Exhibit 16-2 Sam~le
Codebook of Questionnaire Items
Variable

2

1

Variable
Record number
Respondent number

4

4


2 digit birth year
99 = Missing

6

RECNUM
RESlD

Marital status
1 = Married
2 = Widow(er)
3 = Divorced
4 = Separated
5 = Never married

Exhibit 16-3 shows questions in the sample codebook. When precoding is used, editing
precedes data processing. Note question 4, where the respondent may choose between five
categories of marital status and enter the number of the item best representing present status in the coding portion of the questionnaire. This code is later transferred to an input
medium for analysis.

Coding Free-Response Questions
One of the primary reasons for using open questions is that insufficient information or lack
of a hypothesis may prohibit preparing response categories in advance. Researchers are


rpart IV

Analys~s
arid Presentation ot Data


> Exhibit 16-3 Sample Questionnaire Items

forced to categorize responses after the data are collected. Other reasons for using openended responses include the need to measure sensitive or disapproved behavior, discover
salience or importance, or encourage natural modes of expression.' Also, it may be easier
and more efficient for the participant to write in a known short answer rather than read
through a long list of options. Whateyer the reason for their use, analyzing enormous volumes of open-ended questions slows the analysis process and increases the opportunity for
error. The variety of answers to a single question can be staggering, hampering postcollection categorization. Even when categories are anticipated and precoded for open-ended
questions, once data are collected researchers may find it useful to reassess the predetermined categories. One example is a 7-point scale where the researcher offered the participant three levels of agreement, three levels of disagreement, and one neutral position. Once
the data are collected, if these finer nuances of agreement do not materialize, the editor may
choose to recategorize the data into three levels: one level of agreement, one level of disagreement, apd one neutral position.
Exhibit 16-3, question 6, illustrates the use of an open-ended question for which advance knowledge of response options was not available. The answer to "What prompted
you to purchase your most recent life insurance policy?'was to be filled in by the participant as a short-answer essay. After preliminary evaluation, response categories (shown in
the codebook, Exhibit 16-2) were created for that item.


>chapter 16

Data Prepa~at~on Descr~pt~on
and

L

Coding Rules
Four rules guide the pre- and postcoding and categorization of a data set. The categories
within a single variable should be:
Appropriate to the research problem and purpose.
Exhaustive.
Mutually exclusive.
Derived from one classification principle.


h' Researchers address these issues when developing or choosing each specific measurement

question. One of the purposes of pilot testing of any measurement instrument is to identify
and anticipate categorization issues.

Appropriateness
Appropziateness is determined at two levels: (1) the best partitioning of the data for testing
hypotheses and showing relationships and (2) the availability of comparison data. For example, when actual age is obtained (ratio scale), the editor may decide to group data by age
ranges to simplify pattern discovery within the data. The number of age groups and breadth
of each range, as well as the endpoints in each range, should be determined by comparison
data-for example, U.S. census age ranges, a customer database that includes age ranges,
or the age data available from Fox TV used for making an advertising media buy.

Exhaustiveness
Researchers often add an "other" option to a measurement question because they know
they cannot anticipate all possible answers. A large number of ""other"'responses, however,
suggests the measurement scale the researcher designed did not anticipate the full range of
information. The editor must determine if "other'bsponses appropriately fit into established categories, if new categories must be added, if "other" data will be ignored, or i
f
some combination of these actions will be taken.
While the exhaustiveness requirement for a single variable may be obvious, a second aspect is Iess apparent. Does one set of categories--often determined before the data are collected-fully capture all the information in the data? For example, responses to an
open-ended question about family economic prospects for the next year may originally be
categorized only in terms of being "optimistic" or "pessimistic." It may also be enlightening to classify responses in terms of other concepts such as the precise focus of these expectations (income or jobs) and variations in responses between family heads and others in
the family,
4

Mutual Exclusivity
Another important rule when adding categories or redignhg categories is that category
components &odd be m t a l exclusive. This standad is met when a specifie answer can
uuly

be placed in one and only one cell in a category set. a survey, assume that
in
&oa
you asked participants for their occupatian. One & i t ~ " : ~ , ~ ~ 1 o E i zscheme aught include (1) professional, (2) managerial, (3) g e s , (4) cbrical, (5) crafos, (6) operative~,
and
(73 unemployed. As an editor2how would you code a participant's answer that specified
"salesperson at Gap and full-time student*' or maybe "elerneutary teacher and tax preparer"? According to census data, it is not u n m ~ ~ ~ d cnm society to have more
d @
than one job. Here, operational definitions of the ocmpatlcm&cmgorized as "professional," "'managerial:" and "'sales" should help c&adfy@hiha*&tion. But the editor facing


>part IV

QSR,the company that
provided us with N6, the
latest version o NUD*IST,
f
and N-VIVO, introduced a
commercial version o the
f
content analysis software in
2004, XSight. XSight was
developed for and with the
input o researchers.
f
www.qsrinternational.com

Arialys~s
and i3resentailon of Data


this situation also would need to determine how the second-occupation data are handled.
One option would be to add a second-occupation field to the data set; another would be to
develop distinct codes for each unique multiple-occupation combination.

Single Dimension
The problem of how to handle an occppation entry like "unemployed salesperson7'brings
up a fourth rule of category design. The need for a category set to follow a single classificatory principle means every option in the category set is defined in terms of one concept
or construct. Returning to the occupation example, the person in the study might be both a
salesperson and unemployed. The "salesperson" label expresses the concept occupation
type; the response "unemployed" is another dimension concerned with current employment
status without regard to the respondent's normal occupation. When a category set encompasses more than one dimension, the editor may choose to split the dimensions and develop
an additional data field; "occupation" now becomes two variables: "occupation type" and
"employmen&
status."

Using Content Analysis for Open Questions
Increasingly text-based responses to open-ended measurement questions are analyzed with
content analysis software. Content analysis measures the semantic content or the what


>chapter 16 Data Preparat~on
arid Descr~ption

aspect of a message. Its breadth makes it a flexible and wide-ranging tool that may be used
as a stand-alone methodology or as a problem-specific technique. Trend-watching organizations like the BrainReseme, the Naisbitt Group, SRI International, and Inferential Focus
use variations on content analysis for selected projects, often spotting changes from newspaper or magazine articles before they can be confirmed statistically. The Naisbitt Group's
content analysis of 2 million local newspaper articles compiled over a 12-year period resulted in the publication of Megatrends.

Types of Content
Content analysis has been described as "a research technique for the objective, systematic,

Because this deand quantitative description of the manifest content of a comm~nication."~
finition is sometimes confused with simply counting obvious message aspects such as
words or attributes, more recent interpretations have broadened the definition to include latent as well as manifest content, the symbolic meaning of messages, and qualitative analysis. One author states:
In any single written message, one can count letters, words, or sentences. One can categorize phrases,
describe the logical structure of expressions, ascertain associations, connotations, denotations, elocutionary forces, and one can also offer psychiatric, sociological, or political interpretations. All of these
may be simultaneously valid. In short, a message may convey a multitude of contents even to a single
re~eiver.~

Content analysis follows a systematic process for coding and drawing inferences from
texts. It starts by determining which units of data will be analyzed. In written or verbal
texts, data units are of four types: syntactical, referential, propositional, or thematic. Each
unit type is the basis for coding texts into mutually exclusive categories in our search for
meaning.
Syntactical units can be words, phrases, sentences, or paragraphs; words are the
smallest and most reliable data units to analyze. While we can certainly count these
units, we are more interested in the meaning their use reveals. In content analysis we
might determine the words that are most commonly used to describe product A versus its competitor, product B. We ask, "Are these descriptions for product A more
likely to lead to favorable opinions and thus to preference and ultimately selection,
compared to the descriptions used for product B?"
Referential units are described by words, phrases, and sentences; they may be objects, events, persons, and so forth, to which a verbal or textual expression refers.
Participants may refer to a product as a "classic," a "power performer," or "ranked
first in safety"-each word or phrase may be used to describe different objects, and it
is the object that the researcher codes and analyzes in relation to the phrase.
Propositional units are assertions about an object, event, persor), and so on. For example, a researcher assessing advertising for magazine subscriptions might conclude,
"Subscribers who respond to offer A will save $15 over the single issue rate." It is
'
the assertion of savings that is attached to the text of this particular ad claim.
Thematic units are topics contained within (and across) texts; they represent higherlevel abstractions inferred from the text and its context. The responses to an openended question about purchase behavior may reflect a temporal theme: the past ("I
never purchased an alternative brand before you changed the package"), the present
("I really like the new packaging"), or the future ("I would buy the product more often if it came in more flavors"). We coald also look at the comments as relating to

the themes or topics of "packaging" versus a product characteristic, "flavors."

As with all other research methodologies, the analytical use of content analysis is influenced by decisions made prior to data collection. Content analysis guards against selective
perception of the content, provides for the rigorous application of reliability and validity
criteria, and is amenable to computerization.


>part IV

Analys~sand Preseritat~ori Data
of

What Content Is Analyzed?
Content analysis may be used to analyze written, audio, or video data from experiments, observations, surveys, and secondary data studies. The obvious data to be content-analyzed include transcripts of focus groups, transcripts of interviews, and open-ende8survey responses.
But researchers also use content analysis on advertisements, promotional brochures, press releases, speeches, Web pages, historical documents, and conference proceedings, as well as
magazine and newspaper articles. In competitive intelligence and the marketing of political
candidates content analysis is a primary methodology.

Example
Let's look at an informal application of content analysis to a problematic open question. In
this example, which we are processing without the use of content analysis software, suppose
employees in the sales department of a manufacturing firm are asked, "How might companycustomer relations be improved?" sample of the responses yields the following:
We should treat the customer with more respect.
We should stop trying to speed up the sales process when the customer has expressed
objections or concerns.
We should have software that permits real-time tracking of a customer's order.
Our laptops are outdated. We can't work with the latest software or access information quickly when we are in the field.
My [the sales department] manager is rude with customers when he gets calls while
I'm in the field. He should be transferred or fired.
Management should stop pressuring us to meet sales quotas when our customers

have restricted their open-to-buy status.

> These categories are
called analysis
in XSight.
See the screenshot
on page 448.

The first step in analysis requires that the units selected or developed help answer the research question. In our example, the research question is concerned with learning who or
what the sales force thinks is a source for improving company-customer relations. The first
pass through the data produces a few general categories in one concept dimension: source
of responsibility, shown in Exhibit 16-4. These categories are mutually exclusive. The use
of "other" makes the category set exhaustive. If, however, many of the sample participants
suggested the need for action by other parties-for example, the government or a trade association-then including all those responses in the "other" category would ignore much
of the richness of the data. As with coding schemes for numerical responses, category
choices are very important.
Since responses to this type of question often suggest specific actions, the second evaluation of the data uses propositional units. If we used only the set of categories in Exhibit
16-4, the analysis would omit a considerable amount of information. The second analysis
produces categories for action planning:

> Exhibit 16-4 Open Question Coding Example (before revision)
Question: "How can company-customer relations be improved?"


>chapter 16 Data Preparation and Description

Human relations.
Technology.
Training.
Strategic planning.

Other action areas.
No action area identified.
How can we categorize a response suggesting a combined training-technology process?
Exhibit 16-5 illustrates a combination of alternatives. By taking the categories of the first
list of the action areas, it is possible to get an accurate frequency count of the joint classification possibilities for this question.
Using available software, the researcher can spend much less time coding open-ended
responses and capturing categories. Software also eliminates the high cost of sending responses to outside coding firms. What used to take a coding staff several days may now be
done in a few hours.
Content analysis software applies statistical algorithms to open-ended question responses. This permits stemming, aliasing, and exclusion processes. Stemming uses derivations of common root words to create aliases (e.g., using searching, searches, searched, for
search). Aliasing searches for synonyms (wise or smart for intelligent). Exclusion filters
out trivial words (be, is, the, of)in the search for meaning."
When you are using menu-driven programs, an autocategorization option creates manageable categories by clustering terms that occur together throughout the textual data set.
Then, with a few keystrokes, you can modify categorization parameters and refine your results. Once your categories are consistent with the research and investigative questions,
you select what you want to export to a data file or in tab-delimited format. The output, in
the form of tables and plots, serves as modules for your final report. Exhibit 16-6 shows a
plot produced by a content analysis of the Mindwriter complaints data. The distances between pairs of terms reveal how likely it is that the terms occur together, and the colors represent categories.

> Exhibit 16-5 Open Question Coding (after revision)
Question: "How can company-customer relations be improved?"

Locus of Responsibility

Frequency (n = 100)


>part IV

Analys~s
and Presentat~on f Data
o


> Exhibit 16-6 Proximity Plot of Mindwriter Customer Complaints

"Don't Know" Responses
The "don't know" (DK) response presents special problems for data preparation. When
the DK response group is small, it is not troublesome. But there are times when it is of major concern, and it may even be the most frequent response received. Does this mean the
question that elicited this response is useless? The answer is, "It all depends." Most DK answers fall into two categorie~.~ there is the legitimate DK response when the responFirst,
dent does not know the answer. This response meets our research objectives; we expect DK
responses and consider them to be useful.
In the second situation, a DK reply illustrates the researcher's failure to get the appropriate information, Consider the following illustrative questions:

1. Who developed the Managerial-Grid concept?
2. Do you believe the new president's fiscal policy is sound?
3. Do you like your present job?
4. Which of the various brands of chewing gum do you believe has the best quality?
5. How often each year do you go to the movies?
It is reasonable to expect that some legitimate DK responses will be made to each of these
questions. In the first question, the respondents are asked for a level of information that
they often will not have. There seems to be little reason to withhold a correct answer if
known. Thus, most DK answers to this question should be considered as legitimate. A DK
response to the second question presents a different problem. It is not immediately clear
whether the respondent is ignorant of the president's fiscal policy or knows the policy but
has not made a judgment about it. The researchers should have asked two questions: In the
first, they would have determined the respondent's level of awareness of fiscal policy. If the
interviewee passed the awareness test, then a second question would have secured judgment on fiscal policy.


>chapter 16

<


Data Preparation arid Description

> Exhibit 16-7 Handling "Don't Know" Responses
Quest~on: you have a productive relat~onshlp your present salesperson?
Do
with

Years of Purchasing

No

M

Don't Know

I

In the remaining three questions, DK responses are more likely to be a failure of the questioning process, although some will surely be legitimate. The respondent may be reluctant to
give the information. A DK response to question 3 may be a way of saying, "I do not want to
answer that question." Question 4 might also elicit a DK response in which the reply translates to "This is too unimportant to talk about." In question 5, the respondents are being asked
to do some calculation about a topic to which they may attach little importance. Now the DK
may mean "I do not want to do that work for something of so little consequence."

Dealing with Undesired DK Responses

!
f

The best way to deal with undesired DK answers is to design better questions at the beginning. Researchers should identify the questions for which a DK response is unsatisfactory

and design around it. Interviewers, however, often inherit this problem and must deal with
it in the field. Several actions are then possible. First, good interviewer-respondentrapport
will motivate respondents to provide more usable answers. When interviewers recognize
an evasive DK response, they can repeat the question or probe for a more definite answer.
The interviewer may also record verbatim any elaboration by the respondent and pass the
problem on to the editor.
If the editor finds many undesired responses, little can be done unless the verbatim comments can be interpreted. Understanding the real meaning relies on clues from the respondent's answers to other questions. One way to do this is to estimate the allocation of DK
answers from other data in the questionnaire. The pattern of responses may parallel income, education, or experience levels. Suppose a question concerning whether customers
like their present salesperson elicits the answers in Exhibit 16-7. The correlation between
years of purchasing and the "don't know" answers and the "no" answers suggests that most
of the "don't knows" are disguised "no" answers.
There are several ways to handle "don't know" responses in the tabulations. If there are
only a few, it does not make much difference how they are handled, but they will probably
be kept as a separate category. If the DK response is legitimate, it should remain as a separate reply category. When we are not sure how to treat it, we should keep it as a separate
reporting category and let the research sponsor make the decision.

Missing Data
C

Missing data are information from a participant or case that is not available for one or more
variables of interest. In survey studies, missing data typically occur when participants accidentally skip, refuse to answer, or do not know the answer to an item on the questionnaire. In
longitudinal studies, missing data may result from participants dropping out of the study, or
being absent for one or more data collection periods. Missing data also occur due to researcher error, corrupted data files, and changes in the research or instrument design after data
were collected from some participants, such as when variables are dropped or added. The


>part IV

Analys~s
drld Presentatlor1of Data


> Exhibit 16-8 MindWriter Data Set: Missing and Out-of-Range Data

strategy for handling missing data consists of a two-step process: the researcher first explores
the pattern of missing data to determine the mechanism for missingness (the probability that
a value is missing rather than observed) and then selects a missing-data technique.
Examine the sample distribution of variables from the MindWriter dataset shown in
Exhibit 16-8. These data were collected on a five-point interval scale. There are no missing
data in variable lA, although it is apparent that a range of 6 and a maximum value of 7 invalidate the calculated mean or ayerage*
score. Variables 1B and 2B have one case missing but
values that are within range. Variable 2Ais missing four cases, or 27 percent of its data points.
The last variable, 2C, has a range of 6, two missing values, and three values coded as "9." A
"9" is often used as a DK or missing-value code when the scale has a range of less than 9
points. In this case both blanks and 9s are present-a coding concern. Notice that the fifth respondent answered only two of the five questions and the second respondent had two miscoded answers and one missing value. Finally, using descriptive indexes of shape, discussed
in Appendix 16a, you can find three variables that depart from the symmetry of the normal
distribution. They are skewed (or pulled) to the left by a disproportionately small number of
1s and 2s.
one variable's distribution is peaked beyond normal dimensions. We have just
used the minimum and maximum values, the range, and the mean and have already discovered errors in coding, problems with respondent answer patterns, and missing cases.

Mechanisms for Missing Data
In order to select a missing-data technique, the researcher must first determine what caused
the data to be missing. There are three basic mechanisms for this: data missing completely


>chapter 16

Data PI~(J,IIat1011 ~ i d
a Desc~
lpt~on


at random (MCAR); data missing at random (MAR); and data not missing at random
(NMAR). If the probability of missingness for a particular variable is dependent on neither
the variable itself nor any other variable in the data set, then data are MCAR. Data are considered MAR if the probability of missingness for a particular variable is dependent on another variable but not itself when other variables are held constant. The practical
significance of this distinction is that the proper missing-data technique can be selected that
will minimize bias in subsequent analyses. The third type of mechanism, NMAR, occurs
when data are not missing completely at random and they are not predictable from other
variables in the data set. Data NMAR are considered nonignorable and must be treated on
an improvised basis.

Missing-Data Techniques
Three basic types of techniques can be used to salvage data sets with missing data: (1) listwise deletion, (2) painvise deletion, and ( 3 ) replacement of missing values with estimated
scores. Listwise deletion, or complete case analysis, is perhaps the simplest approach, and
is the default option in most statistical packages like SPSS and SAS. With this method,
cases are deleted from the sample if they have missing values on any of the variables in the
analysis. Listwise deletion is most appropriate when data are MCAR. In this situation, no
bias will be introduced because the subsample of complete cases is essentially a random
sample of the original sample. However, if data are MAR but not MCAR, then a bias may
be introduced, especially if a large number of cases are deleted. For example, if men were
more likely than women to be responsible for missing data on the variable shopping preference, then the results would be biased toward women's shopping preferences.
Pairwise deletion, also called available case analysis, assumes that data are MCAR. In
the past, this technique was used frequently with linear models that are functions of means,
variances, and covrulances. Missing values would be estimated using all cases that had data
for each variable or pair of variables in the analysis. Today most experts caution against
pairwise deletion, and recommend alternative approaches.
The replacement of missing values with estimated values includes a variety of techniques. This option generally assumes that data are MAR, since the missing values on one
variable are predicted from observed values on another variable. A common option available on many software packages is the replacement of missing values with a mean or other
central tendency score. This is a simple approach, but has the disadvantage of reducing the
variability in the original data, which can cause bias. Another option is to use a regression
or likelihood-based method. Such techniques are found in specialty software packages and

the procedures for using them are beyond the scope of this text.

> Data Entry
Data entry converts information gathered by secondary or primary methods to a medium
for viewing and manipulation. Keyboarding remains a mainstay for researchers who need
to create a data file immediately and store it in a minimal space on a variety of media.
However, researchers have profited from more efficient ways of speeding up the research
process, especially from bar coding and optical character and mark recognition.

Alternative Data ~ n t r y ~ ~ o r m a t s
Keyboarding
A full-screen editor, where an entire data file can be edited or browsed, is a viable means
of data entry for statistical packages like SPSS or SAS. SPSS offers several data entry
products, including Data Entry BuilderTM, which enables the development of forms and


> Exhibit 16-9 Data Fields, Records, Files, and Databases
Data fields represent single elements of information (e.g., an answer to a particular question) from all participants in a study. Data fields can
contain numeric, alphabetic, or symbolic information. A record is a set of data fields that are related to one case or participant (e.g., the
responses to one completed survey). Records represent rows in a data file or spreadsheet program worksheet. Data files are sets of
records (e.g., responses from all participants in a single study) that are grouped together for storage on diskettes, disks,.tapes, CD-ROM, or
optical disks. Databases are made up of one or more data files that are interrelated. A database might contain all customer survey
information collected quarterly for the last 10 years.

Surveys or
Instruments

Data File

surveys, and Data Entry StationTM,

which gives centralized entry staff, such as telephone
interviewers or online participants, access to the survey. Both SAS and SPSS offer software
that effortlessly accesses data from databases, spreadsheets, data warehouses, or data marts.

Database Development For large projects, database programs serve as valuable
data entry devices. A database is a collection of data organized for computerized retrieval.
Programs allow users to define data fields and link files so that storage, retrieval, and updating are simplified. The relationship between datafields, records,files, and databases is
illustrated in Exhibit 16-9. A company's orders serve as an example of a database. Ordering
information may be kept in several files: salesperson's customer files, customer financial
records, order production records, and order shipping documentation. The data are separated so that authorized people can see only those parts pertinent to their needs. However,
the files may be linked so that when, say, a customer changes his or her shipping address,
the change is entered once and all the files are updated. Another database entry option is
e-mail data capture. It has become popular with those using e-mail-delivered surveys. The
e-mail survey can be delivered to a specific respondent whose e-mail address is known.
Questions are completed on the screen, returned via e-mail, and incorporated into a database.6 An intranet can also capture data. When participants linked by a network take an online survey by completing a database form, the data are captured in a database in a network
server for later or real-time analysis.' ID and password requirements can keep unwanted
participants from skewing the resilts of an online survey.
Researchers consider database entry when they have large amounts of potentially linked
data that will be retrieved and tabulated in different ways over time. Another application of
a database program is as a "front-end" entry mechanism. A telephone interviewer may ask
the question "How many children live in your household?" The computer's software has
been programmed to accept any answer between 0 and 20. If a "P"is accidentally struck,
the program will not accept the answer and will return the interviewer to the question. With
a precoded online instrument, some of the editing previously discussed is done by the program. In %ddition,the program can be set for automatic conditional branching. In the example, an answer of 1 or greater causes the program to prompt the questioner to ask the
ages of the children. A 0 causes the age question to be automatically skipped. Although this
option is available whenever interactive computing is used, front-end processing is typically done within the database design. The database will then store the data in a set of
linked files that allow the data to be easily sorted. Descriptive statistics and tables-the first
steps in exploring data-are readily generated from within the database.



>chapter 16 Data Preparat~on
and Description

:
.

@

Spreadsheet Spreadsheets are a specialized type of database for data that need organizing, tabulating, and simple statistics. They also offer some database management, graphics, and presentation capabilities. Data entry on a spreadsheet uses numbered rows and
lettered columns with a matrix of thousands of cells into which an entry may be placed.
Spreadsheets allow you to type numbers, formulas, and text into appropriate cells. Many
statistics programs for personal computers and also charting 2nd graphics applications have
data editors similar to the Excel spreadsheet format shown in Exhibit 16-10. This is a convenient and flexible means for entering and viewing data.

Optical Recognition
If you use a PC image scanner, you probably are familiar with optical character recognition (OCR) programs that transfer printed text into computer files in order to edit and use
it without retyping. There are other, related applications. Optical scanning of instmments-the choice of testing services-is efficient for researchers. Examinees darken small
circles, ellipses, or spaces between sets of parallel lines to indicate their answers. A more
flexible format, optical mark recognition (OMR) uses a spreadsheet-style interface to
read and process user-created forms. Optical scanners process the marked-sensed questionnaires and store the answers in a file. This method, most often associated with standardized and preprinted forms, has been adopted by researchers for data entry and
preprocessing due to its speed (10 times faster than keyboarding), cost savings on data entry, convenience in charting and reporting data, and improved accuracy. It reduces the number of times data are handled, thereby reducing the number of errors that are introduced.
Other techniques include direct-response entry, of which voting procedures used in several states are an example. With a specially prepared punch card, citizens cast their votes
by pressing a pen-shaped instrument against the card next to the preferred candidate. This


>part IV

Aiialys~s
aiid P~.eseritnt~o~i
rjf Data


Cruises had 15 ships and 30,000 berths sailing 7 to 72 days to more than 6 continents
In 2004 Pr~ncess
on more than 150 itineraries. Princess carries more than 700,000 passengers each year and processes
245,000 customer satisfaction surveys each year-distributed to each cabin on the last day of each
cruise. Princess uses scannable surveys rather than human data entry for one reason: in the 1-week to
10-day analysis lag created by human data entry, 10 cruises could be completed with another 10 under
way. For a business that prides itself on customer service, not knowing about a problem could be
enormously damaging. Princess has found that scannable surveys generate more accurate data entry,
while reducing processing and decision-response time-critical time in the cruise industry.
www.princesscruises.com

opens a small hole in a specific column and row of the card. The cards are collected and
placed directly into a card reader. This method also removes the coding and entry steps.
Another governmental application is the 1040EZ form used by the Internal Revenue
Service. It is designed for computerized number and character recognition. Similar character recognition techniques are employed for many forms of data collection. Again, both approaches move the response from the question to data analysis with little handling.

Voice Recognition
The increase in computerized random dialing has encouraged other data collection innovations. Voice recognition and voiice response systems are providing some interesting alternatives for the telephone interviewer. Upon getting a voice response to a randomly dialed
number, the computer branches into a questionnaire routine. These systems are advancing
quickly and will soon translate recorded voice responses into data files.

Digital
Telephone keypad response, frequently used by restaurants and entertainment venues to
evaluate customer service, is another capability made possible by computers linked to telephone lines. Ifsing the telephone keypad (touch-tone), an invited participant answers questions by pressing the appropriate number. The computer captures the data by decoding the
tone's electrical signal and storing the numeric or alphabetic answer in a data file. While
not originally designed for collecting survey data, each of the software components within
Microsoft Office XP includes advanced speech recognition functionality, enabling people
to enter and edit data by speaking into a microphone.*



>chapter 16

Llata Preparation and Descr~ptlo~l

Field interviewers can use mobile computers or notebooks instead of clipboards and
pencils. With a built-in communications modem, wireless LAN, or cellular link, their files
can be sent directly to another computer in the field or to a remote site. This lets supervisors inspect data immediately or simplifies processing at a central facility. This is the technology that Nielsen Media is using with its portable People Meter.

Bar Code Since adoption of the Universal Product Code (UPC) in 1973, the bar code
has developed from a technological curiosity to a business mainstay. After a study by
McKinsey & Company, the Kroger grocery chain pilot-tested a production system and bar
codes became ubiquitous in that industry?
Bar-code technology is used to simplify the interviewer's role as a data recorder. When
an interviewer passes a bar-code wand over the appropriate codes, the data are recorded in
a small, lightweight unit for translation later. In the large-scale processing project Census
2000, the Census Data Capture Center used bar codes to identify residents. Researchers
studying magazine readership can scan bar codes to denote a magazine cover that is recognized by an interview participant.
The bar code is used in numerous applications: point-of-sale terminals, hospital patient
ID bracelets, inventory control, product and brand tracking, promotional technique evaluation, shipment tracking, marathon runners, rental car locations (to speed the return of cars
and generate invoices), and tracking of insects' mating habits. The military uses 2-foot-long
bar codes to label boats in storage. The codes appear on business documents, truck parts,
and timber in lumberyards. Federal Express shipping labels use a code called Codabal:
Other codes, containing letters as well as numbers, have potential for researchers.

On the Horizon
Even with these time reductions between data collection and analysis, continuing innovations in multimedia technology are being &weloped by the personal computer business.
The capability to integrate visual images, streaming video, audio, and data may soon replace video equipment as the preferred method for recording an experiment, interview, or
focus group. A copy of the response data could be extracted for data analysis, but the audio
and visual images would remain intact for later evaluation. Although technology will never

replace researcher judgment, it can reduce data-handling errors, decrease time between
data collection and analysis, and help provide more usable information.


×