Tài liệu 50 Software Tools for Speech Research and Development doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (168.01 KB, 13 trang )

Shore, J. “Software Tools for Speech Research and Development”
Digital Signal Processing Handbook
Ed. Vijay K. Madisetti and Douglas B. Williams
Boca Raton: CRC Press LLC, 1999
c

1999byCRCPressLLC
50
Software Tools for Speech Research
and Development
John Shore
Entropic Research
Laboratory, Inc.
50.1 Introduction
50.2 Historical Highlights
50.3 The User’s Environment (OS-Based vs. Workspace-Based)
Operating-System-Based Environment
•
Workspace-Based
Environment
50.4 Compute-Oriented vs. Display-Oriented
Compute-Oriented Software
•
Display-Oriented Software
•
Hybrid Compute/Display-Oriented Software
50.5 Compiled vs. Interpreted
Interpreted Software
•
Compiled Software
•

Hybrid Inter-
preted/Compiled Software
•
Computation vs. Display
50.6 Specifying Operations Among Signals
Text-Based Interfaces
•
Visual(“Point-and-Click”) Interfaces
•
Parametric Control of Operations
50.7 Extensibility (Closed vs. Open Systems)
50.8 Consistency Maintenance
50.9 Other Characteristics of Common Approaches
Memory-basedvs. File-based
•
DocumentationofProcessing
History
•
Personalization
•
Real-Time Performance
•
Source
Availability
•
HardwareRequirements
•
Cross-PlatformCom-
patibility
•

DegreeofSpecialization
•
SupportforSpeechInput
and Output
50.10 File Formats (Data Import/Export)
50.11 Speech Databases
50.12 Summary of Characteristics and Uses
50.13 Sources for Finding Out What is Currently Available
50.14 Future Trends
References
50.1 Introduction
Experts in every ﬁeld of study depend on specialized tools. In the case of speech research and
development, the dominant tools today are computer programs. In this article, we present an
overview of key technical approaches and features that are prevalent today.
We restrict the discussion to software intended to suppor t R&D, as opposed to software for com-
mercial applications of speech processing. For example, we ignore DSP programming (which is
discussed in the previous article). Also, we concentrate on software intended to support the special-
c

1999 by CRC Press LLC
ities of speech analysis, coding, synthesis, and recognition, since these are the main subjects of this
chapter. However, much of what we have to say applies as well to the needs of those in such closely
related areas as psycho-acoustics, clinical voice analysis, sound and vibration, etc.
We do not attempt to survey available software packages, as the result would likely be obsolete by
thetimethisbookis printed. Theexamplesmentioned areillustrative,andnot intendedtoprovidea
thorough or balanced review. Our aim is to provide sufﬁcient background so that readers can assess
their needs and understand the differences among available tools. Up-to-date surveys are readily
available online (see Section 50.13).
In general, there are three common uses of speech R&D software:
• Teaching, e.g., homework assignments for a basic course in speech processing

• Interactive, free-form exploration, e.g., designing a ﬁlter and evaluating its effects on a
speech processing system
• Batch experiments, e.g., training and testing speech coders or speech recognizers using a
large database
The relative importance of various features differs among these uses. For example, in conducting
batchexperiments, itisimportantthatlargesignalscanbehandled,andthat complicatedalgorithms
executeefﬁciently. For teaching, on the other hand, these features are less important than simplicity,
quickexperimentation,andease-of-use. Becauseof practical limitations, suchdifferencesinpriority
mean that no one software package today can meet all needs.
To explain the variation among current approaches, we identify a number of distinguishing char-
acteristics. Thesecharacteristics arenotindependent(i.e.,thereisconsiderable overlap),buttheydo
help to present the overall view.
For simplicity, we will refer to any particular speech R&D software as “the speech software”.
50.2 Historical Highlights
Earlyorsigniﬁcantexamplesof speechR&Dsoftwareinclude“VisibleSpeech”[5],MITSYN[1], and
Lloyd Rice’s WA VE program of the mid 1970s (not to be confused with David Talkin’s waves [8]).
The ﬁrst general, commercial system that achieved widespread acceptance was the Interactive
Laboratory System (ILS) from Signal Technology Incorporated, which was popular in the late 1970s
and early 1980s. Using the terminology deﬁned below, ILS is compute-oriented software with an
operating-system-based environment. The ﬁrst popular, display-oriented, workspace-based speech
software was David Shipman’s LISP-machine application called Spire [6].
50.3 The User’s Environment
(OS-Based vs. Workspace-Based)
In some cases, the user sees the speech software as an extension of the computer’s operating system.
We call this “operating-system-based” (or OS-based); an example is the Entropic Signal Processing
System (ESPS)[7].
In other cases, the software provides its own operating environment. We call this “workspace-
based” (from the term used in implementations of the programming language APL); an example is
MATLAB
TM

(from The Mathworks).
c

1999 by CRC Press LLC
50.3.1 Operating-System-Based Environment
Inthisapproach,signals arerepresented as ﬁles under the native operating system (e.g., Unix, D OS),
and the software consists of a set of programs that can be invoked separately to process or display
signals in various ways. Thus, the user sees the software as an extension of an already-familiar oper-
ating system. Because signals are represented as ﬁles, the speech software inherits ﬁle manipulation
capabilities from the operating system. Under Unix, for example, signals can be copied and moved
respectively using the cp and mv programs, and they can be organized as directory trees in the Unix
hierarchical ﬁle system (including NFS).
Similarly, the speech software inherits extension capabilities inherent in the operating system.
UnderUnix,forexample,extensionscanbecreatedusingshellscriptsinvariouslanguages(sh,csh,Tcl,
perl, etc.),aswellassuchfacilitiesaspipesandremoteexecution. OS-basedspeechsoftwarepackages
are often called command-line packages because usage typically involves providing a sequence of
commands to some type of shell.
50.3.2 Workspace-Based Environment
In this approach, the user interacts with a single application program that takes over from the
operating system. Signals, which may or may not correspond to ﬁles, are typically represented as
variables in some kind of virtual space. Various commands are available to process or display the
signals. Such a workspace is often analogous to a personal blackboard.
Workspace-based systems usually offer means for saving the current workspace contents and for
loading previously saved workspaces.
Anextensionmechanismistypicallyprovidedbyacommandinterpreterforasimplelanguagethat
includes the available operations and a means for encapsulating and invoking command sequences
(e.g., in a function or procedure deﬁnition). In effect, the speech software provides its own shell to
the user.
50.4 Compute-Oriented vs. Display-Oriented
This distinction concerns whether the speech software emphasizes computation or visualization or

both.
50.4.1 Compute-Oriented Software
If there is a large number of signal processing operations relative to the number of signal display
operations, we say that the software is compute-oriented. Such software typically can be oper ated
withoutadisplaydeviceandtheuserthinksofitprimarilyasacomputationpackagethatsupportssuch
functions as spectral analysis, ﬁltering, linear prediction, quantization, analysis/synthesis, pattern
classiﬁcation, Hidden Markov Model (HMM) training, speech recognition, etc.
Compute-oriented software can be either OS-based or workspace based. Examples include ESPS,
MATLAB
TM
, and the Hidden Markov Model Toolkit (HTK) (from Cambridge University and En-
tropic).
50.4.2 Display-Oriented Software
In contrast, display-oriented speech software is not intended to and often cannot operate without
a display device. The primary purpose is to support visual inspection of waveforms, spectrograms,
and other parametric representations. The user typically interacts with the software using a mouse
or other pointing device to initiate display operations such as scrolling, zooming, enlarging, etc.
c

1999 by CRC Press LLC
While the software mayalso providecomputationsthatcanbe performed on displayedsignals (or
marked segments of displayed signals), the user thinks of the software as supporting visualization
more than computation. An example is thewaves program [8].
50.4.3 Hybrid Compute/Display-Oriented Software
Hybrid compute/display software combines the best of both. Interactions are typically by means
of a display device, but computational capabilities are rich. The computational capabilities may be
built-in to workspace-based speech software, or may be OS-based but accessible from the display
program. Examples include the Computerized Speech Lab (CSL) from Kay Elemetrics Corp., and
the combination of ESPS and waves.
50.5 Compiled vs. Interpreted

Here we distinguish accordingto whether the bulk of the sig nal processing or display code (whether
written by developers or users) is interpreted or compiled.
50.5.1 Interpreted Software
The interpreter language may be specially designed for the software (e.g., S-PLUS from Statistical
Sciences, Inc., and MATLAB
TM
), or may be an existing, general purpose language (e.g., LISP is used
in N!Power from Signal Technology, Inc.).
Compared to compiler languages, inter preter languages tend to be simpler and easier to learn.
Furthermore, it is usually easier and faster to write and test programs under an interpreter. The
disadvantage, relative to compiled languages, is that the resulting programs can be quite slow to run.
Asaresult,interpretedspeechsoftwareisusuallybettersuitedforteachingandinteractiveexploration
than for batch experiments.
50.5.2 Compiled Software
Comparedtointerpretedlanguages, compiledlanguages(e.g.,FORTRAN,C,C++)tendtobemore
complicatedandhardertolearn. Comparedtointerpretedprograms, compiledprograms areslower
to write and test, but considerably faster to run. As a result, compiled speech software is usually
better suited for batch experiments than for teaching.
50.5.3 Hybrid Interpreted/Compiled Software
Someinterpretersmakeitpossibletocreatenew language commandswithanunderlyingimplemen-
tation that is compiled. This allows a hybrid approach that can combine the best of both.
Some languages provide a hybrid approach in which the source code is pre-compiled quickly into
intermediate code that is then (usually!) interpreted. Java is a good example.
If compiled speech software is OS-based, signal processing scripts can typically be written in an
interpretivelanguage(e.g.,ashscriptcontainingasequenceofcallstoESPS programs). Thus,hybrid
systems can also be based on compiled software.
50.5.4 Computation vs. Display
Thedistinctionbetweencompiled and interpretedlanguagesisrelevant mostly tothecomputational
aspects of the speech software. However, the distinction can apply as well to display software, since
c


1999 by CRC Press LLC
somedisplayprogramsarecompiled(e.g.,usingMotif)whileothersexploitinterpreters(e.g.,Tcl/Tk,
Java).
50.6 Specifying Operations Among Signals
Here we are concernedwith the means by which users specify what operations are to be doneand on
whatsignals. Thisconsiderationisrelevanttohowspeechsoftwarecanbeextendedwithuser-deﬁned
operations (see Section 50.7), but is an issue even in software that is not extensible.
The main distinction is between a text-based interface and a visual (“point-and-click”) interface.
Visual interfaces tend to be less general but easier to use.
50.6.1 Text-Based Interfaces
Traditional interfaces for specifying computations are based on a textual-representation in the form
of scripts and programs. For OS-based speech software, operations are typically speciﬁed by typing
the name of a command (with possible options) directly to a shell. One can also enter a sequence of
such commands into a text editor when preparing a script.
This style of specifying operations also is available for workspace-based speech software that is
based on a command interpreter. In this case, the text comprises legal commands and programs in
the interpreter language.
Both OS-based and workspace-based speech software may also permit the speciﬁcation of opera-
tions using source code in a high-level language (e.g., C) that gets compiled.
50.6.2 Visual (“Point-and-Click”) Interfaces
Thepoint-and-clickapproachhasbecometheubiquitoususer-interfaceofthe1990s. Operationsand
operands (signals) arespeciﬁedbyusingamouseorotherpointing devicetointeractwith on-screen
graphical user-interface (GUI) controls such as buttons and menus. The interface may also have a
text-based component to allow the direct entry of parameter values or formulas relating signals.
Visual Interfaces for Display-Oriented Software
In display-oriented software, the signals on which operations are to be performed are visible
as waveforms or other directly representative graphics.
A typical user-interaction proceeds as follows: A relevant signal is speciﬁed by a mouse-click
operation (if a signal segment is involved, it is selected by a click-and-drag operation or by a pair of

mouse-click operations). Theoperation to be performed is then speciﬁed by mouse click operations
on screen buttons, pull-down menus, or pop-up menus.
This style works very well for unary operations (e.g., compute and display the spectrogram of a
given signal segment), and moderately well for binary operations (e.g., add two signals). But it is
awkward for operations that have more than two inputs. It is also awkward for specifying chained
calculations, especially if you want to repeat the calculations for a new set of signals.
Onesolutiontotheseproblemsisprovided bya“calculator-style”interfacethatlooksandactslike
a familiar arithmetic calculator (except the operands are sig nal names and the operations are signal
processing operations).
Another solution is the “spreadsheet-style” interface. The analogy with spreadsheets is tight.
Imagine a spreadsheet in which the cells are replaced by images (waveforms, spectrograms, etc.)
connectedlogicallybyformulas. Forexample,onecellmightshowatestsignal,asecondmightshow
the results of ﬁltering it, and a third might show a spectrogram of a portion of the ﬁltered signal.
This exempliﬁes a spreadsheet-style interface for speech software.
c

1999 by CRC Press LLC
A spreadsheet-style interface provides some means for specifying the “formulas” that relate the
various “cells”. This formula interface might itself be implemented in a point-and-click fashion,
or it might permit direct entry of formulas in some interpretive language. Speech software with
a spreadsheet-style interface will maintain consistency among the visible signals. Thus, if one of
the signals is edited or replaced, the other signal graphics change correspondingly, according to the
underlying formulas.
DADisp (from DSP De velopment Corporation) is an example of a spreadsheet-style interface.
Visual Interfaces for Compute-Oriented Software
In a visual interface for display-oriented software, the focus is on the signals themselves. In
a visual interface for compute-oriented software, on the other hand, the focus is on the operations.
Operations among sig nals typically arerepresented as iconswith one or more input and output lines
that interconnect the operations. In effect, the representation of a signal is reduced to a straight
line indicating its relationship (input or output) with respect to operations. Such visual interfaces

are often called block-diagram interfaces. In effect, a block-diag ram interface provides a visual
representation of the computation chain. Various point-and-click means are provided to support
the user in creating, examining, and modifying block diagrams.
Ptolomy [4] and N!Power are examples of systems that provide a block-diagram interface.
Limitations of Visual Interfaces
Although much in vogue, visual interfaces are inherently limited as a means for specifying
signal computations.
For example, the analogy between spreadsheets and spreadsheet-style speech software continues.
For simple signal computations, the spreadsheet-style interface can be very useful; computations
are simple to set up and informative when operating. For complicated computations, however,
the spreadsheet-style interface inherits all of the worst features of spreadsheet programming. It is
difﬁculttoencapsulatecommonsub-calculations,anditisdifﬁculttoorganizethe “program”sothat
the computationalstructure is self-e vident. The resultis that spreadsheet-style programs are hard to
write, hard to read, and error-prone.
Inthisrespect,block-diagram interfacesdoabetterjobsincetheirmain focus is ontheunderlying
computation rather than on the signals themselves. Thus, screen “real-estate” is devoted to the
computation rather than to the signal graphics. However, as the complexity of computations grows,
the geometric and visual approach eventually becomes unwieldy. Whenwas the last time you used a
ﬂowchart to design or document a program?
Itfollowsthatvisual interfaces for specifying computations tend to be best suitedfor teachingand
interactive exploration.
50.6.3 Parametric Control of Operations
Speechprocessingoperations oftenarebasedoncomplicatedalgorithmswithnumerousparameters.
Consequently, the means for specifying parameters is an important issue for speech software.
The simplest form of parametric control is provided by command-lineoptions on command-line
programs. This is convenient, but can be cumbersome if there are many parameters. A common
alternative is to read parameter values from parameter ﬁles that are prepared in advance. Typically,
command-line values can be used to override values in the parameter ﬁle. A third input source for
parameter values is directly from the user in response to prompts issued by the program.
Some systems offer the ﬂexibility of a hierarchy of inputs for parameter values, for example:

• default values
c

1999 by CRC Press LLC
• values from a global parameter ﬁle read by all programs
• values from a program-speciﬁc parameter ﬁle
• values from the command line
• values from the user in response to run-time prompts
In some situations, it is helpful if a current default value is replacedby the most recent input from
a given parameter source. We refer to this property as “parameter persistence”.
50.7 Extensibility (Closed vs. Open Systems)
Speech software is “closed” if there is no provision for the user to extend it. There is a ﬁxed set of
operations available to process and display signals. What you get is all you get.
OS-based systemsarealwaysextensibleto a degree because they inherit scripting capabilities from
theOS,whichpermitsthecreationofnewcommands. Theymayalsoprovideprogramminglibraries
so that the user can write and compile new programs and use them as commands.
Workspace-basedsystemsmaybeextensibleiftheyarebasedonaninterpreterwhoseprogramming
language includes the concept of an encapsulated procedure. If so, then users can write scripts that
deﬁne new commands. Some systems also allow the interpreterto be extendedwith commands that
are implemented by underlying code in C or some other compiled language.
In general, for speech software to be extensible, it must be possible to specify operations (see
Section 50.6) and also to re-use the resulting speciﬁcations in other contexts. A block-diagram
interface is extensible, for example, if a given diagram can be reduced to an icon that is available for
use as a single block in another diagram.
For speech software with visual interfaces, extensibility considerations also include the ability to
specifynewGUIcontrols(visiblemenusandbuttons),theabilitytotiearbitraryinternalandexternal
computations to GUI controls, and the ability to deﬁne new display methods for new signal types.
Ingeneral,extendedcommandsmaybehavedifferentlyfromthebuilt-incommandsprovidedwith
the speech software. For example, built-in commands may share a common user interface that is
difﬁculttoimplement inanindependentscriptorprogram (suchacommoninterfacemightprovide

standard parameters for debug control, standard processing of parameter ﬁles, etc.).
If user-deﬁned scripts, programs, and GUI components are indistinguishable from built-in facili-
ties, we say that the speech software provides seamless extensibility.
50.8 Consistency Maintenance
A speech processing chain involves signals, operations, and parameter sets. An important consid-
eration for speech software is whether or not consistency is maintained among all of these. Thus,
for example, if one input signal is replaced with another, are all intermediate and output signals
recalculated automatically? Consistency maintenance is primarily an issue for speech software with
visual interfaces,namelywhether ornotthesoftwareguaranteesthat allaspectsofthevisibledisplays
are consistent w ith each other.
Spreadsheet-style interfaces (for display-oriented software) and block-diagram interfaces (for
compute-oriented software) usually provide consistency maintenance.
c

1999 by CRC Press LLC
50.9 Other Characteristics of Common Approaches
50.9.1 Memory-based vs. File-based
“Memory-based” speech software carries out all of its processing and display operations on signals
that are stored entirely within memory,regardless of whether or not the signals also have an external
representation as a disk ﬁle. This approach has obvious limitations with respect to signal size, but it
simpliﬁes programming and yields fast operation. Thus, memor y-based software is well-suited for
teaching and the interactive exploration of small samples.
In“ﬁle-based”speechsoftware,ontheotherhand,signalsarerepresentedandmanipulatedasdisk
ﬁles. The software partially buffers portions of the signal in memory as required for processing and
display operations. Although programming can be more complicated, the advantage is that there
are no inherent limitations on signal size. The ﬁle-based approach is, therefore, well-suited for large
scale experiments.
50.9.2 Documentation of Processing History
Modernspeechprocessinginvolvescomplicatedalgorithmswithmanyprocessingstepsandoperating
parameters. As a result, it is often important to be able to reconstruct exactly how a given signal was

produced. Speech software can help here by creating appropriate records as signal and parameter
ﬁles are processed.
The most common method for recording this information about a given signal is to put it in the
same ﬁle as the signal. Most modern speech software uses a ﬁle format that includes a “ﬁle header”
that is used for this pur pose. Most systems store at least some information in the header, e.g., the
sampling rate of the signal. Others, such as ESPS, attempt to store all relevant information. In this
approach, the header of a signal ﬁle produced by any program includes the program name, values
of processing parameters, and the names and headers of all source ﬁles. The header is a recursive
structure,sothattheheadersof thesourceﬁlesthemselvescontainthenamesandheadersofﬁlesthat
wereprior sources. Thus,a signal ﬁle header containsthe headers of all source ﬁles in the processing
chain. It follows that ﬁles contain a complete history of the origin of the data in the ﬁle and all
the intermediate processing steps. The importance of record keeping grows with the complexity of
computation chains and the extent of available parametric control.
50.9.3 Personalization
There is considerable variation in the extent to which speech software can be customized to suit
personal requirements and tastes. Some systems cannot be personalized at all; they start out the
same way, every time. But most systems store personal preferences and use them again next time.
Savable preferences may include color selections, button layout, button semantics, menu contents,
currentlyloadedsignals,visiblewindows,windowarrangement,anddefaultparametersetsforspeech
processing operations.
Attheextreme, some systemscansaveacomplete“snapshot”thatpermitsexactresumption. This
isparticularlyimportantfortheinteractivestudyofcomplicatedsignalconﬁgurationsacrossrepeated
software sessions.
50.9.4 Real-Time Performance
Software is generally described as “real-time” if it is able to keep up with relevant, changing inputs.
In the case of speech software, this usually means that the software can keep up with input speech.
Even this deﬁnition is not particularly meaning ful unless the input speech is itself coming from a
c

1999 by CRC Press LLC

human speaker and digitized in real-time. Otherwise, the real-issue is whether or not the software is
fast enough to keep up with interactive use.
For example, if one is testing speech recognition software by directly speaking into the computer,
real-time performance is important. It is less impor tant, on the other hand, if the test procedure
involves running batch scripts on a database of speech ﬁles.
Ifthespeechsoftwareis designed totakeinput directlyfromdevices(orpipes,inthecaseof Unix),
then the issue becomes one of CPU speed.
50.9.5 Source Availability
It is unfortunate but true that the best documentation for a given speech processing command is
oftenthesourcecode. Thus,theavailabilityofsourcecodemaybeanimportantfactorforthisreason
alone. Typically, this is more important when the software is used in advanced R&D applications.
Sourcesalsoareneededifusershaverequirementstoportthespeechsoftwaretoadditionalplatforms.
Source availability may also be important for extensibility, since it may not be possible to extend the
speech software without the sources.
If the speech software is interpreter-based, sources of interest will include the sources for any
built-in operations that are implemented as interpreter scripts.
50.9.6 Hardware Requirements
Speechsoftwaremayrequiretheinstallationofspecialpurposehardware. Therearetwomainreasons
for such requirements: to accelerate particular computations (e.g., spectrograms), and to provide
speech I/O w ith A/D and D/A converters.
Such hardware has several disadvantages. It adds to the system cost, and it decreases the overall
reliability of the system. It may also constrain system software upgrades; for example, the extra
hardware may use special device drivers that do not survive OS upgrades. Special purpose hardware
used to be common, but is less so now owing to the continuing increase in CPU speeds and the
prevalenceofbuilt-inaudioI/O.Itisstillimportant,however,whenmaximumspeedandhigh-quality
audio I/O are important. CSL is a good example of an integrated hardware/software approach.
50.9.7 Cross-Platform Compatibility
If your hardware platform may change or your site has a variety of platforms, then it is important
to consider whether the speech software is available across a variety of platforms. Source availability
(Section 50.9.5) is relevant here.

Ifyouintendtorunthespeechsoftwareonseveralplatformsthathavedifferentunderlyingnumeric
representations(a byte order differencebeing most likely), then it is important to know whether the
ﬁle formats and signal I/O software support transparent data exchange.
50.9.8 Degree of Specialization
Somespeechsoftwareisintendedforgeneralpurposeworkinspeech(e.g.,ESPS/waves,MATLAB
TM
).
Other software is intended for more specialized usage. Some of the areas where specialized software
tools may be relevant include linguistics, recognition, synthesis, coding, psycho-acoustics, clinical-
voice, music, multi-media, sound and vibration, etc. Two examples are HTK for recognition, and
Delta (from Eloquent Technology) for synthesis.
c

1999 by CRC Press LLC
50.9.9 Support for Speech Input and Output
Inthepast,built-inspeechI/OhardwarewasuncommoninworkstationsandPCs,sospeechsoftware
typically supported speech I/O by means of add-on hardwaresupplied with the software or available
fromotherthirdparties. Thisprovidedthedesiredcapability,albeitwiththedisadvantagesmentioned
earlier (see Section 50.9.6).
TodaymostworkstationsandPCshavebuilt-inaudiosupportthatcanbeuseddirectlybythespeech
software. This avoids the disadvantages of add-on hardware, but the resulting A/D-D/A quality can
be too noisy or otherwise inadequate for use in speech R&D (the built-in audio is typically designed
formoremundanerequirements). Therearevariousreasonswhyspecial-purposehardwaremaystill
be needed, including:
• need for more than two channels
• need for very high sampling rates
• compatibility with special hardware (e.g., DAT tape)
50.10 File Formats (Data Import/Export)
Signalﬁleformatsarefundamentallyimportantbecausetheydeterminehoweasyitisforindependent
programstoreadandwritetheﬁles(interoperability). Furthermore,theformatdetermineswhether

ﬁles can contain all of the information that a program might need to operate on the ﬁle’s primary
data (e.g., can the ﬁle contain the sampling frequency in addition to a waveform itself?).
The best way to design speech ﬁle formats is hotly debated, but the clear trend has been towards
“self-describing”ﬁle formats that include information about the names, data types, and layout of all
data in the ﬁle. (For example, this permits programs to retrieve data by name.)
There are many popular ﬁle formats, and various programs are available for converting among
them (e.g., SOX). For speech sampled data, the most important ﬁle format is Sphere (from NIST),
which is used in the speech databases available from the Linguistic Data Consortium (LDC). Sphere
supports several data compression formats in a variety of standard and specialized formats.
Sphereworkswellforsampleddataﬁles,butislimitedformoregeneralspeechdataﬁles. Ageneral
purpose, public-domain format (Esignal) has recently been made available by Entropic.
50.11 Speech Databases
Numerous databases (or corpora) of speech are available from various sources. For a current list,
see the comp.speech Frequently Asked Questions (FAQ) (see Section 50.13). The largest supplier of
speech data is the Linguistic Data Consortium, which publishes a large number of CDs containing
speech and linguistic data.
50.12 Summary of Characteristics and Uses
In Section 50.1, we mentioned that the three most common uses for speech software are teaching,
interactiveexploration,andbatchexperiments. Andatvariouspointsduringthediscussionofspeech
software characteristics, we mentioned their relative importance for the different classes of software
uses. WeattempttosummarizethisinTable50.1,wherethesymbol“•”indicatesthatacharacteristic
is particularly useful or important.
It is important not to take Table 50.1 too ser iously. As we mentioned at the outset, the various
distinguishingcharacteristicsdiscussedinthissection arenotindependent(i.e.,thereisconsiderable
c

1999 by CRC Press LLC
overlap). Furthermore, the three classes of software use are broad and not always easily distinguish-
able;i.e.,theimpor tanceofparticularsoftwarecharacteristicsdependsalotonthedetailsofintended
use. Nevertheless, Table 50.1 is a reasonable starting point for evaluating particular software in the

context of intended use.
TABLE50.1 Relative Importance of Software Characteristics
Interactive Batch
Teaching exploration experiments
OS-based (50.3.1) •
Workspace-based (50.3.2) •
Compute-oriented (50.4.1) •
Display-oriented (50.4.2) ••
Compiled (50.5.2) •
Interpreted (50.5.1) •
Text-based interface (50.6.1) ••
Visual interface (50.6.2) ••
Memory-based (50.9.1) ••
File-based (50.9.1) •
Parametric control (50.6.3) •• •
Consistency maintenance (50.8.0) ••
History documentation (50.9.2) ••
Extensibility (50.7.0) ••
Personalization (50.9.3) ••
Real-time performance (50.9.4) ••
Source availability (50.9.5) ••
Cross-platform compatibility (50.9.7) •• •
Support for speech I/O (50.9.9) ••
50.13 Sources for Finding Out What is Currently Available
The best single online source of general information is the Internet news group comp.speech, and in
particular its FAQ (see Usethis as a starting
point.
Here are some other WWW sites that (at this writing) contain speech software information or
pointers to other sites:

/> />

c

1999 by CRC Press LLC
50.14 Future Trends
From the user’s view point, speech software will continue to become easier to use, with a heavier
reliance on visual interfaces with consistency maintenance.
Calculator, spreadsheet, and block-diagram interfaces will become more common, but will not
eliminate text-based (programming, scripting) interfaces for specifying computations and system
extensions.
Software will become more open. Seamless extensibility will be more common, and extensions
will be easier. GUI extensions as well as computation extensions will be supported.
Therewillbelessofadistinctionbetweencompute-orientedanddisplay-orientedspeechsoftware.
Hybrid compute/display-oriented software will dominate.
Visualization will become more important and more sophisticated, particularly for multidimen-
sional data. “Movies” will be used to show arbitrary 3D data. Sound will be used to represent an
arbitr ary dimension. Various methods will be available to project N-dimensional data into 2- or
3-space. (Thiswillbeused,for example, toshowaspectsofvectorquantizationor HMMclustering.)
Public-domainﬁle formats will dominateproprietaryformats. Networkedcomputerswillbeused
for parallel computation if available. Tcl/Tk and Java will grow in popularity as a base for graphical
data displays and user interfaces.
References
[1] Henke, W.L., Speech and audio computer-aided examination and analysis facility, Quarterly
Progress Rep. No. 95, MIT Research Laborator y for Electronics, 1969, 69–73.
[2] Henke, W.L., MITSYN — An interactive dialogue language for time signal processing, MIT
Research Laboratory for Electronics, report RLE TM-1, 1975.
[3] Kopec, G., The integrated signal processing system ISP,
IEEE Trans. on Acoustic s, Speech, and
Signal Processing,

ASSP-32(4), 842-851, Aug. 1984.
[4] Pino, J.L., Ha, S., Lee, E.A. and Buck, J.T., Software synthesis for DSP using ptolemy,
J. VLSI
Signal Processing,
9(1), 7-21, Jan. 1995.
[5] Potter, R.K., Kopp, G.A. and Green, H.C.,
Visible Speech, D. Van Nostrand Company, New
York, 1946.
[6] Shipman, D., SpireX: Statistical analysis in the SPIRE acoustic-phonetic workstation,
Proc.
ICASSP,
Boston, 1983.
[7] Shore, J., Interactive signal processing with UNIX,
Speech Technol., 3, March/April 1988.
[8] Talkin, D., Looking at speech,
Speech Technol., 4, April/May 1989.
c

1999 by CRC Press LLC

Tài liệu 50 Software Tools for Speech Research and Development doc

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về