| |
continued from page 3...
Figure 2 shows how we implement the familiar dialog processing
cycle used in all voice browsers. In step (1), the
document interpreter asks an URLFetcher to request a voice
markup document from the Internet. The URLFetcher
fetches the document and returns it (2). The returned
page is fed to an XMLParser object, which turns it into
a DOM tree (3). In step (4) the document's DOM tree
is processed into some interpretable form. In step
(5), the document is interpreted, which causes the next
markup language document to be fetched, completing the
cycle. The VoxGateway follows this cycle with the
following variations:
- The components driving the cycle are all created
based on Factory methods, so substantially different
configurations are possible. In theory, each
channel can be set up to use a different configuration,
and the configuration of a channel can change from
session to session.
- There is an extremely strict division between the
core interpreter and the specific speech engines in
use. The core knows absolutely nothing about
the speech engines it is relying on. (In Figure
2, the code interpreter boxes are darker blue, while
the framework's speech engines are in light blue.)
- There is also a strict, though not perfect, division
between the specific voice markup language a document
uses, and the generic compiled dialog structure generated
from it. This means an application could consist
of alternating VoiceXML and VoxML pages. More
usefully it means that a licensee of the VoxGateway
can develop a new voice markup language and intermingle
pages written in this new language with pages written
in VoiceXML.
This
architecture has served us well. It is flexible,
and supports the diverse needs of our source licensees.
It solves all the key problems that faced us in late
1998. The ExecutionContext API (and others) shield
the rest of the VoxGateway from having to know about
the particular speech engines and voice server being
used. The strict separation between a document's
specific voice markup language and the generic dialog
structure compiled from it means that the VoxGateway
is able to handle evolving standards well. And
other classes (OutputFilter, URLFetcher) shield the
VoxGateway from Windows dependencies, so that it able
to run on BSD Unix, Solaris, Linux, and other operating
systems, and run on all Java Virtual Machines.
There
are a few decisions I'd change if we were to start over.
For instance the ExecutionContext abstract superclass
both defines the abstract API to the speech and telephony
resources, and provides the interface to the
concrete, generic dialog interpretation engine.
There would be less explaining to do if these concepts
were separated. The architecture also contains
part of a scripting language interpreter used before
ECMAScript was made part of VoiceXML 1.0. This
old interpreter hasn't been entirely refactored out
yet due to concerns with performance -- it's the basis
of some effective optimizations -- but code clarity
would benefit from its removal.
Efficiency
Efficiency is a vitally important attribute of a voice
browser. Improvements in performance can cut the
cost of a voice platform significantly, so we've always
been looking for ways to improve our efficiency and
increase channel density. Worries about Java performance
goaded us especially at the start of the project.
In
the earliest days, we ran a separate VoxGateway process
for each channel. Our development node was a pair
of 266 Mhz PIIs ("Ren" and "Stimpy")
with one running browsers and Nuance RecClient processes,
and the other terminating a four-port analog Dialogic
card and running the Nuance RecServer and the TTS process.
One our first optimizations was to run only one JVM
with multiple channels in separate threads. This
was pretty effective and it allowed for some other optimizations
(shared caches) and features (e.g., an administrative
web server) not easily implemented in a one-process-per-channel
architecture.
Our
next effort wasn't so effective, and taught us again
the principle that intuition is a poor basis for finding
performance hot spots. In this case, our intuition
was that the process of compiling a voice markup language
document into a generic dialog tree was expensive.
We spent time adding in diagnostic code to measure the
number of dialog tree nodes being created, and then
reduced this number by half. This led to only
a tiny improvement, because in fact, document compilation
was not particularly expensive.
The
next step along these lines was the DialogCache, a cache
of the n most recently created compiled dialog
trees. This cache proved to be difficult to write,
since it uncovered several subtle contraventions of
the principle that there should be no execution state
in the compiled dialog trees. But eventually the
cache was completed and it does reduce the processing
time of the core interpreter by about 10-20 percent
under ideal circumstances: every caller always
listens to the same small set of VoiceXML documents,
such as a set of twelve daily horoscopes. But
constant content tends to be the exception -- content
is often tailored for the particular time, user, data,
and circumstance, so caching it rarely pays off in most
circumstances. However, in the new W3C VoiceXML
2.1 specification, now in progress, several features
support a style of application delivery where all the
VoiceXML document's variability is banished to subsidiary
data fetches. Where this approach will be used,
the DialogCache could be a reasonable optimization to
enable.
At
this point we began serious performance studies based
on automated tests using Empirix Hammer machines to
generate tens of thousands of calls. Audio handling
was soon identified as a key hot spot. We implemented
several optimizations, including a shared local FetchResultCache
that integrated with the Nuance audio system.
As part of this effort, we made sure that we implemented
all the proper VoiceXML 2.0 fetch semantics, including
the maxage and maxstale attributes and properties.
We also added the HTTP If-Modified-Since optimization,
so that if we already had a cached response, and got
back a "not modified" result, we could continue
using that cached response. This work had a very
solid payoff, teaching us that application developers
really need to think carefully about audio age and staleness,
and voice browser implementers need to focus in on audio
handling efficiency.
These
performance studies also showed that the voice browser
process in its totality was a relative performance hog.
Over the next year (2002) we put a lot of effort in
measuring performance and fixing hot spots. Overall
we applied hundreds of small (and not so small)
optimization to the core interpreter, getting it to
run three to four times faster. We used Rational
Quantify, HPjmeter, and various timed tests. Towards
the end of the year we found that the core interpreter
was now taking very little time in the full system test.
The hot spots in the voice browser process had become
logging and speech engine interaction. So today
we're focused on improving framework (as opposed to
core interpreter) performance. Garbage collection
improvements also look like they could have significant
payoffs, so we are experimenting with JVM garbage collection
property settings and object reference nullification.
Today's
VoxGateway
Motorola
was forced to discontinue its voice server business
in late 2001 as the telecom industry suffered through
its historic downturn. MIX was aimed primarily
at the large carrier marketplace, and carriers were
terribly burdened with debt from the 3G spectrum auctions
and the capital investment made during the Internet
boom. But even though MIX was no longer a product,
it still made good sense to keep on developing and source
licensing the VoxGateway: we had a reasonable base of
licensees, with an ongoing flow of interest from other
companies.
The
VoxGateway is also extremely useful to us as we work
to understand the new area of multimodal (visual and
voice) interfaces. In 2001 we were able to use
it to quickly put together a research multimodal server
called the Multimodal Fusion Server (MFS). The
MFS runs a dialect of VoiceXML called MMVXML, and it
uses a research version of SpeechWorks OSR 1.0.
The MFS runs in conjunction with multimodal-enabled
iDEN i90c handsets on the Nextel national network.
A technology called Distributed Speech Recognition (DSR)
runs on both handset and server to greatly improve recognition
rates. One of our sample Java multimodal applications
on the i90c handsets consistently amazed people with
good recognition rates for a huge Chicago street address
grammar. We are continuing these experiments to
be ready to support multimodal applications as the W3C
multimodal standards emerge.

back
to the top

Copyright
© 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|