VoiceXML Review - Feature Articles

Volume 3, Issue 4 - July/August 2003

continued from page 3...

Figure 2 shows how we implement the familiar dialog processing cycle used in all voice browsers. In step (1), the document interpreter asks an URLFetcher to request a voice markup document from the Internet. The URLFetcher fetches the document and returns it (2). The returned page is fed to an XMLParser object, which turns it into a DOM tree (3). In step (4) the document's DOM tree is processed into some interpretable form. In step (5), the document is interpreted, which causes the next markup language document to be fetched, completing the cycle. The VoxGateway follows this cycle with the following variations:

The components driving the cycle are all created based on Factory methods, so substantially different configurations are possible. In theory, each channel can be set up to use a different configuration, and the configuration of a channel can change from session to session.
There is an extremely strict division between the core interpreter and the specific speech engines in use. The core knows absolutely nothing about the speech engines it is relying on. (In Figure 2, the code interpreter boxes are darker blue, while the framework's speech engines are in light blue.)
There is also a strict, though not perfect, division between the specific voice markup language a document uses, and the generic compiled dialog structure generated from it. This means an application could consist of alternating VoiceXML and VoxML pages. More usefully it means that a licensee of the VoxGateway can develop a new voice markup language and intermingle pages written in this new language with pages written in VoiceXML.

This architecture has served us well. It is flexible, and supports the diverse needs of our source licensees. It solves all the key problems that faced us in late 1998. The ExecutionContext API (and others) shield the rest of the VoxGateway from having to know about the particular speech engines and voice server being used. The strict separation between a document's specific voice markup language and the generic dialog structure compiled from it means that the VoxGateway is able to handle evolving standards well. And other classes (OutputFilter, URLFetcher) shield the VoxGateway from Windows dependencies, so that it able to run on BSD Unix, Solaris, Linux, and other operating systems, and run on all Java Virtual Machines.

There are a few decisions I'd change if we were to start over. For instance the ExecutionContext abstract superclass both defines the abstract API to the speech and telephony resources, and provides the interface to the concrete, generic dialog interpretation engine. There would be less explaining to do if these concepts were separated. The architecture also contains part of a scripting language interpreter used before ECMAScript was made part of VoiceXML 1.0. This old interpreter hasn't been entirely refactored out yet due to concerns with performance -- it's the basis of some effective optimizations -- but code clarity would benefit from its removal.

Efficiency

Efficiency is a vitally important attribute of a voice browser. Improvements in performance can cut the cost of a voice platform significantly, so we've always been looking for ways to improve our efficiency and increase channel density. Worries about Java performance goaded us especially at the start of the project.

In the earliest days, we ran a separate VoxGateway process for each channel. Our development node was a pair of 266 Mhz PIIs ("Ren" and "Stimpy") with one running browsers and Nuance RecClient processes, and the other terminating a four-port analog Dialogic card and running the Nuance RecServer and the TTS process. One our first optimizations was to run only one JVM with multiple channels in separate threads. This was pretty effective and it allowed for some other optimizations (shared caches) and features (e.g., an administrative web server) not easily implemented in a one-process-per-channel architecture.

Our next effort wasn't so effective, and taught us again the principle that intuition is a poor basis for finding performance hot spots. In this case, our intuition was that the process of compiling a voice markup language document into a generic dialog tree was expensive. We spent time adding in diagnostic code to measure the number of dialog tree nodes being created, and then reduced this number by half. This led to only a tiny improvement, because in fact, document compilation was not particularly expensive.

The next step along these lines was the DialogCache, a cache of the n most recently created compiled dialog trees. This cache proved to be difficult to write, since it uncovered several subtle contraventions of the principle that there should be no execution state in the compiled dialog trees. But eventually the cache was completed and it does reduce the processing time of the core interpreter by about 10-20 percent under ideal circumstances: every caller always listens to the same small set of VoiceXML documents, such as a set of twelve daily horoscopes. But constant content tends to be the exception -- content is often tailored for the particular time, user, data, and circumstance, so caching it rarely pays off in most circumstances. However, in the new W3C VoiceXML 2.1 specification, now in progress, several features support a style of application delivery where all the VoiceXML document's variability is banished to subsidiary data fetches. Where this approach will be used, the DialogCache could be a reasonable optimization to enable.

At this point we began serious performance studies based on automated tests using Empirix Hammer machines to generate tens of thousands of calls. Audio handling was soon identified as a key hot spot. We implemented several optimizations, including a shared local FetchResultCache that integrated with the Nuance audio system. As part of this effort, we made sure that we implemented all the proper VoiceXML 2.0 fetch semantics, including the maxage and maxstale attributes and properties. We also added the HTTP If-Modified-Since optimization, so that if we already had a cached response, and got back a "not modified" result, we could continue using that cached response. This work had a very solid payoff, teaching us that application developers really need to think carefully about audio age and staleness, and voice browser implementers need to focus in on audio handling efficiency.

These performance studies also showed that the voice browser process in its totality was a relative performance hog. Over the next year (2002) we put a lot of effort in measuring performance and fixing hot spots. Overall we applied hundreds of small (and not so small) optimization to the core interpreter, getting it to run three to four times faster. We used Rational Quantify, HPjmeter, and various timed tests. Towards the end of the year we found that the core interpreter was now taking very little time in the full system test. The hot spots in the voice browser process had become logging and speech engine interaction. So today we're focused on improving framework (as opposed to core interpreter) performance. Garbage collection improvements also look like they could have significant payoffs, so we are experimenting with JVM garbage collection property settings and object reference nullification.

Today's VoxGateway

Motorola was forced to discontinue its voice server business in late 2001 as the telecom industry suffered through its historic downturn. MIX was aimed primarily at the large carrier marketplace, and carriers were terribly burdened with debt from the 3G spectrum auctions and the capital investment made during the Internet boom. But even though MIX was no longer a product, it still made good sense to keep on developing and source licensing the VoxGateway: we had a reasonable base of licensees, with an ongoing flow of interest from other companies.

The VoxGateway is also extremely useful to us as we work to understand the new area of multimodal (visual and voice) interfaces. In 2001 we were able to use it to quickly put together a research multimodal server called the Multimodal Fusion Server (MFS). The MFS runs a dialect of VoiceXML called MMVXML, and it uses a research version of SpeechWorks OSR 1.0. The MFS runs in conjunction with multimodal-enabled iDEN i90c handsets on the Nextel national network. A technology called Distributed Speech Recognition (DSR) runs on both handset and server to greatly improve recognition rates. One of our sample Java multimodal applications on the i90c handsets consistently amazed people with good recognition rates for a huge Chicago street address grammar. We are continuing these experiments to be ready to support multimodal applications as the W3C multimodal standards emerge.

back to the top

Copyright © 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).