VoiceXML Review - Feature Articles

Volume 1, Issue 11 - December 2001

What's New with VoiceXML 2.0?

By Jim A. Larson

(Continued from Part 1)

3.2. Speech Recognition Grammar Markup Language

Grammar Processors, and in particular speech recognizers, use a grammar that defines the words and sequences of words to define the input language that they can accept. The major task of a grammar processor consists of finding the sequence of words described by the grammar that (best) matches a given utterance, or to report that no such sequence exists. The Speech Recognition Grammar Specification defines the allowable sequences of words to be recognized by the ASR. The Speech Recognition Grammar Specification supports the definition of Context-Free Grammars. The specification defines an XML Grammar Markup Language, and an optional Augmented Backus-Naur Format (ABNF) Markup Language. Automatic transformations between the two formats are possible, for example, by XSLT to convert the XML format to ABNF. We anticipate that development tools will be constructed that provide the familiar ABNF format to developers, and enable XML software to manipulate the XML grammar format. The ABNF and XML languages are modeled after Sun's JSpeech Grammar Format. A complementary speech recognition grammar language specification is defined for N-Gram language models. This language model may be used in place of CFG grammars when CFG grammars are two large or complex to specify.

An example grammar, expressed using the ABNF notation, for a pizza ordering application follows:

3.3. Semantic Attachments

After recognizing sequences of words uttered by the user, the ASR performs semantic processing to generate semantic results representing the meaning of the words uttered by the user. For example, the user utterance

"I would like a medium coca cola and a large pizza with pepperoni and mushrooms."

would be converted to the following semantic result:

{
drink: {
    liquid:"coke"
    size:"medium"}
pizza: {
    number: "3"
    size: "large"
    topping: [ "pepperoni" "mushrooms" ]
}
}

Semantic Attachments provide a means to attach instructions to the grammar for the generating semantic results. Semantic attachments are scripts specified using a scripting language defined in Semantic Attachments for Speech Recognition Grammars. This scripting language has the following characteristics:

There are no constructs that give rise to performance or side effects
It behaves as much as possible like to ECMAScript so that developers familiar with ECMAScript find semantic attachments intuitive to use
The specification does not imply that an implementation of it requires a full ECMAScript interpreter or comes close to a full ECMAScript interpreter implementation
If desired the specification can easily be implemented or emulated using an ECMAScript interpreter, with some features optimized or removed from it.

The ABNF speech grammar for the pizza application shown above extended to include semantic attachments follows:

$order = I would like a $drink {drink.liquid = $drink.type;
drink.size = $drink.size}
and $pizza {pizza=$pizza};
// two properties on $order, both are structs
// drink was passed property by property to change a property name
// pizza is passed as whole struct

$kindofdrink = coke | pepsi | coca cola {"coke"};

$size = {"medium"}
[small | medium | large | regular {"medium"}];
// medium is default if nothing said

$tops = $top {Append([],$top)}
(and $top {Append($,$top)})+ ;
// construct list of toppings, return list

$top = anchovies | pepperoni | mushroom {"mushrooms"} | mushrooms;

$drink = $size $kindofdrink {size=$size; type=$kindofdrink };
// two named properties (size and type) on left hand side attribute

$pizza = $number $size {size=$size; number=$number} pizzas
with $tops {topping=$tops};
// three properties on $pizzas attribute

$number = (a | one){1} | two {2}| three {3};

3.4. Speech Synthesis Markup Language

A text document may be produced automatically, authored by people, or a combination of both. The Speech Synthesis Markup Language supports high-level specifications, including the selection of voice characteristics (name, gender, and age) and the speed, volume, and emphasis of individual words. The language also may describe how to pronounce acronyms, such as "Nasa" for NASA, or spelled, such as "N, double A, C, P," for NAACP. At a lower level, designers may specify prosodic control, which includes pitch, timing, pausing, and speaking rate. The Speech Synthesis Markup Language is modeled on Sun's JSpeech Markup Language.

As an example, the sentence

Welcome to Ajax pizza

could be specified as

<voice gender="female" category = "adult">
Welcome to <emphasis level="strong">Ajax Pizza. </emphasis>
</voice>

which specifies that the speech synthesis system should use an adult female voice to speak the sentence "Welcome to Ajax pizza", emphasizing the words "Ajax Pizza."

3.5. Future markup languages from the Voice Browser Working Group

The WBWG has also published requirements and working drafts of other languages within the W3C Speech Interface Framework, including:

N-gram Grammar Markup Language for describing automatically generated grammars for large vocabularies
Natural Language Semantics Markup Language for describing semantic representations of spoken utterances.
Resuable Modules are reusable components that meet specific interface requirements. The purpose of reusable components is to reduce the effort to implement a dialog by reusing encapsulations of common dialog tasks, and to promote consistency across applications.
Pronunciation Lexicon Markup Language to enable open, portable specification of pronunciation information for speech recognition and speech synthesis engines.
Call Control Markup Language to enable the management of telephone calls and conferences.

Additional work is needed before these markup languages become candidates for W3C standardization.

4. Summary

The W3C Voice Browser Working Group has extended VoiceXML 1.0 to form VoiceXML 2.0 plus several new markup languages, including speech recognition grammar, semantic attachment, and speech synthesis. The speech recognition and speech synthesis markup languages were designed to be used in conjunction with VoiceXML 2.0, as well as with non-VoiceXML applications. The speech community is invited to review and comment on working drafts of these languages.

back to the top

Copyright © 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).