What's
New with VoiceXML 2.0?
(Continued
from Part 1)
3.2.
Speech Recognition Grammar Markup Language
Grammar
Processors, and in particular speech recognizers, use
a grammar that defines the words and sequences of words
to define the input language that they can accept. The
major task of a grammar processor consists of finding
the sequence of words described by the grammar that
(best) matches a given utterance, or to report that
no such sequence exists. The Speech
Recognition Grammar Specification defines the allowable
sequences of words to be recognized by the ASR.
The Speech Recognition Grammar Specification supports
the definition of Context-Free Grammars. The specification
defines an XML Grammar Markup Language, and an optional
Augmented Backus-Naur Format (ABNF) Markup Language.
Automatic transformations between the two formats are
possible, for example, by XSLT to convert the XML format
to ABNF. We anticipate that development tools
will be constructed that provide the familiar ABNF format
to developers, and enable XML software to manipulate
the XML grammar format. The ABNF and XML languages
are modeled after Sun's JSpeech Grammar Format. A complementary
speech recognition grammar language specification is
defined for N-Gram language models. This language
model may be used in place of CFG grammars when CFG
grammars are two large or complex to specify.
An
example grammar, expressed using the ABNF notation,
for a pizza ordering application follows:
$order
= I would like a $drink
$kindofdrink = coke | pepsi | coca cola;
$size = small | medium | large | regular;
$tops = $top(and $top)+ ;
$top = anchovies | pepperoni | mushrooms;
$drink = $size $kindofdrink;
$pizza = $number $size pizzas with $tops ;
$number = (a | one) | two | three; |
3.3.
Semantic Attachments
After
recognizing sequences of words uttered by the user,
the ASR performs semantic processing to generate semantic
results representing the meaning of the words uttered
by the user. For example, the user utterance
| "I
would like a medium coca cola and a large pizza
with pepperoni and mushrooms." |
would
be converted to the following semantic result:
{
drink: {
liquid:"coke"
size:"medium"}
pizza: {
number: "3"
size: "large"
topping: [ "pepperoni"
"mushrooms" ]
}
} |
Semantic
Attachments provide a means to attach instructions to
the grammar for the generating semantic results. Semantic
attachments are scripts specified using a scripting
language defined in Semantic Attachments for Speech
Recognition Grammars. This scripting language
has the following characteristics:
-
There are no constructs that give rise to performance
or side effects
-
It behaves as much as possible like to ECMAScript
so that developers familiar with ECMAScript find semantic
attachments intuitive to use
-
The specification does not imply that an implementation
of it requires a full ECMAScript interpreter or comes
close to a full ECMAScript interpreter implementation
-
If desired the specification can easily be implemented
or emulated using an ECMAScript interpreter, with
some features optimized or removed from it.
The
ABNF speech grammar for the pizza application shown
above extended to include semantic attachments follows:
$order = I would like a $drink {drink.liquid =
$drink.type;
drink.size = $drink.size}
and $pizza {pizza=$pizza};
// two properties on $order, both are structs
// drink was passed property by property to change
a property name
// pizza is passed as whole struct
$kindofdrink = coke | pepsi | coca cola {"coke"};
$size = {"medium"}
[small | medium | large | regular {"medium"}];
// medium is default if nothing said
$tops = $top {Append([],$top)}
(and $top {Append($,$top)})+ ;
// construct list of toppings, return list
$top = anchovies | pepperoni | mushroom {"mushrooms"}
| mushrooms;
$drink = $size $kindofdrink {size=$size; type=$kindofdrink
};
// two named properties (size and type) on left
hand side attribute
$pizza = $number $size {size=$size; number=$number}
pizzas
with $tops {topping=$tops};
// three properties on $pizza’s attribute
$number = (a | one){1} | two {2}| three {3}; |
3.4.
Speech Synthesis Markup Language
A
text document may be produced automatically, authored
by people, or a combination of both. The Speech
Synthesis Markup Language supports high-level specifications,
including the selection of voice characteristics (name,
gender, and age) and the speed, volume, and emphasis
of individual words. The language also may describe
how to pronounce acronyms, such as "Nasa"
for NASA, or spelled, such as "N, double A, C,
P," for NAACP. At a lower level, designers may
specify prosodic control, which includes pitch, timing,
pausing, and speaking rate. The Speech Synthesis Markup
Language is modeled on Sun's JSpeech Markup Language.
As
an example, the sentence
could
be specified as
<voice
gender="female" category = "adult">
Welcome to <emphasis
level="strong">Ajax Pizza. </emphasis>
</voice> |
which
specifies that the speech synthesis system should use
an adult female voice to speak the sentence "Welcome
to Ajax pizza", emphasizing the words "Ajax
Pizza."
3.5.
Future markup languages from the Voice Browser Working
Group
The
WBWG has also published requirements and working drafts
of other languages within the W3C Speech Interface Framework,
including:
- N-gram
Grammar Markup Language for describing automatically
generated grammars for large vocabularies
- Natural
Language Semantics Markup Language for describing
semantic representations of spoken utterances.
- Resuable
Modules are reusable components that meet specific
interface requirements. The purpose of reusable components
is to reduce the effort to implement a dialog by reusing
encapsulations of common dialog tasks, and to promote
consistency across applications.
- Pronunciation
Lexicon Markup Language to enable open, portable
specification of pronunciation information for speech
recognition and speech synthesis engines.
- Call
Control Markup Language to enable the management
of telephone calls and conferences.
Additional
work is needed before these markup languages become
candidates for W3C standardization.
4.
Summary
The
W3C Voice Browser Working Group has extended VoiceXML
1.0 to form VoiceXML 2.0 plus several new markup languages,
including speech recognition grammar, semantic attachment,
and speech synthesis. The speech recognition and speech
synthesis markup languages were designed to be used
in conjunction with VoiceXML 2.0, as well as with non-VoiceXML
applications. The speech community is invited to review
and comment on working drafts of these languages.

back
to the top

Copyright
© 2001 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|