VoiceXML Events
Welcome
to "First Words" - the VoiceXML Review's column
to teach you about VoiceXML and how you can use it.
We hope you enjoy the lesson.
Handling
Complex Recognition Results
One of the changes in the April release of the VoiceXML
2.0 working draft was the formalization of how recognition
results would be made available at the VoiceXML level.
Last month, we had a look at how this comes into play
with fairly simple VoiceXML pages. This month we're
going to start discussing more advanced assignments,
which arise when making use of form-level grammars and
more complex user interactions.
You
may recall that last issue we discussed the use of 'slot-filling'
to assign values to recognition results, rather than
just returning the raw utterance value. This is useful
for providing consistent recognition results for a range
of phrases:
We've
written pages using simple recognition results in the
past. These include samples like this:
Here
is one of last issues examples:
|
<field
name="command">
<grammar xml:lang = "en-US" version
= "1.0" root = "Help">
<rule id = "Help" scope = "public">
<one-of>
<item> help </item>
<item> save me </item>
<item> succour </item>
</one-of>
</rule>
</grammar>
<filled>
You said <value expr="command"/>
</filled>
</field>
|
In
this example, we have the ECMAScript variable 'returnvalue'
receiving the value 'help' in all three cases - regardless
of which of the three legal user utterances are recognized.
This allows application processing to be kept simpler,
and decouples the contents of the grammar from the application
itself. This has benefits as grammars are extended or
tuned, where changes can be made to the grammar, without
affecting other parts of the application.
Although
we are using a simple assignment here, the contents
of the <tag> element can be quite complex, allowing
powerful manipulation prior to returning a value to
the application. This processing is known as Semantic
Interpretation, and is important in advanced applications.
Unfortunately, the language for Semantic Interpretation
has not yet been standardized - this is work currently
being performed by the W3C Voice Browser Working group.
The current working draft is available at:
http://www.w3.org/TR/semantic-interpretation/
although it is somewhat dated.
As
we described last issue, the interpretation is returned
to the VoiceXML interpreter from the recognizer as an
ECMAScript object. The more traditional nomenclature
of 'slots' can now be considered as the properties of
this ECMAScript object. This object can be complex,
including component objects. Recall that the complete
object is provided in the application scoped VoiceXML
variable application.lastresult$.interpretation.
An example object result from the VoiceXML 2.0 specification
is shown below:
{
drink: "coke"
pizza: {
number: "3"
size: "large"
topping: [
"pepperoni"
"mushrooms"
]
}
}
|
In
this case, we have two top-level properties: 'drink'
and 'pizza', where 'drink' is a simple property, while
'pizza' is a compound property consisting of an object
with it's own properties ('number', 'size' and 'topping').
Such a result might be produced by an utterance such
as "I would like a coca cola and three large pizzas
with pepperoni and mushrooms." as opposed to requiring
the user to be prompted for each part of the order and
it's characteristics. Both styles of input (mixed initiative
and directed dialog respectively) can be supported within
VoiceXML with a combination of form and field level
grammars.
In the mixed initiative case, a form level grammar would
be used to allow 'pizza power users' to describe their
complete order in a single breath, while the field level
grammars would drive the input for each component of
the order for the directed dialog caller. Next month,
we'll write this application, but for now, we're just
going to talk about how results are mapped when a form-level
grammar is triggered by a user utterance.
The
rules for taking the results from an object such as
shown above, and assigning values to the fields within
the form containing the form grammar are fairly straightforward.
Again, from the VoiceXML 2.0 working draft:
-
The "slot" attribute of a is a (very restricted)
ECMAScript expression that selects some portion of
the result to be assigned to the field. In addition
to selecting the top-level result property, the attribute
can select properties at arbitrary levels of nesting,
using a dot-separated list of element/property names,
as in "pizza.number" and "order.pizza.topping". Note
that it is possible for a specific slot value to fill
more than one field, if the slot names of the fields
are the same.
We've only talked about 'slots' as top-level properties in the
past, but the slot attribute can be used to select components of
the results as well. This may make it simpler to reuse more complex
grammars in both the mixed-initiative and directed dialog cases.
-
If the portion of the result named by the "slot" (or "name") attribute of
a <field> doesn't exist in a given result then the field item's value is
unchanged.
Recall that after assigning everything possible, the form
interpretation algorithm (FIA) will drive the collection of
information for any fields that remain unfilled. So, for example,
if the caller could have also selected a credit card for payment,
and it wasn't supplied in the initial utterance, the credit card
field in the form would be visited after filling other fields with
the results from processing the user utterance.
-
The default value for the "slot" attribute is supplied by the
value of "name" attribute.
Don't forget though, that if the grammar returns a slot, and you
don't specify a matching name (either the slot attribute, or with
the default of the field name) that you will actually end up with
an ECMAScript object assigned to your field variable, as opposed to
the value of the object.
Using these rules, the FIA will populate as many fields
as possible using the results from the form-level recognition,
and then perform directed dialog processing to collect
the rest.
Summary
We're
starting to look at how form level grammars can be used
to populate multiple fields with a single utterance,
and how that information is assigned according to VoiceXML
2.0. Understanding how both field and form-level grammars
are processed is important both for writing flexible
applications and reusable grammars. More on this next
issue!
VoiceXML
Users Group Call for Participation
The VoiceXML Forum is beginning to prepare for the Spring Users Group
Meeting, to be held in conjunction with the
AVIOS Speech Developers
Conference and Expo, from March 31st to April 3rd 2003, at the Fairmont
Hotel in San Jose California. The VoiceXML Users Group Meeting will be
held on April 3rd.
In the past, the VoiceXML User Group has provided tutorials,
technology overviews, and other such features to allow
technology leaders to become familiar with speech technologies.
VoiceXML is clearly now in the mainstream of speech
application development. So for the Spring Meeting,
the VoiceXML Forum is looking to the VoiceXML user community
to share its experience to-date by provide live demos
of their VoiceXML Technologies and/or sharing practical
feedback on their experiences with VoiceXML.
Some possible topics would include live demos and/or
experience reports in the areas of :
· Writing Portable VoiceXML Applications;
· Speech Application Development;
· VoiceXML Platforms;
· Speech Application Tuning;
· Deployment concerns;
· Grammar development;
· Systems Integration Issues.
Or pick another topic related to VoiceXML and the real world.
Take this opportunity to pass along your successes (and failures!)
and help the industry to evolve.
If you would like to participate in this UGM, by presenting a demo of
your VoiceXML application or related technology or share your VoiceXML
experiences, please submit a short abstract on the topic you would
like to present to
brett.mcdowell@ieee-isto.org by February 21, 2003.

back
to the top

Copyright
© 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization (IEEE-ISTO).
|