VoiceXML Review - Columns

Volume 3, Issue 1 - January/February 2003

VoiceXML Events

By Rob Marchand

Welcome to "First Words" - the VoiceXML Review's column to teach you about VoiceXML and how you can use it. We hope you enjoy the lesson.

Handling Complex Recognition Results

One of the changes in the April release of the VoiceXML 2.0 working draft was the formalization of how recognition results would be made available at the VoiceXML level. Last month, we had a look at how this comes into play with fairly simple VoiceXML pages. This month we're going to start discussing more advanced assignments, which arise when making use of form-level grammars and more complex user interactions.

You may recall that last issue we discussed the use of 'slot-filling' to assign values to recognition results, rather than just returning the raw utterance value. This is useful for providing consistent recognition results for a range of phrases:

We've written pages using simple recognition results in the past. These include samples like this:

Here is one of last issues examples:

<field name="command">
     <grammar xml:lang = "en-US" version = "1.0" root = "Help">
          <rule id = "Help" scope = "public">
               <one-of>
                    <item> help </item>
                    <item> save me </item>
                    <item> succour </item>
               </one-of>
          </rule>
     </grammar>

In this example, we have the ECMAScript variable 'returnvalue' receiving the value 'help' in all three cases - regardless of which of the three legal user utterances are recognized. This allows application processing to be kept simpler, and decouples the contents of the grammar from the application itself. This has benefits as grammars are extended or tuned, where changes can be made to the grammar, without affecting other parts of the application.

Although we are using a simple assignment here, the contents of the <tag> element can be quite complex, allowing powerful manipulation prior to returning a value to the application. This processing is known as Semantic Interpretation, and is important in advanced applications. Unfortunately, the language for Semantic Interpretation has not yet been standardized - this is work currently being performed by the W3C Voice Browser Working group. The current working draft is available at:

http://www.w3.org/TR/semantic-interpretation/

although it is somewhat dated.

As we described last issue, the interpretation is returned to the VoiceXML interpreter from the recognizer as an ECMAScript object. The more traditional nomenclature of 'slots' can now be considered as the properties of this ECMAScript object. This object can be complex, including component objects. Recall that the complete object is provided in the application scoped VoiceXML variable application.lastresult$.interpretation.

An example object result from the VoiceXML 2.0 specification is shown below:

{
     drink: "coke"
     pizza: {
          number: "3"
          size: "large"
            topping: [
               "pepperoni"
               "mushrooms"
          ]
     }
}

In this case, we have two top-level properties: 'drink' and 'pizza', where 'drink' is a simple property, while 'pizza' is a compound property consisting of an object with it's own properties ('number', 'size' and 'topping'). Such a result might be produced by an utterance such as "I would like a coca cola and three large pizzas with pepperoni and mushrooms." as opposed to requiring the user to be prompted for each part of the order and it's characteristics. Both styles of input (mixed initiative and directed dialog respectively) can be supported within VoiceXML with a combination of form and field level grammars.

In the mixed initiative case, a form level grammar would be used to allow 'pizza power users' to describe their complete order in a single breath, while the field level grammars would drive the input for each component of the order for the directed dialog caller. Next month, we'll write this application, but for now, we're just going to talk about how results are mapped when a form-level grammar is triggered by a user utterance.

The rules for taking the results from an object such as shown above, and assigning values to the fields within the form containing the form grammar are fairly straightforward. Again, from the VoiceXML 2.0 working draft:

The "slot" attribute of a is a (very restricted) ECMAScript expression that selects some portion of the result to be assigned to the field. In addition to selecting the top-level result property, the attribute can select properties at arbitrary levels of nesting, using a dot-separated list of element/property names, as in "pizza.number" and "order.pizza.topping". Note that it is possible for a specific slot value to fill more than one field, if the slot names of the fields are the same.

We've only talked about 'slots' as top-level properties in the past, but the slot attribute can be used to select components of the results as well. This may make it simpler to reuse more complex grammars in both the mixed-initiative and directed dialog cases.

If the portion of the result named by the "slot" (or "name") attribute of a <field> doesn't exist in a given result then the field item's value is unchanged.

Recall that after assigning everything possible, the form interpretation algorithm (FIA) will drive the collection of information for any fields that remain unfilled. So, for example, if the caller could have also selected a credit card for payment, and it wasn't supplied in the initial utterance, the credit card field in the form would be visited after filling other fields with the results from processing the user utterance.

The default value for the "slot" attribute is supplied by the value of "name" attribute.

Don't forget though, that if the grammar returns a slot, and you don't specify a matching name (either the slot attribute, or with the default of the field name) that you will actually end up with an ECMAScript object assigned to your field variable, as opposed to the value of the object.

Using these rules, the FIA will populate as many fields as possible using the results from the form-level recognition, and then perform directed dialog processing to collect the rest.

Summary

We're starting to look at how form level grammars can be used to populate multiple fields with a single utterance, and how that information is assigned according to VoiceXML 2.0. Understanding how both field and form-level grammars are processed is important both for writing flexible applications and reusable grammars. More on this next issue!

VoiceXML Users Group Call for Participation

The VoiceXML Forum is beginning to prepare for the Spring Users Group Meeting, to be held in conjunction with the AVIOS Speech Developers Conference and Expo, from March 31st to April 3rd 2003, at the Fairmont Hotel in San Jose California. The VoiceXML Users Group Meeting will be held on April 3rd.

In the past, the VoiceXML User Group has provided tutorials, technology overviews, and other such features to allow technology leaders to become familiar with speech technologies. VoiceXML is clearly now in the mainstream of speech application development. So for the Spring Meeting, the VoiceXML Forum is looking to the VoiceXML user community to share its experience to-date by provide live demos of their VoiceXML Technologies and/or sharing practical feedback on their experiences with VoiceXML.

Some possible topics would include live demos and/or experience reports in the areas of :

· Writing Portable VoiceXML Applications;
· Speech Application Development;
· VoiceXML Platforms;
· Speech Application Tuning;
· Deployment concerns;
· Grammar development;
· Systems Integration Issues.

Or pick another topic related to VoiceXML and the real world. Take this opportunity to pass along your successes (and failures!) and help the industry to evolve.

If you would like to participate in this UGM, by presenting a demo of your VoiceXML application or related technology or share your VoiceXML experiences, please submit a short abstract on the topic you would like to present to by February 21, 2003.

back to the top