VoiceXML Review - Feature Articles

Volume 2, Issue 4 - June 2002

XHTML+Voice --- Bringing Spoken Interaction To The WWW

By T. V. Raman IBM Research

Executive Summary

XHTML+Voice provides a set of technologies for enabling WWW developers add voice interaction to Web content by providing access to appropriate features of Voice XML from an XHTML context. XHTML+Voice was submitted by IBM, Opera and Motorola to the W3C in December 2001 to help integrate visual and voice components of the W3C standards framework in creating multimodal user experiences. As a published document, It provides guidance to WWW developers on how various W3C technologies --XHTML, XML Events and Voice XML in this case-- can be used to bring spoken interaction to traditional WWW content without the need for creating a new language for that purpose.

Voice XML was originally designed for voice-only interaction. It has since been further developed within the W3C voice browser activity to fulfill the role of a dialog markup language. XHTML+Voice re-uses those aspects of Voice XML that were designed for declarative authoring of rich spoken dialogs to bring multimodal interaction to XHTML content. XHTML+Voice brings spoken and visual interaction together to enable WWW sites deliver multimodal interaction.

INTRODUCTION

XHTML+Voice brings spoken interaction to standard WWW content by integrating a set of mature WWW technologies such as XHTML and XML Events with XML vocabularies developed as part of the W3C Speech Interface Framework. Documents conforming to the XHTML+Voice profile includes voice modules that support speech synthesis, speech dialogs, command and control, speech grammars. XHTML content is brought to life by attaching Voice handlers that respond to specific DOM events, thereby re-using the event model familiar to web developers. Voice interaction features are integrated directly with XHTML and CSS, and can consequently be used directly within XHTML content.

XHTML+Voice is designed for Web clients that support visual and spoken interaction. To this end, the XHTML+Voice specification first re-formulates VoiceXML 2.0as a collection of modules. These modules, along with Speech Synthesis Markup Language and Speech Recognition Grammar Format are then integrated with XHTML using XHTML modularization to create the XHTML+Voice profile. Finally, the result is integrated with module XML-Events so that voice handlers can be invoked through a standard DOM2 EventListener interface.

How Does It Work?

WWW content continues to be authored in XHTML as is done today
Voice interaction is authored as Voice XML dialogs.
Using the DOM event model familiar to WWW developers, we enable the attaching of Voice XML based handlers to XHTML content.

As a result, traditional XHTML WWW content can be speech-enabled to support voice interaction

Relevance To Developers

Web developers can add voice interaction to WWW content without having to learn a whole new language.
Simple voice dialogs can be authored by the average WWW developer.
Complex voice dialogs can be authored by designers well-versed in speech interfaces; such complex dialogs can be re-used by XHTML authors whose expertise lies in traditional WWW design.

Usage Scenarios

WWW sites enhanced with voice interaction are likely to be of significant advantage when accessing the WWW from wireless hand-held devices with small displays and no keyboard. This technology enables WWW developers make their content accessible to a significantly larger user base due to the above.

Here are some sample usage scenarios:

Looking up stock quotes with a hand-held device.
Browsing and replying to email using a hand-held.
Deploying concierge services, e.g., location-based search, to hand-held devices.
Web-based auctions using a hand-held.

Motivation

XHTML+Voice enables web authors leverage the power of voice interaction enabled by the W3C Speech Interface Framework within standard WWW content. The W3C Speech Interface Framework consists of XML vocabularies for authoring speech interaction, including XML vocabularies for speech synthesis, speech grammars and dialog markup. XHTML+Voice uses the event-driven programming model familiar to XHTML developers. It enhances this model by allowing authors to attach voice handlers to enable speech interaction. XHTML+Voice is designed to keep simple things simple while making complex things possible. Simple voice interaction can be authored by XHTML developers new to voice interaction by following set design patterns. More complex voice interaction can be authored by speech user interface designers; Such complex dialogues can be turned into re-usable dialog components that are used by web developers.

Traditional XHTML content is static, i.e., the user performs some action to make things happen. Such static XHTML content can be speech-enabled by attaching voice handlers that provide spoken prompts and process spoken input, and attaching such voice handlers to be triggered on user actions. Static XHTML can be made dynamic via the Document Object Model (W3C DOM). Traditionally, such dynamism is the result of user action such as a mouse-click triggering an appropriate event, that in turn results in an update of the DOM. XHTML+Voice extends this model by specifying voice-interaction specific events. The web developer can trigger DOM updates based on spoken events to produce web content that reacts dynamically to spoken input.

back to the top

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).