XHTML+Voice --- Bringing Spoken Interaction To The WWW
Executive Summary
XHTML+Voice
provides a set of technologies for enabling WWW developers add voice
interaction to Web content by providing access to appropriate features
of Voice XML from an XHTML context. XHTML+Voice was submitted by IBM, Opera and Motorola
to the W3C in December 2001 to help integrate visual and voice
components of the W3C standards framework in creating multimodal
user experiences. As a published document, It provides guidance
to WWW developers on how various W3C technologies --XHTML, XML
Events and Voice XML in this case-- can be used to bring spoken interaction to traditional WWW
content without the need for creating a new language for
that purpose.
Voice XML was originally designed for voice-only interaction.
It has since been further developed within the W3C voice browser
activity to fulfill the role of a dialog markup
language. XHTML+Voice re-uses those aspects of Voice
XML that were designed for declarative authoring of rich
spoken dialogs to bring multimodal interaction to XHTML content.
XHTML+Voice brings spoken and visual interaction
together to enable WWW sites deliver multimodal interaction.
INTRODUCTION
XHTML+Voice brings spoken interaction to standard WWW content by
integrating a set of mature WWW technologies such as XHTML
and XML Events with XML vocabularies developed as part of the
W3C Speech Interface Framework. Documents conforming to the
XHTML+Voice profile includes voice modules that support speech
synthesis, speech dialogs, command and control, speech grammars.
XHTML content is brought to life by attaching Voice handlers that
respond to specific DOM events, thereby re-using the event model
familiar to web developers. Voice interaction features are integrated
directly with XHTML and CSS, and can consequently be used directly
within XHTML content.
XHTML+Voice is designed for Web clients that support visual and spoken
interaction. To this end, the XHTML+Voice specification first
re-formulates VoiceXML 2.0as a collection of modules. These modules,
along with Speech Synthesis Markup Language and Speech Recognition
Grammar Format are then integrated with XHTML using XHTML
modularization to create the XHTML+Voice profile. Finally, the
result is integrated with module XML-Events so that voice
handlers can be invoked through a standard DOM2 EventListener
interface.
How Does It Work?
- WWW content continues to be authored in XHTML as is done
today
- Voice interaction is authored as Voice XML dialogs.
- Using the DOM event model familiar to WWW developers,
we enable the attaching of Voice XML based handlers to
XHTML content.
As a result, traditional XHTML WWW content can be speech-enabled
to support voice interaction
Relevance To Developers
- Web developers can add voice interaction to WWW content
without having to learn a whole new language.
- Simple voice dialogs can be authored by the average
WWW developer.
- Complex voice dialogs can be authored by designers
well-versed in speech interfaces; such complex dialogs
can be re-used by XHTML authors whose expertise lies
in traditional WWW design.
Usage Scenarios
WWW sites enhanced with voice interaction are likely to be of
significant advantage when accessing the WWW from wireless
hand-held devices with small displays and no keyboard. This
technology enables WWW developers make their content accessible
to a significantly larger user base due to the above.
Here are some sample usage scenarios:
- Looking up stock quotes with a hand-held device.
- Browsing and replying to email using a hand-held.
- Deploying concierge services, e.g., location-based
search, to hand-held devices.
- Web-based auctions using a hand-held.
Motivation
XHTML+Voice enables web authors leverage the power of voice
interaction enabled by the W3C Speech Interface
Framework within standard WWW content. The W3C Speech Interface
Framework consists of XML vocabularies for authoring speech interaction,
including XML vocabularies for speech synthesis, speech grammars and
dialog markup. XHTML+Voice uses the event-driven programming model
familiar to XHTML developers. It enhances this model by allowing
authors to attach voice handlers to enable speech interaction.
XHTML+Voice is designed to keep simple things simple while
making complex things possible. Simple voice interaction can be
authored by XHTML developers new to voice interaction by
following set design patterns. More complex voice interaction can be
authored by speech user interface designers; Such complex dialogues
can be turned into re-usable dialog components that are used by web
developers.
Traditional XHTML content is static, i.e., the user performs some
action to make things happen. Such static XHTML content
can be speech-enabled by attaching voice handlers that provide
spoken prompts and process spoken input, and attaching such
voice handlers to be triggered on user actions. Static XHTML can be
made dynamic via the Document Object Model (W3C DOM). Traditionally,
such dynamism is the result of user action such as a mouse-click
triggering an appropriate event, that in turn results in an update
of the DOM. XHTML+Voice extends this model by specifying
voice-interaction specific events. The web developer can trigger
DOM updates based on spoken events to produce web content that
reacts dynamically to spoken input.

back
to the top

Copyright
© 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|