| |
Developing
X+V Applications Using the Multimodal Tools
By:
Executive Summary
As
computing devices become smaller and more pervasive,
customers expect access to their data anytime, anywhere.
With advances in the function, speed, and size of Personal
Digital Assistants (PDAs) and cellular phones, coupled
with an increasingly diverse set of users, the demands
for flexible user interfaces from application developers
have multiplied. Traditional visual interfaces, such
as those provided by HTML pages, are no longer adequate
to meet the public’s rising expectations for convenience,
performance, and usability. End-users of these embedded
devices are no longer satisfied with low-resolution
versions of desktop-based, Web applications and cumbersome
methods of data entry. Consumers expect multiple methods,
or modes, for interacting with a device. They want the
ability to use the interaction method that most naturally
fits the situation – to make the interface work
for them, instead of being forced to work with an interface.
Traditionally, to create these “multimodal”
applications, developers would have to master the development
of both visual and voice software, resulting in a daunting
learning curve. Many of these applications required
extensive porting work to adapt to new platforms, and
there was no way to leverage existing Web applications.
With IBM's 40-year commitment to voice technology, and
emergent software and hardware making applications faster
and more powerful, IBM has created a practical solution
for application developers seeking to integrate both
voice and visual technologies: the Multimodal Toolkit
and Multimodal Browser.
This document provides an overview of the XHTML+Voice
(X+V) language, an introduction to the development toolkit,
and a general description of how to use the features
in the toolkit to develop a multimodal application.
For specific details on creating and implementing a
multimodal application, refer to the documents that
accompany the Multimodal Tools.
The Multimodal Tools
The Multimodal Tools release builds on the WebSphere®
Studio framework to add the functionality you need to
create, test, and run multimodal applications.
The Multimodal Toolkit V4.1 for WebSphere Studio,
which adds extensions to a WebSphere Studio development
product to provide multimodal functionality, introduces
a user interface that can minimize both the skills and
time needed to develop high-tech applications for PDAs
and other handheld, wireless devices. IBM has used similar
technology for years to facilitate the rapid development
of server-based voice applications.
The Multimodal Toolkit provides an integrated development
environment that lets you integrate visual and voice
applications efficiently without requiring expertise
in all the development languages. The toolkit provides
multiple tools, editors, and views that are operated
using standard menus, icons, toolbars, and basic XHTML
and VoiceXML programming skills.
The toolkit’s Reusable Dialog Components
provide common functionality such as mailing address,
credit card, and social security number form components
using only a few button clicks, and each field provides
the user with multiple methods of data entry.
The WebSphere Everyplace® Multimodal Browser
V1.0, developed in a strategic relationship
with Opera Software, provides a Web browser in which
you can test voice-enabled Web applications. The browser
is enhanced with extensions that include IBM's automatic
speech recognition and text-to-speech technology, allowing
you to view and interact with multimodal applications
that you have built using XHTML+Voice. When you install
the Multimodal Browser, the icon for the Opera Browser
appears on your desktop, and you can use it to open
the browser and run your multimodal applications.
The Voice Server SDK V3.1.1 contains
the programs that are needed to play and compose pronunciations
in the Multimodal Toolkit.
Multimodal applications consist of visual (XHTML) and
voice (VoiceXML) components.
What is XHTML?
The eXtensible HyperText Markup Language (XHTML) is
an XML-based markup language for creating visual applications
that users can access from their desktops or wireless
devices. XHTML is the next generation of HTML 4.01 in
XML.
If you have existing programs with HTML pages, you will
have to make some simple structural changes to comply
with XHTML conventions. XHTML has replaced HTML as the
supported language by the World Wide Web Consortium®
(W3C), so future-proofing your Web pages by using XHTML
will not only help you with multimodal applications,
but will ensure that users with all types of devices
will be able to access your pages correctly.
For more information, refer to the XHTML 1.0 specification
on the W3C Web site (see the References section at the
end of this paper).
What is VoiceXML?
The Voice eXtensible Markup Language (VoiceXML) is an
XML-based markup language for creating distributed voice
applications, just as HTML is a language for distributed
visual applications. VoiceXML was defined and promoted
by an industry forum, the VoiceXML Forum™, founded
by AT&T®, Lucent®, Motorola®, and IBM,
and supported by approximately 500 member companies.
Updates to VoiceXML are a product of the W3C voice working
group. The language is designed to create audio dialogs
that feature text-to-speech, pre-recorded audio, recognition
of both spoken and DTMF key input, recording of spoken
input, telephony, and mixed-initiative conversations.
Its goal is to provide voice access and interactive
voice response (such as by telephone, PDA, or desktop)
to Web-based content and applications.
Users can interact with these Web-based voice applications
by speaking or by pressing telephone keys rather than
solely through a graphical user interface.
For more information, refer to the VoiceXML 2.0 specification
on the W3C Web site (see the References section at the
end of this paper).
What is XHTML+Voice?
XHTML+Voice, or X+V for short, is a
markup language for multimodal Web pages. With X+V,
Web developers can create Web pages that let end-users
select voice input and output as well as traditional
visual (GUI) interaction. X+V does this by providing
a simple way to add voice markup to XHTML. Hence the
name "XHTML plus Voice."
X+V fits into the Web environment by taking a normal
visual Web user-interface and speech-enabling each part
of it. That is, if you take a visual interface and break
it up into its basic parts (such as an input field for
a time of day, a check box for AM or PM, and so on),
you can then simply enable the use of voice by adding
voice markup to the visual markup. X+V consists of visual
markup, a collection of snippets of voice markup for
each element in the user interface, and a specification
of which snippets to activate when. For visual markup,
X+V uses the familiar XHTML standard. For voice markup,
it uses a (simplified) subset of VoiceXML. For associating
the snippets of VoiceXML and user-interface elements,
X+V uses the XML Events standard. All of these are official
standards for the Web as defined by the World-Wide Web
Consortium (W3C) that governs web standards.
Motorola, Opera Software ASA, and IBM submitted the
X+V specification to the W3C, which submitted it to
the multimodal working group in January of 2002. For
a Web site with the XHTML+Voice Profile 1.0 specification,
see the References section at the end of this paper.
Note: The specific details of creating
and implementing a multimodal application are beyond
the scope of this white paper. See the companion article,
also published in this issue of the VoiceXML Review
for a discussion of how to write XHTML+Voice markup.
Continued...

back
to the top

Copyright
© 2001-2003 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|