| |
The
IETF Speech Services Control Working Group
Introduction
Speech
recognition technology has become an essential building
block for a new wave of next generation enhanced services.
Speech resources such as automated speech recognition
(ASR), text-to-speech (TTS), and speaker verification
(SV) are becoming key features in a range of new services
that help businesses manage their work force and customer
base more efficiently and enable consumers to communicate
in compelling new ways. We are just now seeing interesting
applications where you speak to an application and it
responds to you, such as automated stock trading, airline
reservations, and e-mail by phone. Speech resources
make these interesting and useful applications possible.
We
are at only the beginning of the speech-enabled application
era. Right now, most of these applications are experiments,
trials, and limited deployments. There are a number
of challenges still facing speech resource providers,
application developers, and platform manufacturers.
The
IESG recently chartered speechsc, or the Speech
Services Control Work Group of the IETF to develop a
more effective protocol for speech recognition technology
in next generation networks. This article will briefly
discuss what speechsc is, what the expected benefits
of the protocol will be, the role of the work group,
and the speech services vision.
Protocol
Benefits and Vision
Manufacturers
of media processing devices would like to have a uniform
way of accessing speech-processing resources. Having
a uniform protocol allows manufacturers to easily integrate
speech resources into applications. This will reduce
the cost and anxiety of integrating speech resources,
which will result in the development of compelling new
speech enabled applications. Speech resource vendors
will benefit from the opening of a much larger market
for their engines.
There
are a number of proprietary ASR, TTS, and SV API's,
as well as two IETF drafts that address the control
of speech resources. However, proprietary APIs
do not address the interoperability goal. Moreover,
the experience of people implementing the IETF drafts
has shown a number of shortcomings.
One
exciting and challenging area of work is to ensure the
protocol will support wireless networks. There is a lot
of research going on to extend interesting applications
to wireless handsets, as in 3GPP, or to wireless PDAs.
Some have even talked about service providers offering
speech resources over the open Internet.
Speechsc Background
Last year, Messrs. Shanmugham, Monaco, and Eberman
published an Internet Draft entitled MRCP: Media
Resource Control Protocol. This document described
a protocol that enables a client to control speech recognition
engines and text-to-speech resources. The protocol was
principally the result of work done by Cisco, Nuance,
and SpeechWorks, with input from others.
The
target implementation of MRCP was for media servers,
media-rich media gateways, and VoiceXML interpreters
to be able to control external speech resources. Note
that there is nothing in the protocol that to limits
it to this configuration. However, it is where the protocol
gained the most traction.
A
number of vendors implemented the protocol with quite
mixed results. While the protocol basically worked,
implementers experienced a number of problems in trials
and deployments. As these technical issues persisted,
like-minded vendors began searching for solutions. The
movement to develop a better version of MRCP was seeded
at IETF 52 in December 2001 as some interested people
got together to discuss experience with MRCP. At the
IETF 53 in March 2002, the group held a formal Birds-of-a-Feather
(BOF) meeting to gauge the interest within the IETF
to develop a formal work group to develop a more effective
version of MRCP.
There
was an overwhelming consensus to form a Work Group comprised
of leading protocol and speech technology experts to
address this issue. Following the BOF at IETF 53, the
IESG chartered the speechsc Work Group. The work
group held its first meeting in Yokohama at IETF 54
in July 2002. Dave Oran from Cisco and I serve as co-chairs,
and Scott Bradner of Harvard is the Work Groups
Area Director.
The speechsc Work Group
The
speechsc Work Group will develop protocols to
support distributed media processing of audio streams.
The focus of the working group is to develop protocols
to support ASR, TTS, and SV. The working group will
only focus on the secure distributed control of these
servers. The reason for this limit to the scope of the
work group is simple. There is quite a body of work
in SIP for controlling media resources, such as prompting,
digit collection, script initiation, transcoding, conferencing,
and so on. Replicating this work in a new protocol is
not of interest to the IETF and is confusing to the
market.
Currently, the group is working on the formal requirements
for a distributed speech resource control protocol and
an analysis of existing protocols. With the results
of that work, we will develop either changes to existing
protocols or new protocols, as appropriate.
How does the speechsc group conduct business?
The group meets at IETF meetings, which occur three
times a year. You can find a list of upcoming meetings
at
http://www.ietf.org/meetings/meetings.html. All meetings
are open to interested parties. Quite a lot of work
gets done at the meetings. However, all formal discussion
takes place on mail lists. The work group also may have
interim meetings. By IETF rules, the group will announce
interim meetings well in advance to both the speechsc
and general IETF announcement mail lists. The charter
page of the work group,
http://www.ietf.org/html.charters/speechsc-charter.html,
has information on joining the list.
The
work of the group is complimentary to work going on
in other standards bodies. We are coordinating with
ETSI Aurora, ITU-T Study Group 16 Question 15, the W3C
Multi-Modal Interaction Work Group, and other groups,
as appropriate.
Conclusion
The speechsc Work Group of the IETF is taking on the
interesting work of enabling media servers, VoiceXML
Interpreters, arbitrary speech applications, and possibly
even wireless handsets to access distributed speech
resources. This will enable new and useful applications
that are speech driven and integrate multiple media
types.
The work group will
improve upon the existing protocols and produce a robust, extensible
protocol that meets the needs of ASR, TTS, and SV today and into the
future.
We welcome your interest and participation.

back
to the top

Copyright
© 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|