VoiceXML Review - Feature Articles

Volume 2, Issue 5 - July/August 2002

The IETF Speech Services Control Working Group

By Eric Burger,

Introduction

Speech recognition technology has become an essential building block for a new wave of next generation enhanced services. Speech resources such as automated speech recognition (ASR), text-to-speech (TTS), and speaker verification (SV) are becoming key features in a range of new services that help businesses manage their work force and customer base more efficiently and enable consumers to communicate in compelling new ways. We are just now seeing interesting applications where you speak to an application and it responds to you, such as automated stock trading, airline reservations, and e-mail by phone. Speech resources make these interesting and useful applications possible.

We are at only the beginning of the speech-enabled application era. Right now, most of these applications are experiments, trials, and limited deployments. There are a number of challenges still facing speech resource providers, application developers, and platform manufacturers.

The IESG recently chartered speechsc, or the Speech Services Control Work Group of the IETF to develop a more effective protocol for speech recognition technology in next generation networks. This article will briefly discuss what speechsc is, what the expected benefits of the protocol will be, the role of the work group, and the speech services vision.

Protocol Benefits and Vision

Manufacturers of media processing devices would like to have a uniform way of accessing speech-processing resources. Having a uniform protocol allows manufacturers to easily integrate speech resources into applications. This will reduce the cost and anxiety of integrating speech resources, which will result in the development of compelling new speech enabled applications. Speech resource vendors will benefit from the opening of a much larger market for their engines.

There are a number of proprietary ASR, TTS, and SV API's, as well as two IETF drafts that address the control of speech resources. However, proprietary APIs do not address the interoperability goal. Moreover, the experience of people implementing the IETF drafts has shown a number of shortcomings.

One exciting and challenging area of work is to ensure the protocol will support wireless networks. There is a lot of research going on to extend interesting applications to wireless handsets, as in 3GPP, or to wireless PDAs. Some have even talked about service providers offering speech resources over the open Internet.

Speechsc Background

Last year, Messrs. Shanmugham, Monaco, and Eberman published an Internet Draft entitled MRCP: Media Resource Control Protocol. This document described a protocol that enables a client to control speech recognition engines and text-to-speech resources. The protocol was principally the result of work done by Cisco, Nuance, and SpeechWorks, with input from others.

The target implementation of MRCP was for media servers, media-rich media gateways, and VoiceXML interpreters to be able to control external speech resources. Note that there is nothing in the protocol that to limits it to this configuration. However, it is where the protocol gained the most traction.

A number of vendors implemented the protocol with quite mixed results. While the protocol basically worked, implementers experienced a number of problems in trials and deployments. As these technical issues persisted, like-minded vendors began searching for solutions. The movement to develop a better version of MRCP was seeded at IETF 52 in December 2001 as some interested people got together to discuss experience with MRCP. At the IETF 53 in March 2002, the group held a formal Birds-of-a-Feather (BOF) meeting to gauge the interest within the IETF to develop a formal work group to develop a more effective version of MRCP.

There was an overwhelming consensus to form a Work Group comprised of leading protocol and speech technology experts to address this issue. Following the BOF at IETF 53, the IESG chartered the speechsc Work Group. The work group held its first meeting in Yokohama at IETF 54 in July 2002. Dave Oran from Cisco and I serve as co-chairs, and Scott Bradner of Harvard is the Work Groups Area Director.

The speechsc Work Group

The speechsc Work Group will develop protocols to support distributed media processing of audio streams. The focus of the working group is to develop protocols to support ASR, TTS, and SV. The working group will only focus on the secure distributed control of these servers. The reason for this limit to the scope of the work group is simple. There is quite a body of work in SIP for controlling media resources, such as prompting, digit collection, script initiation, transcoding, conferencing, and so on. Replicating this work in a new protocol is not of interest to the IETF and is confusing to the market.

Currently, the group is working on the formal requirements for a distributed speech resource control protocol and an analysis of existing protocols. With the results of that work, we will develop either changes to existing protocols or new protocols, as appropriate.

How does the speechsc group conduct business? The group meets at IETF meetings, which occur three times a year. You can find a list of upcoming meetings at http://www.ietf.org/meetings/meetings.html. All meetings are open to interested parties. Quite a lot of work gets done at the meetings. However, all formal discussion takes place on mail lists. The work group also may have interim meetings. By IETF rules, the group will announce interim meetings well in advance to both the speechsc and general IETF announcement mail lists. The charter page of the work group, http://www.ietf.org/html.charters/speechsc-charter.html, has information on joining the list.

The work of the group is complimentary to work going on in other standards bodies. We are coordinating with ETSI Aurora, ITU-T Study Group 16 Question 15, the W3C Multi-Modal Interaction Work Group, and other groups, as appropriate.

Conclusion

The speechsc Work Group of the IETF is taking on the interesting work of enabling media servers, VoiceXML Interpreters, arbitrary speech applications, and possibly even wireless handsets to access distributed speech resources. This will enable new and useful applications that are speech driven and integrate multiple media types.

The work group will improve upon the existing protocols and produce a robust, extensible protocol that meets the needs of ASR, TTS, and SV today and into the future.

We welcome your interest and participation.

back to the top

Copyright © 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE Industry Standards and Technology Organization (IEEE-ISTO).