Update
on SSML
The
Speech Synthesis Markup Language (SSML) [1], as its
name implies, provides a standardized annotation for
instructing speech synthesizers on how to convert written
language input into spoken language output. This language
has been under development within the
Voice Browser Working Group (VBWG) of the
World Wide Web Consortium
(W3C) for a few years. This article provides a brief
update on the status and future of SSML. For background
on SSML and an introduction to its features, see [2].
Status
In
the previous VoiceXML Review article on SSML [2], the
authors indicated that SSML was "nearly completed".
Although the January 2001 version of the specification
was issued as a Last Call Working Draft (WD) [3], it
was found to have a number of contentious items. In
April of this year, the Voice Browser Working Group
issued another Working Draft [1] (not a Last Call this
time) with some minor content changes. The group is
now working towards publication of a new Last Call WD.
Changes
in the most recent Working Draft
The
April 2002 draft has a fairly small number of changes
from the January 2001 draft. It was released primarily
to provide XML Schema support for use in VoiceXML [4]
and to bring the definition of valid SSML documents
in line with that in the other Voice Browser Working
Group specifications.
Schema
Programmers
in the world of XML are probably familiar with Document
Type Definitions (DTDs) [5], documents that roughly
define syntax constraints for XML documents. Although
DTDs can provide some help with XML document validation,
they are notoriously weak at representing complicated
mixed content models (both elements and text permitted
as content) and cross-element constraints. The XML Schema
language is a more powerful language for representing
such constraints, allowing for more of the syntactic
constraints of the language to be caught by validating
parsers.
The
W3C has now moved from encouraging the use of XML Schema
to the stronger position of explicitly discouraging
the use of DTDs. While the creation of a schema when
you already have a DTD is fairly straightforward, the
fact that SSML is expected to be embedded in other markup
languages (of which VoiceXML is the first example) brought
additional requirements to the table:
- the
need to be able to incorporate SSML elements into
the host language namespace
- the
need to modify the SSML elements to add host language-specific
attributes and functionality
In
the SSML specification the DTD is now informational
only, while the schema provides the normative definition
of the syntax of the language.
Document structure
All of the Voice Browser Working Group specifications
have undergone revision in how they define valid documents.
The most recent Working Draft of SSML brought the definition
of valid SSML documents in line with the definitions
of valid VoiceXML and SRGS [6] documents. The most obvious
change is the addition of the "SSML documents"
section (now section 3). This section describes the
headers required and permitted for valid SSML documents.
Some
key things to note here are that
-
the XML declaration is required
-
a DOCTYPE declaration is optional, but if present
should reference the public and system identifiers
given in section 3.1.
-
the <speak> root element is required, and
the xmlns attribute must specify the SSML namespace
as given in section 3.1.
There
is also now a version attribute on <speak>.
The
Conformance section (now section 4) has been cleaned
up a bit as well, again following the example of the
VoiceXML and SRGS specifications. In particular, it
more clearly distinguishes SSML fragments from
SSML documents and specifies the requirements
for each.
Other miscellaneous changes
In
an attempt to move non-normative sections out of the
main body, the Future Study, Examples, and DTD sections
have been moved into appendices. There are also new
appendices on:
-
Audio file formats -- This appendix lists the audio
formats that an SSML processor must be able to read
and play out.
-
Internationalization -- It is important that the
syntax of SSML be able to indicate the input language
(the language in which the text content is written)
and the output language/dialect/speaker. Of course,
SSML is not a universal translator and will not arbitrarily
convert text written in one human language into the
spoken form of some other human language. Also, there
is no guarantee that any particular written (input)
or spoken (output) language will be supported by a
given SSML processor. Nevertheless, it would be convenient
to allow the output language for items like dates
(say, in a year-month-day format) to be changed with
no more work than updating a flag indicating the output
language or speaker.
The
availability and operating characteristics of features
such as pitch, volume, rate, etc. depend heavily on
the synthesis technology used in any given implementation.
Thus, requests for changes in these values via SSML
elements will have varied effects across platforms.
While some amount of testing can be done to increase
interoperability at a gross level, different engines
will never produce the same synthesized output given
the same input. The most recent Working Draft includes
a new section, section 1.4, whose purpose is to forewarn
a potential user of SSML about this issue.
The
references section has also been cleaned up a bit and
converted into the format used by the other Voice Browser
WG specifications.
What to expect in the future
Any
changes for the next draft are likely to fall into two
categories: clarifications of ambiguous or confusing
features and text, and the addition features requested
or encouraged by other groups in the W3C. Two portions
of the specification that were vague in the last Working
Draft are the use of the xml:lang attribute and
the <say-as> element.
Clarification and refinement
xml:lang
In the XML namespace there is an attribute lang [7].
The valid values for this attribute are human language
identifiers defined by IETF RFC3066 [8]. In most specifications,
this attribute is used to indicate the language in which
the enclosed text is written. While it has that meaning
in SSML as well, in both the <voice> and <say-as>
elements there is a question as to whether the xml:lang
attribute provides some indication of what the output
language/voice should be. If so, does it represent both
the written language and the intended output language?
If not, how are these distinguished? These are some
of the questions raised by the use of the xml:lang attribute
in a markup language regulating the pronunciation of
text and not just using text as an input.
say-as
Whether or not the original intention of the <say-as>
element was clear, it is not clear today. Although this
element has some characteristics of a formatter for
spoken output, the most common use is as an input formatter
-- how the text content is formatted for reading ("interpret-as")
for content that might otherwise be ambiguous, such
as <say-as type="date"> 5/12 </say-as>.
The optional format specifier can in this case clarify
whether this is May 12 ("date:md"), December
5 ("date:dm"), May 2012 ("date:my"),
etc.
In one sense the <say-as> element is unnecessary,
as any text to be spoken can always be written out,
in the SSML document, in its full orthographic form
and will be spoken as such. In other words, if you really
care about which text is spoken, you can just write
it out yourself. However, it is extremely common for
an application to receive data such as a date or time
in a compact form (such as "1/1/2000" or "23:59:59")
and expect the synthesis engine to be able to render
it. In fact, most synthesis engines have significant
knowledge built in about how best to read out dates,
times, etc. -- frequently more knowledge than the application
writer himself may have.
Given the input-vs-output role confusion of this element,
there are at least two problems:
- Descriptions
of some of the types (currency, measure, and address
in particular) are too sketchy to determine either
how the input should be interpreted or how it should
be spoken.
- The
output, if not otherwise specified, is assumed to
be based on the current locale. This is problematic
for applications that routinely output the same content
into multiple languages. In order to successfully
build such applications, authors must do these conversions
outside the markup itself, even if the underlying
synthesis engine is capable of such transformations.
The <say-as> element is a significant convenience,
but in order for it to reach its full potential, it
needs to better distinguish between and allow for indication
of the input and output formats.
Alignment with other W3C work
The second category of likely changes is in the area
of features encouraged by other work in the W3C. Both
the VoiceXML specification and the Speech Recognition
Grammar Specification have added support for xml:base
and <metadata>/rdf, so it is reasonable to consider
that these might be added to SSML at some point in the
future.
xml:base
Many HTML programmers are familiar with the <BASE>
element, an element used to set the base path used for
resolution of relative URIs. The XML Base specification
[9] standardizes this by establishing a common attribute
in the XML namespace, xml:base, that could be
used to indicate this base path. Although a base path
for relative URI resolution can often be obtained from
protocol header information, it is still convenient
to be able to set the path directly within the document.
metadata/rdf
The <metadata> element in VoiceXML and SRGS provides
a mechanism for expressing information about the document.
Both recommend the use of the Resource Description Format
(RDF) syntax [10] and schema [11] as the content format
for this element. RDF "provides a standard way
for using XML to represent metadata in the form of statements
about properties and relationships of items on the Web."
([4], section 6.2.2).
This element (with suggested content structure) is part
of the W3C's Semantic Web Initiative, an attempt to
develop standard ways of representing the meaning of
XML-structured data on the World Wide Web. As such,
it is likely that such a capability will be encouraged
for SSML.
Conclusion
Although the movement of SSML from a Last Call Working
Draft back to a generic Working Draft may at first appear
that the specification is not progressing, the contrary
is true. The most recent Working Draft represents an
improvement in clarity over the prior version and sets
the stage for clearing out some of the last substantial
ambiguities in the specification. It also paves the
way for the introduction of features that connect it
more fully with the W3C's vision for the World Wide
Web. The Conclusion of the previous article on SSML
began, "Widespread adoption of SSML by TTS engine
developers may energize the development of new classes
of speech-enabled applications . . . ." [2] SSML
is already supported on a handful of text-to-speech
engines, with more expected as the specification moves
closer to Recommendation.
References
[1] D. C. Burnett, M. R. Walker and
A. Hunt, editors, Speech Synthesis Markup Language Specification,
W3C Working Draft, April 5, 2002, work in progress.
(http://www.w3.org/TR/2002/WD-speech-synthesis-20020405/)
[2] M. R. Walker and A. Hunt, "The
Speech Synthesis Markup Language for the W3C VoiceXML
Standard", VoiceXML Review, April 2001, work in
progress, Feature article #2. (http://www.voicexmlreview.org/Apr2001/features/ssml1.html)
[3] M. R. Walker and A. Hunt, editors,
Speech Synthesis Markup Language Specification, W3C
Last-Call Working Draft, Jan 3, 2001, work in progress.
(http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/)
[4] S. McGlashan et al., editors, Voice
Extensible Markup Language (VoiceXML) Version 2.0, W3C
Last-Call Working Draft, April 24, 2002, work in progress.(http://www.w3.org/TR/2002/WD-voicexml20-20020424/)
[5] See Information Processing -- Text and Office Systems
-- Standard Generalized Markup Language (SGML), ISO
8879:1986.(http://www.iso.ch/cate/d16387.html)
[6] A. Hunt and S. McGlashan, editors,
Speech Recognition Grammar Specification Version 1.0,
W3C Candidate Recommendation, June 26, 2002, work in
progress. (http://www.w3.org/TR/2002/CR-speech-grammar-20020626/)
[7] See Section 2.12 of T. Bray, et al., Extensible
Markup Language (XML) 1.0 (Second Edition), W3C Recommendation,
October 6, 2000. (http://www.w3.org/TR/2000/REC-xml-20001006/)
[8] H. Alvestrand, Tags for the Identification of Languages,
IETF RFC3066, January 2001. (http://www.w3.org/TR/2002/CR-speech-grammar-20020626/)
[9] J. Marsh, editor, XML Base, W3C Recommendation,
June 27, 2001. (http://www.w3.org/TR/2001/REC-xmlbase-20010627/)
[10] O. Lassila and R. R. Swick, editors, Resource Description
Framework (RDF) Model and Syntax Specification, W3C
Recommendation, February 22, 1999. (http://www.w3.org/TR/REC-rdf-syntax/)
[11] D. Brickley and R.V. Guha, editors,
Resource Description Framework (RDF) Schema Specification,
W3C Candidate Recommendation, March 27, 2000, work in
progress. (http://www.w3.org/TR/2000/CR-rdf-schema-20000327/)

back
to the top

Copyright
© 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization
(IEEE-ISTO).
|