|
In
this monthly column, an industry expert will answer
common questions about VoiceXML and related technologies.
Readers are encouraged to submit questions about VoiceXML,
including development, voice-user interface design,
and speech technology in general, or how VoiceXML is
being used commercially in the marketplace. If you have
a question about VoiceXML, e-mail it to speak.and.listen@voicexmlreview.org
and be sure to read future issues of VoiceXML Review
for the answer.
Q:
How do I begin playing an audio file starting from an
offset?
A:
You can implement code on your Web server that accepts
an HTTP request including the following parameters:
- The path to the audio file on the server or an uploaded
audio file (e.g. via a submitted <record> variable)
- The offset from which you want the interpreter to
begin playing.
The
code parses, modifies, and sends the updated audio file
header to the VoiceXML interpreter. The code then seeks
to the designated offset in the data segment of the
original file and streams the remainder of the file
back to the interpreter.
Appendix
E of the VoiceXML 2.0 specification describes the various
audio file formats that a VoiceXML interpreter must
support. A robust server-side script should be able
to handle all of these formats. If you know the format
of the audio files that comprise your voice application,
your job is easier.
Even
if you only have to deal with a single type of audio
file, unraveling the file format may seem daunting.
Fortunately, open source utilities are available to
ease the burden. For example, you can use "Sound
Exchange" (SOX).
- Download SOX from http://sox.sourceforge.net/
- Compile it into an executable (Windows developers
can use the Win32 binary 'out of the box')
- Run SOX, passing the following four arguments:
- The path to the input file
- The path to the output file
- The value 'trim' (without the quotes)
- The number of milliseconds to remove from the beginning
of the file.
The
following command creates a new file, "sample-new.wav",
containing the contents of "sample.wav" excluding
the first 15 seconds.
sox sample.wav sample-new.wav trim 15000
If performance is not an issue (it's best to avoid creating
new processes and writing large files to disk in response
to an HTTP request), you can call this utility from
a CGI script running on your Web server. If your application
requires optimal performance, you can write code that
integrates more tightly into your Web server environment.
SOX is not a bad place to start, however, and an example
CGI script follows:
#!/usr/local/bin/perl -w
use strict;
use CGI qw(param);
# TODO: location of SOX utility on your Web server
use constant SOX => "sox";
# TODO: base location of the audio files
use constant AUDIODIR => "/var/audio/myapp";
# TODO: temporary location to store 'trimmed' files
use constant TEMPDIR => "/var/tmp/";
my $wav = param("wav");
my $offset = param("offset");
if (!$wav)
{
Log("Missing parameter 'wav'");
print "Status: 400 Bad Request\n\n";
exit(0);
}
# SECURITY: don't allow unrestricted access to your file system
$wav =~ s/(^[A-Za-z]:[\/\\]|^[\/\\]|\.\.)//g;
my $in = AUDIODIR . $wav;
if (!-f $in)
{
Log("Invalid path: $in");
print "Status: 400 Bad Request\n\n";
exit(0);
}
print <<HEADER;
Content-Type: audio/x-wav
HEADER
my $tmpfile = $in;
if (defined($offset) && $offset > 0)
{
$tmpfile = TEMPDIR . "trim-$$.wav";
Log("Running SOX on $in to produce $tmpfile [$offset]");
system(SOX, $in, $tmpfile, "trim", $offset);
}
# enable autoflush; aka disable output buffering
my $old = select STDOUT; $| = 1; select $old;
# open the temp file and stream it to the client/interpreter
open HAUDIO, "<$tmpfile";
binmode HAUDIO, ':raw';
while (<HAUDIO>)
{
print $_;
}
close HAUDIO;
if (defined($offset) && $offset > 0)
{
unlink $tmpfile; # clean up
}
# diagnostic messages; check server log
sub Log
{
print STDERR $_[0] . "\n";
} |
The
script accepts two parameters:
wav
- the name of the .wav file to be returned to the browser
offset - the offset into the file where playing should
begin
Given
these two parameters, the script checks if the file
exists.
If not, it returns an HTTP error to the client. Otherwise,
it
sends an HTTP "Content-Type" header indicating
the response
contains an audio file. If the offset parameter was
missing
or the value is not greater than zero, the script simply
returns
the original audio file. Otherwise, it runs SOX to generate
the trimmed audio to a temporary file,
returns the contents of the temporary file to the client,
and deletes the temporary file.
To
use this script on your own Web Server, you'll need
Perl.
You'll also need to update your Web Server configuration
to allow Perl scripts to be executed.
You'll also need to update the constants at the top
of the script to point valid locations
specific to your server environment.
Now that we've got a server-side script that trims audio
files, let's utilize it.
The following snippet of VoiceXML calls the CGI passing
both the 'wav' and 'offset' parameters.
<vxml version="2.0"
xmlns="http://www.w3.org/2001/vxml">
<form>
 <block>
  <audio
src="http://www.example.org/voice/cgi-
  bin/play.cgi?wav=test.wav&offset=15000"/>
 </block>
</form>
</vxml> |
Q. You've showed me how to play an
audio file beginning at an offset. I want to write a
Voice Mail
application that allows the user to move forward and
back at any point while listening to a message.
How do I do that?
A. To allow the user to move forward
and back while listening to a message, you need to determine
the position in the message where the user "barged
in", for example, by pressing 3 on their telephone
keypad
to move "forward" in the message, or by pressing
1 to move backward.
Although some VoiceXML interpreter implementations may
support this feature today,
the VoiceXML 2.0 specification doesn't formally specify
how this bargein data should be exposed.
If
you want to stick to the standard and write code that's
portable across implementations, you can obtain a rough
approximation of where bargein occurred using two ECMAScript
date objects and some simple math. I need to emphasize
that this will only provide you with an approximate
result. Without native support for determining when
a bargein occurred during audio playback, you will undoubtedly
obtain slightly different behavior on different platforms
due to disparities in execution and recognition performance.
Because voice recognition performance will vary most
markedly across implementations, its best to stick to
DTMF input for the commands that you allow while the
message is being played back.
Somewhere
in your application, you'll have a dialog that plays
back the current message in the user's voice mailbox.
Before queueing the message (via the <audio> element),
initialize a variable with the 'start time'. You can
do this in a <block>. In response to a user command
(e.g. fast forward), call a function that calculates
a new offset based on the following formula:
elapsed = current_time - start_time
new_offset = old_offset + elapsed + increment
- fudge_factor
|
Since the audio file containing the recorded message doesn't
get altered each time you play it,
you'll need to keep track of the accumulated offsets (old_offset).
The elapsed time is easy to calculate:
simply instantiate a new ECMAScript Date
object and subtract the Date object that you initialized
in the <block> prior to queueing the message.
The increment is up to you. In the sample code below,
I've chosen an increment of plus or minus 5 seconds.
See the variables FORWARD and REVERSE.
The fudge_factor will depend on platform performance.
In the sample code below, I've chosen 3 seconds. See
the variable FUDGE.
Once you've calculated the new offset, you'll pass the
value to a server-side script like the one demonstrated
in the Q&A above.
You'll probably modify the script to accept a unique
message identifier that integrates with your backend
message storage system.
Continued...

back
to the top

Copyright
© 2001-2002 VoiceXML Forum. All rights reserved.
The VoiceXML Forum is a program of the
IEEE
Industry Standards and Technology Organization (IEEE-ISTO).
|