[g-a-devel]Re: BAUM Status Report Friday, January 18, 2002



Hi Marc, Hi All,


Our speech component architecture specifies the SR approach to a speech
stack. It is similar to the architecture of the Braille component.

Let's see first what is important for a SR. At a minnimum the SR would like
to produce speech as a reaction to remarcable SR events like focus changes,
key presses, changes in navigation "cursor", and so on. It would be useful
if each type of event would be spoken with a diferent "SR Voice" so that the
blind user can distinguish between these types. It is also possible that
some of these events are happening simultaneously, however in order not to
confuse the user, only one of these concurent events can be spoken at a
given time. Some of these events are more important then others, and should
be able to interrupt the speaking of other events, some events are
"perisable" some not.

It is possible to have more TTS engines on a single machine, possible from
different vendors and driving different hardware and the blind user would
like to use them simultaneously because each one is better optimized for a
given task (speech quality, intonation, semantic, etc.). As our experience
showed us, there is also important to support different languages
simultaneously (i.e. reading a spanish text on an english word processor).

I won't go for the moment deeper into the Speech->SR direction because it
not so relevant for the purpose of this discussion (although this is also
very important).

Let's see then what is the current situation of the speech synthesizers. To
put it short: it is chaos. Each TTS engine has its own protocol, and its own
ideas of what is important or not and how it encapsulates information, and
especially the concept of voice. Efforts have been done to create a standard
to drive the TTS engines, and the currernt trend is to use XML dialects for
it. Some TTS support these dialects better then others.

The task of the Gnopernicus speech component (the SRS or SR Speech) is to
provide a model which is able to mediate between the SR needs and the
available speech infrastructure. It has to be general, flexible and to solve
only the aspects relevant to a SR.

The SRS introduces two concepts (for output): the SR Voice and the SR Text.

A SRS Voice describes *** all *** the parameters possible for a TTS
conversion, including rate and pitch. This might differ from the voice
definition of some TTS, so don't be confused by the overloading of the word
"voice". A SRS Voice also includes a priority, for the case it competes
whith other voices, and a "behaviour on collision" (i.e. preemption). A
SRVoice has an ID which uniquely identifies it (i.e. "fcs_trk", "kbd_echo",
etc).

A SRText is always spoken in a given SR Voice.

I have implemented this model using XML in order to be consistent with the
Braille component. Take a look at this example:

 <?xml version="1.0" ?>
 <SRSOUT>
  <VOICE ID="fcstrk" TTSengine="Festival" TTSVoice="kal_diphone"
priority="0" preempt="yes" rate="80" pitch="100" />
  <VOICE ID="kbd" TTSengine="Festival" TTSVoice="kal_diphone" priority="1"
preempt="no" rate="120" pitch="50" />
  <TEXT voice="fcstrk" marker="my_marker">Hello this is a Gnopernicus
XML.</TEXT>
  </SRSOUT>

You can see that all voice parameters are specified as attributes. Right now
I have just a few, but the ideea is to have as many as necessary to cover
all the existing TTS we want to support.

But this is only an impelementation detail. Let's see what happens when we
are going down in the stack.

A lower layer in the SR Speech stack will map the SRVoice to:

- an underlying abstraction layer like gnome speech
- a sequence of comands to the TTS engine (in order to set the TTS Voice or
TTS Voice plus some extra parameters like rate and pitch, etc.)
- a speech XML dialect like JavaSpeech, Sable, VoiceXML, etc.

As you can see from the SRSML example above, the SRSVoice can specify a
target TTS engine and a target TTS voice, so in many cases the mapping would
be trivial. However, the system should work also if most of the SR Voice
attributes are not specified, in which case they will map to some defaults.

Without going too deep into implementation now, I'd like to mention the need
of a priority queue with flexible preemption logic. I hope I will get this
from the gnome speech library. The gnome speech could also be used as the
infrastructure for SRVoice/Text mapping.

I'm currently using Festival for validating my design. I did my own Festival
mapping (in fact I took some code from the AT-SPI demo) and it worked. My
goal for a proof of concept is to use Festival for focus tracking and an
external Dectalk Express (if I can get one) for keaboard echo, or some
combination like this. However, the primary goal is to have A.S.A.P. any
kind of speech produced through this architecture, most likely Festival, so
that the SR can validate its own design.

There are more details I could discuss here, but I think this is enough in
order to present the ideas behind the SRSpeech component. I will send you
some code, as soon I have more "meat" and less bugs in it.

Best regards,
Draghi



----- Original Message -----
From: "Marc Mulcahy" <marc mulcahy sun com>
To: "Draghi Puterity" <mp baum de>; "BAUM GNOME Development list"
<gnome-baum-dev basso SFBay Sun COM>
Sent: Tuesday, January 22, 2002 7:47 AM
Subject: Re: BAUM Status Report Friday, January 18, 2002


> Please keep us apprized of the speech component work-- I need to ensure it
> integrates with and uses gnome-speech appropriately.
>
> Thanks,
>
> Marc
>





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]