Re: [g-a-devel]Gnome Speech Architecture Proposal



Draghi,

I thought I'd give you my humble reaction. I really like it. I am curious about the voice parameters idea -- is it often the case that the same "Speaker" can have such vastly different voices? For example, is it likely that an implementation of a Speaker such as "Beautiful Betty" would be able to speak different languages? (not rhetorical - i really don't know)

Also, I am not clear about the step:

"- pick one Speaker object and ask it to create a Voice object with some
given parameters."

-- it seems to beg the question: how do you pick a Speaker?  Would it make sense to ask for a speaker with certain parameters and have the GS give you the closest match(es)?

You might consider cross posting this to the java-speech people...( JAVASPEECH-INTEREST JAVA SUN COM) but I'll leave that up to you.

cheers,

~~David

Draghi Puterity wrote:

Hi Marc, hi All,

we discussed a lot in our team about the GS arhitecture and this is how we
think it could look like:

The most important ideea is that we should hide the TTS engines from the GS
clients. Instead of exposing the TTS engines with their subseqent "voices"
to the GS clients we suggest to introduce the concept of "Speaker". The GS
would expose at its highest level only the Speaker objects to its clients.
Speaker examples are kal_diphone (Festival voice) or Perfect Paul or
Beautiful Betty (DECTalk Express voices),  the ViaVoice male voice #5, etc.
For the GS client, it shouldn't matter where or what these Speakers are.
These are just some entities that can produce speech output.

A Voice is a Speaker "instantiated" with a number of parameters (I know that
"voice" is a heavily overloaded term but I couldn't find something better
yet). An example of voice would be "a slow spoken Beautiful Betty". So,
Speakers describe the properties available for instantiating a voice (i.e.
pitch range, rate range, language supported (enum), etc). Voices can
actually speak, pause, resume, shut-up, and have "current values" for the
parameters.

A tipical usage scenario would be:

- ask the bonobo infrastructure for a GS object
- query the GS object for the Speakers available in the system
- pick one Speaker object and ask it to create a Voice object with some
given parameters.
- ask the Voice object to say something and receive its markers
...

All the plumbing like starting other servers, initialize devices and TTS
engines should be internal to the GS. The GS shouldn't be aware about this
implementation details.

There are many other issues that can be discussed here (i.e. the optimal
balance between Speaker properties and Voices, concurent speaking voices,
multiclient issues, etc), but I would leave this for later discussions if
you consider that we should follow this architecture.

Best regards,
Draghi

_______________________________________________
Gnome-accessibility-devel mailing list
Gnome-accessibility-devel gnome org
http://mail.gnome.org/mailman/listinfo/gnome-accessibility-devel







[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]