Re: Orca Speech API

From: Tomas Cerha <cerha brailcom org>
To: Orca screen reader developers <orca-list gnome org>
Subject: Re: Orca Speech API
Date: Thu, 13 Jul 2006 10:43:53 +0200

Willie Walker wrote:

Yep.  I have still yet to write up my notes on this, but we basically
want the interface to speech from scripts to be very simple, and then
provide layers underneath that to talk with different speech systems
(e.g., gnome-speech, speech-dispatcher, emacspeak).  

These various layers may need to provide code for support that is not in
the speech system - for example, we need verbalized punctuation logic in
the orca code that talks to gnome-speech because we cannot rely upon the
gnome-speech stuff to do this for us.  We may not need this logic in the
orca code that talks to speech dispatcher, though, since speech
dispatcher provides this support.


Yes, exactly.  Especially all the verbalization features should be
separated from Orca itself, since this is mostly a job for the
synthesizer (it is language dependent, etc).  Since many synthesizers
don't support all the required features, it can be emulated for them,
but it should not prevent those, which do support them, from performing
them.  This is essential for i18n.

isSpeaking(self):

  Is not used within Orca, except for the http interface.
  It is a question whether it is used by anything behind
  this interface, but it would be definitely nice to avoid
  this method at all.


FestVox has support to use Orca as its speech service.  The model for
FestVox is to poll for isSpeaking, so we added this expressly for that
purpose.


Ok, I think this is a sign of bad design, but I will have to discuss
that with FestVox authors, of course.  The fact that you avoided using
that in Orca indicates that it should not be necessary...

sayAll is used primarily to automatically read an entire document
(imagine you are browsing "Journey to the Center of the Earth" as one
large document).  As such, it is doing more than breaking things up into
SSML-like things.  It is intended to do several things:

1) Take an iterator that can provide one utterance at a time, along with
ACSS information.  The main goal here is to be able to lazily provide
the engine with information to speak rather than queuing it all up at
once.


Ok, I see.  Constructing the whole SSML document at once is not
possible.  So what is a typical "utterance" here?  I guess, if you need
to change voice eg. for en emphasized word within a sentence, you need
to split the sentence into several utterances.  Is then the synthesizer
getting each piece separately, or the whole sentence, or paragraph?  How
do you recognize where to split the text?

2) Provide us with progress information as speech is happening - we want
to be able to highlight the word being spoken and/or move the region of
interest of the magnifier to show the word being spoken.


Yes, this is possible with SSML too.

3) Let us know when/where speech has been interrupted or stopped so we
can position the caret appropriately.


Yes, this is also possible.

Please, note, that I am not pushing for SSML, I just want to get an idea
how it was meant and comparing it to SSML can help me...

It is also unfortunate that SSML is not supported by a number of
synthesizers.  So, we'd need to provide some sort of layer for SSML if
we center on SSML.


Yes, but most synthesizers support some sort of embedded markup and it
is quite simple to translate SSML into anything else in the output
driver.  Thats how Speech Dispatcher works and the advantage is that we
have a common (and even w3c standardized) markup for the common interface.

speakUtterances(self, list, acss=None, interrupt=True):

  This method seems redundant to me.


I think we might be able to simplify things greatly if we had just one
call for speaking that looks something like a cross between sayAll and
speakUtterances:  it would take an iterator and a callback.

  The 'acss' argument is never used within Orca, BTW.


It actually is indeed used, and is used for doing things such as
changing the voice for uppercase (something the underlying system might
be able to detect and manage for us)


Yes, this should be done by the synthesizer and emulated by the driver
for those, which don't support it.

and for hyperlinks (something the
underlying system may not be able to detect and manage for us).


Then, may sayAll be used?  There may be a convenience wrapper to make it
simple to use within Orca.

increaseSpeechRate(self, step=5), decreaseSpeechRate(self, step=5),
increaseSpeechPitch(self, step=0.5), decreaseSpeechPitch(self,step=0.5):

  The argument 'step' is never used, so it might be omitted.  Moreover,
  it might be better to implement increasing and decreasing in a
  layer above and only set an absolute value at the speech API level.


That's kind of what we're doing now.  The step value is part of Orca's
settings, and the parameter you see above is more of a vestige that we
should get rid of.


Ok.  These methods are only used for the Insert-arrows shorcuts.
Setting the pitch and rate within the setup GUI uses some other
mechanism.  It would be nice to be able to use the same methods for both.

In addition, I suggest a new method 'speakKey(self, key)'.  Currently
key names are constructed by Orca and spoken using the 'speak' method,
but  some backends (such as Speech Dispatcher) then loose a chance to
handle keys in a better way, such as playing a sound instead of the key
name or caching the synthesized key name for a key identifier, etc.


Can you describe more about what a "key" is and where Orca might call
speakKey?


Well, anywhere you need to verbalize a key.  Typically for keyboard
echo.  It can get much more responsive, when you allow caching.

A lot of what Orca does is just push text around and concatenate it.
So far, we've been kind of lucky, but there are holes, especially where
we are going through a string byte-by-byte versus
character-by-character.  Not having to worry about encoding types
internally would be nice - there are a couple ugly hacks in Orca that
check for some specific UTF-8 patterns and I'd like to get away from
that.


Yes, this is all very comfortable when using the Python unicode type.

A good source of information is the Python UNICODE HOWTO:
http://aldonza.org/python/howto/unicode

Some practical hints:

http://effbot.org/zone/unicode-objects.htm

If using unicode is the right thing to do, then we need to take a
careful survey of where we are going wrong.  I'd like to understand what
kind of impact this is going to have on the code - for example, will all
the *.po files need changing, where do we now need to move from unicode
type to encoding (e.g., interfaces to gnome-speech and BrlTTY?), etc.
Do you have an idea of this impact?


The basic idea is that you decode all the input, whether it is UTF-8 or
any other encoding.  Then you work with the python unicode type
internally and you encode all the output (to whatever encoding the
target wants).  So it is UTF-8 for gnome-speech, BRLTTY, Speech
Dispatcher etc.

But in practise you don't care too much.  The encoding in fact is mostly
done within the python binding library (for example when sending the
data to the Speech Dispatcher's socket or when passing it to the C
interface function in BrlAPI binding).  Changing PO files is not needed,
since the Python gettext library decodes the input for you (when you
pass a unicode flag to the install function).

Best regards

Tomas

PS: I'm leaving today for a few days, so hope to get back to you
sometimes next week...

-- 
Brailcom, o.p.s. http://www.brailcom.org
Free(b)soft project http://www.freebsoft.org
Eurochance project http://eurochance.brailcom.org

References:
- Orca Speech API
  - From: Tomas Cerha
- Re: Orca Speech API
  - From: Willie Walker

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]