Re: Orca Speech API



As I promised at Guadec, here are my brief comments to the current
speech API in Orca.  

Thanks Tomas!

Hope you find that usefull.  As we agreed, the goal
is to make the API as simple as possible.

Yep.  I have still yet to write up my notes on this, but we basically
want the interface to speech from scripts to be very simple, and then
provide layers underneath that to talk with different speech systems
(e.g., gnome-speech, speech-dispatcher, emacspeak).  

These various layers may need to provide code for support that is not in
the speech system - for example, we need verbalized punctuation logic in
the orca code that talks to gnome-speech because we cannot rely upon the
gnome-speech stuff to do this for us.  We may not need this logic in the
orca code that talks to speech dispatcher, though, since speech
dispatcher provides this support.

getInfo(self):

  This method is often used as 'server.getInfo()[0]' or
  'server.getInfo()[1]', thus it might be more practical
  to have two methods: 'name()' and 'id()'.

The two are most commonly used together, so I kind of prefer keeping
them coupled.

queueText(...), queueTone(...), queueSilence(...):

  These methods are not used anywhere in Orca code, so
  they may be removed.

Sounds good.  I'm for ripping things out that don't belong, and I'm
usually hesitant to add them.

isSpeaking(self):

  Is not used within Orca, except for the http interface.
  It is a question whether it is used by anything behind
  this interface, but it would be definitely nice to avoid
  this method at all.

FestVox has support to use Orca as its speech service.  The model for
FestVox is to poll for isSpeaking, so we added this expressly for that
purpose.

sayAll(self, utteranceIterator, progressCallback):

  This is the most complex point of the interface.  In fact, that
  does the same thing as SSML.  You can change voices and their
  properties within the text and you get progress notifications.
  I'm mentioning that because it might be practical to use SSML
  directly, since it doesn't have certain limitations.  For example
  you can change voice properties within a sentence, without
  breaking it into pieces (and breaking the lexical structure for
  the synthesizer).  Did you consider such problems and are you
  satisfied with the current solution?

sayAll is used primarily to automatically read an entire document
(imagine you are browsing "Journey to the Center of the Earth" as one
large document).  As such, it is doing more than breaking things up into
SSML-like things.  It is intended to do several things:

1) Take an iterator that can provide one utterance at a time, along with
ACSS information.  The main goal here is to be able to lazily provide
the engine with information to speak rather than queuing it all up at
once.

2) Provide us with progress information as speech is happening - we want
to be able to highlight the word being spoken and/or move the region of
interest of the magnifier to show the word being spoken.

3) Let us know when/where speech has been interrupted or stopped so we
can position the caret appropriately.

It is also unfortunate that SSML is not supported by a number of
synthesizers.  So, we'd need to provide some sort of layer for SSML if
we center on SSML.

speakUtterances(self, list, acss=None, interrupt=True):

  This method seems redundant to me. 

I think we might be able to simplify things greatly if we had just one
call for speaking that looks something like a cross between sayAll and
speakUtterances:  it would take an iterator and a callback.

  The 'acss' argument is never used within Orca, BTW.

It actually is indeed used, and is used for doing things such as
changing the voice for uppercase (something the underlying system might
be able to detect and manage for us) and for hyperlinks (something the
underlying system may not be able to detect and manage for us).

increaseSpeechRate(self, step=5), decreaseSpeechRate(self, step=5),
increaseSpeechPitch(self, step=0.5), decreaseSpeechPitch(self,step=0.5):

  The argument 'step' is never used, so it might be omitted.  Moreover,
  it might be better to implement increasing and decreasing in a
  layer above and only set an absolute value at the speech API level.

That's kind of what we're doing now.  The step value is part of Orca's
settings, and the parameter you see above is more of a vestige that we
should get rid of.

In addition, I suggest a new method 'speakKey(self, key)'.  Currently
key names are constructed by Orca and spoken using the 'speak' method,
but  some backends (such as Speech Dispatcher) then loose a chance to
handle keys in a better way, such as playing a sound instead of the key
name or caching the synthesized key name for a key identifier, etc.

Can you describe more about what a "key" is and where Orca might call
speakKey?

Another concern I have is using UNICODE.  Currently, AFAIK, Orca works
with UTF-8 strings.  This is not a very Pythonic approach.  Better is to
use Python's unicode type internally and only encode/decode to UTF-8 on
output/input.  This would have many practical advantages (especially
when handling character offsets in callback contexts).  I don't know
what are your plans in this respect, so would be grateful if you could
let me know.

The main goal here is to make sure we manage things so that we consume
and present text accurately.  We need to make sure also we play well in
the CORBA interfaces to AT-SPI and gnome-speech as well as the interface
to BrlTTY.

A lot of what Orca does is just push text around and concatenate it.
So far, we've been kind of lucky, but there are holes, especially where
we are going through a string byte-by-byte versus
character-by-character.  Not having to worry about encoding types
internally would be nice - there are a couple ugly hacks in Orca that
check for some specific UTF-8 patterns and I'd like to get away from
that.

If using unicode is the right thing to do, then we need to take a
careful survey of where we are going wrong.  I'd like to understand what
kind of impact this is going to have on the code - for example, will all
the *.po files need changing, where do we now need to move from unicode
type to encoding (e.g., interfaces to gnome-speech and BrlTTY?), etc.
Do you have an idea of this impact?

Thanks for your feedback!

Will





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]