[g-a-devel]Re: New revision of GNOME Speech IDL

From: Paul Lamere <Paul Lamere sun com>
To: Marc Mulcahy <marc mulcahy sun com>
Cc: gnome-accessibility-devel gnome org, Bill Haneman sun com, Rich Burridge <Rich Burridge sun com>, william walker sun com, peter korn sun com, Marney Beard <marney beard sun com>
Subject: [g-a-devel]Re: New revision of GNOME Speech IDL
Date: Wed, 09 Oct 2002 08:04:04 -0400

Marc:

Thanks for sending along the new IDL for gnome-speech. I've taken a lookat and have made some comments, which are attached.



Paul


*** Comment (0) - Overall
    It would be nice to have the API behavior documented a bit better
    (e.g. what does the boolean return from 'stop' mean?)

*** Comment (1) - GNOME_Speech_SynthesizerDriver:voice_language

    Since voice_language is an enum:

      enum voice_language {
        language_english,
        language_german
      };

    Does this mean that a new version of gnome-speech needs to be
    generated to support a new language?    How does Gnome represent a
    locale? If Gnome doesn't have a standard way of representing a locale,
    then perhaps ISO country/language codes should be used.  This would
    allow for speech engines to support new languages without any changes
    to the gnome-speech API.

*** Comment (2) - GNOME_Speech_SynthesizerDriver:voice_info

    VoiceInfo should probably be expanded to include:
       * age - the age of the speaker.
       * style - the style of the voice. Allows for further 
                 distinguishing voices. For instance: 
                 "business", "casual" could be styles supported by a
                 particular engine.
       

*** Comment (3) - GNOME_Speech_SynthesizerDriver

    Is it possible for a system to have more than one vendor TTS engine
    installed? If it is possible, it is not clear to me how an application
    would chose one vendor TTS engine over another with this API.

    How does an application get an instance of a  SynthesisDriver?

*** Comment (4) - GNOME_Speech_SynthesizerDriver

    In a typical system, would there be a single running instance of a TTS
    engine that all speech applications would share, or would each
    application have its own instance?   How can speech applications
    coordinate their output with each other?   For instance, a book reader
    application may queue up an hours worth of text to be spoken.  If a
    speech-enabled calendar application needs to announce an appointment
    would it:

       a) Interrupt the book reader
       b) Speak over the book reader
       c) Wait until the book was finished.

    How is this coordination managed in the API?

*** Comment (5) - GNOME_Speech_Speaker

   - Speaker: Needs a 'pause' method
   - Speaker: Needs a 'resume' method
   - Speaker: Needs a 'getVoiceInfo' method.
   - Perhaps add a 'sayURL' that takes a URL pointing to marked-up
     text.
   - Perhaps add a 'sayPlainText' that speaks text with no markup.

   - Does 'say' return immediately after queueing up text?  I presume
     that since there is a 'wait' call, that indeed it does. If an
     application has queued up ten strings to be spoken, does 'stop'
     stop the speaking of the first string in the queue or all strings
     in the queue. Likewise, does 'wait' wait for the queue to be empty, or
     for the current (head of the queue) text to be spoken. Generally
     speaking, if there is a TTS output queue, an application will
     need some methods to allow fine grained manipulation the queue.


*** Comment (6) Vocabulary management
   - How does an application add new words/pronunciations to the TTS
     dictionary?


*** Comment (7) callbacks
    It is unclear from the API when callbacks are generated. There are
    a number of interesting speech events that an application may want
    to receive callbacks for:

       - utterance started
       - utterance ended
       - word started
       - word ended
       - phoneme started
       - marker reached
       - utterance cancelled
       - utterance paused
       - utterance resumed

    I suggest adding an event type to the SpeechCallback notify that
    can be used to indicate the type of event

    Some applications may want to see all of these events, while some
    won't want to see any.  Since some of the events can add some 
    overhead to the speech engine, it may be useful to allow an
    application to register callbacks for specific event types.

    The 'notify' method gets a 'text_id' as a long type. I presume
    this is the long value that is returned by 'say'.  I think this is
    a bit awkward for an application. If an application wants access
    to the text that is currently being output it will have to store
    all the text, associate the text with the id and remember to clean
    it all up afterwards. I would prefer to see the SpeechCallBack
    'notify' method get a struct that includes a reference to the
    current string being output.

*** Comment (8) Parameters -
    How does setting parameters interact with text that is queued to
    be spoken?  If an application queues ten strings to be spoken and
    then calls 'setParameterValue' to change a property such as the
    speaking volume, when does the change take place? Some options:
        a) Immediately (in the middle of speaking the current text)
        b) After the current utterance is finished
        c) After all the items currently in the queue have been
        spoken.

    Note that some engines will not be able to change some parameters
    in mid-utterance.

*** Comment (9) Parameters -
        What is the intended use of the 'enumerated' field of the
        Parameter struct?

        What does 'getParameterValue' return if a request is made for
        an unknown parameter?

        What does 'getParameterValueDescription' do?

        Does the 'setParameterValue' return value indicate the
        success/failure of the set?

        I'm not sure if the 'double' type is appropriate for all
        parameters. 


*** Comment (10) Global Parameters -

    If I have a number of speech applications do I need to set
    properties for each application, or is there a way that I can
    set parameters globally and have each application use these
    global settings.  For instance, I can imagine that a user may
    want to globally increase the speaking rate for all
    applications.  Perhaps a mechanism akin to the X resources
    would be appropriate.

*** Comment (11) Audio Redirection -
   Is there anyway for an application to redirect audio? 

*** Comment (12) releaseSpeaker? -

    When an application is done with a 'Speaker' that was created with
    SynthesisDriver::createSpeaker, how does the application release it?

Follow-Ups:
- [g-a-devel]Re: New revision of GNOME Speech IDL
  - From: Marc Mulcahy

References:
- [g-a-devel]New revision of GNOME Speech IDL
  - From: Marc Mulcahy

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]