Re: [g-a-devel] speech recognition



Hi Peter:

Sounds like you are doing some fun work!

Robert Brewer has done some work in this space with his SpeechLion

I've been thinking about speech recognition engines and where they
should be integrated. I now think that the window manager is not the
place for it, it should be at a 'lower' level.

One way to think about a speech recognition engine is similar to
that of a speech synthesis engine: it's a service that can be used
by assistive technologies.

The speech synthesis problem is a bit simpler because the interface 
between the engine and the assistive technology is not so complex.  
Speech recognition requires perhaps a bit more complexity because of 
the typical need to tell the engine to listen for different grammars
as well as the high degree of two-way communication between the
speech app and the speech engine.  It's a solvable problem, however, 
and emerging standards such as MRCPv2 are addressing this.

I'm currently hacking on a little daemon which uses the sphinx2
recognition engine to convert speech to text after which it sends this
text to the keyboard driver (using the uinput device driver). This
means that I'll be able to use my voice to 'type' every keyboard
character. (My current implementation already does this for a limited
set of characters)

This is an interesting first step.  I've had many conversations with
various folks who run down this path.  My personal opinion is that
turning speech into keyboard events is a potentially workable path,
but I believe much more compelling access can be done via higher
level access to the application, such as the AT-SPI.

As one goes further down the speech input path, one starts to realize
that speech recognition is not perfect.  As such, one needs to start
tuning/modifying the speech engine and the grammars it uses to squeeze 
the best accuracy/performance out of the engine.  Really good tuning
can be done by understanding just what utterances are acceptable input
to the application based on its given state.  This understanding can
be better obtained by something such as the AT-SPI.

Furthermore, once users can start talking to an application, they start 
expecting more than just "speech buttons."  For example, one might want
to be able to say "change the current selection to 12 point bold 
helvetica."  This involves a plurality of UI operations.  While this
might be able to be done by injecting a sequence of well known keyboard
events, direct semantic access via something such as the AT-SPI might
be a better way to go.

In any case, it sounds like you are getting pretty interested in this
space, and I'd be excited to hear more about your progress!

Will



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]