Re: [g-a-devel] speech recognition

From: Willie Walker <William Walker Sun COM>
To: "W.P. van Paassen" <wp van paassen gmail com>
Cc: gnome-accessibility-devel gnome org
Subject: Re: [g-a-devel] speech recognition
Date: Wed, 10 Aug 2005 12:00:51 -0400

Hi Peter:

Sounds like you are doing some fun work!

Robert Brewer has done some work in this space with his SpeechLion

project: http://brewer123.home.comcast.net/speechlion/.

I've been thinking about speech recognition engines and where they
should be integrated. I now think that the window manager is not the
place for it, it should be at a 'lower' level.

One way to think about a speech recognition engine is similar to

that of a speech synthesis engine: it's a service that can be used

by assistive technologies.

The speech synthesis problem is a bit simpler because the interface

between the engine and the assistive technology is not so complex.

Speech recognition requires perhaps a bit more complexity because of

the typical need to tell the engine to listen for different grammars

as well as the high degree of two-way communication between the

speech app and the speech engine. It's a solvable problem, however,

and emerging standards such as MRCPv2 are addressing this.

I'm currently hacking on a little daemon which uses the sphinx2
recognition engine to convert speech to text after which it sends this
text to the keyboard driver (using the uinput device driver). This
means that I'll be able to use my voice to 'type' every keyboard
character. (My current implementation already does this for a limited
set of characters)

This is an interesting first step. I've had many conversations with

various folks who run down this path. My personal opinion is that

turning speech into keyboard events is a potentially workable path,

but I believe much more compelling access can be done via higher

level access to the application, such as the AT-SPI.

As one goes further down the speech input path, one starts to realize

that speech recognition is not perfect. As such, one needs to start

tuning/modifying the speech engine and the grammars it uses to squeeze

the best accuracy/performance out of the engine. Really good tuning

can be done by understanding just what utterances are acceptable input

to the application based on its given state. This understanding can

be better obtained by something such as the AT-SPI.

Furthermore, once users can start talking to an application, they start

expecting more than just "speech buttons." For example, one might want

to be able to say "change the current selection to 12 point bold

helvetica." This involves a plurality of UI operations. While this

might be able to be done by injecting a sequence of well known keyboard

events, direct semantic access via something such as the AT-SPI might

be a better way to go.

In any case, it sounds like you are getting pretty interested in this

space, and I'd be excited to hear more about your progress!

Will

References:
- [g-a-devel] gnome-speech api
  - From: W.P. van Paassen
- Re: [g-a-devel] gnome-speech api
  - From: Willie Walker
- Re: [g-a-devel] speech recognition
  - From: W.P. van Paassen

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]