Re: [orca-list] making Linux a hospitable place for TTS engines like Voxin

From: Bill Cox <waywardgeek gmail com>
To: Samuel Thibault <sthibault hypra fr>
Cc: speechd-discuss nongnu org, Rastislav Kish <rastislav kish protonmail com>, Orca-list <orca-list gnome org>
Subject: Re: [orca-list] making Linux a hospitable place for TTS engines like Voxin
Date: Mon, 21 Dec 2020 08:46:29 -0800

Thanks for the very detailed response! I just saw this, because my speech stack was broken until this morning when I finally got ibmtts working for me using my initial speechsw module for speech-dispatcher. Comments inline below. My module still needs a lot of work, but it is talking!

On Sat, Dec 5, 2020 at 9:32 AM Samuel Thibault <sthibault hypra fr> wrote:

Hello,

Ok, so let's discuss!

Thanks!

Bill Cox, le mer. 02 déc. 2020 12:54:17 -0800, a ecrit:
> Specifically, modules are required to render speech through the sound system,
> rather than generating speech samples.

Yes. I do not know the historical rationale for this. Possibly it was
meant to avoid yet another data passing between processes (we already
have the ssip client to the ssip server, then to the output module, then
to the sound audio server), each of which adds some latency. Possibly
this does not matter that much any more with nowadays' faster machines.

Possibly it is because it was thought possible that some synthesis would
not support producing the samples rather than playing.

But the TODO file actually lists moving audio to the server :)

Awesome. I would be interested in helping with that. IIRC, the last time I looked at the module code for Espeak, it was very complex, like ibmtts.c. This was several years ago, maybe 2011. Now espeak.c is only 800-ish lines of code, and is far simpler than I recall. Nice work!

I don't mean to disparage the current implementation of modules like espeak.c, which has much improved over the years. However, it is simply not portable at the binary level between Linux distros. Just run ldd on sd_espeak:

linux-vdso.so.1 (0x00007ffc569d5000)
libespeak.so.1 => /usr/lib/x86_64-linux-gnu/libespeak.so.1 (0x00007fc579272000)
libsndfile.so.1 => /usr/lib/x86_64-linux-gnu/libsndfile.so.1 (0x00007fc5791f7000)
libdotconf.so.0 => /usr/lib/x86_64-linux-gnu/libdotconf.so.0 (0x00007fc5791ef000)
libglib-2.0.so.0 => /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0 (0x00007fc5790c0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc578efb000)
libltdl.so.7 => /usr/lib/x86_64-linux-gnu/libltdl.so.7 (0x00007fc578ef0000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc578ecc000)
libpulse.so.0 => /usr/lib/x86_64-linux-gnu/libpulse.so.0 (0x00007fc578e78000)
libpulse-simple.so.0 => /usr/lib/x86_64-linux-gnu/libpulse-simple.so.0 (0x00007fc578e71000)
libportaudio.so.2 => /usr/lib/x86_64-linux-gnu/libportaudio.so.2 (0x00007fc578e40000)
libsonic.so.0 => /usr/lib/x86_64-linux-gnu/libsonic.so.0 (0x00007fc578e38000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc578cf4000)
libFLAC.so.8 => /usr/lib/x86_64-linux-gnu/libFLAC.so.8 (0x00007fc578cb3000)
libogg.so.0 => /usr/lib/x86_64-linux-gnu/libogg.so.0 (0x00007fc578ca9000)
libvorbis.so.0 => /usr/lib/x86_64-linux-gnu/libvorbis.so.0 (0x00007fc578c7c000)
libvorbisenc.so.2 => /usr/lib/x86_64-linux-gnu/libvorbisenc.so.2 (0x00007fc578bd1000)
libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fc578b5e000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc579341000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc578b58000)
libpulsecommon-13.99.so => /usr/lib/x86_64-linux-gnu/pulseaudio/libpulsecommon-13.99.so (0x00007fc578ad2000)
libdbus-1.so.3 => /lib/x86_64-linux-gnu/libdbus-1.so.3 (0x00007fc578a7f000)
libasound.so.2 => /usr/lib/x86_64-linux-gnu/libasound.so.2 (0x00007fc578983000)
libjack.so.0 => /usr/lib/x86_64-linux-gnu/libjack.so.0 (0x00007fc578932000)
libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007fc578908000)
libsystemd.so.0 => /lib/x86_64-linux-gnu/libsystemd.so.0 (0x00007fc578853000)
libwrap.so.0 => /usr/lib/x86_64-linux-gnu/libwrap.so.0 (0x00007fc578847000)
libasyncns.so.0 => /usr/lib/x86_64-linux-gnu/libasyncns.so.0 (0x00007fc57883f000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc578834000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc578667000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc57864d000)
libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007fc578646000)
libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007fc57863e000)
liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007fc578616000)
libzstd.so.1 => /usr/lib/x86_64-linux-gnu/libzstd.so.1 (0x00007fc57853a000)
liblz4.so.1 => /usr/lib/x86_64-linux-gnu/liblz4.so.1 (0x00007fc578518000)
libgcrypt.so.20 => /usr/lib/x86_64-linux-gnu/libgcrypt.so.20 (0x00007fc5783f8000)
libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007fc5783dc000)
libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007fc5783c2000)
libbsd.so.0 => /usr/lib/x86_64-linux-gnu/libbsd.so.0 (0x00007fc5783a8000)
libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007fc578382000)

In contrast, here's the dependency list for my more portable (and currently feature-poor) binary sw_espak:

linux-vdso.so.1 (0x00007ffdc133b000)

libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f0a83bfe000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f0a83bdc000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0a83a17000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0a83e08000)

The lacking features can be added without adding any new dependencies. For binary compatibility between Linux distros, minimizing these dependencies is critical. Binary portability appears to have been a non-goal in speech-dispatcher. Is there any chance I can contribute code to speech-dispatcher to fix this?

I see two approaches, and would be happy to work on either, or any other approach if there is one. The two I see look something like:

1) Create a meta-module (currently called speechsw.c) to support alternative portable binary modules. The meta-modules and synth data would be under libexec/speechsww, one directory per meta-module.

2) Move much of the complex code out of modules into speech-dispatcher, including the audio queue, and removing glib and other dependencies other than libc. Also, link in libraries statically, such as espeak.a, and move module data files needed by the synthesizers to a libexec directory, one per module under speech-dispatcher-modules.

I think my prefered approach is to start with 1), and migrate eventually to 2). That way, users who need binary portability can start to benefit in the near term while the more complex tasks in 2) are implemented. As a 20% project, it might take me a couple of years to implement the whole refactoring. Espeak would be simpler, but some of the other engines that don't use the new module_utils_* code need a major rewrite.

If folks feel binary portability is a nice goal, but not worth the price (e.g. not being able to use glib), then I'd be happy to maintain the meta-module long-term, to provide portable engines for folks who need them.

> No speech synth for Linux I know of is incapable of returning speech spamples
> rather than playing them.

This is not completely theoretical, at least Kali didn't know to fill
a buffer with samples, you'd have to make it write a file, and reread
that. Also, Luke mentioned in the TODO file that he knows one synth
who's licensing model doesn't allow for direct audio retrieval.

I guess supporting engines that handle their own audio is a requirement. Those engine binaries are not likely to be portable. However, the other engines could be made binary-portable.

That being said, we can keeping supporting modules that produce their
own audio output (e.g. the generic modules for which it's usually hard
to avoid).

> This greatly complicates them,

I would have agreed a few years ago, but this is not true any more,
thanks to the factorization I made recently. In the espeak module, the
only pieces of audio management left are

Thanks for writing that! It is a major improvement. I think we could move the audio queue into the speech-dispatcher daemon without making things more complex. Some of the other engines could be refactored to make use of the module_utils_* functionality to simplify them and help with binary portability.

> I'll send you an off-list email showing how I've simplified the
> modules. If folks want to look, here's [1]my Espeak module vs the
> one in [2]Speech Dispatcher.

They are not comparable: your module does not support

- sound icons
- character/key spelling
- setting the synth volume
- setting the pitch range
- selecting an appropriate voice according to the requested locale
- index marking

I generally agree that speech-dispatcher engines are easier to write nowadays, and the difference in lines of code has a lot to do with features not yet implemented in my engines. However, there is room for simplification, especially with TTS engines that don't have many features, such as picotts.

I've added character/key spelling, which did make the code a bit more complex. I am confused about what pitch range is for. I use a floating point value to scale pitch and speed, where 1.0 is default. 2.0 means twice as fast or twice the pitch, 0.5 means half. Pitch range seems like something that could be dealt with upstream, as well as volume. I deal with locale and voice in my speechsw module code for speech-dispatcher. I plan to have the speechsw module also control volume, deal with index marking, and playing sound icons. This is needed for backend engines that do not support SSML, like pico (last I checked). I'll also enable SSML for engines that do support it, like espeak. My speechsw module also links in libsonic and engines can select whether to change pitch and speed themselves (the default for ibmtts), or to have sonic do it (the default for picotts).

Notably, index marking is quite a beast to support, but it is really
important for a lot of users. Getting it right is tricky, and is the
reason for the seemingly long list of module_speak_queue_* calls. In
the end if you implement these features, you will end up basically with
the same complexity.

I've implemented index marking before, for NVDA to talk to a speech-stack called Speech-Hub. I agree it is painful to support. For engines that don't support SSML, that complexity can be upstream rather than in the engine binary. I think if the speech queue moves to speech-dispatcher, so can the logic that wrapps more primitive synthesizers to emulate index marking, volume, and even speed and pitch for engines that do not do it well on their own.

Note that speech-dispatcher *also* supports much simpler cases: for a
given speech synthesis, an initial completely synchronous speechd module
is possible by just synthesizing+playing in the module_speak() function
and be done. It would not support indexing etc. but that can be added
progressively, and yes, that also makes the module progressively more
complex, but that's inherent to supporting indexing.

I don't think that quite works right now. For example, the BEGIN message is not sent to Orca until module_speak returns, which leads to a message stuck in Orca that should be speaking but isn't. However, I don't think it would take much work to support simpler early-stage engines. For example, enable BEGIN to be sent as soon as the first audio samples are written to the queue. Not only simple modules, but most modules should rely on module_utils to handle the audio. If we do move the audio stack into the speech-dispatcher executable, those modules would not be impacted, since we could support the exiting APIs.

> and makes these binarys specific to not only the distro, but the
> distro version.

That, however, is a very convincing argument. Making it simple for
vendors to just ship a binary to a known place, whatever the distro and
version, can simplify things a lot for them.

I am relieved to hear you say that. There are many folks who contribute to FOSS who feel strongly that all binaries should be compiled from source for each distro, and there are some good reasons to do so, such as improving security. Black-box binaries should be sand-boxed and should not be trusted. A TODO for me is to look into sandboxing these shady binaries from TTS vendors :)

While I don't want to debate FOSS versus non-free software, I am a strong advocate for FOSS. However, a11y comes first for me, and that means having some closed-source TTS engines on the system.

We however have to be very careful with the protocol for the data
exchange with modules. Notably if you want to support indexing in
speechswitch, you'd have to break compatibility somehow, or introduce
backward/forward compatibility management complexity (which might not
even be possible if nothing was prepared in the initial design for the
server and the module to announce what they actually support).

I iplan for two modes in speechswitch: SSML supported, and SSML not supported. Each TTS engine has to call swSetSSML to enable/disable SSML at startup time. swSpeak is synchronous, and it is possible for the speech-dispatcher meta-module to add the indexing with the current API. I've already done this in Speech-Hub. It was painful, but sometimes problems are just hard.

So, now, where do we start. We need to specify the extension of the
module protocol to transfer audio from the module to the server. AIUI,
what we would want is to add a new "SERVER" case to the AUDIO
configuration command, that speech-dispatcher would try first by
default, and revert to alsa/pulse/etc. if that is rejected. When
accepted, the module can emit its audio snippets and index marks as SSIP
events.

That sounds like a good approach to me. It stays backwards compatible with existing modules while we work on binary portability. I would be interested in the task of making sd_espeak binary portable, or working with you and others in whatever manner is sensible.

Also, I have been thinking about simplifying modules into not using
a separate speak thread. Ideally modules should only care about
synchronously calling the synthesizing function from module_speak,
possibly piece by piece or with a periodic callback, and synchronously
calling some functions to determine whether stopping is wanted etc. The
current way (main()'s while(1) loop managing all communications) make
it difficult for modules to juggle with events. We can probably rework
this. Also I am thinking that this should be rewritten with a BSD
licence, so people can use it as a skeleton for their proprietary module
implementation.

It looks like we have similar opinions here. I actually implemented speechswitch that way. Fewer module-specific threads is better. The latencies mostly be eliminated. The one I did not eliminate is cancelling speech synthesis in the middle of generating an audio chunk. Instead the callback returns true to continue getting audio chunks and false to cancel. I do not hear a significant latency from this, but it could be a problem for some engines.

Anything I would have forgotten?

Ha! We will only know what we forgot when we write the code! Try as I might, the compiler refuses to implement functionality I forgot to build. In my experience, there is always something we forgot, ususally a lot of it.

Samuel

Follow-Ups:
- Re: [orca-list] making Linux a hospitable place for TTS engines like Voxin
  - From: Samuel Thibault

References:
- [orca-list] Announcing Speech Switch: making Linux a hospitable place for TTS engines like Voxin
  - From: Bill Cox
- Re: [orca-list] Announcing Speech Switch: making Linux a hospitable place for TTS engines like Voxin
  - From: Rastislav Kish
- Re: [orca-list] Announcing Speech Switch: making Linux a hospitable place for TTS engines like Voxin
  - From: Samuel Thibault
- Re: [orca-list] Announcing Speech Switch: making Linux a hospitable place for TTS engines like Voxin
  - From: Bill Cox
- Re: [orca-list] making Linux a hospitable place for TTS engines like Voxin
  - From: Samuel Thibault

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]