Hi folks,
so, after few days of work, I am happy to announce, that Chinfusor
v.
1.1 was finally released!
For those who didn't read my previous thread, Chinfusor is a
program,
which allows highly-flexible and efficient reading of texts written
in
non-latin alphabets, directly from Orca, with abilities to
configure
speech properties for each alphabet individually.
There are few significant changes in version 1.1, which I want to
mention here.
First of all, alphabets in Chinfusor are no longer configured, but
defined. This small change of one word means, that while
previously,
there were few alphabets recognized by Chinfusor, which you were
able
to configure, now, there are no built-in ranges and you can define
any
alphabet yourself.
The file was renamed from settings.csv to alphabets_settings.csv
and
has infact the same format as before, only with three important
differences:
* Comments must be now marked with # at beginning of commented
line.
* The column for name of alphabet is purely informational, and
doesn'ŧ
have any functional role.
* A column for specification of characters range in unicode table
was
added.
Unicode range can be specified as u0xa-u0xb, where a and b stand
for
beginning and ending index in the unicode table. With 0x, the
numbers
are expected to be hexadecimal, without 0x, they are treated as
decimal. The notation is case-insensitive.
You can write multiple ranges in this form for one alphabet,
separating
them with an optional delimiter of your choice.
if you specify nothing on place of unicode ranges, or write the
star
sign (*), it is treated as latin alphabet. There should be exactly
one
latin alphabet definition, any duplicates will be ignored.
Also note, that unicode ranges musn't overlap. There is currently
no
check of this, you should guard this yourself. Overlapping of two
ranges won't produce an immediate error, but can cause undefined
behavior while parsing.
settings.conf now contains general settings, where you can
configure
punctuation characters for example.
The second big change is rather internal, but it affects the
program
from outside as well. In Chinfusor 1.0, every alphabet had its own
speech module, which required in turn its own thread to read its
output, as it's not possible to check, whether there is new content
in
pipes or not.
Thus in the default state, the program was using 3 threads just for
alphabets and additional two threads for its other parts, that
means 5
threads in total. That's not a problem, with 3 alphabets. But as
the
number would grow, if you specified for example 10 alphabets, 12
threads would be required. 20 alphabets would require 22 threads.
And
so on.
That's kind of much for a middle-layer speech processor, especially
when we consider, that just one thread of a speech module is really
active at a time, as just one module should speak at once. The rest
of
them are mostly passive, as there is nothing to read.
Thus I have developed my own ThreadPool to solve this issue. It's
specially designed for Chinfusor and without going through
technical
details, simply put, it lowers the amount of required threads for
speech modules from theoretically infinite to just one.
And why am I mentioning this?
Because it means, that you can define any amount of alphabets,
using
any amount of speech modules, without any loss of performance.
Chinfusor will run just fine with three alphabets as well as 100
alphabets, using just 3 threads in total.
The last thing, which I would like to mention here is, that I have
done
some benchmarks of the parsing algorithm of chinfusor.
I have designed it with speed in mind, so it always does the
minimal
amount of work necessary, doesn't start loops if they can be
avoided
and most importantly, eliminates additional alphabets checking by
comparing characters with threshold, which will mark most of
normally
used characters as latin without even checking the alphabets list.
You
don't need to worry, this threshold is set dynamically in a way,
that
it won't affect any of your alphabets.
Thus regardless if you have 1 or 100 defined alphabets, performance
on
latin text will be the same.
But still, I was worried, how efficient the algorithm really is.
Thus,
to answer this question, I made some benchmarks and found
interesting
answers.
The first benchmark consisted of parsing million latin characters.
It
took about 19 ms.
The second benchmark consisted of parsing million chinese
characters.
Thisone took approximately 33 milliseconds.
And the third benchmark, aimed at measuring copying operations,
with
500000 latin and 500000 chinese characters in groups of ten, took
about
39 ms.
When you take the average value for the worst-case scenario,
chinese
characters, which can't be optimized-out by the threshold, the
performance is approximately 33/1000000 ms per characters. That
means
0.000033 ms per character, or 33 nanoseconds per character.
Maximum recommended length for a line of Python code is about 80
characters if I remember right, so if you really had such a long
line,
parsing it with Chinfusor would take about 2.6 microseconds.
Thus it seems, that there is really no need to worry about speed,
when
using chinfusor. :)
You can find more details about changes in v1.1 in Chinfusor's
changelog, the documentation was updated with a section explaining
steps necessary to update the program.
You can download the new version from Chinfusor's official page:
https://rastisoftslabs.com/2020/07/22/chinfusor-a-universal-solution-for-reading-texts-in-foreign-alphabets-on-linux/
Happy reading!
Best regards
Rastislav
_______________________________________________
orca-list mailing list
orca-list gnome org
https://mail.gnome.org/mailman/listinfo/orca-list
Orca wiki: https://wiki.gnome.org/Projects/Orca
Orca documentation: https://help.gnome.org/users/orca/stable/
GNOME Universal Access guide:
https://help.gnome.org/users/gnome-help/stable/a11y.html