Hi folks,
so, after few days of work, I am happy to announce, that
Chinfusor
v.
1.1 was finally released!
For those who didn't read my previous thread, Chinfusor is
a
program,
which allows highly-flexible and efficient reading of texts
written
in
non-latin alphabets, directly from Orca, with abilities to
configure
speech properties for each alphabet individually.
There are few significant changes in version 1.1, which I
want
to
mention here.
First of all, alphabets in Chinfusor are no longer
configured,
but
defined. This small change of one word means, that while
previously,
there were few alphabets recognized by Chinfusor, which you
were
able
to configure, now, there are no built-in ranges and you can
define
any
alphabet yourself.
The file was renamed from settings.csv to
alphabets_settings.csv
and
has infact the same format as before, only with three
important
differences:
* Comments must be now marked with # at beginning of
commented
line.
* The column for name of alphabet is purely informational,
and
doesn'ŧ
have any functional role.
* A column for specification of characters range in unicode
table
was
added.
Unicode range can be specified as u0xa-u0xb, where a and b
stand
for
beginning and ending index in the unicode table. With 0x,
the
numbers
are expected to be hexadecimal, without 0x, they are
treated as
decimal. The notation is case-insensitive.
You can write multiple ranges in this form for one
alphabet,
separating
them with an optional delimiter of your choice.
if you specify nothing on place of unicode ranges, or write
the
star
sign (*), it is treated as latin alphabet. There should be
exactly
one
latin alphabet definition, any duplicates will be ignored.
Also note, that unicode ranges musn't overlap. There is
currently
no
check of this, you should guard this yourself. Overlapping
of
two
ranges won't produce an immediate error, but can cause
undefined
behavior while parsing.
settings.conf now contains general settings, where you can
configure
punctuation characters for example.
The second big change is rather internal, but it affects
the
program
from outside as well. In Chinfusor 1.0, every alphabet had
its
own
speech module, which required in turn its own thread to
read
its
output, as it's not possible to check, whether there is new
content
in
pipes or not.
Thus in the default state, the program was using 3 threads
just
for
alphabets and additional two threads for its other parts,
that
means 5
threads in total. That's not a problem, with 3 alphabets.
But
as
the
number would grow, if you specified for example 10
alphabets,
12
threads would be required. 20 alphabets would require 22
threads.
And
so on.
That's kind of much for a middle-layer speech processor,
especially
when we consider, that just one thread of a speech module
is
really
active at a time, as just one module should speak at once.
The
rest
of
them are mostly passive, as there is nothing to read.
Thus I have developed my own ThreadPool to solve this
issue.
It's
specially designed for Chinfusor and without going through
technical
details, simply put, it lowers the amount of required
threads
for
speech modules from theoretically infinite to just one.
And why am I mentioning this?
Because it means, that you can define any amount of
alphabets,
using
any amount of speech modules, without any loss of
performance.
Chinfusor will run just fine with three alphabets as well
as
100
alphabets, using just 3 threads in total.
The last thing, which I would like to mention here is, that
I
have
done
some benchmarks of the parsing algorithm of chinfusor.
I have designed it with speed in mind, so it always does
the
minimal
amount of work necessary, doesn't start loops if they can
be
avoided
and most importantly, eliminates additional alphabets
checking
by
comparing characters with threshold, which will mark most
of
normally
used characters as latin without even checking the
alphabets
list.
You
don't need to worry, this threshold is set dynamically in a
way,
that
it won't affect any of your alphabets.
Thus regardless if you have 1 or 100 defined alphabets,
performance
on
latin text will be the same.
But still, I was worried, how efficient the algorithm
really
is.
Thus,
to answer this question, I made some benchmarks and found
interesting
answers.
The first benchmark consisted of parsing million latin
characters.
It
took about 19 ms.
The second benchmark consisted of parsing million chinese
characters.
Thisone took approximately 33 milliseconds.
And the third benchmark, aimed at measuring copying
operations,
with
500000 latin and 500000 chinese characters in groups of
ten,
took
about
39 ms.
When you take the average value for the worst-case
scenario,
chinese
characters, which can't be optimized-out by the threshold,
the
performance is approximately 33/1000000 ms per characters.
That
means
0.000033 ms per character, or 33 nanoseconds per character.
Maximum recommended length for a line of Python code is
about
80
characters if I remember right, so if you really had such a
long
line,
parsing it with Chinfusor would take about 2.6
microseconds.
Thus it seems, that there is really no need to worry about
speed,
when
using chinfusor. :)
You can find more details about changes in v1.1 in
Chinfusor's
changelog, the documentation was updated with a section
explaining
steps necessary to update the program.
You can download the new version from Chinfusor's official
page:
https://rastisoftslabs.com/2020/07/22/chinfusor-a-universal-solution-for-reading-texts-in-foreign-alphabets-on-linux/
Happy reading!
Best regards
Rastislav
_______________________________________________
orca-list mailing list
orca-list gnome org
https://mail.gnome.org/mailman/listinfo/orca-list
Orca wiki: https://wiki.gnome.org/Projects/Orca
Orca documentation:
https://help.gnome.org/users/orca/stable/
GNOME Universal Access guide:
https://help.gnome.org/users/gnome-help/stable/a11y.html