[orca-list] Chinfusor 1.1 was released



Hi folks,
so, after few days of work, I am happy to announce, that Chinfusor v.
1.1 was finally released!
For those who didn't read my previous thread, Chinfusor is a program,
which allows highly-flexible and efficient reading of texts written in
non-latin alphabets, directly from Orca, with abilities to configure
speech properties for each alphabet individually.

There are few significant changes in version 1.1, which I want to
mention here.

First of all, alphabets in Chinfusor are no longer configured, but
defined. This small change of one word means, that while previously,
there were few alphabets recognized by Chinfusor, which you were able
to configure, now, there are no built-in ranges and you can define any
alphabet yourself.
The file was renamed from settings.csv to alphabets_settings.csv and
has infact the same format as before, only with three important
differences:
* Comments must be now marked with # at beginning of commented line.
* The column for name of alphabet is purely informational, and doesn'ŧ
have any functional role.
* A column for specification of characters range in unicode table was
added.

Unicode range can be specified as u0xa-u0xb, where a and b stand for
beginning and ending index in the unicode table. With 0x, the numbers
are expected to be hexadecimal, without 0x, they are treated as
decimal. The notation is case-insensitive.
You can write multiple ranges in this form for one alphabet, separating
them with an optional delimiter of your choice.
if you specify nothing on place of unicode ranges, or write the star
sign (*), it is treated as latin alphabet. There should be exactly one
latin alphabet definition, any duplicates will be ignored.

Also note, that unicode ranges musn't overlap. There is currently no
check of this, you should guard this yourself. Overlapping of two
ranges won't produce an immediate error, but can cause undefined
behavior while parsing.

settings.conf now contains general settings, where you can configure
punctuation characters for example.

The second big change is rather internal, but it affects the program
from outside as well. In Chinfusor 1.0, every alphabet had its own
speech module, which required in turn its own thread to read its
output, as it's not possible to check, whether there is new content in
pipes or not.
Thus in the default state, the program was using 3 threads just for
alphabets and additional two threads for its other parts, that means 5
threads in total. That's not a problem, with 3 alphabets. But as the
number would grow, if you specified for example 10 alphabets, 12
threads would be required. 20 alphabets would require 22 threads. And
so on.
That's kind of much for a middle-layer speech processor, especially
when we consider, that just one thread of a speech module is really
active at a time, as just one module should speak at once. The rest of
them are mostly passive, as there is nothing to read.
Thus I have developed my own ThreadPool to solve this issue. It's
specially designed for Chinfusor and without going through technical
details, simply put, it lowers the amount of required threads for
speech modules from theoretically infinite to just one.
And why am I mentioning this?
Because it means, that you can define any amount of alphabets, using
any amount of speech modules, without any loss of performance.
Chinfusor will run just fine with three alphabets as well as 100
alphabets, using just 3 threads in total.

The last thing, which I would like to mention here is, that I have done
some benchmarks of the parsing algorithm of chinfusor.
I have designed it with speed in mind, so it always does the minimal
amount of work necessary, doesn't start loops if they can be avoided
and most importantly, eliminates additional alphabets checking by
comparing characters with threshold, which will mark most of normally
used characters as latin without even checking the alphabets list. You
don't need to worry, this threshold is set dynamically in a way, that
it won't affect any of your alphabets.
Thus regardless if you have 1 or 100 defined alphabets, performance on
latin text will be the same.
But still, I was worried, how efficient the algorithm really is. Thus,
to answer this question, I made some benchmarks and found interesting
answers.

The first benchmark consisted of parsing million latin characters. It
took about 19 ms.
The second benchmark consisted of parsing million chinese characters.
Thisone took approximately 33 milliseconds.
And the third benchmark, aimed at measuring copying operations, with
500000 latin and 500000 chinese characters in groups of ten, took about
39 ms.

When you take the average value for the worst-case scenario, chinese
characters, which can't be optimized-out by the threshold, the
performance is approximately 33/1000000 ms per characters. That means
0.000033 ms per character, or 33 nanoseconds per character.
Maximum recommended length for a line of Python code is about 80
characters if I remember right, so if you really had such a long line,
parsing it with Chinfusor would take about 2.6 microseconds.

Thus it seems, that there is really no need to worry about speed, when
using chinfusor. :)

You can find more details about changes in v1.1 in Chinfusor's
changelog, the documentation was updated with a section explaining
steps necessary to update the program.
You can download the new version from Chinfusor's official page:
https://rastisoftslabs.com/2020/07/22/chinfusor-a-universal-solution-for-reading-texts-in-foreign-alphabets-on-linux/

Happy reading!

Best regards

Rastislav



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]