Re: [File Roller] Supporting Unicode Enabled ZIP Archive When Using Info-ZIP Stack



On Mon, Nov 12, 2012 at 9:44 AM, Ma Xiaojun <damage3025 gmail com> wrote:
> Bug Hint (not reported by me):
> https://bugzilla.gnome.org/show_bug.cgi?id=648673
>
> There are basically two kinds of ZIP archive. Those with random file
> name encoding (not Unicode enabled) and those with UTF-8 file name
> encoding and proper meta data set (Unicode enabled).
>

Also see
https://bugzilla.gnome.org/show_bug.cgi?id=306403
https://bugzilla.redhat.com/show_bug.cgi?id=225576

> UnZip 6.0 (the current latest released version) from Info-ZIP can
> extract Unicode enabled archive correctly. However, it's listing
> feature would treat any non-ASCII character in file name as '?', even
> for Unicode enabled archives. This affects File Roller also so we have
> above mentioned bug.
>
> Fortunately, UnZip has a -U option. When dealing with Unicode enabled
> archives, it will escape non-ASCII character to #UXXXX or #LYYYYYY. I
> already made a working patch for File Roller to utilize this.
> https://gist.github.com/4057999
>
> Unfortunately, #UXXXX or #LYYYYYY are also legitimate file names in
> ZIP archives and UnZip's -U option doesn't escape literal # currently.
> I'm trying to contact the upstream already.
> http://www.info-zip.org/phpBB3/viewtopic.php?f=4&t=405
>
> In the File Roller side, we may list the archive twice, one without -U
> and one with -U. Then we can determine which # is literal and which #
> is for escaping. There is another annoying detail worth noting here,
> Vanilla UnZip show exactly one ? for one Unicode character while
> patched UnZip (found in at least Arch and Ubuntu) show several ? for
> one Unicode character (the number of ? equals to number of UTF-8
> bytes).
>
> What do you think?

I think that the wider issue is about how to deal with legacy
(=non-UTF8) encodings.
Not only with filenames from within ZIP archives, but also text files
in legacy encoding (such as subtitles),
IDv3 tags and so on.

There have been some proposals to guess the legacy encoding (using
frequencies of letters, etc), however they add to the complexity.

AFAIK, if gtk/glib finds an invalid UTF-8 encoding in text, it tries
to convert from iso-8859-1 to UTF-8.
What I believe should happen is for gtk/glib to get a hint from the
operating system locale (i.e. a variable GTK_LEGACY_ENCODING), and
autoconvert any invalid text from GTK_LEGACY_ENCODING to UTF-8.

For your case with ZIP archives, you deal with archives that may have
been created with a localised version of Windows, thus the filenames
may have a legacy encoding.

Thus, my easy recommendation:

   File-roller considers all ZIP files to contain UTF-8 encoded
filenames. When it detects that the encoding is not UTF-8, then it
tries to convert from a legacy encoding to UTF-8. File-roller can
guess based on the system locale, or it can show to the user a dialog
box with the best guess, and allow to change encoding on the fly until
the filenames in the textbox make sense.

Simos


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]