Re: [File Roller] Supporting Unicode Enabled ZIP Archive When Using Info-ZIP Stack



On Mon, Nov 12, 2012 at 5:29 PM, Simos Xenitellis
<simos lists googlemail com> wrote:

> Also see
> https://bugzilla.gnome.org/show_bug.cgi?id=306403
> https://bugzilla.redhat.com/show_bug.cgi?id=225576

I will check later.

> I think that the wider issue is about how to deal with legacy
> (=non-UTF8) encodings.
> Not only with filenames from within ZIP archives, but also text files
> in legacy encoding (such as subtitles),
> IDv3 tags and so on.

For ZIP, the best tool I know so far is lsar/unar from the The
Unarchiver project.
http://manpages.ubuntu.com/manpages/precise/en/man1/lsar.1.html
http://manpages.ubuntu.com/manpages/precise/en/man1/unar.1.html
As you can see from its man page, it supports auto encoding detection
and manual encoding conversion natively.

However, I don't think we can get rid of Info-Zip stack. The most
annoying fact about Info-Zip's UnZip is that you can only get '?' for
non-ASCII characters in non-UTF8 archive. No way to get raw file name
data.

Patched UnZip that adds -I and -O just gives more '?'.

I hope Info-Zip's next release can resolve these issues.

For plain text and Gedit, see my post at gedit-list:
https://mail.gnome.org/archives/gedit-list/2012-November/msg00008.html

For ID3, can you show me a legitimate way to buy MP3 that contains
problematic ID3? I bought some songs from Ubuntu Music Store, but they
contain English meta-data only.

If problematic ID3 only comes from other sources, I think users should
convert ID3 encoding themselves. There are tools out there.

> There have been some proposals to guess the legacy encoding (using
> frequencies of letters, etc), however they add to the complexity.

Most people port Mozilla's detector. There is no GNOMEism port of that
library yet. But there is a KDEism one, try Kate on local encoded
plain text file for inspiration.
I already mentioned similar idea on gedit-list.
https://mail.gnome.org/archives/gedit-list/2012-October/msg00001.html

> AFAIK, if gtk/glib finds an invalid UTF-8 encoding in text, it tries
> to convert from iso-8859-1 to UTF-8.
> What I believe should happen is for gtk/glib to get a hint from the
> operating system locale (i.e. a variable GTK_LEGACY_ENCODING), and
> autoconvert any invalid text from GTK_LEGACY_ENCODING to UTF-8.

I don't think the fallback is currently done in GTK/GLib level, please
correct me.

> For your case with ZIP archives, you deal with archives that may have
> been created with a localised version of Windows, thus the filenames
> may have a legacy encoding.

Well, decent ZIP software on Windows, e.g., 7-zip, does created
Unicode enabled ZIP archive now.
Microsh*t's built-in ZIP supporting feature is another story.

> Thus, my easy recommendation:
>
>    File-roller considers all ZIP files to contain UTF-8 encoded
> filenames. When it detects that the encoding is not UTF-8, then it
> tries to convert from a legacy encoding to UTF-8. File-roller can
> guess based on the system locale, or it can show to the user a dialog
> box with the best guess, and allow to change encoding on the fly until
> the filenames in the textbox make sense.

File Roller is not that smart I guess. It accepts whatever Info-Zip or
p7zip returns. That's why after I hacked Info-Zip interfacing code, I
realized that Info-Zip itself need some hacking also.
p7zip can return file names in Unicode enabled ZIP archive correctly
and garbage otherwise. But p7zip doesn't support encoding conversion.
I thought about p7zip hacking but I really don't like its Windowsism
code base and convoluted build system.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]