Re: UTF-16 support?



Inline below.

On Thu, 2013-04-04 at 05:50 +1000, Kai Willadsen wrote:
(Answering lots of things at once, and not in order.)

On 4 April 2013 00:06, Nick <nospam codesniffer com> wrote:
I think I found a solution.  It required 3 pieces:

1.  Set Meld's Encodings to:
       utf8, utf-16le, iso8859

2.  Set SVN's mime-type property on the UTF-16 files to
    text/plain;encoding=UTF-16LE.

3.  Placed a BOM in the UTF-16 files.

With this configuration I am able to view UTF-8 and UTF-16 files in Meld
without changing the configuration.  The files can be directly from the
filesystem (ie. meld file1 file2) or via the SVN hook within Meld.


In the process of experimenting on this (and I think contributing to the
problem), I think I found a bug in Meld.  It seems that once I attempt
to view/diff a file that's in SVN which fails, other files which
normally work also fail.  Here's a breakdown of the steps I observe this
happening (using Meld 1.7.0):

(1)  Open Meld for a directory inside a SVN working copy, which contains
3 files:  a.xml (a UTF-16LE file without a BOM), b.xml (a UTF-16LE file
with a BOM), c.txt (a UTF-8 file).

The issue seems to be tied to opening a binary file.  In this case,
a.xml only needs to be considered binary (SVN's svn:mime-type property
set to application/octet-stream).

(2)  Set Meld's Encodings configuration to "utf8, utf-16le"
(3)  Open/View b.xml.  This should work.
(4)  Open/View c.txt.  This should work.
(5)  Attempt to open a.xml.  This should yield an error that the file is
binary (as expected).
(6)  Now attempt to open/view b.xml again.  It fails with the same
error.

The only way I've found to get it out of this stuck state is to refresh
the listing.

I can try creating a screen recording of this behavior if it helps.

A screen recording wouldn't really help - those are pretty clear
instructions - but if there's any way you could provide a SVN working
copy to reproduce the problem, then that would be great.

I've attached a patch which exhibits the problem that you should be able
to apply to any WC since it doesn't require modifications to existing
files (ie. new files marked for addition exhibit the problem fine).  Let
me know if you can't apply this patch to a WC and I'll provide a
procedure or a script to create the same (it's very easy).

Note the svn:mime-type property on each file:
 - File1.txt is UTF-16LE that's marked as binary by SVN
(application/octet-stream).
 - File2.txt is UTF-16LE that's marked as UTF-16LE by SVN.
 - File3.txt is UTF-8.

To see the problem:
1.  Open meld for the directory in the WC where you applied the patch.
2.  Set the Encodings to "utf8, utf-16le".
3.  Open/view the files in reverse order: (File3.txt, File2.txt,
File1.txt).  Notice that you can view File3 & File2, but not File1.
That's sort-of expected since File1.txt is marked as binary.  (I say
sort-of because the file is really UTF-16LE and includes a suitable
BOM).
4.  Now open/view File2.txt again, and notice that it does not open this
time.  Refresh the directory listing in Meld, and you can once again
open it.

Let me know if I can help w/ more info.



On Wed, 2013-04-03 at 09:41 -0400, Nick wrote:
Looks like if I change the order of the codecs such that utf16 is listed
first, then Meld displays the file fine.  But then I lose the ability to
view UTF-8 files.  So it seems like it's one or the other, but not both.

If this is true, I don't understand the purpose of being able to specify
more than one encoding in the Preferences dialog.

Can Meld support going through each specified encoding while the file is
not displayable (including the finding that it's a 'binary' file)?  This
will allow me to specify "utf8, utf16" for the encodings which will
support UTF-8 and UTF-16 files to be used in Meld w/out changing the
configuration.

That's exactly what we do... except that the binary file check is
unrelated to the rest. Having said that, reordering those really
shouldn't avoid the binary file check.

On Wed, 2013-04-03 at 08:48 -0400, Nick wrote:
Hi,

First and foremost, thanks for a great diff & merge tool!

My project involves XML files which need to be encoded in UTF-16 Little
Endian.  I cannot seem to view or diff UTF-16 files with Meld.

In the Encoding tab of the Preferences dialog I have this for the
codecs:

    utf8, iso8859, utf16, utf-16, utf16le, utf-16le

When I try to open a UTF-16LE file that's in SVN, Meld displays a yellow
error bar on top which reads, "Error fetching original comparison file".
I've confirmed UTF-8 files in the repo open fine--it's only an issue w/
UTF-16 files.

It behaves the same even for files which are marked for addition in the
repo but not yet added (so in this case, there's nothing to diff
against, but normally Meld will display the contents of the file
alongside a blank pane).

I've tried UTF-16 files that contain a BOM and files which do not; no
difference.

I notice that SVN sets the mime-type on these files as binary
(application/octet-stream).  If I manually change it to UTF-16LE
(text/plain;encoding=UTF-16LE), Meld displays a yellow error bar on top
which reads, "Could not read file" "test.xml appears to be a binary
file."--but it still doesn't display the contents of the file.

I had no idea the mime-type behaviour would be different... we
certainly don't do anything on the SVN end with regards to that. I
guess that's a possibly-interesting issue with the new SVN support.

Yeah, I got the idea from
http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ which speaks only
about subversion support.  I confirmed that the svn command line tool
functions fine according to the mime-type.


What version of Subversion are you using here? We fetch files in very
different ways for <1.6 and 1.7.

Client and server (same machine) are 1.7:

nick nimble ~/test_repo $ svn --version
svn, version 1.7.7 (r1393599)
   compiled Jan  5 2013, 15:01:56


If I call meld and pass it 2 UTF-16 files on the file system (ie. not
trying to open a file from the SVN listing), I still get a yellow error
bar on top which reports "Could not read file" "test.xml appears to be a
binary file."

Is there something else I need to do?

Has anyone used Meld to diff UTF-16 files?

No, and it's known not to work. In fact, it shouldn't be possible to
view UTF-16 files in Meld. Or at least, this is what I would have said
if I'd seen this email before I saw your follow-ups.

The problem is that in FileDiff._load_files, we check for null bytes
in the file we're reading in, and throw up our hands and declare a
file to be binary if there are any. This works shockingly well,
considering how wrong it is. Obviously it falls over pretty badly for
UTF16. What I'm actually more puzzled by is that you've somehow
managed to find a way around this!

Also, this is bug 632540:
    https://bugzilla.gnome.org/show_bug.cgi?id=632540


It's nice when thing work when they ought not to, huh?  Certainly nicer
than the opposite.  :)

I'm too lazy to look at the code now, so do you scan the entire file for
null bytes, or just the beginning?  As I mentioned, the presence of the
BOM makes a difference.  Is it possible it's only looking at the
beginning couple bytes?

One thing I noticed was that specifying invalid encodings makes a
difference.  I got the clue from some error messages printed to the
console about one of my encodings being invalid.  I got the initial list
from iconv's list, where they are all valid, but I guess Meld's
underlying library has a different list.  Anyway, that's how my list
shrank
from: utf8, iso8859, utf16, utf-16, utf16le, utf-16le
to:   utf8, utf-16le, iso8859

Using process of elimination I found the only encoding that actually
worked for UTF-16LE files is utf-16le.  The others I had (like utf16)
did not work.  I mention this because in the bug you referenced, Martin
Weis reports Meld does not work and cites the encodings "utf16 utf-16".
It's possible that is the cause of the problem for him.
But I can say for sure the prescriptive steps I provided above work.
Please don't break it. :)


cheers,
Kai

Attachment: BinaryFileCausesStuckState.diff
Description: Text Data



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]