Re: UTF-16 support?



On 4 April 2013 06:38, Nick <nospam codesniffer com> wrote:
On Thu, 2013-04-04 at 05:50 +1000, Kai Willadsen wrote:
A screen recording wouldn't really help - those are pretty clear
instructions - but if there's any way you could provide a SVN working
copy to reproduce the problem, then that would be great.

I've attached a patch which exhibits the problem that you should be able
to apply to any WC since it doesn't require modifications to existing
files (ie. new files marked for addition exhibit the problem fine).  Let
me know if you can't apply this patch to a WC and I'll provide a
procedure or a script to create the same (it's very easy).

Note the svn:mime-type property on each file:
 - File1.txt is UTF-16LE that's marked as binary by SVN
(application/octet-stream).
 - File2.txt is UTF-16LE that's marked as UTF-16LE by SVN.
 - File3.txt is UTF-8.

To see the problem:
1.  Open meld for the directory in the WC where you applied the patch.
2.  Set the Encodings to "utf8, utf-16le".
3.  Open/view the files in reverse order: (File3.txt, File2.txt,
File1.txt).  Notice that you can view File3 & File2, but not File1.
That's sort-of expected since File1.txt is marked as binary.  (I say
sort-of because the file is really UTF-16LE and includes a suitable
BOM).
4.  Now open/view File2.txt again, and notice that it does not open this
time.  Refresh the directory listing in Meld, and you can once again
open it.

So I've applied your patch and can confirm the first few bits, but I
can't get the stuck state to occur. When you hit the problem, do you
get any command-line tracebacks?

Also, this is Meld 1.7.0, right? 1.7.1 (and head) is very different,
but I can't reproduce with either of them from your sample. I'd
certainly be interested to know whether you can reproduce with the
current git version (it's easy to try! clone and run from the
directory).

What version of Subversion are you using here? We fetch files in very
different ways for <1.6 and 1.7.

Client and server (same machine) are 1.7:

nick nimble ~/test_repo $ svn --version
svn, version 1.7.7 (r1393599)
   compiled Jan  5 2013, 15:01:56

Right. When I said this, I was thinking of Meld 1.7.1+. It won't make
any difference on 1.7.0, but I'm testing with the same SVN version
anyway.

No, and it's known not to work. In fact, it shouldn't be possible to
view UTF-16 files in Meld. Or at least, this is what I would have said
if I'd seen this email before I saw your follow-ups.

The problem is that in FileDiff._load_files, we check for null bytes
in the file we're reading in, and throw up our hands and declare a
file to be binary if there are any. This works shockingly well,
considering how wrong it is. Obviously it falls over pretty badly for
UTF16. What I'm actually more puzzled by is that you've somehow
managed to find a way around this!

Also, this is bug 632540:
    https://bugzilla.gnome.org/show_bug.cgi?id=632540


It's nice when thing work when they ought not to, huh?  Certainly nicer
than the opposite.  :)

No! I like knowing why things that work work, and why things that
break break! :)

I'm too lazy to look at the code now, so do you scan the entire file for
null bytes, or just the beginning?  As I mentioned, the presence of the
BOM makes a difference.  Is it possible it's only looking at the
beginning couple bytes?

We scan each chunk as we read it from the file. Of course, it's not
like UTF16 is guaranteed to have null bytes, so maybe your sample just
happens to work?

One thing I noticed was that specifying invalid encodings makes a
difference.  I got the clue from some error messages printed to the
console about one of my encodings being invalid.  I got the initial list
from iconv's list, where they are all valid, but I guess Meld's
underlying library has a different list.  Anyway, that's how my list
shrank
from: utf8, iso8859, utf16, utf-16, utf16le, utf-16le
to:   utf8, utf-16le, iso8859

We use Python's list of encodings which is... not authoritative. I
would have thought that most of those would be recognised aliases, but
I haven't checked.

Using process of elimination I found the only encoding that actually
worked for UTF-16LE files is utf-16le.  The others I had (like utf16)
did not work.  I mention this because in the bug you referenced, Martin
Weis reports Meld does not work and cites the encodings "utf16 utf-16".
It's possible that is the cause of the problem for him.
But I can say for sure the prescriptive steps I provided above work.
Please don't break it. :)

I'll try not to, though I'd still like to know why it's working. Could
I ask you to dump whatever you currently have to hand in that bugzilla
bug, just for future reference?

...and I'm glad that it's working for you, even if it scares me slightly.

cheers,
Kai


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]