Re: UTF-16 support?



Would it be helpful to have the option to change the encoding on the fly like jEdit's "reload with encoding"?

-Keegan


On Wed, Apr 3, 2013 at 4:38 PM, Nick <nospam codesniffer com> wrote:
Inline below.

On Thu, 2013-04-04 at 05:50 +1000, Kai Willadsen wrote:
> (Answering lots of things at once, and not in order.)
>
> On 4 April 2013 00:06, Nick <nospam codesniffer com> wrote:
> > I think I found a solution.  It required 3 pieces:
> >
> > 1.  Set Meld's Encodings to:
> >        utf8, utf-16le, iso8859
> >
> > 2.  Set SVN's mime-type property on the UTF-16 files to
> >     text/plain;encoding=UTF-16LE.
> >
> > 3.  Placed a BOM in the UTF-16 files.
> >
> > With this configuration I am able to view UTF-8 and UTF-16 files in Meld
> > without changing the configuration.  The files can be directly from the
> > filesystem (ie. meld file1 file2) or via the SVN hook within Meld.
> >
> >
> > In the process of experimenting on this (and I think contributing to the
> > problem), I think I found a bug in Meld.  It seems that once I attempt
> > to view/diff a file that's in SVN which fails, other files which
> > normally work also fail.  Here's a breakdown of the steps I observe this
> > happening (using Meld 1.7.0):
> >
> > (1)  Open Meld for a directory inside a SVN working copy, which contains
> > 3 files:  a.xml (a UTF-16LE file without a BOM), b.xml (a UTF-16LE file
> > with a BOM), c.txt (a UTF-8 file).

The issue seems to be tied to opening a binary file.  In this case,
a.xml only needs to be considered binary (SVN's svn:mime-type property
set to application/octet-stream).

> > (2)  Set Meld's Encodings configuration to "utf8, utf-16le"
> > (3)  Open/View b.xml.  This should work.
> > (4)  Open/View c.txt.  This should work.
> > (5)  Attempt to open a.xml.  This should yield an error that the file is
> > binary (as expected).
> > (6)  Now attempt to open/view b.xml again.  It fails with the same
> > error.
> >
> > The only way I've found to get it out of this stuck state is to refresh
> > the listing.
> >
> > I can try creating a screen recording of this behavior if it helps.
>
> A screen recording wouldn't really help - those are pretty clear
> instructions - but if there's any way you could provide a SVN working
> copy to reproduce the problem, then that would be great.

I've attached a patch which exhibits the problem that you should be able
to apply to any WC since it doesn't require modifications to existing
files (ie. new files marked for addition exhibit the problem fine).  Let
me know if you can't apply this patch to a WC and I'll provide a
procedure or a script to create the same (it's very easy).

Note the svn:mime-type property on each file:
 - File1.txt is UTF-16LE that's marked as binary by SVN
(application/octet-stream).
 - File2.txt is UTF-16LE that's marked as UTF-16LE by SVN.
 - File3.txt is UTF-8.

To see the problem:
1.  Open meld for the directory in the WC where you applied the patch.
2.  Set the Encodings to "utf8, utf-16le".
3.  Open/view the files in reverse order: (File3.txt, File2.txt,
File1.txt).  Notice that you can view File3 & File2, but not File1.
That's sort-of expected since File1.txt is marked as binary.  (I say
sort-of because the file is really UTF-16LE and includes a suitable
BOM).
4.  Now open/view File2.txt again, and notice that it does not open this
time.  Refresh the directory listing in Meld, and you can once again
open it.

Let me know if I can help w/ more info.


>
> > On Wed, 2013-04-03 at 09:41 -0400, Nick wrote:
> >> Looks like if I change the order of the codecs such that utf16 is listed
> >> first, then Meld displays the file fine.  But then I lose the ability to
> >> view UTF-8 files.  So it seems like it's one or the other, but not both.
> >>
> >> If this is true, I don't understand the purpose of being able to specify
> >> more than one encoding in the Preferences dialog.
> >>
> >> Can Meld support going through each specified encoding while the file is
> >> not displayable (including the finding that it's a 'binary' file)?  This
> >> will allow me to specify "utf8, utf16" for the encodings which will
> >> support UTF-8 and UTF-16 files to be used in Meld w/out changing the
> >> configuration.
>
> That's exactly what we do... except that the binary file check is
> unrelated to the rest. Having said that, reordering those really
> shouldn't avoid the binary file check.
>
> >> On Wed, 2013-04-03 at 08:48 -0400, Nick wrote:
> >> > Hi,
> >> >
> >> > First and foremost, thanks for a great diff & merge tool!
> >> >
> >> > My project involves XML files which need to be encoded in UTF-16 Little
> >> > Endian.  I cannot seem to view or diff UTF-16 files with Meld.
> >> >
> >> > In the Encoding tab of the Preferences dialog I have this for the
> >> > codecs:
> >> >
> >> >     utf8, iso8859, utf16, utf-16, utf16le, utf-16le
> >> >
> >> > When I try to open a UTF-16LE file that's in SVN, Meld displays a yellow
> >> > error bar on top which reads, "Error fetching original comparison file".
> >> > I've confirmed UTF-8 files in the repo open fine--it's only an issue w/
> >> > UTF-16 files.
> >> >
> >> > It behaves the same even for files which are marked for addition in the
> >> > repo but not yet added (so in this case, there's nothing to diff
> >> > against, but normally Meld will display the contents of the file
> >> > alongside a blank pane).
> >> >
> >> > I've tried UTF-16 files that contain a BOM and files which do not; no
> >> > difference.
> >> >
> >> > I notice that SVN sets the mime-type on these files as binary
> >> > (application/octet-stream).  If I manually change it to UTF-16LE
> >> > (text/plain;encoding=UTF-16LE), Meld displays a yellow error bar on top
> >> > which reads, "Could not read file" "test.xml appears to be a binary
> >> > file."--but it still doesn't display the contents of the file.
>
> I had no idea the mime-type behaviour would be different... we
> certainly don't do anything on the SVN end with regards to that. I
> guess that's a possibly-interesting issue with the new SVN support.

Yeah, I got the idea from
http://rhubbarb.wordpress.com/2012/04/28/svn-unicode/ which speaks only
about subversion support.  I confirmed that the svn command line tool
functions fine according to the mime-type.


> What version of Subversion are you using here? We fetch files in very
> different ways for <1.6 and 1.7.

Client and server (same machine) are 1.7:

nick nimble ~/test_repo $ svn --version
svn, version 1.7.7 (r1393599)
   compiled Jan  5 2013, 15:01:56


> >> > If I call meld and pass it 2 UTF-16 files on the file system (ie. not
> >> > trying to open a file from the SVN listing), I still get a yellow error
> >> > bar on top which reports "Could not read file" "test.xml appears to be a
> >> > binary file."
> >> >
> >> > Is there something else I need to do?
> >> >
> >> > Has anyone used Meld to diff UTF-16 files?
>
> No, and it's known not to work. In fact, it shouldn't be possible to
> view UTF-16 files in Meld. Or at least, this is what I would have said
> if I'd seen this email before I saw your follow-ups.
>
> The problem is that in FileDiff._load_files, we check for null bytes
> in the file we're reading in, and throw up our hands and declare a
> file to be binary if there are any. This works shockingly well,
> considering how wrong it is. Obviously it falls over pretty badly for
> UTF16. What I'm actually more puzzled by is that you've somehow
> managed to find a way around this!
>
> Also, this is bug 632540:
>     https://bugzilla.gnome.org/show_bug.cgi?id=632540
>

It's nice when thing work when they ought not to, huh?  Certainly nicer
than the opposite.  :)

I'm too lazy to look at the code now, so do you scan the entire file for
null bytes, or just the beginning?  As I mentioned, the presence of the
BOM makes a difference.  Is it possible it's only looking at the
beginning couple bytes?

One thing I noticed was that specifying invalid encodings makes a
difference.  I got the clue from some error messages printed to the
console about one of my encodings being invalid.  I got the initial list
from iconv's list, where they are all valid, but I guess Meld's
underlying library has a different list.  Anyway, that's how my list
shrank
from: utf8, iso8859, utf16, utf-16, utf16le, utf-16le
to:   utf8, utf-16le, iso8859

Using process of elimination I found the only encoding that actually
worked for UTF-16LE files is utf-16le.  The others I had (like utf16)
did not work.  I mention this because in the bug you referenced, Martin
Weis reports Meld does not work and cites the encodings "utf16 utf-16".
It's possible that is the cause of the problem for him.
But I can say for sure the prescriptive steps I provided above work.
Please don't break it. :)


> cheers,
> Kai

_______________________________________________
meld-list mailing list
meld-list gnome org
https://mail.gnome.org/mailman/listinfo/meld-list



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]