This bug is reopened to track some new UTF-8 filename issues caused by GHC 7.4. In this version of GHC, git-annex's hack to support filenames in any encoding no longer works. Even unicode filenames fail to work when git-annex is built with 7.4. --Joey

This bug is now fixed in current master. Once again, git-annex will work for all filename encodings, and all system encodings. It will only build with the new GHC. done --Joey


Old, now fixed bug report follows:

There are problems with displaying filenames in UTF8 encoding, as shown here:

$ echo $LANG
en_GB.UTF-8
$ git init
$ git annex init test
[...]
$ touch "Umlaut Ü.txt"
$ git annex add Uml*
add Umlaut Ã.txt ok
(Recording state in git...)
$ find -name U\* | hexdump -C
00000000  2e 2f 55 6d 6c 61 75 74  20 c3 9c 2e 74 78 74 0a  |./Umlaut ...txt.|
00000010
$ git annex find | hexdump -C
00000000  55 6d 6c 61 75 74 20 c3  83 c2 9c 2e 74 78 74 0a  |Umlaut .....txt.|
00000010
$

It looks like the common latin1-to-UTF8 encoding. Functionality other than otuput seems not to be affected.

Yes, I believe that git-annex is reading filename data from git as a stream of char8s, and not decoding unicode in it into logical characters. Haskell then I guess, tries to unicode encode it when it's output to the console. This only seems to matter WRT its output to the console; the data does not get mangled internally and so it accesses the right files under the hood.

I am too new to haskell to really have a handle on how to handle unicode and other encodings issues with it. In general, there are three valid approaches: --Joey

  1. Convert all input data to unicode and be unicode clean end-to-end internally. Problimatic here since filenames may not necessarily be encoded in utf-8 (an archive could have historical filenames using varying encodings), and you don't want which files are accessed to depend on locale settings.

    I tried to do this by making parts of GitRepo call Codec.Binary.UTF8.String.decodeString when reading filenames from git. This seemed to break attempts to operate on the files, weirdly encoded strings were seen in syscalls in strace.

  2. Keep input and internal data un-decoded, but decode it when outputting a filename (assuming the filename is encoded using the user's configured encoding), and allow haskell's output encoding to then encode it according to the user's locale configuration.

    This is now implemented. I'm not very happy that I have to watch out for any place that a filename is output and call filePathToString on it, but there are really not too many such places in git-annex.

    Note that this only affects filenames apparently. (Names of files in the annex, and also some places where names of keys are displayed.) Utf-8 in the uuid.map file etc seems to be handled cleanly.

  3. Avoid encodings entirely. Mostly what I'm doing now; probably could find a way to disable encoding of console output. Then the raw filename would be displayed, which should work ok. git-annex does not really need to pull apart filenames; they are almost entirely opaque blobs. I guess that the --exclude option is the exception to that, but it is currently not unicode safe anyway. (Update: tried --exclude again, seems it is unicode clean..) One other possible issue would be that this could cause problems if git-annex were translated.

    On second thought, I switched to this. Any decoding of a filename is going to make someone unhappy; the previous approach broke non-utf8 filenames.

I just noticed this issue, and was wondering what the current status is.

% ls -l 04\ -\ Orixás.mp3
-rw-r--r-- 1 adam users 8377816 Jul 12  2007 04 - Orixás.mp3
% echo 04\ -\ Orixás.mp3 | od -c
0000000   0   4       -       O   r   i   x 303 241   s   .   m   p   3
0000020  \n
0000021
% git annex add 04\ -\ Orixás.mp3
git-annex: /home/adam/music/RotC/transcribe/04 - Orixás.mp3: getSymbolicLinkStatus: does not exist (No such file or directory)
Comment by http://adamspiers.myopenid.com/ Sat Dec 24 01:05:07 2011

This (rather longish) thread discusses the current situation, the planned changes for 7.2 and the various issues: http://haskell.org/pipermail/glasgow-haskell-users/2011-November/021115.html

The summary seems to be: From 7.2 on, getDirectoryContents will return proper Strings, i.e. where a Char represents a Unicode code point, and not a Word8, which will fix the problem of outputting them.

Comment by http://www.joachim-breitner.de/ Sat Dec 24 12:49:40 2011
An alternative that is available from ghc 7.4 on is a pure ByteString based unix API: http://thread.gmane.org/gmane.comp.lang.haskell.libraries/16556
Comment by http://www.joachim-breitner.de/ Sat Dec 24 12:51:43 2011

Adam, this bug was fixed a long time ago, first using option #2 above, but later switching to option #3 -- git-annex treats filenames as opaque binary blobs and never decodes them in any encoding; haskell's normal encoding support for stdio is disabled.

And it never resulted in a failure like you show. I cannot reproduce your problem, but it is a different bug, please open a new bug report.

Comment by http://joey.kitenet.net/ Sat Dec 24 16:49:13 2011

I also encountered Adam's bug. The problem seems to be that communication with the git process is done with Char8-bytestrings. So, when L.unpack is called, all filenames that git outputs (with ls-files or ls-tree) are interpreted to be in latin-1, which wreaks havoc if they are really in UTF-8.

I suspect that it would be enough to just switch to standard Strings (or Data.Text.Text) instead of bytestrings for textual data, and to Word8-bytestrings for pure binary data. GHC should nowadays handle locale-dependent encoding of Strings transparently.

Lauri, what version of GHC do you have that behaves this way? 7.0.4 does not.
Comment by http://joey.kitenet.net/ Fri Jan 27 21:00:06 2012
7.2. nomeata already explained the issue. I got utf-8 filenames to work on a utf-8 locale by switching from Char8-bytestrings to UTF8-bytestrings, and adding hSetEncoding h localeEncoding to suitable places. Making things work properly with an arbitrary locale encoding would be more complicated.

Lauri a scratch patch would be very helpful. Encoding stuff makes my head explode.

However, I am very worried by haskell's changes WRT unicode and filenames. Based on user input, git-annex users like to use it on diverse sets of files, with diverse and ill-defined encodings. Faffing about with converting between encodings seems likely to speactacularly fail.

Comment by http://joey.kitenet.net/ Sat Jan 28 19:40:34 2012
Comments on this page are closed.