It seems that git-annex copies every individual file in a separate transaction. This is quite costly for mass transfers: each file involves a separate rsync invocation and the creation of a new commit. Even with a meager thousand files or so in the annex, I have to wait for fifteen minutes to copy the contents to another disk, simply because every individual file involves some disk thrashing. Also, it seems suspicious that the git-annex branch would get a thousands commits of history from the simple procedure of copying everything to a new repository. Surely it would be better to first copy everything and then create only a single commit that registers the changes to the files' availability?

git-annex is very careful to commit as infrequently as possible, and the current version makes 1 commit after all the copies are complete, even if it transferred a billion files. The only overhead incurred for each file is writing a journal file. You must have an old version. --Joey

(I'm also not quite clear on why rsync is being used when both repositories are local. It seems to be just overhead.)

Even when copying to another disk it's often on some slow bus, and the file is by definition large. So it's nice to support resumes of interrupted transfers of files. Also because rsync has a handy progress display that is hard to get with cp.

(However, if the copy is to another directory in the same disk, it does use cp, and even supports really fast copies on COW filesystems.) --Joey


Oneshot mode is now implemented, making git-annex-shell and other short lifetime processes not bother with committing changes. done --Joey

Update: Now it makes one commit at the very end of such a mass transfer. --Joey

To me it very much seems that a commit per file is indeed created at the remote end, although not at the local end. See the following transcript: https://gist.github.com/1691714.

Ah, I see, I was not thinking about the location log update that's done on the remote side.

For transfers over ssh, that's a separate git-annex-shell invoked per change. For local-local transfers, it's all done in a single process but it spins up a state to handle the remote and then immediately shuts it down, also generating a commit.

In either case, I think there is a nice fix. Since git-annex does have a journal nowadays, and goes to all the bother to support recovery if a process was interrupted and journalled changes that did not get committed, there's really no reason in either of these cases for the remote end to do anything more than journal the change. The next time git-annex is actually run on the remote, and needs to look up location information, it will merge the journalled changes into the branch, in a single commit.

My only real concern is that some remotes might never have git-annex run in them directly, and would just continue to accumulate journal files forever. Although due to the way the journal is structured, it can have, at a maximum, the number of files in the git-annex branch. However, the number of files in it is expected to be relatively smal and it might get a trifle innefficient, as it lacks directory hashing. These performance problems could certainly be dealt with if they do turn out to be a problem.

Comment by http://joey.kitenet.net/ Sat Jan 28 19:32:36 2012

That sounds just fine, but indeed my use case was a bare backup/transfer repository that is meant to always be only at the remote end of git-annex operations. So why not as well do a single commit after everything has been copied and journaled? That's what's done at the other end too, after all. Or, if commits are to be minimized, just stage the journal into the index before finishing, but don't commit it yet?

(I would actually prefer this mode of usage for other git-annex operations, too. In git you can add stuff little by little and commit them all in one go. In git-annex the add immediately creates a commit, which is unexpected and a bit annoying.)

Comments on this page are closed.