The git-annex assistant is being crowd funded on Kickstarter. I'll be blogging about my progress here on a semi-daily basis.
Random improvements day..
Got the merge conflict resolution code working in git annex assistant
.
Did some more fixes to the pushing and pulling code, covering some cases I missed earlier.
Git syncing seems to work well for me now; I've seen it recover from a variety of error conditions, including merge conflicts and repos that were temporarily unavailable.
There is definitely a MVar deadlock if the merger thread's inotify event handler tries to run code in the Annex monad. Luckily, it doesn't currently seem to need to do that, so I have put off debugging what's going on there.
Reworked how the inotify thread runs, to avoid the two inotify threads in the assistant now from both needing to wait for program termination, in a possibly conflicting manner.
Hmm, that seems to have fixed the MVar deadlock problem.
Been thinking about how to fix watcher commits unlocked files. Posted some thoughts there.
It's about time to move on to data syncing. While eventually that will need to build a map of the repo network to efficiently sync data over the fastest paths, I'm thinking that I'll first write a dumb version. So, two more threads:
Uploads new data to every configured remote. Triggered by the watcher thread when it adds content. Easy; just use a
TSet
of Keys to send.Downloads new data from the cheapest remote that has it. COuld be triggered by the merger thread, after it merges in a git sync. Rather hard; how does it work out what new keys are in the tree without scanning it all? Scan through the git history to find newly created files? Maybe the watcher triggers this thread instead, when it sees a new symlink, without data, appear.
Both threads will need to be able to be stopped, and restarted, as needed to control the data transfer. And a lot of other control smarts will eventually be needed, but my first pass will be to do a straightforward implementation. Once it's done, the git annex assistant will be basically usable.
Worked on automatic merge conflict resolution today. I had expected to be able to use git's merge driver interface for this, but that interface is not sufficient. There are two problems with it:
- The merge program is run when git is in the middle of an operation that locks the index. So it cannot delete or stage files. I need to do both as part of my conflict resolution strategy.
- The merge program is not run at all when the merge conflict is caused by one side deleting a file, and the other side modifying it. This is an important case to handle.
So, instead, git-annex will use a regular git merge
, and if it fails, it
will fix up the conflicts.
That presented its own difficully, of finding which files in the tree
conflict. git ls-files --unmerged
is the way to do that, but its output
is a quite raw form:
120000 3594e94c04db171e2767224db355f514b13715c5 1 foo
120000 35ec3b9d7586b46c0fd3450ba21e30ef666cfcd6 3 foo
100644 1eabec834c255a127e2e835dadc2d7733742ed9a 2 bar
100644 36902d4d842a114e8b8912c02d239b2d7059c02b 3 bar
I had to stare at the rather inpenetrable documentation for hours and write a lot of parsing and processing code to get from that to these mostly self expanatory data types:
data Conflicting v = Conflicting
{ valUs :: Maybe v
, valThem :: Maybe v
} deriving (Show)
data Unmerged = Unmerged
{ unmergedFile :: FilePath
, unmergedBlobType :: Conflicting BlobType
, unmergedSha :: Conflicting Sha
} deriving (Show)
Not the first time I've whined here about time spent parsing unix command output, is it? :)
From there, it was relatively easy to write the actual conflict cleanup
code, and make git annex sync
use it. Here's how it looks:
$ ls -1
foo.png
bar.png
$ git annex sync
commit
# On branch master
nothing to commit (working directory clean)
ok
merge synced/master
CONFLICT (modify/delete): bar.png deleted in refs/heads/synced/master and modified in HEAD. Version HEAD of bar.png left in tree.
Automatic merge failed; fix conflicts and then commit the result.
bar.png: needs merge
(Recording state in git...)
[master 0354a67] git-annex automatic merge conflict fix
ok
$ ls -1
foo.png
bar.variant-a1fe.png
bar.variant-93a1.png
There are very few options for ways for the conflict resolution code to name conflicting variants of files. The conflict resolver can only use data present in git to generate the names, because the same conflict needs to be resolved the same everywhere.
So I had to choose between using the full key name in the filenames produced when resolving a merge, and using a shorter checksum of the key, that would be more user-friendly, but could theoretically collide with another key. I chose the checksum, and weakened it horribly by only using 32 bits of it!
Surprisingly, I think this is a safe choice. The worst that can happens if such a collision happens is another conflict, and the conflict resolution code will work on conflicts produced by the conflict resolution code! In such a case, it does fall back to putting the whole key in the filename: "bar.variant-SHA256-s2550--2c09deac21fa93607be0844fefa870b2878a304a7714684c4cc8f800fda5e16b.png"
Still need to hook this code into git annex assistant
.
Not much available time today, only a few hours.
Main thing I did was fixed up the failed push tracking to use a better data structure. No need for a queue of failed pushes, all it needs is a map of remotes that have an outstanding failed push, and a timestamp. Now it won't grow in memory use forever anymore. :)
Finding the right thread mutex type for this turned out to be a bit of a challenge. I ended up with a STM TMVar, which is left empty when there are no pushes to retry, so the thread using it blocks until there are some. And, it can be updated transactionally, without races.
I also fixed a bug outside the git-annex assistant code. It was possible to crash git-annex if a local git repository was configured as a remote, and the repository was not available on startup. git-annex now ignores such remotes. This does impact the assistant, since it is a long running process and git repositories will come and go. Now it ignores any that were not available when it started up. This will need to be dealt with when making it support removable drives.
I released a version of git-annex over the weekend that includes the git
annex watch
command. There's a minor issue installing it from cabal on
OSX, which I've fixed in my tree. Nice timing: At least the watch command
should be shipped in the next Debian release, which freezes at the end of
the month.
Jimmy found out how kqueue blows up when there are too many directories to keep all open. I'm not surprised this happens, but it's nice to see exactly how. Odd that it happened to him at just 512 directories; I'd have guessed more. I have plans to fork watcher programs that each watch 512 directories (or whatever the ulimit is), to deal with this. What a pitiful interface is kqueue.. I have not thought yet about how the watcher programs would communicate back to the main program.
Back on the assistant front, I've worked today on making git syncing more robust. Now when a push fails, it tries a pull, and a merge, and repushes. That ensures that the push is, almost always, a fast-forward. Unless something else gets in a push first, anyway!
If a push still fails, there's Yet Another Thread, added today, that will wake up after 30 minutes and retry the push. It currently keeps retrying every 30 minutes until the push finally gets though. This will deal, to some degree, with those situations where a remote is only sometimes available.
I need to refine the code a bit, to avoid it keeping an ever-growing queue of failed pushes, if a remote is just dead. And to clear old failed pushes from the queue when a later push succeeds.
I also need to write a git merge driver that handles conflicts in the tree.
If two conflicting versions of a file foo
are saved, this would merge
them, renaming them to foo.X
and foo.Y
. Probably X and Y are the
git-annex keys for the content of the files; this way all clones will
resolve the conflict in a way that leads to the same tree. It's also
possible to get a conflict by one repo deleting a file, and another
modifying it. In this case, renaming the deleted file to foo.Y
may
be the right approach, I am not sure.
I glanced through some Haskell dbus bindings today. I belive there are dbus events available to detect when drives are mounted, and on Linux this would let git-annex notice and sync to usb drives, etc.
Good news! My beta testers report that the new kqueue code works on OSX. At least "works" as well as it does on Debian kFreeBSD. My crazy development strategy of developing on Debian kFreeBSD while targeting Mac OSX is vindicated. ;-)
So, I've been beating the kqueue code into shape for the last 12 hours, minus a few hours sleep.
First, I noticed it was seeming to starve the other threads. I'm using
Haskell's non-threaded runtime, which does cooperative multitasking between
threads, and my C code was never returning to let the other threads run.
Changed that around, so the C code runs until SIGALARMed, and then that
thread calls yield
before looping back into the C code. Wow, cooperative
multitasking.. I last dealt with that when programming for Windows 3.1!
(Should try to use Haskell's -threaded runtime sometime, but git-annex
doesn't work under it, and I have not tried to figure out why not.)
Then I made a single commit, with no testing, in which I made the kqueue code maintain a cache of what it expects in the directory tree, and use that to determine what files changed how when a change is detected. Serious code. It worked on the first go. If you were wondering why I'm writing in Haskell ... yeah, that's why.
And I've continued to hammer on the kqueue code, making lots of little
fixes, and at this point it seems almost able to handle the changes I
throw at it. It does have one big remaining problem; kqueue doesn't tell me
when a writer closes a file, so it will sometimes miss adding files. To fix
this, I'm going to need to make it maintain a queue of new files, and
periodically check them, with lsof
, to see when they're done being
written to, and add them to the annex. So while a file is being written
to, git annex watch
will have to wake up every second or so, and run
lsof
... and it'll take it at least 1 second to notice a file's complete.
Not ideal, but the best that can be managed with kqueue.
A rather frustrating and long day coding went like this:
1-3 pm
Wrote a single function, of which all any Haskell programmer needs to know is its type signature:
Lsof.queryDir :: FilePath -> IO [(FilePath, LsofOpenMode, ProcessInfo)]
When I'm spending another hour or two taking a unix utility like lsof and parsing its output, which in this case is in a rather complicated machine-parsable output format, I often wish unix streams were strongly typed, which would avoid this bother.
3-9 pm
Six hours spent making it defer annexing files until the commit thread wakes up and is about to make a commit. Why did it take so horribly long? Well, there were a number of complications, and some really bad bugs involving races that were hard to reproduce reliably enough to deal with.
In other words, I was lost in the weeds for a lot of those hours...
At one point, something glorious happened, and it was always making exactly one commit for batch mode modifications of a lot of files (like untarring them). Unfortunatly, I had to lose that gloriousness due to another potential race, which, while unlikely, would have made the program deadlock if it happened.
So, it's back to making 2 or 3 commits per batch mode change. I also have a buglet that causes sometimes a second empty commit after a file is added. I know why (the inotify event for the symlink gets in late, after the commit); will try to improve commit frequency later.
9-11 pm
Put the capstone on the day's work, by calling lsof on a directory full of hardlinks to the files that are about to be annexed, to check if any are still open for write.
This works great! Starting up git annex watch
when processes have files
open is no longer a problem, and even if you're evil enough to try having
muliple processes open the same file, it will complain and not annex it
until all the writers close it.
(Well, someone really evil could turn the write bit back on after git annex
clears it, and open the file again, but then really evil people can do
that to files in .git/annex/objects
too, and they'll get their just
deserts when git annex fsck
runs. So, that's ok..)
Anyway, will beat on it more tomorrow, and if all is well, this will finally go out to the beta testers.
Followed my plan from yesterday, and wrote a simple C library to interface
to kqueue
, and Haskell code to use that library. By now I think I
understand kqueue fairly well -- there are some very tricky parts to the
interface.
But... it still did't work. After building all this, my code was failing the same way that the haskell kqueue library failed yesterday. I filed a bug report with a testcase.
Then I thought to ask on #haskell. Got sorted out in quick order! The
problem turns out to be that haskell's runtime has a peridic SIGALARM,
that is interrupting my kevent call. It can be worked around with +RTS -V0
,
but I put in a fix to retry to kevent when it's interrupted.
And now git-annex watch
can detect changes to directories on BSD and OSX!
Note: I said "detect", not "do something useful in response to". Getting
from the limited kqueue events to actually staging changes in the git repo
is going to be another day's work. Still, brave FreeBSD or OSX users
might want to check out the watch
branch from git and see if
git annex watch
will at least say it sees changes you make to your
repository.
Syncing works! I have two clones, and any file I create in the first is immediately visible in the second. Delete that file from the second, and it's immediately removed from the first.
Most of my work today felt like stitching existing limbs onto a pre-existing
monster. Took the committer thread, that waits for changes and commits them,
and refashioned it into a pusher thread, that waits for commits and pushes
them. Took the watcher thread, that watches for files being made,
and refashioned it into a merger thread, that watches for git refs being
updated. Pulled in bits of the git annex sync
command to reanimate this.
It may be a shambling hulk, but it works.
Actually, it's not much of a shambling hulk; I refactored my code after copying it. ;)
I think I'm up to 11 threads now in the new
git annex assistant
command, each with its own job, and each needing
to avoid stepping on the other's toes. I did see one MVar deadlock error
today, which I have not managed to reproduce after some changes. I think
the committer thread was triggering the merger thread, which probably
then waited on the Annex state MVar the committer thread had held.
Anyway, it even pushes to remotes in parallel, and keeps track of remotes it failed to push to, although as of yet it doesn't do any attempt at periodically retrying.
One bug I need to deal with is that the push code assumes any change made to the remote has already been pushed back to it. When it hasn't, the push will fail due to not being a fast-forward. I need to make it detect this case and pull before pushing.
(I've pushed this work out in a new assistant branch
.)
... I'm getting tired of kqueue.
But the end of the tunnel is in sight. Today I made git-annex handle files that are still open for write after a kqueue creation event is received. Unlike with inotify, which has a new event each time a file is closed, kqueue only gets one event when a file is first created, and so git-annex needs to retry adding files until there are no writers left.
Eventually I found an elegant way to do that. The committer thread already wakes up every second as long as there's a pending change to commit. So for adds that need to be retried, it can just push them back onto the change queue, and the committer thread will wait one second and retry the add. One second might be too frequent to check, but it will do for now.
This means that git annex watch
should now be usable on OSX, FreeBSD, and
NetBSD! (It'll also work on Debian kFreeBSD once lsof is ported to it.)
I've meged kqueue support to master
.
I also think I've squashed the empty commits that were sometimes made.
Incidentally, I'm 50% through my first month, and finishing inotify was the first half of my roadmap for this month. Seem to be right on schedule.. Now I need to start thinking about syncing.
Pondering syncing today. I will be doing syncing of the git repository first, and working on syncing of file data later.
The former seems straightforward enough, since we just want to push all changes to everywhere. Indeed, git-annex already has a sync command that uses a smart technique to allow syncing between clones without a central bare repository. (Props to Joachim Breitner for that.)
But it's not all easy. Syncing should happen as fast as possible, so changes show up without delay. Eventually it'll need to support syncing between nodes that cannot directly contact one-another. Syncing needs to deal with nodes coming and going; one example of that is a USB drive being plugged in, which should immediatly be synced, but network can also come and go, so it should periodically retry nodes it failed to sync with. To start with, I'll be focusing on fast syncing between directly connected nodes, but I have to keep this wider problem space in mind.
One problem with git annex sync
is that it has to be run in both clones
in order for changes to fully propigate. This is because git doesn't allow
pushing changes into a non-bare repository; so instead it drops off a new
branch in .git/refs/remotes/$foo/synced/master
. Then when it's run locally
it merges that new branch into master
.
So, how to trigger a clone to run git annex sync
when syncing to it?
Well, I just realized I have spent two weeks developing something that can
be repurposed to do that! Inotify can watch for changes to
.git/refs/remotes
, and the instant a change is made, the local sync
process can be started. This avoids needing to make another ssh connection
to trigger the sync, so is faster and allows the data to be transferred
over another protocol than ssh, which may come in handy later.
So, in summary, here's what will happen when a new file is created:
- inotify event causes the file to be added to the annex, and immediately committed.
- new branch is pushed to remotes (probably in parallel)
- remotes notice new sync branch and merge it
- (data sync, TBD later)
- file is fully synced and available
Steps 1, 2, and 3 should all be able to be accomplished in under a second.
The speed of git push
making a ssh connection will be the main limit
to making it fast. (Perhaps I should also reuse git-annex's existing ssh
connection caching code?)
I've been investigating how to make git annex watch
work on
FreeBSD, and by extension, OSX.
One option is kqueue, which works on both operating systems, and allows very basic monitoring of file changes. There's also an OSX specific hfsevents interface.
Kqueue is far from optimal for git annex watch
, because it provides even
less information than inotify (which didn't really provide everything I
needed, thus the lsof hack). Kqueue doesn't have events for files being
closed, only an event when a file is created. So it will be difficult for
git annex watch
to know when a file is done being written to and can be
annexed. git annex will probably need to run lsof periodically to check when
recently added files are complete. (hsevents shares this limitation)
Kqueue also doesn't provide specific events when a file or directory is
moved. Indeed, it doesn't provide specific events about what changed at
all. All you get with kqueue is a generic "oh hey, the directory you're
watching changed in some way", and it's up to you to scan it to work out
how. So git annex will probably need to run git ls-tree --others
to find changes in the directory tree. This could be expensive with large
trees. (hsevents has per-file events on current versions of OSX)
Despite these warts, I want to try kqueue first, since it's more portable than hfsevents, and will surely be easier for me to develop support for, since I don't have direct access to OSX.
So I went to a handy Debian kFreeBSD porter box, and tried some kqueue stuff to get a feel for it. I got a python program that does basic directory monitoring with kqueue to work, so I know it's usable there.
Next step was getting kqueue working from Haskell. Should be easy, there's a Haskell library already. I spent a while trying to get it to work on Debian kFreeBSD, but ran into a problem that could be caused by the Debian kFreeBSD being different, or just a bug in the Haskell library. I didn't want to spend too long shaving this yak; I might install "real" FreeBSD on a spare laptop and try to get it working there instead.
But for now, I've dropped down to C instead, and have a simple C program that can monitor a directory with kqueue. Next I'll turn it into a simple library, which can easily be linked into my Haskell code. The Haskell code will pass it a set of open directory descriptors, and it'll return the one that it gets an event on. This is necessary because kqueue doesn't recurse into subdirectories on its own.
I've generally had good luck with this approach to adding stuff in Haskell; rather than writing a bit-banging and structure packing low level interface in Haskell, write it in C, with a simpler interface between C and Haskell.
Since last post, I've worked on speeding up git annex watch
's startup time
in a large repository.
The problem was that its initial scan was naively staging every symlink in
the repository, even though most of them are, presumably, staged correctly
already. This was done in case the user copied or moved some symlinks
around while git annex watch
was not running -- we want to notice and
commit such changes at startup.
Since I already had the stat
info for the symlink, it can look at the
ctime
to see if the symlink was made recently, and only stage it if so.
This sped up startup in my big repo from longer than I cared to wait (10+
minutes, or half an hour while profiling) to a minute or so. Of course,
inotify events are already serviced during startup, so making it scan
quickly is really only important so people don't think it's a resource hog.
First impressions are important. :)
But what does "made recently" mean exactly? Well, my answer is possibly
overengineered, but most of it is really groundwork for things I'll need
later anyway. I added a new data structure for tracking the status of the
daemon, which is periodically written to disk by another thread (thread #6!)
to .git/annex/daemon.status
Currently it looks like this; I anticipate
adding lots more info as I move into the syncing stage:
lastRunning:1339610482.47928s
scanComplete:True
So, only symlinks made after the daemon was last running need to be expensively staged on startup. Although, as RichiH pointed out, this fails if the clock is changed. But I have been planning to have a cleanup thread anyway, that will handle this, and other potential problems, so I think that's ok.
Stracing its startup scan, it's fairly tight now. There are some repeated
getcwd
syscalls that could be optimised out for a minor speedup.
Added the sanity check thread. Thread #7! It currently only does one sanity check per day, but the sanity check is a fairly lightweight job, so I may make it run more frequently. OTOH, it may never ever find a problem, so once per day seems a good compromise.
Currently it's only checking that all files in the tree are properly staged
in git. I might make it git annex fsck
later, but fscking the whole tree
once per day is a bit much. Perhaps it should only fsck a few files per
day? TBD
Currently any problems found in the sanity check are just fixed and logged. It would be good to do something about getting problems that might indicate bugs fed back to me, in a privacy-respecting way. TBD
I also refactored the code, which was getting far too large to all be in one module.
I have been thinking about renaming git annex watch
to git annex assistant
,
but I think I'll leave the command name as-is. Some users might
want a simple watcher and stager, without the assistant's other features
like syncing and the webapp. So the next stage of the
roadmap will be a different command that also runs
watch
.
At this point, I feel I'm done with the first phase of inotify. It has a couple known bugs, but it's ready for brave beta testers to try. I trust it enough to be running it on my live data.
First day of Kickstarter funded work!
Worked on inotify today. The watch
branch in git now does a pretty
good job of following changes made to the directory, annexing files
as they're added and staging other changes into git. Here's a quick
transcript of it in action:
joey@gnu:~/tmp>mkdir demo
joey@gnu:~/tmp>cd demo
joey@gnu:~/tmp/demo>git init
Initialized empty Git repository in /home/joey/tmp/demo/.git/
joey@gnu:~/tmp/demo>git annex init demo
init demo ok
(Recording state in git...)
joey@gnu:~/tmp/demo>git annex watch &
[1] 3284
watch . (scanning...) (started)
joey@gnu:~/tmp/demo>dd if=/dev/urandom of=bigfile bs=1M count=2
add ./bigfile 2+0 records in
2+0 records out
2097152 bytes (2.1 MB) copied, 0.835976 s, 2.5 MB/s
(checksum...) ok
(Recording state in git...)
joey@gnu:~/tmp/demo>ls -la bigfile
lrwxrwxrwx 1 joey joey 188 Jun 4 15:36 bigfile -> .git/annex/objects/Wx/KQ/SHA256-s2097152--e5ced5836a3f9be782e6da14446794a1d22d9694f5c85f3ad7220b035a4b82ee/SHA256-s2097152--e5ced5836a3f9be782e6da14446794a1d22d9694f5c85f3ad7220b035a4b82ee
joey@gnu:~/tmp/demo>git status -s
A bigfile
joey@gnu:~/tmp/demo>mkdir foo
joey@gnu:~/tmp/demo>mv bigfile foo
"del ./bigfile"
joey@gnu:~/tmp/demo>git status -s
AD bigfile
A foo/bigfile
Due to Linux's inotify interface, this is surely some of the most subtle, race-heavy code that I'll need to deal with while developing the git annex assistant. But I can't start wading, need to jump off the deep end to make progress!
The hardest problem today involved the case where a directory is moved outside of the tree that's being watched. Inotify will still send events for such directories, but it doesn't make sense to continue to handle them.
Ideally I'd stop inotify watching such directories, but a lot of state would need to be maintained to know which inotify handle to stop watching. (Seems like Haskell's inotify API makes this harder than it needs to be...)
Instead, I put in a hack that will make it detect inotify events from directories moved away, and ignore them. This is probably acceptable, since this is an unusual edge case.
The notable omission in the inotify code, which I'll work on next, is staging deleting of files. This is tricky because adding a file to the annex happens to cause a deletion event. I need to make sure there are no races where that deletion event causes data loss.
Since my last blog, I've been polishing the git annex watch
command.
First, I fixed the double commits problem. There's still some extra
committing going on in the git-annex
branch that I don't understand. It
seems like a shutdown event is somehow being triggered whenever
a git command is run by the commit thread.
I also made git annex watch
run as a proper daemon, with locking to
prevent multiple copies running, and a pid file, and everything.
I made git annex watch --stop
stop it.
Then I managed to greatly increase its startup speed. At startup, it generates "add" events for every symlink in the tree. This is necessary because it doesn't really know if a symlink is already added, or was manually added before it starter, or indeed was added while it started up. Problem was that these events were causing a lot of work staging the symlinks -- most of which were already correctly staged.
You'd think it could just check if the same symlink was in the index. But it can't, because the index is in a constant state of flux. The symlinks might have just been deleted and re-added, or changed, and the index still have the old value.
Instead, I got creative. :) We can't trust what the index says about the symlink, but if the index happens to contian a symlink that looks right, we can trust that the SHA1 of its blob is the right SHA1, and reuse it when re-staging the symlink. Wham! Massive speedup!
Then I started running git annex watch
on my own real git annex repos,
and noticed some problems.. Like it turns normal files already checked into
git into symlinks. And it leaks memory scanning a big tree. Oops..
I put together a quick screencast demoing git annex watch
.
While making the screencast, I noticed that git-annex watch
was spinning
in strace, which is bad news for powertop and battery usage. This seems to
be a GHC bug also affecting Xmonad. I
tried switching to GHC's threaded runtime, which solves that problem, but
causes git-annex to hang under heavy load. Tried to debug that for quite a
while, but didn't get far. Will need to investigate this further..
Am seeing indications that this problem only affects ghc 7.4.1; in
particular 7.4.2 does not seem to have the problem.
After a few days otherwise engaged, back to work today.
My focus was on adding the committing thread mentioned in day 4 speed. I got rather further than expected!
First, I implemented a really dumb thread, that woke up once per second,
checked if any changes had been made, and committed them. Of course, this
rather sucked. In the middle of a large operation like untarring a tarball,
or rm -r
of a large directory tree, it made lots of commits and made
things slow and ugly. This was not unexpected.
So next, I added some smarts to it. First, I wanted to stop it waking up every second when there was nothing to do, and instead blocking wait on a change occuring. Secondly, I wanted it to know when past changes happened, so it could detect batch mode scenarios, and avoid committing too frequently.
I played around with combinations of various Haskell thread communications
tools to get that information to the committer thread: MVar
, Chan
,
QSem
, QSemN
. Eventually, I realized all I needed was a simple channel
through which the timestamps of changes could be sent. However, Chan
wasn't quite suitable, and I had to add a dependency on
Software Transactional Memory,
and use a TChan
. Now I'm cooking with gas!
With that data channel available to the committer thread, it quickly got some very nice smart behavior. Playing around with it, I find it commits instantly when I'm making some random change that I'd want the git-annex assistant to sync out instantly; and that its batch job detection works pretty well too.
There's surely room for improvement, and I made this part of the code be an entirely pure function, so it's really easy to change the strategy. This part of the committer thread is so nice and clean, that here's the current code, for your viewing pleasure:
[[!format haskell """ {- Decide if now is a good time to make a commit.
- Note that the list of change times has an undefined order.
- Current strategy: If there have been 10 commits within the past second,
- a batch activity is taking place, so wait for later.
-}
shouldCommit :: UTCTime -> [UTCTime] -> Bool
shouldCommit now changetimes
| len == 0 = False
| len > 4096 = True -- avoid bloating queue too much
| length (filter thisSecond changetimes) < 10 = True
| otherwise = False -- batch activity
where
len = length changetimes
thisSecond t = now
diffUTCTime
t <= 1 """]]
Still some polishing to do to eliminate minor innefficiencies and deal with more races, but this part of the git-annex assistant is now very usable, and will be going out to my beta testers soon!
Only had a few hours to work today, but my current focus is speed, and I
have indeed sped up parts of git annex watch
.
One thing folks don't realize about git is that despite a rep for being
fast, it can be rather slow in one area: Writing the index. You don't
notice it until you have a lot of files, and the index gets big. So I've
put a lot of effort into git-annex in the past to avoid writing the index
repeatedly, and queue up big index changes that can happen all at once. The
new git annex watch
was not able to use that queue. Today I reworked the
queue machinery to support the types of direct index writes it needs, and
now repeated index writes are eliminated.
... Eliminated too far, it turns out, since it doesn't yet ever flush that queue until shutdown! So the next step here will be to have a worker thread that wakes up periodically, flushes the queue, and autocommits. (This will, in fact, be the start of the syncing phase of my roadmap!) There's lots of room here for smart behavior. Like, if a lot of changes are being made close together, wait for them to die down before committing. Or, if it's been idle and a single file appears, commit it immediatly, since this is probably something the user wants synced out right away. I'll start with something stupid and then add the smarts.
(BTW, in all my years of programming, I have avoided threads like the nasty bug-prone plague they are. Here I already have three threads, and am going to add probably 4 or 5 more before I'm done with the git annex assistant. So far, it's working well -- I give credit to Haskell for making it easy to manage state in ways that make it possible to reason about how the threads will interact.)
What about the races I've been stressing over? Well, I have an ulterior
motive in speeding up git annex watch
, and that's to also be able to
slow it down. Running in slow-mo makes it easy to try things that might
cause a race and watch how it reacts. I'll be using this technique when
I circle back around to dealing with the races.
Another tricky speed problem came up today that I also need to fix. On
startup, git annex watch
scans the whole tree to find files that have
been added or moved etc while it was not running, and take care of them.
Currently, this scan involves re-staging every symlink in the tree. That's
slow! I need to find a way to avoid re-staging symlinks; I may use git
cat-file
to check if the currently staged symlink is correct, or I may
come up with some better and faster solution. Sleeping on this problem.
Oh yeah, I also found one more race bug today. It only happens at startup and could only make it miss staging file deletions.
git merge watch_
My cursor has been mentally poised here all day, but I've been reluctant to merge watch into master. It seems solid, but is it correct? I was able to think up a lot of races it'd be subject to, and deal with them, but did I find them all?
Perhaps I need to do some automated fuzz testing to reassure myself. I looked into using genbackupdata to that end. It's not quite what I need, but could be moved in that direction. Or I could write my own fuzz tester, but it seems better to use someone else's, because a) laziness and b) they're less likely to have the same blind spots I do.
My reluctance to merge isn't helped by the known bugs with files that are
either already open before git annex watch
starts, or are opened by two
processes at once, and confuse it into annexing the still-open file when one
process closes it.
I've been thinking about just running lsof
on every file as it's being
annexed to check for that, but in the end, lsof
is too slow. Since its
check involves trawling through all of /proc, it takes it a good half a
second to check a file, and adding 25 seconds to the time it takes to
process 100 files is just not acceptable.
But an option that could work is to run lsof
after a bunch of new files
have been annexed. It can check a lot of files nearly as fast as a single
one. In the rare case that an annexed file is indeed still open, it could
be moved back out of the annex. Then when its remaining writer finally
closes it, another inotify event would re-annex it.
Today I worked on the race conditions, and fixed two of them. Both
were fixed by avoiding using git add
, which looks at the files currently
on disk. Instead, git annex watch
injects symlinks directly into git's
index, using git update-index
.
There is one bad race condition remaining. If multiple processes have a file open for write, one can close it, and it will be added to the annex. But then the other can still write to it.
Getting away from race conditions for a while, I made git annex watch
not annex .gitignore
and .gitattributes
files.
And, I made it handle running out of inotify descriptors. By default,
/proc/sys/fs/inotify/max_user_watches
is 8192, and that's how many
directories inotify can watch. Now when it needs more, it will print
a nice message showing how to increase it with sysctl
.
FWIW, DropBox also uses inotify and has the same limit. It seems to not
tell the user how to fix it when it goes over. Here's what git annex
watch
will say:
Too many directories to watch! (Not watching ./dir4299)
Increase the limit by running:
echo fs.inotify.max_user_watches=81920 | sudo tee -a /etc/sysctl.conf; sudo sysctl -p
Kickstarter is over. Yay!
Today I worked on the bug where git annex watch
turned regular files
that were already checked into git into symlinks. So I made it check
if a file is already in git before trying to add it to the annex.
The tricky part was doing this check quickly. Unless I want to write my
own git index parser (or use one from Hackage), this check requires running
git ls-files
, once per file to be added. That won't fly if a huge
tree of files is being moved or unpacked into the watched directory.
Instead, I made it only do the check during git annex watch
's initial
scan of the tree. This should be ok, because once it's running, you
won't be adding new files to git anyway, since it'll automatically annex
new files. This is good enough for now, but there are at least two problems
with it:
- Someone might
git merge
in a branch that has some regular files, and it would add the merged in files to the annex. - Once
git annex watch
is running, if you modify a file that was checked into git as a regular file, the new version will be added to the annex.
I'll probably come back to this issue, and may well find myself directly querying git's index.
I've started work to fix the memory leak I see when running git annex
watch
in a large repository (40 thousand files). As always with a Haskell
memory leak, I crack open Real World Haskell's chapter on profiling.
Eventually this yields a nice graph of the problem:
So, looks like a few minor memory leaks, and one huge leak. Stared at this for a while and trying a few things, and got a much better result:
I may come back later and try to improve this further, but it's not bad memory usage. But, it's still rather slow to start up in such a large repository, and its initial scan is still doing too much work. I need to optimize more..
Last night I got git annex watch
to also handle deletion of files.
This was not as tricky as feared; the key is using git rm --ignore-unmatch
,
which avoids most problimatic situations (such as a just deleted file
being added back before git is run).
Also fixed some races when git annex watch
is doing its startup scan of
the tree, which might be changed as it's being traversed. Now only one
thread performs actions at a time, so inotify events are queued up during
the scan, and dealt with once it completes. It's worth noting that inotify
can only buffer so many events .. Which might have been a problem except
for a very nice feature of Haskell's inotify interface: It has a thread
that drains the limited inotify buffer and does its own buffering.
Right now, git annex watch
is not as fast as it could be when doing
something like adding a lot of files, or deleting a lot of files.
For each file, it currently runs a git command that updates the index.
I did some work toward coalescing these into one command (which git annex
already does normally). It's not quite ready to be turned on yet,
because of some races involving git add
that become much worse
if it's delayed by event coalescing.
And races were the theme of today. Spent most of the day really
getting to grips with all the fun races that can occur between
modification happening to files, and git annex watch
. The inotify
page now has a long list of known races, some benign, and several,
all involving adding files, that are quite nasty.
I fixed one of those races this evening. The rest will probably involve
moving away from using git add
, which necessarily examines the file
on disk, to directly shoving the symlink into git's index.
BTW, it turns out that dvcs-autosync
has grappled with some of these same
races: http://comments.gmane.org/gmane.comp.version-control.home-dir/665
I hope that git annex watch
will be in a better place to deal with them,
since it's only dealing with git, and with a restricted portion of it
relevant to git-annex.
It's important that git annex watch
be rock solid. It's the foundation
of the git annex assistant. Users should not need to worry about races
when using it. Most users won't know what race conditions are. If only I
could be so lucky!