(Hi, this is paulproteus@debian, AKA Asheesh).
I've been enjoying using git-annex to archive my data.
It's great that, by using git-annex and the SHA1 backend, I get a space-saving kind of deduplication through the symbolic links.
I'm looking for the ability to filter files, before they get added to the annex, so that I don't add new files whose content is already in the annex.look That would help me in terms of personal file organization.
It seems there is not, so this is a wishlist bug filed so that maybe such a thing might exist. What I would really like to do is:
- $ git annex add --no-add-if-already-present .
- $ git commit -m "Slurping in some photos I found on my old laptop hard drive"
And then I'd do something like:
- $ git clean -f
to remove the files that didn't get annexed in this run. That way, only one filename would ever point to a particular SHA1.
I want this because I have copies of various of mine (photos, in particular) scattered across various hard disks. If this feature existed, I could comfortably toss them all into one git annex that grew, bit by bit, to store all of these files exactly once.
(I would be even happier for "git annex add --unlink-duplicates .")
(Another way to do this would be to "git annex add" them all, and then use a "git annex remove-duplicates" that could prompt me about which files are duplicates of each other, and then I could pipe that command's output into xargs git rm.)
(As I write this, I realize it's possible to parse the destination of the symlink in a way that does this..)
done; see finding duplicate files --Joey
Hey Asheesh, I'm happy you're finding git-annex useful.
So, there are two forms of duplication going on here. There's duplication of the content, and duplication of the filenames pointing at that content.
Duplication of the filenames is probably not a concern, although it's what I thought you were talking about at first. It's probably info worth recording that backup-2010/some_dir/foo and backup-2009/other_dir/foo are two names you've used for the same content in the past. If you really wanted to remove backup-2009/foo, you could do it by writing a script that looks at the basenames of the symlink targets and removes files that point to the same content as other files.
Using SHA1 ensures that the same key is used for identical files, so generally avoids duplication of content. But if you have 2 disks with an identical file on each, and make them both into annexes, then git-annex will happily retain both copies of the content, one per disk. It generally considers keeping copies of content a good thing. :)
So, what if you want to remove the unnecessary copies? Well, there's a really simple way:
This asks git-annex to add everything to the annex, but then remove any file contents that it can safely remove. What can it safely remove? Well, anything that it can verify is on another repository such as "other-disk"! So, this will happily drop any duplicated file contents, while leaving all the rest alone.
In practice, you might not want to have all your old backup disks mounted at the same time and configured as remotes. Look into configuring trust to avoid needing do to that. If usb-0 is already a trusted disk, all you need is a simple "git annex drop" on usb-1.
I really do want just one filename per file, at least for some cases.
For my photos, there's no benefit to having a few filenames point to the same file. As I'm putting them all into the git-annex, that is a good time to remove the pure duplicates so that I don't e.g. see them twice when browsing the directory as a gallery. Also, I am uploading my photos to the web, and I want to avoid uploading the same photo (by content) twice.
I hope that makes things clearer!
For now I'm just doing this:
(Yeah, Flickr for my photos for now. I feel sad about betraying the principle of autonomo.us-ness.)
For what it's worth, yes, I want to actually forget I ever had the same file in the filesystem with a duplicated name. I'm not just aiming to clean up the disk's space usage; I'm also aiming to clean things up so that navigating the filesystem is easier.
I can write my own script to do that based on the symlinks' target (and I wrote something along those lines), but I still think it'd be nicer if git-annex supported this use case.
Perhaps:
could let me remove a file from git-annex if the contents are available through a different name. (Right now, "git annex drop" requires the name and contents match.)
-- Asheesh.
I have the same use case as Asheesh but I want to be able to see which filenames point to the same objects and then decide which of the duplicates to drop myself. I think
would be the wrong approach because how does git-annex know which ones to drop? There's too much potential for error.
Instead it would be great to have something like
While it's easy enough to knock up a bit of shell or Perl to achieve this, that relies on knowledge of the annex symlink structure, so I think really it belongs inside git-annex.
If this command gave output similar to the excellent
fastdup
utility:then you could do stuff like
https://github.com/aspiers/git-config/blob/master/bin/git-annex-finddups
but it would be better in git-annex itself ...
My main concern with putting this in git-annex is that finding duplicates necessarily involves storing a list of every key and file in the repository, and git-annex is very carefully built to avoid things that require non-constant memory use, so that it can scale to very big repositories. (The only exception is the
unused
command, and reducing its memory usage is a continuing goal.)So I would rather come at this from a different angle.. like providing a way to output a list of files and their associated keys, which the user can then use in their own shell pipelines to find duplicate keys:
Which is implemented now!
(Making that pipeline properly handle filenames with spaces is left as an exercise for the reader..)
Only if you want to search the whole repository for duplicates, and if you do, then you're necessarily going to have to chew up memory in some process anyway, so what difference whether it's git-annex or (say) a Perl wrapper?
That's a worthy goal, but if everything could be implemented with an O(1) memory footprint then we'd be in much more pleasant world :-) Even O(n) isn't that bad ...
That aside, I like your
--format="%f %k\n"
idea a lot. That opens up the "black box" of.git/annex/objects
and makes nice things possible, as your pipeline already demonstrates. However, I'm not sure why you thinkgit annex find | sort | uniq
would be more efficient. Not only does the sort require the very thing you were trying to avoid (i.e. the whole list in memory), but it's also O(n log n) which is significantly slower than my O(n) Perl script linked above.More considerations about this pipeline:
--include '*'
? Doesn'tgit annex find
with no arguments already include all files, modulo the requirement above that they're locally available?git annex find | ...
approach is likely to run up against its limitations sooner rather than later, because they're already used to the plethora of optionsfind(1)
provides. Rather than reinventing the wheel, is there some waygit annex find
could harness the power offind(1)
?Those considerations aside, a combined approach would be to implement
and then alter my Perl wrapper to
popen(2)
from that rather than usingFile::Find
. But I doubt you would want to ship Perl wrappers in the distribution, so if you don't provide a Haskell equivalent then users who can't code are left high and dry.Adam, to answer a lot of points breifly..
What's your source for this assertion? I would expect an amortized average of
O(1)
per insertion, i.e.O(n)
for full population.None of which necessarily change the algorithmic complexity. However real benchmarks are far more useful here than complexity analysis, and the dangers of premature optimization should not be forgotten.
Sure, I was aware of that, but my point still stands. Even 500k keys per 1GB of RAM does not sound expensive to me.
Why not? What's the maximum it should use? 512MB? 256MB? 32MB? I don't see the sense in the author of a program dictating thresholds which are entirely dependent on the context in which the program is run, not the context in which it's written. That's why systems have files such as
/etc/security/limits.conf
.You said you want git-annex to scale to enormous repositories. If you impose an arbitrary memory restriction such as the above, that means avoiding implementing any kind of functionality which requires
O(n)
memory or worse. Isn't it reasonable to assume that many users use git-annex on repositories which are not enormous? Even when they do work with enormous repositories, just like with any other program, they would naturally expect certain operations to take longer or become impractical without sufficient RAM. That's why I say that this restriction amounts to throwing out the baby with the bathwater. It just means that those who need the functionality would have to reimplement it themselves, assuming they are able, which is likely to result in more wheel reinventions. I've already shared my implementation but how many people are likely to find it, let alone get it working?Interesting. Presumably you are referring to some undocumented behaviour, rather than
--batch-size
which only applies when merging multiple files, and not when only sorting STDIN.It's the best choice for sorting. But sorting purely to detect duplicates is a dismally bad choice.